Add new hardreg PRE pass

Message ID 33dd6e39-3fb4-1836-271d-b1740e992a53@e124511.cambridge.arm.com
State New
Headers
Series Add new hardreg PRE pass |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm fail Build failed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed

Commit Message

Andrew Carlotti Oct. 31, 2024, 6:29 p.m. UTC
  This pass is used to optimise assignments to the FPMR register in
aarch64.  I chose to implement this as a middle-end pass because it
mostly reuses the existing RTL PRE code within gcse.cc.

Compared to RTL PRE, the key difference in this new pass is that we
insert new writes directly to the destination hardreg, instead of
writing to a new pseudo-register and copying the result later.  This
requires changes to the analysis portion of the pass, because sets
cannot be moved before existing instructions that set, use or clobber
the hardreg, and the value becomes unavailable after any uses of
clobbers of the hardreg.

This patch would currently break any debug instructions that use the
value of fpmr in a region of code where that value is changed by this
pass.  I haven't worked out the best way to fix this, but I suspect the
issue is uncommon and tricky enough that it would be best to just drop
those debug instructions.

I've bootstrapped and regression tested this on aarch64, and it should be NFC
on other targets.  Aside from this, my testing so far has involved hacking in a
single FP8 intrinsic and testing various parameters and control flow
structures, and checking both the codegen and the LCM bitmaps.  I intend to
write better and more comprehensive tests once there are some real intrinsic
implementations available to use.


Is this approach good?  Apart from fixing the debug instructions and
adding tests, is there anything else I need to change?


gcc/ChangeLog:

	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
	* gcse.cc (doing_hardreg_pre_p): New global variable.
	(current_hardreg_regno): Ditto.
	(compute_local_properties): Unset transp for hardreg clobbers.
	(prune_hardreg_uses): New.
	(want_to_gcse_p): Always return true for hardreg PRE.
	(hash_scan_set): Add checks for hardreg uses/clobbers.
	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
	(record_last_mem_set_info): Ditto.
	(compute_hash_table_work): Record hardreg uses.
	(prune_expressions): Mark hardreg sets as call-clobbered.
	(compute_pre_data): Add call to prune_hardreg_uses.
	(pre_expr_reaches_here_p_work): Add comment.
	(insert_insn_start_basic_block): New functions.
	(pre_edge_insert): Don't add hardreg sets to predecessor block.
	(pre_delete): Use hardreg for the reaching reg.
	(pre_gcse): Don't insert copies for hardreg PRE.
	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
	(execute_hardreg_pre): New.
	(class pass_hardreg_pre): New.
	(pass_hardreg_pre::gate): New.
	(make_pass_hardreg_pre): New.
	* passes.def (pass_hardreg_pre): New pass.
	* tree-pass.h (make_pass_hardreg_pre): New.
  

Comments

Richard Sandiford Nov. 12, 2024, 10:42 p.m. UTC | #1
Sorry for the slow review.  I think Jeff's much better placed to comment
on this than I am, but here's a stab.  Mostly it looks really good to me
FWIW.

Andrew Carlotti <andrew.carlotti@arm.com> writes:
> This pass is used to optimise assignments to the FPMR register in
> aarch64.  I chose to implement this as a middle-end pass because it
> mostly reuses the existing RTL PRE code within gcse.cc.
>
> Compared to RTL PRE, the key difference in this new pass is that we
> insert new writes directly to the destination hardreg, instead of
> writing to a new pseudo-register and copying the result later.  This
> requires changes to the analysis portion of the pass, because sets
> cannot be moved before existing instructions that set, use or clobber
> the hardreg, and the value becomes unavailable after any uses of
> clobbers of the hardreg.
>
> This patch would currently break any debug instructions that use the
> value of fpmr in a region of code where that value is changed by this
> pass.  I haven't worked out the best way to fix this, but I suspect the
> issue is uncommon and tricky enough that it would be best to just drop
> those debug instructions.

Yeah, good question, and pass on that :)  Will need to think more about it.

> I've bootstrapped and regression tested this on aarch64, and it should be NFC
> on other targets.  Aside from this, my testing so far has involved hacking in a
> single FP8 intrinsic and testing various parameters and control flow
> structures, and checking both the codegen and the LCM bitmaps.  I intend to
> write better and more comprehensive tests once there are some real intrinsic
> implementations available to use.
>
>
> Is this approach good?  Apart from fixing the debug instructions and
> adding tests, is there anything else I need to change?
>
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
> 	* gcse.cc (doing_hardreg_pre_p): New global variable.
> 	(current_hardreg_regno): Ditto.
> 	(compute_local_properties): Unset transp for hardreg clobbers.
> 	(prune_hardreg_uses): New.
> 	(want_to_gcse_p): Always return true for hardreg PRE.
> 	(hash_scan_set): Add checks for hardreg uses/clobbers.
> 	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
> 	(record_last_mem_set_info): Ditto.
> 	(compute_hash_table_work): Record hardreg uses.
> 	(prune_expressions): Mark hardreg sets as call-clobbered.
> 	(compute_pre_data): Add call to prune_hardreg_uses.
> 	(pre_expr_reaches_here_p_work): Add comment.
> 	(insert_insn_start_basic_block): New functions.
> 	(pre_edge_insert): Don't add hardreg sets to predecessor block.
> 	(pre_delete): Use hardreg for the reaching reg.
> 	(pre_gcse): Don't insert copies for hardreg PRE.
> 	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
> 	(execute_hardreg_pre): New.
> 	(class pass_hardreg_pre): New.
> 	(pass_hardreg_pre::gate): New.
> 	(make_pass_hardreg_pre): New.
> 	* passes.def (pass_hardreg_pre): New pass.
> 	* tree-pass.h (make_pass_hardreg_pre): New.
>
> [...]
> @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
>  	}
>      }
>  }
> +
> +/* A hardreg set is not transparent in a block if there are any uses of that
> +   hardreg.  This filters the results of compute_local_properties, after the
> +   result of that function has been used to define the kills bitmap.

I think this is mostly my ignorance of the code, and would be obvious
if I tried it out locally, but: why do we need to do this after
computing the kills bitmap?  For mode-switching, the kills bitmap
is the inverse of the transparency bitmap, but it sounds like here
you want the kills bitmap to be more selective.

> +
> +   TRANSP is the destination sbitmap to be updated.
> +
> +   TABLE controls which hash table to look at.  */
> +
> +static void
> +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
> +{
> +  unsigned int i;
> +  gcc_assert (doing_hardreg_pre_p);
> +
> +  for (i = 0; i < table->size; i++)
> +    {
> +      struct gcse_expr *expr;
> +
> +      for (expr = table->table[i]; expr != NULL; expr = expr->next_same_hash)
> +	{
> +	  int indx = expr->bitmap_index;
> +	  df_ref def;
> +
> +	  for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
> +	       def;
> +	       def = DF_REF_NEXT_REG (def))
> +	    bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
> +	}
> +    }
> +}
>  
>  /* Hash table support.  */
 

> @@ -747,6 +804,9 @@ static basic_block current_bb;
>  static bool
>  want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
>  {
> +  if (doing_hardreg_pre_p)
> +    return true;
> +
>  #ifdef STACK_REGS
>    /* On register stack architectures, don't GCSE constants from the
>       constant pool, as the benefits are often swamped by the overhead
> @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, bool avail_p)
>        }
>  
>      case MEM:
> -      if (! flag_gcse_lm
> +      if (! flag_gcse_lm || doing_hardreg_pre_p

This test occurs often enough that I think it's worth splitting out.
Something like: !do_load_motion ()?

>  	  || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
>  				     x, avail_p))
>  	return false;
> [...]
> @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d *table)
>  	    }
>  
>  	  note_stores (insn, record_last_set_info, insn);
> +
> +	  if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
> +	    {
> +	      /* We need to record the first use of a hardreg to determine if a
> +		 set of that hardreg is anticipatable.  */
> +	      df_ref ref;
> +	      FOR_EACH_INSN_USE (ref, insn)
> +		if (DF_REF_REGNO (ref) == current_hardreg_regno)
> +		  {
> +		    hardreg_last_bb = current_bb;
> +		    hardreg_first_use = DF_INSN_LUID (insn);
> +		  }
> +	    }
>  	}

Couldn't we instead check whether the register is live on entry to the block?
That would avoid the extra bit of state.

>  
>        /* The next pass builds the hash table.  */
> @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
>      {
>        for (expr = expr_hash_table.table[ui]; expr; expr = expr->next_same_hash)
>  	{
> +	  /* For hardreg pre, we assume that all relevant hardregs are
> +	     call-clobbered, and set all bits in prune_exprs if the reg is call
> +	     clobbered.

Not sure I understand this.  But...

>                        If the hardreg were merely call-used, then we would
> +	     need to remove the expression from the anticipatable and
> +	     transparent bitmaps only (after using this to compute the kills
> +	     bitmap).  */
> +
> +	  if (doing_hardreg_pre_p)
> +	    {
> +	      bitmap_set_bit (prune_exprs, expr->bitmap_index);
> +	      continue;
> +	    }
> +

...the effect seems to be to set every bit of prune_exprs, in which
case it might be easier to skip this loop entirely and adjust the later
one to use bitmap_set_range.

>  	  /* Note potentially trapping expressions.  */
>  	  if (may_trap_p (expr->expr))
>  	    {
> [...]
> @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
>    return 0;
>  }
>  
> +static unsigned int
> +execute_hardreg_pre (void)
> +{
> +  doing_hardreg_pre_p = true;
> +  unsigned int regnos[] = HARDREG_PRE_REGNOS;
> +  /* It's possible to avoid this loop, but it isn't worth doing so until
> +     hardreg PRE is used for multiple hardregs.  */

Yeah, sounds ok to me.  But out of curiosity, how difficult would it be
to structure the code so that this just works?  Where are the main
difficulties?  Having to maintain a list of which expressions are
associated with which register, and therefore which expressions
mutually kill each other?

> +  for (int i = 0; regnos[i] != 0; i++)
> +    {
> +      int changed;
> +      current_hardreg_regno = regnos[i];
> +      if (dump_file)
> +	fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
> +		current_hardreg_regno);
> +      delete_unreachable_blocks ();
> +      df_analyze ();
> +      changed = one_pre_gcse_pass ();
> +      flag_rerun_cse_after_global_opts |= changed;

Is this appropriate for the new pass?  We're not really exposing general
CSE opportunities.

> +      if (changed)
> +	cleanup_cfg (0);
> +    }
> +  doing_hardreg_pre_p = false;
> +  return 0;
> +}
> +
>  static unsigned int
>  execute_rtl_hoist (void)
>  {
> @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
>  
>  namespace {
>  
> +const pass_data pass_data_hardreg_pre =
> +{
> +  RTL_PASS, /* type */
> +  "hardreg_pre", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_PRE, /* tv_id */
> +  PROP_cfglayout, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_hardreg_pre : public rtl_opt_pass
> +{
> +public:
> +  pass_hardreg_pre (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  bool gate (function *) final override;
> +  unsigned int execute (function *)  final override
> +  {
> +    return execute_hardreg_pre ();
> +  }
> +
> +}; // class pass_rtl_pre
> +
> +bool
> +pass_hardreg_pre::gate (function *fun)
> +{
> +#ifdef HARDREG_PRE_REGNOS
> +  return optimize > 0
> +    && !fun->calls_setjmp;

Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
have expected them to be needed now, since the modern cfg framework
should represent setjmp correctly.  Jeff, do you agree?  I'll try
removing them and see what breaks...

Thanks,
Richard

> +#else
> +  return false;
> +#endif
> +}
> +
> +} // anon namespace
> +
> +rtl_opt_pass *
> +make_pass_hardreg_pre (gcc::context *ctxt)
> +{
> +  return new pass_hardreg_pre (ctxt);
> +}
> +
> +namespace {
> +
>  const pass_data pass_data_rtl_hoist =
>  {
>    RTL_PASS, /* type */
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_rtl_cprop);
>        NEXT_PASS (pass_rtl_pre);
>        NEXT_PASS (pass_rtl_hoist);
> +      NEXT_PASS (pass_hardreg_pre);
>        NEXT_PASS (pass_rtl_cprop);
>        NEXT_PASS (pass_rtl_store_motion);
>        NEXT_PASS (pass_cse_after_global_opts);
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
  
Richard Biener Nov. 13, 2024, 9:15 a.m. UTC | #2
On Tue, 12 Nov 2024, Richard Sandiford wrote:

> Sorry for the slow review.  I think Jeff's much better placed to comment
> on this than I am, but here's a stab.  Mostly it looks really good to me
> FWIW.
> 
> Andrew Carlotti <andrew.carlotti@arm.com> writes:
> > This pass is used to optimise assignments to the FPMR register in
> > aarch64.  I chose to implement this as a middle-end pass because it
> > mostly reuses the existing RTL PRE code within gcse.cc.
> >
> > Compared to RTL PRE, the key difference in this new pass is that we
> > insert new writes directly to the destination hardreg, instead of
> > writing to a new pseudo-register and copying the result later.  This
> > requires changes to the analysis portion of the pass, because sets
> > cannot be moved before existing instructions that set, use or clobber
> > the hardreg, and the value becomes unavailable after any uses of
> > clobbers of the hardreg.
> >
> > This patch would currently break any debug instructions that use the
> > value of fpmr in a region of code where that value is changed by this
> > pass.  I haven't worked out the best way to fix this, but I suspect the
> > issue is uncommon and tricky enough that it would be best to just drop
> > those debug instructions.
> 
> Yeah, good question, and pass on that :)  Will need to think more about it.
> 
> > I've bootstrapped and regression tested this on aarch64, and it should be NFC
> > on other targets.  Aside from this, my testing so far has involved hacking in a
> > single FP8 intrinsic and testing various parameters and control flow
> > structures, and checking both the codegen and the LCM bitmaps.  I intend to
> > write better and more comprehensive tests once there are some real intrinsic
> > implementations available to use.
> >
> >
> > Is this approach good?  Apart from fixing the debug instructions and
> > adding tests, is there anything else I need to change?
> >
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
> > 	* gcse.cc (doing_hardreg_pre_p): New global variable.
> > 	(current_hardreg_regno): Ditto.
> > 	(compute_local_properties): Unset transp for hardreg clobbers.
> > 	(prune_hardreg_uses): New.
> > 	(want_to_gcse_p): Always return true for hardreg PRE.
> > 	(hash_scan_set): Add checks for hardreg uses/clobbers.
> > 	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
> > 	(record_last_mem_set_info): Ditto.
> > 	(compute_hash_table_work): Record hardreg uses.
> > 	(prune_expressions): Mark hardreg sets as call-clobbered.
> > 	(compute_pre_data): Add call to prune_hardreg_uses.
> > 	(pre_expr_reaches_here_p_work): Add comment.
> > 	(insert_insn_start_basic_block): New functions.
> > 	(pre_edge_insert): Don't add hardreg sets to predecessor block.
> > 	(pre_delete): Use hardreg for the reaching reg.
> > 	(pre_gcse): Don't insert copies for hardreg PRE.
> > 	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
> > 	(execute_hardreg_pre): New.
> > 	(class pass_hardreg_pre): New.
> > 	(pass_hardreg_pre::gate): New.
> > 	(make_pass_hardreg_pre): New.
> > 	* passes.def (pass_hardreg_pre): New pass.
> > 	* tree-pass.h (make_pass_hardreg_pre): New.
> >
> > [...]
> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
> >  	}
> >      }
> >  }
> > +
> > +/* A hardreg set is not transparent in a block if there are any uses of that
> > +   hardreg.  This filters the results of compute_local_properties, after the
> > +   result of that function has been used to define the kills bitmap.
> 
> I think this is mostly my ignorance of the code, and would be obvious
> if I tried it out locally, but: why do we need to do this after
> computing the kills bitmap?  For mode-switching, the kills bitmap
> is the inverse of the transparency bitmap, but it sounds like here
> you want the kills bitmap to be more selective.
> 
> > +
> > +   TRANSP is the destination sbitmap to be updated.
> > +
> > +   TABLE controls which hash table to look at.  */
> > +
> > +static void
> > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
> > +{
> > +  unsigned int i;
> > +  gcc_assert (doing_hardreg_pre_p);
> > +
> > +  for (i = 0; i < table->size; i++)
> > +    {
> > +      struct gcse_expr *expr;
> > +
> > +      for (expr = table->table[i]; expr != NULL; expr = expr->next_same_hash)
> > +	{
> > +	  int indx = expr->bitmap_index;
> > +	  df_ref def;
> > +
> > +	  for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
> > +	       def;
> > +	       def = DF_REF_NEXT_REG (def))
> > +	    bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
> > +	}
> > +    }
> > +}
> >  
> >  /* Hash table support.  */
>  
> 
> > @@ -747,6 +804,9 @@ static basic_block current_bb;
> >  static bool
> >  want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
> >  {
> > +  if (doing_hardreg_pre_p)
> > +    return true;
> > +
> >  #ifdef STACK_REGS
> >    /* On register stack architectures, don't GCSE constants from the
> >       constant pool, as the benefits are often swamped by the overhead
> > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, bool avail_p)
> >        }
> >  
> >      case MEM:
> > -      if (! flag_gcse_lm
> > +      if (! flag_gcse_lm || doing_hardreg_pre_p
> 
> This test occurs often enough that I think it's worth splitting out.
> Something like: !do_load_motion ()?
> 
> >  	  || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
> >  				     x, avail_p))
> >  	return false;
> > [...]
> > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d *table)
> >  	    }
> >  
> >  	  note_stores (insn, record_last_set_info, insn);
> > +
> > +	  if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
> > +	    {
> > +	      /* We need to record the first use of a hardreg to determine if a
> > +		 set of that hardreg is anticipatable.  */
> > +	      df_ref ref;
> > +	      FOR_EACH_INSN_USE (ref, insn)
> > +		if (DF_REF_REGNO (ref) == current_hardreg_regno)
> > +		  {
> > +		    hardreg_last_bb = current_bb;
> > +		    hardreg_first_use = DF_INSN_LUID (insn);
> > +		  }
> > +	    }
> >  	}
> 
> Couldn't we instead check whether the register is live on entry to the block?
> That would avoid the extra bit of state.
> 
> >  
> >        /* The next pass builds the hash table.  */
> > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
> >      {
> >        for (expr = expr_hash_table.table[ui]; expr; expr = expr->next_same_hash)
> >  	{
> > +	  /* For hardreg pre, we assume that all relevant hardregs are
> > +	     call-clobbered, and set all bits in prune_exprs if the reg is call
> > +	     clobbered.
> 
> Not sure I understand this.  But...
> 
> >                        If the hardreg were merely call-used, then we would
> > +	     need to remove the expression from the anticipatable and
> > +	     transparent bitmaps only (after using this to compute the kills
> > +	     bitmap).  */
> > +
> > +	  if (doing_hardreg_pre_p)
> > +	    {
> > +	      bitmap_set_bit (prune_exprs, expr->bitmap_index);
> > +	      continue;
> > +	    }
> > +
> 
> ...the effect seems to be to set every bit of prune_exprs, in which
> case it might be easier to skip this loop entirely and adjust the later
> one to use bitmap_set_range.
> 
> >  	  /* Note potentially trapping expressions.  */
> >  	  if (may_trap_p (expr->expr))
> >  	    {
> > [...]
> > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
> >    return 0;
> >  }
> >  
> > +static unsigned int
> > +execute_hardreg_pre (void)
> > +{
> > +  doing_hardreg_pre_p = true;
> > +  unsigned int regnos[] = HARDREG_PRE_REGNOS;
> > +  /* It's possible to avoid this loop, but it isn't worth doing so until
> > +     hardreg PRE is used for multiple hardregs.  */
> 
> Yeah, sounds ok to me.  But out of curiosity, how difficult would it be
> to structure the code so that this just works?  Where are the main
> difficulties?  Having to maintain a list of which expressions are
> associated with which register, and therefore which expressions
> mutually kill each other?
> 
> > +  for (int i = 0; regnos[i] != 0; i++)
> > +    {
> > +      int changed;
> > +      current_hardreg_regno = regnos[i];
> > +      if (dump_file)
> > +	fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
> > +		current_hardreg_regno);
> > +      delete_unreachable_blocks ();
> > +      df_analyze ();
> > +      changed = one_pre_gcse_pass ();
> > +      flag_rerun_cse_after_global_opts |= changed;
> 
> Is this appropriate for the new pass?  We're not really exposing general
> CSE opportunities.
> 
> > +      if (changed)
> > +	cleanup_cfg (0);
> > +    }
> > +  doing_hardreg_pre_p = false;
> > +  return 0;
> > +}
> > +
> >  static unsigned int
> >  execute_rtl_hoist (void)
> >  {
> > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
> >  
> >  namespace {
> >  
> > +const pass_data pass_data_hardreg_pre =
> > +{
> > +  RTL_PASS, /* type */
> > +  "hardreg_pre", /* name */
> > +  OPTGROUP_NONE, /* optinfo_flags */
> > +  TV_PRE, /* tv_id */
> > +  PROP_cfglayout, /* properties_required */
> > +  0, /* properties_provided */
> > +  0, /* properties_destroyed */
> > +  0, /* todo_flags_start */
> > +  TODO_df_finish, /* todo_flags_finish */
> > +};
> > +
> > +class pass_hardreg_pre : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_hardreg_pre (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  bool gate (function *) final override;
> > +  unsigned int execute (function *)  final override
> > +  {
> > +    return execute_hardreg_pre ();
> > +  }
> > +
> > +}; // class pass_rtl_pre
> > +
> > +bool
> > +pass_hardreg_pre::gate (function *fun)
> > +{
> > +#ifdef HARDREG_PRE_REGNOS
> > +  return optimize > 0
> > +    && !fun->calls_setjmp;
> 
> Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
> have expected them to be needed now, since the modern cfg framework
> should represent setjmp correctly.  Jeff, do you agree?  I'll try
> removing them and see what breaks...

The problem with setjmp on RTL is that we fail to preserve abnormal
edges during RTL expansion and fail in the impossible job to recover
all that are required for correctness (to prevent code motion).
With PRE that only operates on hardregs and in particular does not
introduce alternate register usages or move memory ops there might
be no issue.

Richard.

> Thanks,
> Richard
> 
> > +#else
> > +  return false;
> > +#endif
> > +}
> > +
> > +} // anon namespace
> > +
> > +rtl_opt_pass *
> > +make_pass_hardreg_pre (gcc::context *ctxt)
> > +{
> > +  return new pass_hardreg_pre (ctxt);
> > +}
> > +
> > +namespace {
> > +
> >  const pass_data pass_data_rtl_hoist =
> >  {
> >    RTL_PASS, /* type */
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
> >        NEXT_PASS (pass_rtl_cprop);
> >        NEXT_PASS (pass_rtl_pre);
> >        NEXT_PASS (pass_rtl_hoist);
> > +      NEXT_PASS (pass_hardreg_pre);
> >        NEXT_PASS (pass_rtl_cprop);
> >        NEXT_PASS (pass_rtl_store_motion);
> >        NEXT_PASS (pass_cse_after_global_opts);
> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> > index a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f 100644
> > --- a/gcc/tree-pass.h
> > +++ b/gcc/tree-pass.h
> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
>
  
Andrew Carlotti Nov. 13, 2024, 6:15 p.m. UTC | #3
On Tue, Nov 12, 2024 at 10:42:50PM +0000, Richard Sandiford wrote:
> Sorry for the slow review.  I think Jeff's much better placed to comment
> on this than I am, but here's a stab.  Mostly it looks really good to me
> FWIW.
> 
> Andrew Carlotti <andrew.carlotti@arm.com> writes:
> > This pass is used to optimise assignments to the FPMR register in
> > aarch64.  I chose to implement this as a middle-end pass because it
> > mostly reuses the existing RTL PRE code within gcse.cc.
> >
> > Compared to RTL PRE, the key difference in this new pass is that we
> > insert new writes directly to the destination hardreg, instead of
> > writing to a new pseudo-register and copying the result later.  This
> > requires changes to the analysis portion of the pass, because sets
> > cannot be moved before existing instructions that set, use or clobber
> > the hardreg, and the value becomes unavailable after any uses of
> > clobbers of the hardreg.
> >
> > This patch would currently break any debug instructions that use the
> > value of fpmr in a region of code where that value is changed by this
> > pass.  I haven't worked out the best way to fix this, but I suspect the
> > issue is uncommon and tricky enough that it would be best to just drop
> > those debug instructions.
> 
> Yeah, good question, and pass on that :)  Will need to think more about it.
> 
> > I've bootstrapped and regression tested this on aarch64, and it should be NFC
> > on other targets.  Aside from this, my testing so far has involved hacking in a
> > single FP8 intrinsic and testing various parameters and control flow
> > structures, and checking both the codegen and the LCM bitmaps.  I intend to
> > write better and more comprehensive tests once there are some real intrinsic
> > implementations available to use.
> >
> >
> > Is this approach good?  Apart from fixing the debug instructions and
> > adding tests, is there anything else I need to change?
> >
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
> > 	* gcse.cc (doing_hardreg_pre_p): New global variable.
> > 	(current_hardreg_regno): Ditto.
> > 	(compute_local_properties): Unset transp for hardreg clobbers.
> > 	(prune_hardreg_uses): New.
> > 	(want_to_gcse_p): Always return true for hardreg PRE.
> > 	(hash_scan_set): Add checks for hardreg uses/clobbers.
> > 	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
> > 	(record_last_mem_set_info): Ditto.
> > 	(compute_hash_table_work): Record hardreg uses.
> > 	(prune_expressions): Mark hardreg sets as call-clobbered.
> > 	(compute_pre_data): Add call to prune_hardreg_uses.
> > 	(pre_expr_reaches_here_p_work): Add comment.
> > 	(insert_insn_start_basic_block): New functions.
> > 	(pre_edge_insert): Don't add hardreg sets to predecessor block.
> > 	(pre_delete): Use hardreg for the reaching reg.
> > 	(pre_gcse): Don't insert copies for hardreg PRE.
> > 	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
> > 	(execute_hardreg_pre): New.
> > 	(class pass_hardreg_pre): New.
> > 	(pass_hardreg_pre::gate): New.
> > 	(make_pass_hardreg_pre): New.
> > 	* passes.def (pass_hardreg_pre): New pass.
> > 	* tree-pass.h (make_pass_hardreg_pre): New.
> >
> > [...]
> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
> >  	}
> >      }
> >  }
> > +
> > +/* A hardreg set is not transparent in a block if there are any uses of that
> > +   hardreg.  This filters the results of compute_local_properties, after the
> > +   result of that function has been used to define the kills bitmap.
> 
> I think this is mostly my ignorance of the code, and would be obvious
> if I tried it out locally, but: why do we need to do this after
> computing the kills bitmap?  For mode-switching, the kills bitmap
> is the inverse of the transparency bitmap, but it sounds like here
> you want the kills bitmap to be more selective.

I had to work through the entire LCM algorithm before I understood how these
bitmaps were being used (and I intend to update the documentation to make this
more obvious).  In summary, the kills and avail bitmaps indicate whether the
result of an earlier expression is still available and up-to-date, whereas the
transparent and anticipatable bitmaps indicate whether a later assignment can
be moved earlier.

For the existing hoist/PRE passes these are the same - this is because new
pseduoregs are used to hold the result of relocated computations, so the only
obstruction is if the values of the inputs to the expression are changed.

For the new hardreg PRE pass the bitmaps are different in one case - if the
content of the hardreg is used, then the result of the expression remains
available after the use, but it isn't possible to anticipate a future
assignment by moving that assignment before the earlier use.

> > +
> > +   TRANSP is the destination sbitmap to be updated.
> > +
> > +   TABLE controls which hash table to look at.  */
> > +
> > +static void
> > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
> > +{
> > +  unsigned int i;
> > +  gcc_assert (doing_hardreg_pre_p);
> > +
> > +  for (i = 0; i < table->size; i++)
> > +    {
> > +      struct gcse_expr *expr;
> > +
> > +      for (expr = table->table[i]; expr != NULL; expr = expr->next_same_hash)
> > +	{
> > +	  int indx = expr->bitmap_index;
> > +	  df_ref def;
> > +
> > +	  for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
> > +	       def;
> > +	       def = DF_REF_NEXT_REG (def))
> > +	    bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
> > +	}
> > +    }
> > +}
> >  
> >  /* Hash table support.  */
>  
> 
> > @@ -747,6 +804,9 @@ static basic_block current_bb;
> >  static bool
> >  want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
> >  {
> > +  if (doing_hardreg_pre_p)
> > +    return true;
> > +
> >  #ifdef STACK_REGS
> >    /* On register stack architectures, don't GCSE constants from the
> >       constant pool, as the benefits are often swamped by the overhead
> > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, bool avail_p)
> >        }
> >  
> >      case MEM:
> > -      if (! flag_gcse_lm
> > +      if (! flag_gcse_lm || doing_hardreg_pre_p
> 
> This test occurs often enough that I think it's worth splitting out.
> Something like: !do_load_motion ()?
> 
> >  	  || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
> >  				     x, avail_p))
> >  	return false;
> > [...]
> > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d *table)
> >  	    }
> >  
> >  	  note_stores (insn, record_last_set_info, insn);
> > +
> > +	  if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
> > +	    {
> > +	      /* We need to record the first use of a hardreg to determine if a
> > +		 set of that hardreg is anticipatable.  */
> > +	      df_ref ref;
> > +	      FOR_EACH_INSN_USE (ref, insn)
> > +		if (DF_REF_REGNO (ref) == current_hardreg_regno)
> > +		  {
> > +		    hardreg_last_bb = current_bb;
> > +		    hardreg_first_use = DF_INSN_LUID (insn);
> > +		  }
> > +	    }
> >  	}
> 
> Couldn't we instead check whether the register is live on entry to the block?
> That would avoid the extra bit of state.

That should work, and would be much neater.  The only reason I didn't do
originally that was because I didn't know there was a good interface for that.
 
> >  
> >        /* The next pass builds the hash table.  */
> > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
> >      {
> >        for (expr = expr_hash_table.table[ui]; expr; expr = expr->next_same_hash)
> >  	{
> > +	  /* For hardreg pre, we assume that all relevant hardregs are
> > +	     call-clobbered, and set all bits in prune_exprs if the reg is call
> > +	     clobbered.
> 
> Not sure I understand this.  But...
> 
> >                        If the hardreg were merely call-used, then we would
> > +	     need to remove the expression from the anticipatable and
> > +	     transparent bitmaps only (after using this to compute the kills
> > +	     bitmap).  */
> > +
> > +	  if (doing_hardreg_pre_p)
> > +	    {
> > +	      bitmap_set_bit (prune_exprs, expr->bitmap_index);
> > +	      continue;
> > +	    }
> > +
> 
> ...the effect seems to be to set every bit of prune_exprs, in which
> case it might be easier to skip this loop entirely and adjust the later
> one to use bitmap_set_range.

You're right - I considered writing a comment acknowledging that.  I think the
main argument for doing it this way is that it would make it easier to support
optimising multiple hardregs at the same time.
 
> >  	  /* Note potentially trapping expressions.  */
> >  	  if (may_trap_p (expr->expr))
> >  	    {
> > [...]
> > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
> >    return 0;
> >  }
> >  
> > +static unsigned int
> > +execute_hardreg_pre (void)
> > +{
> > +  doing_hardreg_pre_p = true;
> > +  unsigned int regnos[] = HARDREG_PRE_REGNOS;
> > +  /* It's possible to avoid this loop, but it isn't worth doing so until
> > +     hardreg PRE is used for multiple hardregs.  */
> 
> Yeah, sounds ok to me.  But out of curiosity, how difficult would it be
> to structure the code so that this just works?  Where are the main
> difficulties?  Having to maintain a list of which expressions are
> associated with which register, and therefore which expressions
> mutually kill each other?

I think the only obstruction might be that the expression hash table would need
to incorporate the destination of the set as part of the hash.  I suspect that
mutually incompatible code motion isn't possible, but I'd have to think about
the algorithm a bit more to be confident.
 
> > +  for (int i = 0; regnos[i] != 0; i++)
> > +    {
> > +      int changed;
> > +      current_hardreg_regno = regnos[i];
> > +      if (dump_file)
> > +	fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
> > +		current_hardreg_regno);
> > +      delete_unreachable_blocks ();
> > +      df_analyze ();
> > +      changed = one_pre_gcse_pass ();
> > +      flag_rerun_cse_after_global_opts |= changed;
> 
> Is this appropriate for the new pass?  We're not really exposing general
> CSE opportunities.

Probably not - I copied this without thinking too hard about it (and still
haven't thought hard about it).
 
> > +      if (changed)
> > +	cleanup_cfg (0);
> > +    }
> > +  doing_hardreg_pre_p = false;
> > +  return 0;
> > +}
> > +
> >  static unsigned int
> >  execute_rtl_hoist (void)
> >  {
> > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
> >  
> >  namespace {
> >  
> > +const pass_data pass_data_hardreg_pre =
> > +{
> > +  RTL_PASS, /* type */
> > +  "hardreg_pre", /* name */
> > +  OPTGROUP_NONE, /* optinfo_flags */
> > +  TV_PRE, /* tv_id */
> > +  PROP_cfglayout, /* properties_required */
> > +  0, /* properties_provided */
> > +  0, /* properties_destroyed */
> > +  0, /* todo_flags_start */
> > +  TODO_df_finish, /* todo_flags_finish */
> > +};
> > +
> > +class pass_hardreg_pre : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_hardreg_pre (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  bool gate (function *) final override;
> > +  unsigned int execute (function *)  final override
> > +  {
> > +    return execute_hardreg_pre ();
> > +  }
> > +
> > +}; // class pass_rtl_pre
> > +
> > +bool
> > +pass_hardreg_pre::gate (function *fun)
> > +{
> > +#ifdef HARDREG_PRE_REGNOS
> > +  return optimize > 0
> > +    && !fun->calls_setjmp;
> 
> Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
> have expected them to be needed now, since the modern cfg framework
> should represent setjmp correctly.  Jeff, do you agree?  I'll try
> removing them and see what breaks...
> 
> Thanks,
> Richard
> 
> > +#else
> > +  return false;
> > +#endif
> > +}
> > +
> > +} // anon namespace
> > +
> > +rtl_opt_pass *
> > +make_pass_hardreg_pre (gcc::context *ctxt)
> > +{
> > +  return new pass_hardreg_pre (ctxt);
> > +}
> > +
> > +namespace {
> > +
> >  const pass_data pass_data_rtl_hoist =
> >  {
> >    RTL_PASS, /* type */
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
> >        NEXT_PASS (pass_rtl_cprop);
> >        NEXT_PASS (pass_rtl_pre);
> >        NEXT_PASS (pass_rtl_hoist);
> > +      NEXT_PASS (pass_hardreg_pre);
> >        NEXT_PASS (pass_rtl_cprop);
> >        NEXT_PASS (pass_rtl_store_motion);
> >        NEXT_PASS (pass_cse_after_global_opts);
> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> > index a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f 100644
> > --- a/gcc/tree-pass.h
> > +++ b/gcc/tree-pass.h
> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
  
Richard Sandiford Nov. 13, 2024, 6:52 p.m. UTC | #4
Richard Biener <rguenther@suse.de> writes:
> On Tue, 12 Nov 2024, Richard Sandiford wrote:
>
>> Sorry for the slow review.  I think Jeff's much better placed to comment
>> on this than I am, but here's a stab.  Mostly it looks really good to me
>> FWIW.
>> 
>> Andrew Carlotti <andrew.carlotti@arm.com> writes:
>> > This pass is used to optimise assignments to the FPMR register in
>> > aarch64.  I chose to implement this as a middle-end pass because it
>> > mostly reuses the existing RTL PRE code within gcse.cc.
>> >
>> > Compared to RTL PRE, the key difference in this new pass is that we
>> > insert new writes directly to the destination hardreg, instead of
>> > writing to a new pseudo-register and copying the result later.  This
>> > requires changes to the analysis portion of the pass, because sets
>> > cannot be moved before existing instructions that set, use or clobber
>> > the hardreg, and the value becomes unavailable after any uses of
>> > clobbers of the hardreg.
>> >
>> > This patch would currently break any debug instructions that use the
>> > value of fpmr in a region of code where that value is changed by this
>> > pass.  I haven't worked out the best way to fix this, but I suspect the
>> > issue is uncommon and tricky enough that it would be best to just drop
>> > those debug instructions.
>> 
>> Yeah, good question, and pass on that :)  Will need to think more about it.
>> 
>> > I've bootstrapped and regression tested this on aarch64, and it should be NFC
>> > on other targets.  Aside from this, my testing so far has involved hacking in a
>> > single FP8 intrinsic and testing various parameters and control flow
>> > structures, and checking both the codegen and the LCM bitmaps.  I intend to
>> > write better and more comprehensive tests once there are some real intrinsic
>> > implementations available to use.
>> >
>> >
>> > Is this approach good?  Apart from fixing the debug instructions and
>> > adding tests, is there anything else I need to change?
>> >
>> >
>> > gcc/ChangeLog:
>> >
>> > 	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
>> > 	* gcse.cc (doing_hardreg_pre_p): New global variable.
>> > 	(current_hardreg_regno): Ditto.
>> > 	(compute_local_properties): Unset transp for hardreg clobbers.
>> > 	(prune_hardreg_uses): New.
>> > 	(want_to_gcse_p): Always return true for hardreg PRE.
>> > 	(hash_scan_set): Add checks for hardreg uses/clobbers.
>> > 	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
>> > 	(record_last_mem_set_info): Ditto.
>> > 	(compute_hash_table_work): Record hardreg uses.
>> > 	(prune_expressions): Mark hardreg sets as call-clobbered.
>> > 	(compute_pre_data): Add call to prune_hardreg_uses.
>> > 	(pre_expr_reaches_here_p_work): Add comment.
>> > 	(insert_insn_start_basic_block): New functions.
>> > 	(pre_edge_insert): Don't add hardreg sets to predecessor block.
>> > 	(pre_delete): Use hardreg for the reaching reg.
>> > 	(pre_gcse): Don't insert copies for hardreg PRE.
>> > 	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
>> > 	(execute_hardreg_pre): New.
>> > 	(class pass_hardreg_pre): New.
>> > 	(pass_hardreg_pre::gate): New.
>> > 	(make_pass_hardreg_pre): New.
>> > 	* passes.def (pass_hardreg_pre): New pass.
>> > 	* tree-pass.h (make_pass_hardreg_pre): New.
>> >
>> > [...]
>> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
>> >  	}
>> >      }
>> >  }
>> > +
>> > +/* A hardreg set is not transparent in a block if there are any uses of that
>> > +   hardreg.  This filters the results of compute_local_properties, after the
>> > +   result of that function has been used to define the kills bitmap.
>> 
>> I think this is mostly my ignorance of the code, and would be obvious
>> if I tried it out locally, but: why do we need to do this after
>> computing the kills bitmap?  For mode-switching, the kills bitmap
>> is the inverse of the transparency bitmap, but it sounds like here
>> you want the kills bitmap to be more selective.
>> 
>> > +
>> > +   TRANSP is the destination sbitmap to be updated.
>> > +
>> > +   TABLE controls which hash table to look at.  */
>> > +
>> > +static void
>> > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
>> > +{
>> > +  unsigned int i;
>> > +  gcc_assert (doing_hardreg_pre_p);
>> > +
>> > +  for (i = 0; i < table->size; i++)
>> > +    {
>> > +      struct gcse_expr *expr;
>> > +
>> > +      for (expr = table->table[i]; expr != NULL; expr = expr->next_same_hash)
>> > +	{
>> > +	  int indx = expr->bitmap_index;
>> > +	  df_ref def;
>> > +
>> > +	  for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
>> > +	       def;
>> > +	       def = DF_REF_NEXT_REG (def))
>> > +	    bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
>> > +	}
>> > +    }
>> > +}
>> >  
>> >  /* Hash table support.  */
>>  
>> 
>> > @@ -747,6 +804,9 @@ static basic_block current_bb;
>> >  static bool
>> >  want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
>> >  {
>> > +  if (doing_hardreg_pre_p)
>> > +    return true;
>> > +
>> >  #ifdef STACK_REGS
>> >    /* On register stack architectures, don't GCSE constants from the
>> >       constant pool, as the benefits are often swamped by the overhead
>> > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, bool avail_p)
>> >        }
>> >  
>> >      case MEM:
>> > -      if (! flag_gcse_lm
>> > +      if (! flag_gcse_lm || doing_hardreg_pre_p
>> 
>> This test occurs often enough that I think it's worth splitting out.
>> Something like: !do_load_motion ()?
>> 
>> >  	  || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
>> >  				     x, avail_p))
>> >  	return false;
>> > [...]
>> > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d *table)
>> >  	    }
>> >  
>> >  	  note_stores (insn, record_last_set_info, insn);
>> > +
>> > +	  if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
>> > +	    {
>> > +	      /* We need to record the first use of a hardreg to determine if a
>> > +		 set of that hardreg is anticipatable.  */
>> > +	      df_ref ref;
>> > +	      FOR_EACH_INSN_USE (ref, insn)
>> > +		if (DF_REF_REGNO (ref) == current_hardreg_regno)
>> > +		  {
>> > +		    hardreg_last_bb = current_bb;
>> > +		    hardreg_first_use = DF_INSN_LUID (insn);
>> > +		  }
>> > +	    }
>> >  	}
>> 
>> Couldn't we instead check whether the register is live on entry to the block?
>> That would avoid the extra bit of state.
>> 
>> >  
>> >        /* The next pass builds the hash table.  */
>> > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
>> >      {
>> >        for (expr = expr_hash_table.table[ui]; expr; expr = expr->next_same_hash)
>> >  	{
>> > +	  /* For hardreg pre, we assume that all relevant hardregs are
>> > +	     call-clobbered, and set all bits in prune_exprs if the reg is call
>> > +	     clobbered.
>> 
>> Not sure I understand this.  But...
>> 
>> >                        If the hardreg were merely call-used, then we would
>> > +	     need to remove the expression from the anticipatable and
>> > +	     transparent bitmaps only (after using this to compute the kills
>> > +	     bitmap).  */
>> > +
>> > +	  if (doing_hardreg_pre_p)
>> > +	    {
>> > +	      bitmap_set_bit (prune_exprs, expr->bitmap_index);
>> > +	      continue;
>> > +	    }
>> > +
>> 
>> ...the effect seems to be to set every bit of prune_exprs, in which
>> case it might be easier to skip this loop entirely and adjust the later
>> one to use bitmap_set_range.
>> 
>> >  	  /* Note potentially trapping expressions.  */
>> >  	  if (may_trap_p (expr->expr))
>> >  	    {
>> > [...]
>> > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
>> >    return 0;
>> >  }
>> >  
>> > +static unsigned int
>> > +execute_hardreg_pre (void)
>> > +{
>> > +  doing_hardreg_pre_p = true;
>> > +  unsigned int regnos[] = HARDREG_PRE_REGNOS;
>> > +  /* It's possible to avoid this loop, but it isn't worth doing so until
>> > +     hardreg PRE is used for multiple hardregs.  */
>> 
>> Yeah, sounds ok to me.  But out of curiosity, how difficult would it be
>> to structure the code so that this just works?  Where are the main
>> difficulties?  Having to maintain a list of which expressions are
>> associated with which register, and therefore which expressions
>> mutually kill each other?
>> 
>> > +  for (int i = 0; regnos[i] != 0; i++)
>> > +    {
>> > +      int changed;
>> > +      current_hardreg_regno = regnos[i];
>> > +      if (dump_file)
>> > +	fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
>> > +		current_hardreg_regno);
>> > +      delete_unreachable_blocks ();
>> > +      df_analyze ();
>> > +      changed = one_pre_gcse_pass ();
>> > +      flag_rerun_cse_after_global_opts |= changed;
>> 
>> Is this appropriate for the new pass?  We're not really exposing general
>> CSE opportunities.
>> 
>> > +      if (changed)
>> > +	cleanup_cfg (0);
>> > +    }
>> > +  doing_hardreg_pre_p = false;
>> > +  return 0;
>> > +}
>> > +
>> >  static unsigned int
>> >  execute_rtl_hoist (void)
>> >  {
>> > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
>> >  
>> >  namespace {
>> >  
>> > +const pass_data pass_data_hardreg_pre =
>> > +{
>> > +  RTL_PASS, /* type */
>> > +  "hardreg_pre", /* name */
>> > +  OPTGROUP_NONE, /* optinfo_flags */
>> > +  TV_PRE, /* tv_id */
>> > +  PROP_cfglayout, /* properties_required */
>> > +  0, /* properties_provided */
>> > +  0, /* properties_destroyed */
>> > +  0, /* todo_flags_start */
>> > +  TODO_df_finish, /* todo_flags_finish */
>> > +};
>> > +
>> > +class pass_hardreg_pre : public rtl_opt_pass
>> > +{
>> > +public:
>> > +  pass_hardreg_pre (gcc::context *ctxt)
>> > +    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
>> > +  {}
>> > +
>> > +  /* opt_pass methods: */
>> > +  bool gate (function *) final override;
>> > +  unsigned int execute (function *)  final override
>> > +  {
>> > +    return execute_hardreg_pre ();
>> > +  }
>> > +
>> > +}; // class pass_rtl_pre
>> > +
>> > +bool
>> > +pass_hardreg_pre::gate (function *fun)
>> > +{
>> > +#ifdef HARDREG_PRE_REGNOS
>> > +  return optimize > 0
>> > +    && !fun->calls_setjmp;
>> 
>> Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
>> have expected them to be needed now, since the modern cfg framework
>> should represent setjmp correctly.  Jeff, do you agree?  I'll try
>> removing them and see what breaks...
>
> The problem with setjmp on RTL is that we fail to preserve abnormal
> edges during RTL expansion and fail in the impossible job to recover
> all that are required for correctness (to prevent code motion).

Ah, ok.  I'd wrongly thought that that was a solved problem now.

(Kind-of curious why it's impossible to recover the info.  I guess that's
a bit of a distraction though.  Leaning further into reconstructing
rather than preserving the cfg would be the wrong direction anyway.)

Thanks,
Richard

> With PRE that only operates on hardregs and in particular does not
> introduce alternate register usages or move memory ops there might
> be no issue.
>
> Richard.
>
>> Thanks,
>> Richard
>> 
>> > +#else
>> > +  return false;
>> > +#endif
>> > +}
>> > +
>> > +} // anon namespace
>> > +
>> > +rtl_opt_pass *
>> > +make_pass_hardreg_pre (gcc::context *ctxt)
>> > +{
>> > +  return new pass_hardreg_pre (ctxt);
>> > +}
>> > +
>> > +namespace {
>> > +
>> >  const pass_data pass_data_rtl_hoist =
>> >  {
>> >    RTL_PASS, /* type */
>> > diff --git a/gcc/passes.def b/gcc/passes.def
>> > index 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2 100644
>> > --- a/gcc/passes.def
>> > +++ b/gcc/passes.def
>> > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
>> >        NEXT_PASS (pass_rtl_cprop);
>> >        NEXT_PASS (pass_rtl_pre);
>> >        NEXT_PASS (pass_rtl_hoist);
>> > +      NEXT_PASS (pass_hardreg_pre);
>> >        NEXT_PASS (pass_rtl_cprop);
>> >        NEXT_PASS (pass_rtl_store_motion);
>> >        NEXT_PASS (pass_cse_after_global_opts);
>> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
>> > index a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f 100644
>> > --- a/gcc/tree-pass.h
>> > +++ b/gcc/tree-pass.h
>> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
>> > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
>> >  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
>>
  
Richard Sandiford Nov. 13, 2024, 7:03 p.m. UTC | #5
Andrew Carlotti <andrew.carlotti@arm.com> writes:
> On Tue, Nov 12, 2024 at 10:42:50PM +0000, Richard Sandiford wrote:
>> Sorry for the slow review.  I think Jeff's much better placed to comment
>> on this than I am, but here's a stab.  Mostly it looks really good to me
>> FWIW.
>> 
>> Andrew Carlotti <andrew.carlotti@arm.com> writes:
>> > This pass is used to optimise assignments to the FPMR register in
>> > aarch64.  I chose to implement this as a middle-end pass because it
>> > mostly reuses the existing RTL PRE code within gcse.cc.
>> >
>> > Compared to RTL PRE, the key difference in this new pass is that we
>> > insert new writes directly to the destination hardreg, instead of
>> > writing to a new pseudo-register and copying the result later.  This
>> > requires changes to the analysis portion of the pass, because sets
>> > cannot be moved before existing instructions that set, use or clobber
>> > the hardreg, and the value becomes unavailable after any uses of
>> > clobbers of the hardreg.
>> >
>> > This patch would currently break any debug instructions that use the
>> > value of fpmr in a region of code where that value is changed by this
>> > pass.  I haven't worked out the best way to fix this, but I suspect the
>> > issue is uncommon and tricky enough that it would be best to just drop
>> > those debug instructions.
>> 
>> Yeah, good question, and pass on that :)  Will need to think more about it.
>> 
>> > I've bootstrapped and regression tested this on aarch64, and it should be NFC
>> > on other targets.  Aside from this, my testing so far has involved hacking in a
>> > single FP8 intrinsic and testing various parameters and control flow
>> > structures, and checking both the codegen and the LCM bitmaps.  I intend to
>> > write better and more comprehensive tests once there are some real intrinsic
>> > implementations available to use.
>> >
>> >
>> > Is this approach good?  Apart from fixing the debug instructions and
>> > adding tests, is there anything else I need to change?
>> >
>> >
>> > gcc/ChangeLog:
>> >
>> > 	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
>> > 	* gcse.cc (doing_hardreg_pre_p): New global variable.
>> > 	(current_hardreg_regno): Ditto.
>> > 	(compute_local_properties): Unset transp for hardreg clobbers.
>> > 	(prune_hardreg_uses): New.
>> > 	(want_to_gcse_p): Always return true for hardreg PRE.
>> > 	(hash_scan_set): Add checks for hardreg uses/clobbers.
>> > 	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
>> > 	(record_last_mem_set_info): Ditto.
>> > 	(compute_hash_table_work): Record hardreg uses.
>> > 	(prune_expressions): Mark hardreg sets as call-clobbered.
>> > 	(compute_pre_data): Add call to prune_hardreg_uses.
>> > 	(pre_expr_reaches_here_p_work): Add comment.
>> > 	(insert_insn_start_basic_block): New functions.
>> > 	(pre_edge_insert): Don't add hardreg sets to predecessor block.
>> > 	(pre_delete): Use hardreg for the reaching reg.
>> > 	(pre_gcse): Don't insert copies for hardreg PRE.
>> > 	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
>> > 	(execute_hardreg_pre): New.
>> > 	(class pass_hardreg_pre): New.
>> > 	(pass_hardreg_pre::gate): New.
>> > 	(make_pass_hardreg_pre): New.
>> > 	* passes.def (pass_hardreg_pre): New pass.
>> > 	* tree-pass.h (make_pass_hardreg_pre): New.
>> >
>> > [...]
>> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
>> >  	}
>> >      }
>> >  }
>> > +
>> > +/* A hardreg set is not transparent in a block if there are any uses of that
>> > +   hardreg.  This filters the results of compute_local_properties, after the
>> > +   result of that function has been used to define the kills bitmap.
>> 
>> I think this is mostly my ignorance of the code, and would be obvious
>> if I tried it out locally, but: why do we need to do this after
>> computing the kills bitmap?  For mode-switching, the kills bitmap
>> is the inverse of the transparency bitmap, but it sounds like here
>> you want the kills bitmap to be more selective.
>
> I had to work through the entire LCM algorithm before I understood how these
> bitmaps were being used (and I intend to update the documentation to make this
> more obvious).  In summary, the kills and avail bitmaps indicate whether the
> result of an earlier expression is still available and up-to-date, whereas the
> transparent and anticipatable bitmaps indicate whether a later assignment can
> be moved earlier.

Right.  That part is pretty standard.

> For the existing hoist/PRE passes these are the same - this is because new
> pseduoregs are used to hold the result of relocated computations, so the only
> obstruction is if the values of the inputs to the expression are changed.
>
> For the new hardreg PRE pass the bitmaps are different in one case - if the
> content of the hardreg is used, then the result of the expression remains
> available after the use, but it isn't possible to anticipate a future
> assignment by moving that assignment before the earlier use.

But what I meant was: doesn't an assignment to the hard register block
movement/reuse in both directions?  We can't move R:=X up through a block B
that requires R==Y (so X is not transparent in B).  We also can't
reuse R:=X after a block that requires R==Y (because B kills X).

That's why I was expecting the kill set to be updated too, not just the
transparency set.

Thanks,
Richard
  
Andrew Carlotti Nov. 14, 2024, 12:05 a.m. UTC | #6
On Wed, Nov 13, 2024 at 07:03:44PM +0000, Richard Sandiford wrote:
> Andrew Carlotti <andrew.carlotti@arm.com> writes:
> > On Tue, Nov 12, 2024 at 10:42:50PM +0000, Richard Sandiford wrote:
> >> Sorry for the slow review.  I think Jeff's much better placed to comment
> >> on this than I am, but here's a stab.  Mostly it looks really good to me
> >> FWIW.
> >> 
> >> Andrew Carlotti <andrew.carlotti@arm.com> writes:
> >> > This pass is used to optimise assignments to the FPMR register in
> >> > aarch64.  I chose to implement this as a middle-end pass because it
> >> > mostly reuses the existing RTL PRE code within gcse.cc.
> >> >
> >> > Compared to RTL PRE, the key difference in this new pass is that we
> >> > insert new writes directly to the destination hardreg, instead of
> >> > writing to a new pseudo-register and copying the result later.  This
> >> > requires changes to the analysis portion of the pass, because sets
> >> > cannot be moved before existing instructions that set, use or clobber
> >> > the hardreg, and the value becomes unavailable after any uses of
> >> > clobbers of the hardreg.
> >> >
> >> > This patch would currently break any debug instructions that use the
> >> > value of fpmr in a region of code where that value is changed by this
> >> > pass.  I haven't worked out the best way to fix this, but I suspect the
> >> > issue is uncommon and tricky enough that it would be best to just drop
> >> > those debug instructions.
> >> 
> >> Yeah, good question, and pass on that :)  Will need to think more about it.
> >> 
> >> > I've bootstrapped and regression tested this on aarch64, and it should be NFC
> >> > on other targets.  Aside from this, my testing so far has involved hacking in a
> >> > single FP8 intrinsic and testing various parameters and control flow
> >> > structures, and checking both the codegen and the LCM bitmaps.  I intend to
> >> > write better and more comprehensive tests once there are some real intrinsic
> >> > implementations available to use.
> >> >
> >> >
> >> > Is this approach good?  Apart from fixing the debug instructions and
> >> > adding tests, is there anything else I need to change?
> >> >
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> > 	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
> >> > 	* gcse.cc (doing_hardreg_pre_p): New global variable.
> >> > 	(current_hardreg_regno): Ditto.
> >> > 	(compute_local_properties): Unset transp for hardreg clobbers.
> >> > 	(prune_hardreg_uses): New.
> >> > 	(want_to_gcse_p): Always return true for hardreg PRE.
> >> > 	(hash_scan_set): Add checks for hardreg uses/clobbers.
> >> > 	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
> >> > 	(record_last_mem_set_info): Ditto.
> >> > 	(compute_hash_table_work): Record hardreg uses.
> >> > 	(prune_expressions): Mark hardreg sets as call-clobbered.
> >> > 	(compute_pre_data): Add call to prune_hardreg_uses.
> >> > 	(pre_expr_reaches_here_p_work): Add comment.
> >> > 	(insert_insn_start_basic_block): New functions.
> >> > 	(pre_edge_insert): Don't add hardreg sets to predecessor block.
> >> > 	(pre_delete): Use hardreg for the reaching reg.
> >> > 	(pre_gcse): Don't insert copies for hardreg PRE.
> >> > 	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
> >> > 	(execute_hardreg_pre): New.
> >> > 	(class pass_hardreg_pre): New.
> >> > 	(pass_hardreg_pre::gate): New.
> >> > 	(make_pass_hardreg_pre): New.
> >> > 	* passes.def (pass_hardreg_pre): New pass.
> >> > 	* tree-pass.h (make_pass_hardreg_pre): New.
> >> >
> >> > [...]
> >> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
> >> >  	}
> >> >      }
> >> >  }
> >> > +
> >> > +/* A hardreg set is not transparent in a block if there are any uses of that
> >> > +   hardreg.  This filters the results of compute_local_properties, after the
> >> > +   result of that function has been used to define the kills bitmap.
> >> 
> >> I think this is mostly my ignorance of the code, and would be obvious
> >> if I tried it out locally, but: why do we need to do this after
> >> computing the kills bitmap?  For mode-switching, the kills bitmap
> >> is the inverse of the transparency bitmap, but it sounds like here
> >> you want the kills bitmap to be more selective.
> >
> > I had to work through the entire LCM algorithm before I understood how these
> > bitmaps were being used (and I intend to update the documentation to make this
> > more obvious).  In summary, the kills and avail bitmaps indicate whether the
> > result of an earlier expression is still available and up-to-date, whereas the
> > transparent and anticipatable bitmaps indicate whether a later assignment can
> > be moved earlier.
> 
> Right.  That part is pretty standard.
> 
> > For the existing hoist/PRE passes these are the same - this is because new
> > pseduoregs are used to hold the result of relocated computations, so the only
> > obstruction is if the values of the inputs to the expression are changed.
> >
> > For the new hardreg PRE pass the bitmaps are different in one case - if the
> > content of the hardreg is used, then the result of the expression remains
> > available after the use, but it isn't possible to anticipate a future
> > assignment by moving that assignment before the earlier use.
> 
> But what I meant was: doesn't an assignment to the hard register block
> movement/reuse in both directions?  We can't move R:=X up through a block B
> that requires R==Y (so X is not transparent in B).  We also can't
> reuse R:=X after a block that requires R==Y (because B kills X).
> 
> That's why I was expecting the kill set to be updated too, not just the
> transparency set.

An assignment to the hardreg does indeed block movement/reuse in both
directions, but this case is handled elsewhere.  The code here is specifically
to handle instructions that use the hardreg but do not modify it.

> Thanks,
> Richard
  
Richard Biener Nov. 14, 2024, 8:17 a.m. UTC | #7
On Wed, 13 Nov 2024, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Tue, 12 Nov 2024, Richard Sandiford wrote:
> >
> >> Sorry for the slow review.  I think Jeff's much better placed to comment
> >> on this than I am, but here's a stab.  Mostly it looks really good to me
> >> FWIW.
> >> 
> >> Andrew Carlotti <andrew.carlotti@arm.com> writes:
> >> > This pass is used to optimise assignments to the FPMR register in
> >> > aarch64.  I chose to implement this as a middle-end pass because it
> >> > mostly reuses the existing RTL PRE code within gcse.cc.
> >> >
> >> > Compared to RTL PRE, the key difference in this new pass is that we
> >> > insert new writes directly to the destination hardreg, instead of
> >> > writing to a new pseudo-register and copying the result later.  This
> >> > requires changes to the analysis portion of the pass, because sets
> >> > cannot be moved before existing instructions that set, use or clobber
> >> > the hardreg, and the value becomes unavailable after any uses of
> >> > clobbers of the hardreg.
> >> >
> >> > This patch would currently break any debug instructions that use the
> >> > value of fpmr in a region of code where that value is changed by this
> >> > pass.  I haven't worked out the best way to fix this, but I suspect the
> >> > issue is uncommon and tricky enough that it would be best to just drop
> >> > those debug instructions.
> >> 
> >> Yeah, good question, and pass on that :)  Will need to think more about it.
> >> 
> >> > I've bootstrapped and regression tested this on aarch64, and it should be NFC
> >> > on other targets.  Aside from this, my testing so far has involved hacking in a
> >> > single FP8 intrinsic and testing various parameters and control flow
> >> > structures, and checking both the codegen and the LCM bitmaps.  I intend to
> >> > write better and more comprehensive tests once there are some real intrinsic
> >> > implementations available to use.
> >> >
> >> >
> >> > Is this approach good?  Apart from fixing the debug instructions and
> >> > adding tests, is there anything else I need to change?
> >> >
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> > 	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
> >> > 	* gcse.cc (doing_hardreg_pre_p): New global variable.
> >> > 	(current_hardreg_regno): Ditto.
> >> > 	(compute_local_properties): Unset transp for hardreg clobbers.
> >> > 	(prune_hardreg_uses): New.
> >> > 	(want_to_gcse_p): Always return true for hardreg PRE.
> >> > 	(hash_scan_set): Add checks for hardreg uses/clobbers.
> >> > 	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
> >> > 	(record_last_mem_set_info): Ditto.
> >> > 	(compute_hash_table_work): Record hardreg uses.
> >> > 	(prune_expressions): Mark hardreg sets as call-clobbered.
> >> > 	(compute_pre_data): Add call to prune_hardreg_uses.
> >> > 	(pre_expr_reaches_here_p_work): Add comment.
> >> > 	(insert_insn_start_basic_block): New functions.
> >> > 	(pre_edge_insert): Don't add hardreg sets to predecessor block.
> >> > 	(pre_delete): Use hardreg for the reaching reg.
> >> > 	(pre_gcse): Don't insert copies for hardreg PRE.
> >> > 	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
> >> > 	(execute_hardreg_pre): New.
> >> > 	(class pass_hardreg_pre): New.
> >> > 	(pass_hardreg_pre::gate): New.
> >> > 	(make_pass_hardreg_pre): New.
> >> > 	* passes.def (pass_hardreg_pre): New pass.
> >> > 	* tree-pass.h (make_pass_hardreg_pre): New.
> >> >
> >> > [...]
> >> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
> >> >  	}
> >> >      }
> >> >  }
> >> > +
> >> > +/* A hardreg set is not transparent in a block if there are any uses of that
> >> > +   hardreg.  This filters the results of compute_local_properties, after the
> >> > +   result of that function has been used to define the kills bitmap.
> >> 
> >> I think this is mostly my ignorance of the code, and would be obvious
> >> if I tried it out locally, but: why do we need to do this after
> >> computing the kills bitmap?  For mode-switching, the kills bitmap
> >> is the inverse of the transparency bitmap, but it sounds like here
> >> you want the kills bitmap to be more selective.
> >> 
> >> > +
> >> > +   TRANSP is the destination sbitmap to be updated.
> >> > +
> >> > +   TABLE controls which hash table to look at.  */
> >> > +
> >> > +static void
> >> > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
> >> > +{
> >> > +  unsigned int i;
> >> > +  gcc_assert (doing_hardreg_pre_p);
> >> > +
> >> > +  for (i = 0; i < table->size; i++)
> >> > +    {
> >> > +      struct gcse_expr *expr;
> >> > +
> >> > +      for (expr = table->table[i]; expr != NULL; expr = expr->next_same_hash)
> >> > +	{
> >> > +	  int indx = expr->bitmap_index;
> >> > +	  df_ref def;
> >> > +
> >> > +	  for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
> >> > +	       def;
> >> > +	       def = DF_REF_NEXT_REG (def))
> >> > +	    bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
> >> > +	}
> >> > +    }
> >> > +}
> >> >  
> >> >  /* Hash table support.  */
> >>  
> >> 
> >> > @@ -747,6 +804,9 @@ static basic_block current_bb;
> >> >  static bool
> >> >  want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
> >> >  {
> >> > +  if (doing_hardreg_pre_p)
> >> > +    return true;
> >> > +
> >> >  #ifdef STACK_REGS
> >> >    /* On register stack architectures, don't GCSE constants from the
> >> >       constant pool, as the benefits are often swamped by the overhead
> >> > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn, bool avail_p)
> >> >        }
> >> >  
> >> >      case MEM:
> >> > -      if (! flag_gcse_lm
> >> > +      if (! flag_gcse_lm || doing_hardreg_pre_p
> >> 
> >> This test occurs often enough that I think it's worth splitting out.
> >> Something like: !do_load_motion ()?
> >> 
> >> >  	  || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
> >> >  				     x, avail_p))
> >> >  	return false;
> >> > [...]
> >> > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d *table)
> >> >  	    }
> >> >  
> >> >  	  note_stores (insn, record_last_set_info, insn);
> >> > +
> >> > +	  if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
> >> > +	    {
> >> > +	      /* We need to record the first use of a hardreg to determine if a
> >> > +		 set of that hardreg is anticipatable.  */
> >> > +	      df_ref ref;
> >> > +	      FOR_EACH_INSN_USE (ref, insn)
> >> > +		if (DF_REF_REGNO (ref) == current_hardreg_regno)
> >> > +		  {
> >> > +		    hardreg_last_bb = current_bb;
> >> > +		    hardreg_first_use = DF_INSN_LUID (insn);
> >> > +		  }
> >> > +	    }
> >> >  	}
> >> 
> >> Couldn't we instead check whether the register is live on entry to the block?
> >> That would avoid the extra bit of state.
> >> 
> >> >  
> >> >        /* The next pass builds the hash table.  */
> >> > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
> >> >      {
> >> >        for (expr = expr_hash_table.table[ui]; expr; expr = expr->next_same_hash)
> >> >  	{
> >> > +	  /* For hardreg pre, we assume that all relevant hardregs are
> >> > +	     call-clobbered, and set all bits in prune_exprs if the reg is call
> >> > +	     clobbered.
> >> 
> >> Not sure I understand this.  But...
> >> 
> >> >                        If the hardreg were merely call-used, then we would
> >> > +	     need to remove the expression from the anticipatable and
> >> > +	     transparent bitmaps only (after using this to compute the kills
> >> > +	     bitmap).  */
> >> > +
> >> > +	  if (doing_hardreg_pre_p)
> >> > +	    {
> >> > +	      bitmap_set_bit (prune_exprs, expr->bitmap_index);
> >> > +	      continue;
> >> > +	    }
> >> > +
> >> 
> >> ...the effect seems to be to set every bit of prune_exprs, in which
> >> case it might be easier to skip this loop entirely and adjust the later
> >> one to use bitmap_set_range.
> >> 
> >> >  	  /* Note potentially trapping expressions.  */
> >> >  	  if (may_trap_p (expr->expr))
> >> >  	    {
> >> > [...]
> >> > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
> >> >    return 0;
> >> >  }
> >> >  
> >> > +static unsigned int
> >> > +execute_hardreg_pre (void)
> >> > +{
> >> > +  doing_hardreg_pre_p = true;
> >> > +  unsigned int regnos[] = HARDREG_PRE_REGNOS;
> >> > +  /* It's possible to avoid this loop, but it isn't worth doing so until
> >> > +     hardreg PRE is used for multiple hardregs.  */
> >> 
> >> Yeah, sounds ok to me.  But out of curiosity, how difficult would it be
> >> to structure the code so that this just works?  Where are the main
> >> difficulties?  Having to maintain a list of which expressions are
> >> associated with which register, and therefore which expressions
> >> mutually kill each other?
> >> 
> >> > +  for (int i = 0; regnos[i] != 0; i++)
> >> > +    {
> >> > +      int changed;
> >> > +      current_hardreg_regno = regnos[i];
> >> > +      if (dump_file)
> >> > +	fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
> >> > +		current_hardreg_regno);
> >> > +      delete_unreachable_blocks ();
> >> > +      df_analyze ();
> >> > +      changed = one_pre_gcse_pass ();
> >> > +      flag_rerun_cse_after_global_opts |= changed;
> >> 
> >> Is this appropriate for the new pass?  We're not really exposing general
> >> CSE opportunities.
> >> 
> >> > +      if (changed)
> >> > +	cleanup_cfg (0);
> >> > +    }
> >> > +  doing_hardreg_pre_p = false;
> >> > +  return 0;
> >> > +}
> >> > +
> >> >  static unsigned int
> >> >  execute_rtl_hoist (void)
> >> >  {
> >> > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
> >> >  
> >> >  namespace {
> >> >  
> >> > +const pass_data pass_data_hardreg_pre =
> >> > +{
> >> > +  RTL_PASS, /* type */
> >> > +  "hardreg_pre", /* name */
> >> > +  OPTGROUP_NONE, /* optinfo_flags */
> >> > +  TV_PRE, /* tv_id */
> >> > +  PROP_cfglayout, /* properties_required */
> >> > +  0, /* properties_provided */
> >> > +  0, /* properties_destroyed */
> >> > +  0, /* todo_flags_start */
> >> > +  TODO_df_finish, /* todo_flags_finish */
> >> > +};
> >> > +
> >> > +class pass_hardreg_pre : public rtl_opt_pass
> >> > +{
> >> > +public:
> >> > +  pass_hardreg_pre (gcc::context *ctxt)
> >> > +    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
> >> > +  {}
> >> > +
> >> > +  /* opt_pass methods: */
> >> > +  bool gate (function *) final override;
> >> > +  unsigned int execute (function *)  final override
> >> > +  {
> >> > +    return execute_hardreg_pre ();
> >> > +  }
> >> > +
> >> > +}; // class pass_rtl_pre
> >> > +
> >> > +bool
> >> > +pass_hardreg_pre::gate (function *fun)
> >> > +{
> >> > +#ifdef HARDREG_PRE_REGNOS
> >> > +  return optimize > 0
> >> > +    && !fun->calls_setjmp;
> >> 
> >> Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
> >> have expected them to be needed now, since the modern cfg framework
> >> should represent setjmp correctly.  Jeff, do you agree?  I'll try
> >> removing them and see what breaks...
> >
> > The problem with setjmp on RTL is that we fail to preserve abnormal
> > edges during RTL expansion and fail in the impossible job to recover
> > all that are required for correctness (to prevent code motion).
> 
> Ah, ok.  I'd wrongly thought that that was a solved problem now.
> 
> (Kind-of curious why it's impossible to recover the info.  I guess that's
> a bit of a distraction though.  Leaning further into reconstructing
> rather than preserving the cfg would be the wrong direction anyway.)

Let me refer you to PR57067, I'm not sure somebody spent serious time
in trying to fix reconstruction - it should be possible to recover
a conservative set of abnormal edges, but on the GIMPLE CFG side
we've already started with the required set before inlining and
manage to keep a more optimal set as one would conservatively
compute _after_ inlining.  You'd lose that, so I believe the best
thing would be to _not_ throw away abnormal edges but at least
preserve whether the currently expanding block had any outgoing
one and the set of target blocks.  The other issue is that the
target of abnormal edges has to be adjusted to the sub-block
which will actually receive it.

Richard.

> Thanks,
> Richard
> 
> > With PRE that only operates on hardregs and in particular does not
> > introduce alternate register usages or move memory ops there might
> > be no issue.
> >
> > Richard.
> >
> >> Thanks,
> >> Richard
> >> 
> >> > +#else
> >> > +  return false;
> >> > +#endif
> >> > +}
> >> > +
> >> > +} // anon namespace
> >> > +
> >> > +rtl_opt_pass *
> >> > +make_pass_hardreg_pre (gcc::context *ctxt)
> >> > +{
> >> > +  return new pass_hardreg_pre (ctxt);
> >> > +}
> >> > +
> >> > +namespace {
> >> > +
> >> >  const pass_data pass_data_rtl_hoist =
> >> >  {
> >> >    RTL_PASS, /* type */
> >> > diff --git a/gcc/passes.def b/gcc/passes.def
> >> > index 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2 100644
> >> > --- a/gcc/passes.def
> >> > +++ b/gcc/passes.def
> >> > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
> >> >        NEXT_PASS (pass_rtl_cprop);
> >> >        NEXT_PASS (pass_rtl_pre);
> >> >        NEXT_PASS (pass_rtl_hoist);
> >> > +      NEXT_PASS (pass_hardreg_pre);
> >> >        NEXT_PASS (pass_rtl_cprop);
> >> >        NEXT_PASS (pass_rtl_store_motion);
> >> >        NEXT_PASS (pass_cse_after_global_opts);
> >> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> >> > index a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f 100644
> >> > --- a/gcc/tree-pass.h
> >> > +++ b/gcc/tree-pass.h
> >> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> >> > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
> >> >  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
> >> 
>
  
Andrew Carlotti Nov. 15, 2024, 6:42 p.m. UTC | #8
On Tue, Nov 12, 2024 at 10:42:50PM +0000, Richard Sandiford wrote:
> Sorry for the slow review.  I think Jeff's much better placed to comment
> on this than I am, but here's a stab.  Mostly it looks really good to me
> FWIW.
> 
> Andrew Carlotti <andrew.carlotti@arm.com> writes:
> > This pass is used to optimise assignments to the FPMR register in
> > aarch64.  I chose to implement this as a middle-end pass because it
> > mostly reuses the existing RTL PRE code within gcse.cc.
> >
> > Compared to RTL PRE, the key difference in this new pass is that we
> > insert new writes directly to the destination hardreg, instead of
> > writing to a new pseudo-register and copying the result later.  This
> > requires changes to the analysis portion of the pass, because sets
> > cannot be moved before existing instructions that set, use or clobber
> > the hardreg, and the value becomes unavailable after any uses of
> > clobbers of the hardreg.
> >
> > This patch would currently break any debug instructions that use the
> > value of fpmr in a region of code where that value is changed by this
> > pass.  I haven't worked out the best way to fix this, but I suspect the
> > issue is uncommon and tricky enough that it would be best to just drop
> > those debug instructions.
> 
> Yeah, good question, and pass on that :)  Will need to think more about it.

I've looked into this a bit more, and there's some interesting quirks in the
existing behaviour.  It looks like we might always be ok at the moment, but it
would be safer to add code to handle this properly.

My current idea for a conservative approach to handle this is that if I detect
any debug insn using the fpmr register, then I could create a new debug
variable to replace it, and create assignments to this debug variable that
clone all existing assignments to the fpmr register.


Below are some dumps that helped me understand what's happening.  A couple of
points of interest:

1. I'm surprised that 045t.cddce1 is losing debug information for dead results
of intrinsic calls.  This is an issue for existing intrinsics as well (e.g.
vrndmq_f16).  I can vaguely see why this might be happening, but I wonder
whether there's anything we can do better here.  It's not really relevant to
this patch, however, but it did present an extra barrier to getting the debug
rtl I was wanting to examine.

2. I think 274r.cse1 is the first pass that can eliminate redundant fpmr
assignments, and it looks like this will also create debug variables for the
fpmr input to any debug_insns that use an fpmr value.  I don't think any of the
other passes between cse1 and hardreg-pre can break this, so I believe we get
lucky here.  However, as I say above, I think it would should check for and
handle any debug_insn uses anyway.


------ Source ------
#include <arm_neon.h>

float16x8_t bat(float16x8_t a, mfloat8x16_t b, mfloat8x16_t c)
{
  a = vmlalbq_f16_mf8_fpm(a, b, c, 13);
  float16x8_t zxcx = vmlalbq_f16_mf8_fpm(a, b, c, 13);
  float16x8_t zxcy = vmlalbq_f16_mf8_fpm(zxcx, b, c, 13);
  return a;
}

------ Built with ------
gcc fp8-debug.cc -S -g -Og -march=armv8-a+fp8 -fdump-tree-all -fdump-rtl-all

------ 043t.mergephi1 ------
  <bb 2> :
  # DEBUG BEGIN_STMT
  a_6 = vmlalbq_f16_mf8_fpm (a_2(D), b_3(D), c_4(D), 13);
  # DEBUG a => a_6 
  # DEBUG BEGIN_STMT
  zxcx_8 = vmlalbq_f16_mf8_fpm (a_6, b_3(D), c_4(D), 13);
  # DEBUG zxcx => zxcx_8
  # DEBUG BEGIN_STMT
  zxcy_10 = vmlalbq_f16_mf8_fpm (zxcx_8, b_3(D), c_4(D), 13);
  # DEBUG zxcy => zxcy_10
  # DEBUG BEGIN_STMT
  return a_6;

------ 045t.cddce1 ------
  <bb 2> :
  # DEBUG BEGIN_STMT
  a_6 = vmlalbq_f16_mf8_fpm (a_2(D), b_3(D), c_4(D), 13);
  # DEBUG a => a_6 
  # DEBUG BEGIN_STMT
  zxcx_8 = vmlalbq_f16_mf8_fpm (a_6, b_3(D), c_4(D), 13);
  # DEBUG zxcx => zxcx_8
  # DEBUG BEGIN_STMT
  vmlalbq_f16_mf8_fpm (zxcx_8, b_3(D), c_4(D), 13);
  # DEBUG BEGIN_STMT
  return a_6;

[Note that we've lost debug information for zxct here.]

------ 270r.into_cfglayout ------
(insn 15 14 16 2 (set (reg:DI 109) 
        (const_int 13 [0xd])) "fp8-debug.cc":6:41 70 {*movdi_aarch64}
     (nil))
(insn 16 15 17 2 (set (reg:DI 84 fpmr)
        (reg:DI 109)) "fp8-debug.cc":6:41 70 {*movdi_aarch64}
     (nil)) 
(insn 17 16 18 2 (set (reg:V8HF 108)
        (unspec:V8HF [
                (reg/v:V8HF 102 [ <retval> ])
                (reg/v:V16QI 104 [ b ])
                (reg/v:V16QI 105 [ c ])
                (reg:DI 84 fpmr)
            ] UNSPEC_FP8TEST)) "fp8-debug.cc":6:41 5277 {fp8test}
     (nil)) 
(insn 18 17 19 2 (set (reg/v:V8HF 101 [ zxcx ])
        (reg:V8HF 108)) "fp8-debug.cc":6:41 1272 {*aarch64_simd_movv8hf}
     (nil))
(debug_insn 19 18 20 2 (var_location:V8HF zxcx (reg/v:V8HF 101 [ zxcx ])) "fp8-debug.cc":6:41 -1
     (nil)) 
(debug_insn 20 19 21 2 (debug_marker) "fp8-debug.cc":7:3 -1
     (nil))
(insn 21 20 22 2 (set (reg:DI 111)
        (const_int 13 [0xd])) "fp8-debug.cc":7:41 70 {*movdi_aarch64}
     (nil))
(insn 22 21 23 2 (set (reg:DI 84 fpmr)
        (reg:DI 111)) "fp8-debug.cc":7:41 70 {*movdi_aarch64}
     (nil))
(insn 23 22 24 2 (set (reg:V8HF 110) 
        (unspec:V8HF [
                (reg/v:V8HF 101 [ zxcx ])
                (reg/v:V16QI 104 [ b ])
                (reg/v:V16QI 105 [ c ])
                (reg:DI 84 fpmr)
            ] UNSPEC_FP8TEST)) "fp8-debug.cc":7:41 5277 {fp8test}
     (nil))


------ 271r.jump = 273r.dfinit ------
(insn 15 14 16 2 (set (reg:DI 109) 
        (const_int 13 [0xd])) "fp8-debug.cc":6:41 70 {*movdi_aarch64}
     (nil)) 
(insn 16 15 32 2 (set (reg:DI 84 fpmr)
        (reg:DI 109)) "fp8-debug.cc":6:41 70 {*movdi_aarch64}
     (nil))
(debug_insn 32 16 31 2 (var_location:V8HF D#2 (unspec:V8HF [
            (reg/v:V8HF 102 [ <retval> ])
            (reg/v:V16QI 104 [ b ])
            (reg/v:V16QI 105 [ c ])
            (reg:DI 84 fpmr)
        ] UNSPEC_FP8TEST)) -1
     (nil))
(debug_insn 31 32 19 2 (var_location:V8HF D#1 (debug_expr:V8HF D#2)) -1
     (nil)) 
(debug_insn 19 31 20 2 (var_location:V8HF zxcx (debug_expr:V8HF D#1)) "fp8-debug.cc":6:41 -1     
     (nil))

[Last UNSPEC_FP8 instruction was optimised away here, and the previous one
converted to a debug_insn.  We're temporarily using fpmr directly in a
debug_insn.]

------ 274r.cse1 ------
(debug_insn 34 14 33 2 (var_location:DI D#4 (const_int 13 [0xd])) -1
     (nil))
(debug_insn 33 34 32 2 (var_location:DI D#3 (debug_expr:DI D#4)) -1
     (nil))
(debug_insn 32 33 31 2 (var_location:V8HF D#2 (unspec:V8HF [
            (reg:V8HF 106)
            (debug_expr:V16QI D#6)
            (debug_expr:V16QI D#5)
            (debug_expr:DI D#3)
        ] UNSPEC_FP8TEST)) -1
     (nil))
(debug_insn 31 32 19 2 (var_location:V8HF D#1 (debug_expr:V8HF D#2)) -1
     (nil))
(debug_insn 19 31 20 2 (var_location:V8HF zxcx (debug_expr:V8HF D#1)) "fp8-debug.cc":6:41 -1
     (nil))

[After cse, the fpmr assignment is also optimised away, at which point the
debug_insn uses a debug variable for fpmr instead.]
  
Jeff Law Dec. 1, 2024, 10:24 p.m. UTC | #9
On 11/12/24 3:42 PM, Richard Sandiford wrote:
> Sorry for the slow review.  I think Jeff's much better placed to comment
> on this than I am, but here's a stab.  Mostly it looks really good to me
> FWIW.
Digging out.  I'll try to get a good looksie this afternoon/evening.

jeff
  
Jeff Law Dec. 1, 2024, 10:32 p.m. UTC | #10
On 11/12/24 3:42 PM, Richard Sandiford wrote:

>> +
>> +bool
>> +pass_hardreg_pre::gate (function *fun)
>> +{
>> +#ifdef HARDREG_PRE_REGNOS
>> +  return optimize > 0
>> +    && !fun->calls_setjmp;
> 
> Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
> have expected them to be needed now, since the modern cfg framework
> should represent setjmp correctly.  Jeff, do you agree?  I'll try
> removing them and see what breaks...
So back in '98 our CFG wasn't accurate (IIRC this code was a lot of what 
motivated making the CFG available before flow).  In addition to not 
being accurate, I don't think we had all of rth's bits to kill 
expressions on abnormal edges which saves us from trying to split an 
abnormal critical edge.

I'd think that if our CFG is accurately representing that abnormal edge 
that we'd be OK these days.  But it's been a long time and there may 
always be something lurking.

Jeff
  
Andrew Pinski Dec. 1, 2024, 10:53 p.m. UTC | #11
On Sun, Dec 1, 2024 at 2:36 PM Jeff Law <jlaw@ventanamicro.com> wrote:
>
>
>
> On 11/12/24 3:42 PM, Richard Sandiford wrote:
>
> >> +
> >> +bool
> >> +pass_hardreg_pre::gate (function *fun)
> >> +{
> >> +#ifdef HARDREG_PRE_REGNOS
> >> +  return optimize > 0
> >> +    && !fun->calls_setjmp;
> >
> > Huh.  It looks like these setjmp exclusions go back to 1998.  I wouldn't
> > have expected them to be needed now, since the modern cfg framework
> > should represent setjmp correctly.  Jeff, do you agree?  I'll try
> > removing them and see what breaks...
> So back in '98 our CFG wasn't accurate (IIRC this code was a lot of what
> motivated making the CFG available before flow).  In addition to not
> being accurate, I don't think we had all of rth's bits to kill
> expressions on abnormal edges which saves us from trying to split an
> abnormal critical edge.
>
> I'd think that if our CFG is accurately representing that abnormal edge
> that we'd be OK these days.  But it's been a long time and there may
> always be something lurking.

I think for RTL CFG we still have issues with setjmp;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067  .

Thanks,
Andrew

>
> Jeff
>
  
Jeff Law Dec. 1, 2024, 10:54 p.m. UTC | #12
On 11/13/24 12:03 PM, Richard Sandiford wrote:
> Andrew Carlotti <andrew.carlotti@arm.com> writes:

>>>
>>> I think this is mostly my ignorance of the code, and would be obvious
>>> if I tried it out locally, but: why do we need to do this after
>>> computing the kills bitmap?  For mode-switching, the kills bitmap
>>> is the inverse of the transparency bitmap, but it sounds like here
>>> you want the kills bitmap to be more selective.
>>
>> I had to work through the entire LCM algorithm before I understood how these
>> bitmaps were being used (and I intend to update the documentation to make this
>> more obvious).  In summary, the kills and avail bitmaps indicate whether the
>> result of an earlier expression is still available and up-to-date, whereas the
>> transparent and anticipatable bitmaps indicate whether a later assignment can
>> be moved earlier.
> 
> Right.  That part is pretty standard.
> 
>> For the existing hoist/PRE passes these are the same - this is because new
>> pseduoregs are used to hold the result of relocated computations, so the only
>> obstruction is if the values of the inputs to the expression are changed.
>>
>> For the new hardreg PRE pass the bitmaps are different in one case - if the
>> content of the hardreg is used, then the result of the expression remains
>> available after the use, but it isn't possible to anticipate a future
>> assignment by moving that assignment before the earlier use.
> 
> But what I meant was: doesn't an assignment to the hard register block
> movement/reuse in both directions?  We can't move R:=X up through a block B
> that requires R==Y (so X is not transparent in B).  We also can't
> reuse R:=X after a block that requires R==Y (because B kills X).
> 
> That's why I was expecting the kill set to be updated too, not just the
> transparency set.
In general, yes, I would expect transparency and kill to be inverses of 
each other.

I suspect (but would have to do a fair amount of archaeology to be sure) 
that we probably had kills computed for some other problem (classic gcse 
  or const/copy propagation perhaps) and we just inverted it to work 
with the LCM algorithm which wants to query transparency.  Flipping 
kills once into transparency seems better than using kills and having to 
flip it every time we visit a block during the global propagation step.

jeff
  
Jeff Law Dec. 2, 2024, 3:59 p.m. UTC | #13
On 10/31/24 12:29 PM, Andrew Carlotti wrote:
> This pass is used to optimise assignments to the FPMR register in
> aarch64.  I chose to implement this as a middle-end pass because it
> mostly reuses the existing RTL PRE code within gcse.cc.
> 
> Compared to RTL PRE, the key difference in this new pass is that we
> insert new writes directly to the destination hardreg, instead of
> writing to a new pseudo-register and copying the result later.  This
> requires changes to the analysis portion of the pass, because sets
> cannot be moved before existing instructions that set, use or clobber
> the hardreg, and the value becomes unavailable after any uses of
> clobbers of the hardreg.
> 
> This patch would currently break any debug instructions that use the
> value of fpmr in a region of code where that value is changed by this
> pass.  I haven't worked out the best way to fix this, but I suspect the
> issue is uncommon and tricky enough that it would be best to just drop
> those debug instructions.
> 
> I've bootstrapped and regression tested this on aarch64, and it should be NFC
> on other targets.  Aside from this, my testing so far has involved hacking in a
> single FP8 intrinsic and testing various parameters and control flow
> structures, and checking both the codegen and the LCM bitmaps.  I intend to
> write better and more comprehensive tests once there are some real intrinsic
> implementations available to use.
> 
> 
> Is this approach good?  Apart from fixing the debug instructions and
> adding tests, is there anything else I need to change?
> 
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
> 	* gcse.cc (doing_hardreg_pre_p): New global variable.
> 	(current_hardreg_regno): Ditto.
> 	(compute_local_properties): Unset transp for hardreg clobbers.
> 	(prune_hardreg_uses): New.
> 	(want_to_gcse_p): Always return true for hardreg PRE.
> 	(hash_scan_set): Add checks for hardreg uses/clobbers.
> 	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
> 	(record_last_mem_set_info): Ditto.
> 	(compute_hash_table_work): Record hardreg uses.
> 	(prune_expressions): Mark hardreg sets as call-clobbered.
> 	(compute_pre_data): Add call to prune_hardreg_uses.
> 	(pre_expr_reaches_here_p_work): Add comment.
> 	(insert_insn_start_basic_block): New functions.
> 	(pre_edge_insert): Don't add hardreg sets to predecessor block.
> 	(pre_delete): Use hardreg for the reaching reg.
> 	(pre_gcse): Don't insert copies for hardreg PRE.
> 	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
> 	(execute_hardreg_pre): New.
> 	(class pass_hardreg_pre): New.
> 	(pass_hardreg_pre::gate): New.
> 	(make_pass_hardreg_pre): New.
> 	* passes.def (pass_hardreg_pre): New pass.
> 	* tree-pass.h (make_pass_hardreg_pre): New.
So at a 30k foot level, one thing to be very leery of is extending the 
lifetime of any hard register.  It's probably not a big deal on aarch, 
but it can cause all kinds of headaches on other targets.

Essentially you probably need to avoid PRE on a hard register that's in 
a likely spilled class.


> 
> 

> diff --git a/gcc/gcse.cc b/gcc/gcse.cc
> index 31b92f30fa1ba6c519429d4b7bc55547b2d71c01..ce4ebe420c02d78fcde3144eed595e22212aaa0b 100644
> --- a/gcc/gcse.cc
> +++ b/gcc/gcse.cc

> @@ -693,10 +698,29 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
>   	     We start by assuming all are transparent [none are killed], and
>   	     then reset the bits for those that are.  */
>   	  if (transp)
> -	    compute_transp (expr->expr, indx, transp,
> -			    blocks_with_calls,
> -			    modify_mem_list_set,
> -			    canon_modify_mem_list);
> +	    {
> +	      compute_transp (expr->expr, indx, transp,
> +			      blocks_with_calls,
> +			      modify_mem_list_set,
> +			      canon_modify_mem_list);
> +
> +	      if (doing_hardreg_pre_p)
> +		{
> +		  /* We also need to check whether the destination hardreg is
> +		     set or call-clobbered in each BB.  We'll check for hardreg
> +		     uses later.  */
> +		  df_ref def;
> +		  for (def = DF_REG_DEF_CHAIN (current_hardreg_regno);
> +		       def;
> +		       def = DF_REF_NEXT_REG (def))
> +		    bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
> +
> +		  bitmap_iterator bi;
> +		  unsigned bb_index;
> +		  EXECUTE_IF_SET_IN_BITMAP (blocks_with_calls, 0, bb_index, bi)
> +		    bitmap_clear_bit (transp[bb_index], indx);
> +		}
> +	    }
It's been a long time since I looked at the code, but is there code 
already in the pass to walk down the FUSAGE notes attached to calls? 
You'll definitely need that since it can have uses/clobbers of hard regs 
that are potentially outside the set normally clobbered by calls.




>   
>   	  /* The occurrences recorded in antic_occr are exactly those that
>   	     we want to set to nonzero in ANTLOC.  */
> @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
>   	}
>       }
>   }
> +
> +/* A hardreg set is not transparent in a block if there are any uses of that
> +   hardreg.  This filters the results of compute_local_properties, after the
> +   result of that function has been used to define the kills bitmap.
> +
> +   TRANSP is the destination sbitmap to be updated.
> +
> +   TABLE controls which hash table to look at.  */
Sorry, that comment doesn't make much sense to me.  A use doesn't 
traditionally impact transparency.


>   
>   /* Hash table support.  */
>   
> @@ -739,6 +794,8 @@ struct reg_avail_info
>   };
>   
>   static struct reg_avail_info *reg_avail_info;
> +static basic_block hardreg_last_bb;
> +static int hardreg_first_use;
>   static basic_block current_bb;
>   
>   /* See whether X, the source of a set, is something we want to consider for
> @@ -747,6 +804,9 @@ static basic_block current_bb;
>   static bool
>   want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
>   {
> +  if (doing_hardreg_pre_p)
> +    return true;
This seems overly aggressive, though perhaps it's not so bad in practice.





> @@ -1286,12 +1348,33 @@ hash_scan_set (rtx set, rtx_insn *insn, struct gcse_hash_table_d *table)
>   	     able to handle code motion of insns with multiple sets.  */
>   	  bool antic_p = (oprs_anticipatable_p (src, insn)
>   			  && !multiple_sets (insn));
> +	  if (doing_hardreg_pre_p)
> +	    {
> +	      /* An hardreg assignment is anticipatable only if the hardreg is
> +		 neither set nor used prior to this assignment.  */
> +	      auto info = reg_avail_info[current_hardreg_regno];
> +	      if ((info.last_bb == current_bb
> +		   && info.first_set < DF_INSN_LUID (insn))
> +		  || (hardreg_last_bb == current_bb
> +		      && hardreg_first_use <= DF_INSN_LUID (insn)))
> +		antic_p = false;
> +	    }
> +
>   	  /* An expression is not available if its operands are
>   	     subsequently modified, including this insn.  It's also not
>   	     available if this is a branch, because we can't insert
>   	     a set after the branch.  */
>   	  bool avail_p = (oprs_available_p (src, insn)
>   			  && ! JUMP_P (insn));
> +	  if (doing_hardreg_pre_p)
> +	    {
> +	      /* An hardreg assignment is only available if the hardreg is
> +		 not set later in the BB.  Uses of the hardreg are allowed. */
> +	      auto info = reg_avail_info[current_hardreg_regno];
> +	      if (info.last_bb == current_bb
> +		  && info.last_set > DF_INSN_LUID (insn))
> +		antic_p = false;
Did you mean to set antic_p here, or should it have been avail_p?



> @@ -1537,6 +1623,18 @@ compute_hash_table_work (struct gcse_hash_table_d *table)
>   	      EXECUTE_IF_SET_IN_HARD_REG_SET (callee_clobbers, 0, regno, hrsi)
>   		record_last_reg_set_info (insn, regno);
>   
> +	      if (doing_hardreg_pre_p)
> +		{
> +		  /* This is covered by the above clobbers, but let's
> +		     conservatively make this work as well for hardregs that
> +		     are call-used but not call-clobbered.  */
> +		  record_last_reg_set_info (insn, current_hardreg_regno);
> +
> +		  /* Mark this block as containing a call-clobber.  */
> +		  bitmap_set_bit (blocks_with_calls,
> +				  BLOCK_FOR_INSN (insn)->index);
> +	
I don't think this is wrong, but isn't it redundant?  ie, shouldn't we 
have populated that bitmap elsewhere?




> @@ -2159,13 +2335,24 @@ pre_edge_insert (struct edge_list *edge_list, struct gcse_expr **index_map)
>   
>   			/* We can't insert anything on an abnormal and
>   			   critical edge, so we insert the insn at the end of
> -			   the previous block. There are several alternatives
> +			   the previous block.  There are several alternatives
>   			   detailed in Morgans book P277 (sec 10.5) for
>   			   handling this situation.  This one is easiest for
> -			   now.  */
> +			   now.
>   
> +			   For hardreg PRE, this would add an unwanted clobber
> +			   of the hardreg, so we instead insert in the
> +			   successor block, which may be partially redundant
> +			   but is at least correct.  */
But if it's abnormal critical, then doesn't this result in clobbers on 
paths where we didn't have them before?  Is that actually safe?

That's what I've got after a first pass over the bits.

Jeff
  
Andrew Carlotti Dec. 5, 2024, 3:45 p.m. UTC | #14
On Mon, Dec 02, 2024 at 08:59:20AM -0700, Jeff Law wrote:
> 
> 
> On 10/31/24 12:29 PM, Andrew Carlotti wrote:
> > This pass is used to optimise assignments to the FPMR register in
> > aarch64.  I chose to implement this as a middle-end pass because it
> > mostly reuses the existing RTL PRE code within gcse.cc.
> > 
> > Compared to RTL PRE, the key difference in this new pass is that we
> > insert new writes directly to the destination hardreg, instead of
> > writing to a new pseudo-register and copying the result later.  This
> > requires changes to the analysis portion of the pass, because sets
> > cannot be moved before existing instructions that set, use or clobber
> > the hardreg, and the value becomes unavailable after any uses of
> > clobbers of the hardreg.
> > 
> > This patch would currently break any debug instructions that use the
> > value of fpmr in a region of code where that value is changed by this
> > pass.  I haven't worked out the best way to fix this, but I suspect the
> > issue is uncommon and tricky enough that it would be best to just drop
> > those debug instructions.
> > 
> > I've bootstrapped and regression tested this on aarch64, and it should be NFC
> > on other targets.  Aside from this, my testing so far has involved hacking in a
> > single FP8 intrinsic and testing various parameters and control flow
> > structures, and checking both the codegen and the LCM bitmaps.  I intend to
> > write better and more comprehensive tests once there are some real intrinsic
> > implementations available to use.
> > 
> > 
> > Is this approach good?  Apart from fixing the debug instructions and
> > adding tests, is there anything else I need to change?
> > 
> > 
> > gcc/ChangeLog:
> > 
> > 	* config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
> > 	* gcse.cc (doing_hardreg_pre_p): New global variable.
> > 	(current_hardreg_regno): Ditto.
> > 	(compute_local_properties): Unset transp for hardreg clobbers.
> > 	(prune_hardreg_uses): New.
> > 	(want_to_gcse_p): Always return true for hardreg PRE.
> > 	(hash_scan_set): Add checks for hardreg uses/clobbers.
> > 	(oprs_unchanged_p): Disable load motion for hardreg PRE pass.
> > 	(record_last_mem_set_info): Ditto.
> > 	(compute_hash_table_work): Record hardreg uses.
> > 	(prune_expressions): Mark hardreg sets as call-clobbered.
> > 	(compute_pre_data): Add call to prune_hardreg_uses.
> > 	(pre_expr_reaches_here_p_work): Add comment.
> > 	(insert_insn_start_basic_block): New functions.
> > 	(pre_edge_insert): Don't add hardreg sets to predecessor block.
> > 	(pre_delete): Use hardreg for the reaching reg.
> > 	(pre_gcse): Don't insert copies for hardreg PRE.
> > 	(one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
> > 	(execute_hardreg_pre): New.
> > 	(class pass_hardreg_pre): New.
> > 	(pass_hardreg_pre::gate): New.
> > 	(make_pass_hardreg_pre): New.
> > 	* passes.def (pass_hardreg_pre): New pass.
> > 	* tree-pass.h (make_pass_hardreg_pre): New.
> So at a 30k foot level, one thing to be very leery of is extending the
> lifetime of any hard register.  It's probably not a big deal on aarch, but
> it can cause all kinds of headaches on other targets.
> 
> Essentially you probably need to avoid PRE on a hard register that's in a
> likely spilled class.

This is not intended to be used for ordinary registers, so that shouldn't be a
concern.  The use case is essentially as a form of mode-switching, where the
active mode is specified by a register that can take arbitrary values at
runtime.
 
> > 
> > 
> 
> > diff --git a/gcc/gcse.cc b/gcc/gcse.cc
> > index 31b92f30fa1ba6c519429d4b7bc55547b2d71c01..ce4ebe420c02d78fcde3144eed595e22212aaa0b 100644
> > --- a/gcc/gcse.cc
> > +++ b/gcc/gcse.cc
> 
> > @@ -693,10 +698,29 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
> >   	     We start by assuming all are transparent [none are killed], and
> >   	     then reset the bits for those that are.  */
> >   	  if (transp)
> > -	    compute_transp (expr->expr, indx, transp,
> > -			    blocks_with_calls,
> > -			    modify_mem_list_set,
> > -			    canon_modify_mem_list);
> > +	    {
> > +	      compute_transp (expr->expr, indx, transp,
> > +			      blocks_with_calls,
> > +			      modify_mem_list_set,
> > +			      canon_modify_mem_list);
> > +
> > +	      if (doing_hardreg_pre_p)
> > +		{
> > +		  /* We also need to check whether the destination hardreg is
> > +		     set or call-clobbered in each BB.  We'll check for hardreg
> > +		     uses later.  */
> > +		  df_ref def;
> > +		  for (def = DF_REG_DEF_CHAIN (current_hardreg_regno);
> > +		       def;
> > +		       def = DF_REF_NEXT_REG (def))
> > +		    bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
> > +
> > +		  bitmap_iterator bi;
> > +		  unsigned bb_index;
> > +		  EXECUTE_IF_SET_IN_BITMAP (blocks_with_calls, 0, bb_index, bi)
> > +		    bitmap_clear_bit (transp[bb_index], indx);
> > +		}
> > +	    }
> It's been a long time since I looked at the code, but is there code already
> in the pass to walk down the FUSAGE notes attached to calls? You'll
> definitely need that since it can have uses/clobbers of hard regs that are
> potentially outside the set normally clobbered by calls.
> 
> 
> 
> 
> >   	  /* The occurrences recorded in antic_occr are exactly those that
> >   	     we want to set to nonzero in ANTLOC.  */
> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
> >   	}
> >       }
> >   }
> > +
> > +/* A hardreg set is not transparent in a block if there are any uses of that
> > +   hardreg.  This filters the results of compute_local_properties, after the
> > +   result of that function has been used to define the kills bitmap.
> > +
> > +   TRANSP is the destination sbitmap to be updated.
> > +
> > +   TABLE controls which hash table to look at.  */
> Sorry, that comment doesn't make much sense to me.  A use doesn't
> traditionally impact transparency.

This is not quite the traditional GCSE scenario.

In traditional GCSE/LCM we're just moving an abstract computation (and
temporarily storing the result in a newly-created pseudoreg so that it can't
interfere with existing uses of the destination registers).

In this new hardreg PRE pass we want to move the actual assignment to the
hardreg, so we have to additionally check whether moving the assignment earlier
(and not just moving the compuation earlier) would conflict with existing uses
of the hardreg.  The transparency (and antic) bitmaps are used to indicate this
property in LCM; perhaps the name is less accurate when applied to hardreg PRE.

> >   
> >   /* Hash table support.  */
> > @@ -739,6 +794,8 @@ struct reg_avail_info
> >   };
> >   static struct reg_avail_info *reg_avail_info;
> > +static basic_block hardreg_last_bb;
> > +static int hardreg_first_use;
> >   static basic_block current_bb;
> >   /* See whether X, the source of a set, is something we want to consider for
> > @@ -747,6 +804,9 @@ static basic_block current_bb;
> >   static bool
> >   want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
> >   {
> > +  if (doing_hardreg_pre_p)
> > +    return true;
> This seems overly aggressive, though perhaps it's not so bad in practice.

This should be fine - want_to_gcse_p is aiming to filter out cases where the
lifetime of the newly-created pseduoregs would increase register to an extent
that the resulting spills would outweigh the benefits of the code motion.

For hardreg PRE we don't create new pseduoregs, but instead use specific
hardregs that aren't used by RA, and we check for conflicting uses in
determining the extent of the code motion.  So the checks in want_to_gcse_p are
irrelevant here.
 
> > @@ -1286,12 +1348,33 @@ hash_scan_set (rtx set, rtx_insn *insn, struct gcse_hash_table_d *table)
> >   	     able to handle code motion of insns with multiple sets.  */
> >   	  bool antic_p = (oprs_anticipatable_p (src, insn)
> >   			  && !multiple_sets (insn));
> > +	  if (doing_hardreg_pre_p)
> > +	    {
> > +	      /* An hardreg assignment is anticipatable only if the hardreg is
> > +		 neither set nor used prior to this assignment.  */
> > +	      auto info = reg_avail_info[current_hardreg_regno];
> > +	      if ((info.last_bb == current_bb
> > +		   && info.first_set < DF_INSN_LUID (insn))
> > +		  || (hardreg_last_bb == current_bb
> > +		      && hardreg_first_use <= DF_INSN_LUID (insn)))
> > +		antic_p = false;
> > +	    }
> > +
> >   	  /* An expression is not available if its operands are
> >   	     subsequently modified, including this insn.  It's also not
> >   	     available if this is a branch, because we can't insert
> >   	     a set after the branch.  */
> >   	  bool avail_p = (oprs_available_p (src, insn)
> >   			  && ! JUMP_P (insn));
> > +	  if (doing_hardreg_pre_p)
> > +	    {
> > +	      /* An hardreg assignment is only available if the hardreg is
> > +		 not set later in the BB.  Uses of the hardreg are allowed. */
> > +	      auto info = reg_avail_info[current_hardreg_regno];
> > +	      if (info.last_bb == current_bb
> > +		  && info.last_set > DF_INSN_LUID (insn))
> > +		antic_p = false;
> Did you mean to set antic_p here, or should it have been avail_p?

Oops - that definitely should have been avail_p.  This will be fixed in the
next version I send.

> > @@ -1537,6 +1623,18 @@ compute_hash_table_work (struct gcse_hash_table_d *table)
> >   	      EXECUTE_IF_SET_IN_HARD_REG_SET (callee_clobbers, 0, regno, hrsi)
> >   		record_last_reg_set_info (insn, regno);
> > +	      if (doing_hardreg_pre_p)
> > +		{
> > +		  /* This is covered by the above clobbers, but let's
> > +		     conservatively make this work as well for hardregs that
> > +		     are call-used but not call-clobbered.  */
> > +		  record_last_reg_set_info (insn, current_hardreg_regno);
> > +
> > +		  /* Mark this block as containing a call-clobber.  */
> > +		  bitmap_set_bit (blocks_with_calls,
> > +				  BLOCK_FOR_INSN (insn)->index);
> > +	
> I don't think this is wrong, but isn't it redundant?  ie, shouldn't we have
> populated that bitmap elsewhere?

The existing code only uses blocks_with_calls for load motion, to indicate when
memory might be clobbered by a call.  The existing computation of blocks_with_calls is in
record_last_mem_set_info_common, but I bypass this with an added early exit
condition in record_last_mem_set_info (in the same way that I bypass all the
other code specific to load motion).

In the hardreg PRE pass we don't need to check for clobbered memory, but we do
need to check whether the hardreg might be clobbered by a call.  It seemed
sensible to reuse the existing suitably named bitmap to store this information,
but because I bypassed the existing computation, I needed to add the
computation back in elsewhere.

> > @@ -2159,13 +2335,24 @@ pre_edge_insert (struct edge_list *edge_list, struct gcse_expr **index_map)
> >   			/* We can't insert anything on an abnormal and
> >   			   critical edge, so we insert the insn at the end of
> > -			   the previous block. There are several alternatives
> > +			   the previous block.  There are several alternatives
> >   			   detailed in Morgans book P277 (sec 10.5) for
> >   			   handling this situation.  This one is easiest for
> > -			   now.  */
> > +			   now.
> > +			   For hardreg PRE, this would add an unwanted clobber
> > +			   of the hardreg, so we instead insert in the
> > +			   successor block, which may be partially redundant
> > +			   but is at least correct.  */
> But if it's abnormal critical, then doesn't this result in clobbers on paths
> where we didn't have them before?  Is that actually safe?

This should be entirely safe - the other paths will merely have a redundant set
with the same value that the hardreg already contains.

These paths will either have an earlier assignment on a different edge (that
will also have been marked for insertion by LCM), or will have the result of
the computation already available in the hardreg from an assignment that
existed before the pass.

> That's what I've got after a first pass over the bits.
> 
> Jeff
>
  
Andrew Carlotti Dec. 5, 2024, 4:16 p.m. UTC | #15
On Sun, Dec 01, 2024 at 03:54:25PM -0700, Jeff Law wrote:
> 
> 
> On 11/13/24 12:03 PM, Richard Sandiford wrote:
> > Andrew Carlotti <andrew.carlotti@arm.com> writes:
> 
> > > > 
> > > > I think this is mostly my ignorance of the code, and would be obvious
> > > > if I tried it out locally, but: why do we need to do this after
> > > > computing the kills bitmap?  For mode-switching, the kills bitmap
> > > > is the inverse of the transparency bitmap, but it sounds like here
> > > > you want the kills bitmap to be more selective.
> > > 
> > > I had to work through the entire LCM algorithm before I understood how these
> > > bitmaps were being used (and I intend to update the documentation to make this
> > > more obvious).  In summary, the kills and avail bitmaps indicate whether the
> > > result of an earlier expression is still available and up-to-date, whereas the
> > > transparent and anticipatable bitmaps indicate whether a later assignment can
> > > be moved earlier.
> > 
> > Right.  That part is pretty standard.
> > 
> > > For the existing hoist/PRE passes these are the same - this is because new
> > > pseduoregs are used to hold the result of relocated computations, so the only
> > > obstruction is if the values of the inputs to the expression are changed.
> > > 
> > > For the new hardreg PRE pass the bitmaps are different in one case - if the
> > > content of the hardreg is used, then the result of the expression remains
> > > available after the use, but it isn't possible to anticipate a future
> > > assignment by moving that assignment before the earlier use.
> > 
> > But what I meant was: doesn't an assignment to the hard register block
> > movement/reuse in both directions?  We can't move R:=X up through a block B
> > that requires R==Y (so X is not transparent in B).  We also can't
> > reuse R:=X after a block that requires R==Y (because B kills X).
> > 
> > That's why I was expecting the kill set to be updated too, not just the
> > transparency set.
> In general, yes, I would expect transparency and kill to be inverses of each
> other.

Kills and transparency are genuinely different bitmaps in the context of LCM.
In the context of GCC terminology (I don't know whether the naming of our
bitmaps is standard elsewhere) we have:

- Transparency is the extension of anticipatability to entire basic blocks.
- Kills is the bitwise inverse of the extension of availability to entire basic blocks.

For the existing GCSE passes, where we use a brand new pseudoreg to convey
computation results between original and new locations, it turns out that the
local conditions for anticipatability and availability (and hence transparency
and ~kills) are the same (namely that we don't change any of the inputs to the
computation).  The only difference is that anticipatability ranges must end at
an existing instance of the computation, and availability ranges start at an
existing instance of the computation.

For hardreg PRE the endpoint conditions are the essentially the same, but the
local conditions for anticipatability and availability surviving an instruction
are now different.  This is because we are directly extending the range over
which the hardreg values are live, so we need extra checks to make sure these
don't conflict.  For availability, we only need to check that the destination
hardreg hasn't been modified, but for anticipatability we also need to check
that the destination hardreg hasn't been used.  This is where the asymmetry
between kills and transparency arises for hardreg PRE.

> I suspect (but would have to do a fair amount of archaeology to be sure)
> that we probably had kills computed for some other problem (classic gcse  or
> const/copy propagation perhaps) and we just inverted it to work with the LCM
> algorithm which wants to query transparency.  Flipping kills once into
> transparency seems better than using kills and having to flip it every time
> we visit a block during the global propagation step.
> 
> jeff
  
Andrew Carlotti Dec. 5, 2024, 7:03 p.m. UTC | #16
On Thu, Dec 05, 2024 at 04:16:22PM +0000, Andrew Carlotti wrote:
> On Sun, Dec 01, 2024 at 03:54:25PM -0700, Jeff Law wrote:
> > 
> > 
> > On 11/13/24 12:03 PM, Richard Sandiford wrote:
> > > Andrew Carlotti <andrew.carlotti@arm.com> writes:
> > 
> > > > > 
> > > > > I think this is mostly my ignorance of the code, and would be obvious
> > > > > if I tried it out locally, but: why do we need to do this after
> > > > > computing the kills bitmap?  For mode-switching, the kills bitmap
> > > > > is the inverse of the transparency bitmap, but it sounds like here
> > > > > you want the kills bitmap to be more selective.

To add a bit more context to the mode-switching comparison - I think we can
view this code in that pass as doing the follow:

1. Add a mode switch (assignment) to every instruction that needs (uses) a
specific mode.

2. Remove redundant assignments within a basic block (this is equivalent to
grouping together instructions with the same mode into block that can be
treated as a single instruction for subsequent analysis).

3. Do LCM analysis, and add/remove mode switches according to the LCM output.

The actual code is slightly different - I think 1. and 2. effectively run at
the same time, and produce a list of necessary mode switches, but these aren't
added until after the analysis in 3. has run, and the switches marked for
deletion by LCM have been removed from that list.

The result of this flow is that mode uses and mode assignments are always
(effectively) in the same place when running LCM, so there are no differences
in the transparency and kills bitmaps arising from preexisting uses without
defs.

The difference for hardreg switching is that the mode register assignments are
created during expand, and have already been subject to some inter-BB
optimisation by this point, so we now have some mode register uses that don't
have corresponding mode register assignments in the same BB.



A concrete example that will hopefully help is the following (all edges are
fallthrough edges):

BB2:
fpmr = 1
USE fpmr

BB3:
USE fpmr

BB4:
fpmr = 1
USE fpmr

BB5:
USE fpmr

BB6:
fpmr = 2
USE fpmr

If we marked `fpmr = 2` as transparent in BB5, then LCM would be allowed** to
move the assignment to the edge (4 -> 5), which would change the value of fpmr
observed in BB5.  So `fpmr = 2` is not transparent when a block contains a use
of fpmr.

If we marked `fpmr = 1` as killed in BB3, then LCM would be unable to see that
the this value is still available in BB4, so it would be unable to remove the
assignment in that block.  The value is still available, so we should mark
`fpmr = 1` as not killed in BB3.

So in a block that looks like BB3 or BB5, we should mark the hardreg assignment
as neither transparent nor killed.



** In practice LCM doesn't do this, because after finding the earliest possible
insertion points for the computations it moves, it then tries to move them a
bit later again.  I think this means that for any RTL that we will generate for
fpmr (ignoring weird unsupported stuff with inline asm), LCM would give the
same result even if we incorrectly claimed that the assignment was transparent
in blocks like BB3 and BB5.  But that would be playing with fire for no benefit
other than allowing us to remove a couple of blocks of code, and would make
things harder for anyone trying to use this pass to optimise special hardregs
in a different context.

> > > > 
> > > > I had to work through the entire LCM algorithm before I understood how these
> > > > bitmaps were being used (and I intend to update the documentation to make this
> > > > more obvious).  In summary, the kills and avail bitmaps indicate whether the
> > > > result of an earlier expression is still available and up-to-date, whereas the
> > > > transparent and anticipatable bitmaps indicate whether a later assignment can
> > > > be moved earlier.
> > > 
> > > Right.  That part is pretty standard.
> > > 
> > > > For the existing hoist/PRE passes these are the same - this is because new
> > > > pseduoregs are used to hold the result of relocated computations, so the only
> > > > obstruction is if the values of the inputs to the expression are changed.
> > > > 
> > > > For the new hardreg PRE pass the bitmaps are different in one case - if the
> > > > content of the hardreg is used, then the result of the expression remains
> > > > available after the use, but it isn't possible to anticipate a future
> > > > assignment by moving that assignment before the earlier use.
> > > 
> > > But what I meant was: doesn't an assignment to the hard register block
> > > movement/reuse in both directions?  We can't move R:=X up through a block B
> > > that requires R==Y (so X is not transparent in B).  We also can't
> > > reuse R:=X after a block that requires R==Y (because B kills X).
> > > 
> > > That's why I was expecting the kill set to be updated too, not just the
> > > transparency set.
> > In general, yes, I would expect transparency and kill to be inverses of each
> > other.
> 
> Kills and transparency are genuinely different bitmaps in the context of LCM.
> In the context of GCC terminology (I don't know whether the naming of our
> bitmaps is standard elsewhere) we have:
> 
> - Transparency is the extension of anticipatability to entire basic blocks.
> - Kills is the bitwise inverse of the extension of availability to entire basic blocks.
> 
> For the existing GCSE passes, where we use a brand new pseudoreg to convey
> computation results between original and new locations, it turns out that the
> local conditions for anticipatability and availability (and hence transparency
> and ~kills) are the same (namely that we don't change any of the inputs to the
> computation).  The only difference is that anticipatability ranges must end at
> an existing instance of the computation, and availability ranges start at an
> existing instance of the computation.
> 
> For hardreg PRE the endpoint conditions are the essentially the same, but the
> local conditions for anticipatability and availability surviving an instruction
> are now different.  This is because we are directly extending the range over
> which the hardreg values are live, so we need extra checks to make sure these
> don't conflict.  For availability, we only need to check that the destination
> hardreg hasn't been modified, but for anticipatability we also need to check
> that the destination hardreg hasn't been used.  This is where the asymmetry
> between kills and transparency arises for hardreg PRE.
> 
> > I suspect (but would have to do a fair amount of archaeology to be sure)
> > that we probably had kills computed for some other problem (classic gcse  or
> > const/copy propagation perhaps) and we just inverted it to work with the LCM
> > algorithm which wants to query transparency.  Flipping kills once into
> > transparency seems better than using kills and having to flip it every time
> > we visit a block during the global propagation step.
> > 
> > jeff
  

Patch

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 593319fd4723626bf95f475e79c1c7b12238b2dd..860e29b3b24dfca656740c85ef0ac0445f9848cd 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1589,6 +1589,10 @@  enum class aarch64_tristate_mode : int { NO, YES, MAYBE };
   { int (aarch64_tristate_mode::MAYBE), \
     int (aarch64_local_sme_state::ANY) }
 
+/* Zero terminated list of regnos for which hardreg PRE should be
+   applied.  */
+#define HARDREG_PRE_REGNOS { FPM_REGNUM, 0 }
+
 #endif
 
 #endif /* GCC_AARCH64_H */
diff --git a/gcc/gcse.cc b/gcc/gcse.cc
index 31b92f30fa1ba6c519429d4b7bc55547b2d71c01..ce4ebe420c02d78fcde3144eed595e22212aaa0b 100644
--- a/gcc/gcse.cc
+++ b/gcc/gcse.cc
@@ -415,6 +415,11 @@  static int gcse_create_count;
 
 /* Doing code hoisting.  */
 static bool doing_code_hoisting_p = false;
+
+/* Doing hardreg_pre.  */
+static bool doing_hardreg_pre_p = false;
+
+static unsigned int current_hardreg_regno;
 
 /* For available exprs */
 static sbitmap *ae_kill;
@@ -693,10 +698,29 @@  compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
 	     We start by assuming all are transparent [none are killed], and
 	     then reset the bits for those that are.  */
 	  if (transp)
-	    compute_transp (expr->expr, indx, transp,
-			    blocks_with_calls,
-			    modify_mem_list_set,
-			    canon_modify_mem_list);
+	    {
+	      compute_transp (expr->expr, indx, transp,
+			      blocks_with_calls,
+			      modify_mem_list_set,
+			      canon_modify_mem_list);
+
+	      if (doing_hardreg_pre_p)
+		{
+		  /* We also need to check whether the destination hardreg is
+		     set or call-clobbered in each BB.  We'll check for hardreg
+		     uses later.  */
+		  df_ref def;
+		  for (def = DF_REG_DEF_CHAIN (current_hardreg_regno);
+		       def;
+		       def = DF_REF_NEXT_REG (def))
+		    bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
+
+		  bitmap_iterator bi;
+		  unsigned bb_index;
+		  EXECUTE_IF_SET_IN_BITMAP (blocks_with_calls, 0, bb_index, bi)
+		    bitmap_clear_bit (transp[bb_index], indx);
+		}
+	    }
 
 	  /* The occurrences recorded in antic_occr are exactly those that
 	     we want to set to nonzero in ANTLOC.  */
@@ -728,6 +752,37 @@  compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
 	}
     }
 }
+
+/* A hardreg set is not transparent in a block if there are any uses of that
+   hardreg.  This filters the results of compute_local_properties, after the
+   result of that function has been used to define the kills bitmap.
+
+   TRANSP is the destination sbitmap to be updated.
+
+   TABLE controls which hash table to look at.  */
+
+static void
+prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
+{
+  unsigned int i;
+  gcc_assert (doing_hardreg_pre_p);
+
+  for (i = 0; i < table->size; i++)
+    {
+      struct gcse_expr *expr;
+
+      for (expr = table->table[i]; expr != NULL; expr = expr->next_same_hash)
+	{
+	  int indx = expr->bitmap_index;
+	  df_ref def;
+
+	  for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
+	       def;
+	       def = DF_REF_NEXT_REG (def))
+	    bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
+	}
+    }
+}
 
 /* Hash table support.  */
 
@@ -739,6 +794,8 @@  struct reg_avail_info
 };
 
 static struct reg_avail_info *reg_avail_info;
+static basic_block hardreg_last_bb;
+static int hardreg_first_use;
 static basic_block current_bb;
 
 /* See whether X, the source of a set, is something we want to consider for
@@ -747,6 +804,9 @@  static basic_block current_bb;
 static bool
 want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
 {
+  if (doing_hardreg_pre_p)
+    return true;
+
 #ifdef STACK_REGS
   /* On register stack architectures, don't GCSE constants from the
      constant pool, as the benefits are often swamped by the overhead
@@ -911,7 +971,7 @@  oprs_unchanged_p (const_rtx x, const rtx_insn *insn, bool avail_p)
       }
 
     case MEM:
-      if (! flag_gcse_lm
+      if (! flag_gcse_lm || doing_hardreg_pre_p
 	  || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
 				     x, avail_p))
 	return false;
@@ -1258,8 +1318,10 @@  hash_scan_set (rtx set, rtx_insn *insn, struct gcse_hash_table_d *table)
 	  && want_to_gcse_p (XEXP (note, 0), GET_MODE (dest), NULL))
 	src = XEXP (note, 0), set = gen_rtx_SET (dest, src);
 
-      /* Only record sets of pseudo-regs in the hash table.  */
-      if (regno >= FIRST_PSEUDO_REGISTER
+      /* Only record sets of pseudo-regs in the hash table, unless we're
+	 currently doing hardreg switching.  */
+      if ((doing_hardreg_pre_p ? regno == current_hardreg_regno
+				     : regno >= FIRST_PSEUDO_REGISTER)
 	  /* Don't GCSE something if we can't do a reg/reg copy.  */
 	  && can_copy_p (GET_MODE (dest))
 	  /* GCSE commonly inserts instruction after the insn.  We can't
@@ -1286,12 +1348,33 @@  hash_scan_set (rtx set, rtx_insn *insn, struct gcse_hash_table_d *table)
 	     able to handle code motion of insns with multiple sets.  */
 	  bool antic_p = (oprs_anticipatable_p (src, insn)
 			  && !multiple_sets (insn));
+	  if (doing_hardreg_pre_p)
+	    {
+	      /* An hardreg assignment is anticipatable only if the hardreg is
+		 neither set nor used prior to this assignment.  */
+	      auto info = reg_avail_info[current_hardreg_regno];
+	      if ((info.last_bb == current_bb
+		   && info.first_set < DF_INSN_LUID (insn))
+		  || (hardreg_last_bb == current_bb
+		      && hardreg_first_use <= DF_INSN_LUID (insn)))
+		antic_p = false;
+	    }
+
 	  /* An expression is not available if its operands are
 	     subsequently modified, including this insn.  It's also not
 	     available if this is a branch, because we can't insert
 	     a set after the branch.  */
 	  bool avail_p = (oprs_available_p (src, insn)
 			  && ! JUMP_P (insn));
+	  if (doing_hardreg_pre_p)
+	    {
+	      /* An hardreg assignment is only available if the hardreg is
+		 not set later in the BB.  Uses of the hardreg are allowed. */
+	      auto info = reg_avail_info[current_hardreg_regno];
+	      if (info.last_bb == current_bb
+		  && info.last_set > DF_INSN_LUID (insn))
+		antic_p = false;
+	    }
 
 	  insert_expr_in_table (src, GET_MODE (dest), insn, antic_p, avail_p,
 				max_distance, table);
@@ -1300,7 +1383,10 @@  hash_scan_set (rtx set, rtx_insn *insn, struct gcse_hash_table_d *table)
   /* In case of store we want to consider the memory value as available in
      the REG stored in that memory. This makes it possible to remove
      redundant loads from due to stores to the same location.  */
-  else if (flag_gcse_las && REG_P (src) && MEM_P (dest))
+  else if (flag_gcse_las
+	   && !doing_hardreg_pre_p
+	   && REG_P (src)
+	   && MEM_P (dest))
     {
       unsigned int regno = REGNO (src);
       HOST_WIDE_INT max_distance = 0;
@@ -1460,7 +1546,7 @@  record_last_reg_set_info (rtx_insn *insn, int regno)
 static void
 record_last_mem_set_info (rtx_insn *insn)
 {
-  if (! flag_gcse_lm)
+  if (!flag_gcse_lm || doing_hardreg_pre_p)
     return;
 
   record_last_mem_set_info_common (insn, modify_mem_list,
@@ -1537,6 +1623,18 @@  compute_hash_table_work (struct gcse_hash_table_d *table)
 	      EXECUTE_IF_SET_IN_HARD_REG_SET (callee_clobbers, 0, regno, hrsi)
 		record_last_reg_set_info (insn, regno);
 
+	      if (doing_hardreg_pre_p)
+		{
+		  /* This is covered by the above clobbers, but let's
+		     conservatively make this work as well for hardregs that
+		     are call-used but not call-clobbered.  */
+		  record_last_reg_set_info (insn, current_hardreg_regno);
+
+		  /* Mark this block as containing a call-clobber.  */
+		  bitmap_set_bit (blocks_with_calls,
+				  BLOCK_FOR_INSN (insn)->index);
+		}
+
 	      if (! RTL_CONST_OR_PURE_CALL_P (insn)
 		  || RTL_LOOPING_CONST_OR_PURE_CALL_P (insn)
 		  || can_throw_external (insn))
@@ -1544,6 +1642,19 @@  compute_hash_table_work (struct gcse_hash_table_d *table)
 	    }
 
 	  note_stores (insn, record_last_set_info, insn);
+
+	  if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
+	    {
+	      /* We need to record the first use of a hardreg to determine if a
+		 set of that hardreg is anticipatable.  */
+	      df_ref ref;
+	      FOR_EACH_INSN_USE (ref, insn)
+		if (DF_REF_REGNO (ref) == current_hardreg_regno)
+		  {
+		    hardreg_last_bb = current_bb;
+		    hardreg_first_use = DF_INSN_LUID (insn);
+		  }
+	    }
 	}
 
       /* The next pass builds the hash table.  */
@@ -1714,6 +1825,19 @@  prune_expressions (bool pre_p)
     {
       for (expr = expr_hash_table.table[ui]; expr; expr = expr->next_same_hash)
 	{
+	  /* For hardreg pre, we assume that all relevant hardregs are
+	     call-clobbered, and set all bits in prune_exprs if the reg is call
+	     clobbered.  If the hardreg were merely call-used, then we would
+	     need to remove the expression from the anticipatable and
+	     transparent bitmaps only (after using this to compute the kills
+	     bitmap).  */
+
+	  if (doing_hardreg_pre_p)
+	    {
+	      bitmap_set_bit (prune_exprs, expr->bitmap_index);
+	      continue;
+	    }
+
 	  /* Note potentially trapping expressions.  */
 	  if (may_trap_p (expr->expr))
 	    {
@@ -1884,6 +2008,9 @@  compute_pre_data (void)
       bitmap_not (ae_kill[bb->index], ae_kill[bb->index]);
     }
 
+  if (doing_hardreg_pre_p)
+    prune_hardreg_uses (transp, &expr_hash_table);
+
   edge_list = pre_edge_lcm (expr_hash_table.n_elems, transp, comp, antloc,
 			    ae_kill, &pre_insert_map, &pre_delete_map);
   sbitmap_vector_free (antloc);
@@ -1938,7 +2065,10 @@  pre_expr_reaches_here_p_work (basic_block occr_bb, struct gcse_expr *expr,
 
 	  visited[pred_bb->index] = 1;
 	}
-      /* Ignore this predecessor if it kills the expression.  */
+      /* Ignore this predecessor if it kills the expression.
+
+	 If this were used for hardreg pre, then it would need to use the kills
+	 bitmap.  */
       else if (! bitmap_bit_p (transp[pred_bb->index], expr->bitmap_index))
 	visited[pred_bb->index] = 1;
 
@@ -2109,6 +2239,51 @@  insert_insn_end_basic_block (struct gcse_expr *expr, basic_block bb)
     }
 }
 
+/* Return the INSN which is added at the start of the block BB with
+   same instruction pattern with PAT.  */
+
+rtx_insn *
+insert_insn_start_basic_block (rtx_insn *pat, basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  gcc_assert (pat && INSN_P (pat));
+  rtx_insn *new_insn = emit_insn_before_noloc (pat, insn, bb);
+
+  while (pat != NULL_RTX)
+    {
+      if (INSN_P (pat))
+	add_label_notes (PATTERN (pat), new_insn);
+      pat = NEXT_INSN (pat);
+    }
+
+  return new_insn;
+}
+
+/* Add EXPR to the start of basic block BB.
+
+   This is used by hardreg PRE.  */
+
+static void
+insert_insn_start_basic_block (struct gcse_expr *expr, basic_block bb)
+{
+  rtx reg = expr->reaching_reg;
+  int regno = REGNO (reg);
+
+  rtx_insn *insn = process_insert_insn (expr);
+  rtx_insn *new_insn = insert_insn_start_basic_block (insn, bb);
+
+  gcse_create_count++;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "hardreg PRE: start of bb %d, insn %d, ",
+	       bb->index, INSN_UID (new_insn));
+      fprintf (dump_file, "copying expression %d to reg %d\n",
+	       expr->bitmap_index, regno);
+    }
+}
+
 /* Insert partially redundant expressions on edges in the CFG to make
    the expressions fully redundant.  */
 
@@ -2130,7 +2305,8 @@  pre_edge_insert (struct edge_list *edge_list, struct gcse_expr **index_map)
   for (e = 0; e < num_edges; e++)
     {
       int indx;
-      basic_block bb = INDEX_EDGE_PRED_BB (edge_list, e);
+      basic_block pred_bb = INDEX_EDGE_PRED_BB (edge_list, e);
+      basic_block succ_bb = INDEX_EDGE_SUCC_BB (edge_list, e);
 
       for (i = indx = 0; i < set_size; i++, indx += SBITMAP_ELT_BITS)
 	{
@@ -2159,13 +2335,24 @@  pre_edge_insert (struct edge_list *edge_list, struct gcse_expr **index_map)
 
 			/* We can't insert anything on an abnormal and
 			   critical edge, so we insert the insn at the end of
-			   the previous block. There are several alternatives
+			   the previous block.  There are several alternatives
 			   detailed in Morgans book P277 (sec 10.5) for
 			   handling this situation.  This one is easiest for
-			   now.  */
+			   now.
 
+			   For hardreg PRE, this would add an unwanted clobber
+			   of the hardreg, so we instead insert in the
+			   successor block, which may be partially redundant
+			   but is at least correct.  */
 			if (eg->flags & EDGE_ABNORMAL)
-			  insert_insn_end_basic_block (index_map[j], bb);
+			  {
+			    if (doing_hardreg_pre_p)
+			      insert_insn_start_basic_block (index_map[j],
+							     succ_bb);
+			    else
+			      insert_insn_end_basic_block (index_map[j],
+							   pred_bb);
+			  }
 			else
 			  {
 			    insn = process_insert_insn (index_map[j]);
@@ -2175,8 +2362,8 @@  pre_edge_insert (struct edge_list *edge_list, struct gcse_expr **index_map)
 			if (dump_file)
 			  {
 			    fprintf (dump_file, "PRE: edge (%d,%d), ",
-				     bb->index,
-				     INDEX_EDGE_SUCC_BB (edge_list, e)->index);
+				     pred_bb->index,
+				     succ_bb->index);
 			    fprintf (dump_file, "copy expression %d\n",
 				     expr->bitmap_index);
 			  }
@@ -2491,13 +2678,24 @@  pre_delete (void)
 		&& (set = single_set (insn)) != 0
                 && dbg_cnt (pre_insn))
 	      {
-		/* Create a pseudo-reg to store the result of reaching
-		   expressions into.  Get the mode for the new pseudo from
-		   the mode of the original destination pseudo.  */
+		rtx dest = SET_DEST (set);
 		if (expr->reaching_reg == NULL)
-		  expr->reaching_reg = gen_reg_rtx_and_attrs (SET_DEST (set));
+		  {
+		    if (doing_hardreg_pre_p)
+		      /* Use the hardreg as the reaching register.  The
+			 deleted sets will be replaced with noop moves.
+
+			 FIXME: This may change the value of the hardreg in
+			 some debug instructions.  */
+		      expr->reaching_reg = dest;
+		    else
+		      /* Create a pseudo-reg to store the result of reaching
+			 expressions into.  Get the mode for the new pseudo from
+			 the mode of the original destination pseudo.  */
+		      expr->reaching_reg = gen_reg_rtx_and_attrs (SET_DEST (set));
+		  }
 
-		gcse_emit_move_after (SET_DEST (set), expr->reaching_reg, insn);
+		gcse_emit_move_after (dest, expr->reaching_reg, insn);
 		delete_insn (insn);
 		occr->deleted_p = 1;
 		changed = true;
@@ -2561,10 +2759,12 @@  pre_gcse (struct edge_list *edge_list)
 
   changed = pre_delete ();
   did_insert = pre_edge_insert (edge_list, index_map);
-
   /* In other places with reaching expressions, copy the expression to the
-     specially allocated pseudo-reg that reaches the redundant expr.  */
-  pre_insert_copies ();
+     specially allocated pseudo-reg that reaches the redundant expr.  This
+     isn't needed for hardreg PRE.  */
+  if (!doing_hardreg_pre_p)
+    pre_insert_copies ();
+
   if (did_insert)
     {
       commit_edge_insertions ();
@@ -2601,11 +2801,11 @@  one_pre_gcse_pass (void)
 
   alloc_hash_table (&expr_hash_table);
   add_noreturn_fake_exit_edges ();
-  if (flag_gcse_lm)
+  if (flag_gcse_lm && !doing_hardreg_pre_p)
     compute_ld_motion_mems ();
 
   compute_hash_table (&expr_hash_table);
-  if (flag_gcse_lm)
+  if (flag_gcse_lm && !doing_hardreg_pre_p)
     trim_ld_motion_mems ();
   if (dump_file)
     dump_hash_table (dump_file, "Expression", &expr_hash_table);
@@ -2621,7 +2821,7 @@  one_pre_gcse_pass (void)
       free_pre_mem ();
     }
 
-  if (flag_gcse_lm)
+  if (flag_gcse_lm && !doing_hardreg_pre_p)
     free_ld_motion_mems ();
   remove_fake_exit_edges ();
   free_hash_table (&expr_hash_table);
@@ -4028,6 +4228,31 @@  execute_rtl_pre (void)
   return 0;
 }
 
+static unsigned int
+execute_hardreg_pre (void)
+{
+  doing_hardreg_pre_p = true;
+  unsigned int regnos[] = HARDREG_PRE_REGNOS;
+  /* It's possible to avoid this loop, but it isn't worth doing so until
+     hardreg PRE is used for multiple hardregs.  */
+  for (int i = 0; regnos[i] != 0; i++)
+    {
+      int changed;
+      current_hardreg_regno = regnos[i];
+      if (dump_file)
+	fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
+		current_hardreg_regno);
+      delete_unreachable_blocks ();
+      df_analyze ();
+      changed = one_pre_gcse_pass ();
+      flag_rerun_cse_after_global_opts |= changed;
+      if (changed)
+	cleanup_cfg (0);
+    }
+  doing_hardreg_pre_p = false;
+  return 0;
+}
+
 static unsigned int
 execute_rtl_hoist (void)
 {
@@ -4096,6 +4321,56 @@  make_pass_rtl_pre (gcc::context *ctxt)
 
 namespace {
 
+const pass_data pass_data_hardreg_pre =
+{
+  RTL_PASS, /* type */
+  "hardreg_pre", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_PRE, /* tv_id */
+  PROP_cfglayout, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_hardreg_pre : public rtl_opt_pass
+{
+public:
+  pass_hardreg_pre (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  bool gate (function *) final override;
+  unsigned int execute (function *)  final override
+  {
+    return execute_hardreg_pre ();
+  }
+
+}; // class pass_rtl_pre
+
+bool
+pass_hardreg_pre::gate (function *fun)
+{
+#ifdef HARDREG_PRE_REGNOS
+  return optimize > 0
+    && !fun->calls_setjmp;
+#else
+  return false;
+#endif
+}
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_hardreg_pre (gcc::context *ctxt)
+{
+  return new pass_hardreg_pre (ctxt);
+}
+
+namespace {
+
 const pass_data pass_data_rtl_hoist =
 {
   RTL_PASS, /* type */
diff --git a/gcc/passes.def b/gcc/passes.def
index 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -462,6 +462,7 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rtl_cprop);
       NEXT_PASS (pass_rtl_pre);
       NEXT_PASS (pass_rtl_hoist);
+      NEXT_PASS (pass_hardreg_pre);
       NEXT_PASS (pass_rtl_cprop);
       NEXT_PASS (pass_rtl_store_motion);
       NEXT_PASS (pass_cse_after_global_opts);
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -572,6 +572,7 @@  extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);