[v4] aarch64: New RTL optimization pass avoid-store-forwarding.

Message ID 20231204180042.12450-1-manos.anagnostakis@vrull.eu
State New
Headers
Series [v4] aarch64: New RTL optimization pass avoid-store-forwarding. |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Testing passed

Commit Message

Manos Anagnostakis Dec. 4, 2023, 6 p.m. UTC
  This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).

This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
through testing on ampere1/ampere1a machines.

For example, it can transform cases like

str  d5, [sp, #320]
fmul d5, d31, d29
ldp  d31, d17, [sp, #312] # Large load from small store

to

str  d5, [sp, #320]
fmul d5, d31, d29
ldr  d31, [sp, #312]
ldr  d17, [sp, #320]

Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.

If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
or other architectures as well, without needing to be turned on by this option.

Bootstrapped and regtested on aarch64-linux.

gcc/ChangeLog:

        * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
        * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
        * config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding): Declare.
        * config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
	(aarch64-store-forwarding-threshold): New param.
        * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
        * doc/invoke.texi: Document new option and new param.
        * config/aarch64/aarch64-store-forwarding.cc: New file.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
        * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
        * gcc.target/aarch64/ldp_ssll_overlap.c: New test.

Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
---
Changes in v4:
	- I had problems to make cselib_subst_to_values work correctly
	  so I used cselib_lookup to implement the exact same behaviour and
	  record the store value at the time we iterate over it.
	- Removed the store/load_mem_addr check from is_forwarding as
	  unnecessary.
	- The pass is called on all optimization levels right now.
	- The threshold check should remain as it is as we only care for
	  the front element of the list. The comment above the check explains
	  why a single if is enough.
	- The documentation changes requested.
	- Adjusted a comment.

 gcc/config.gcc                                |   1 +
 gcc/config/aarch64/aarch64-passes.def         |   1 +
 gcc/config/aarch64/aarch64-protos.h           |   1 +
 .../aarch64/aarch64-store-forwarding.cc       | 321 ++++++++++++++++++
 gcc/config/aarch64/aarch64.opt                |   9 +
 gcc/config/aarch64/t-aarch64                  |  10 +
 gcc/doc/invoke.texi                           |  11 +-
 .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
 .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
 .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
 10 files changed, 452 insertions(+), 1 deletion(-)
 create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c

--
2.41.0
  

Comments

Richard Sandiford Dec. 4, 2023, 7:21 p.m. UTC | #1
Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).
>
> This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
> through testing on ampere1/ampere1a machines.
>
> For example, it can transform cases like
>
> str  d5, [sp, #320]
> fmul d5, d31, d29
> ldp  d31, d17, [sp, #312] # Large load from small store
>
> to
>
> str  d5, [sp, #320]
> fmul d5, d31, d29
> ldr  d31, [sp, #312]
> ldr  d17, [sp, #320]
>
> Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.
>
> If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
> or other architectures as well, without needing to be turned on by this option.
>
> Bootstrapped and regtested on aarch64-linux.
>
> gcc/ChangeLog:
>
>         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
>         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
>         * config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding): Declare.
>         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
> 	(aarch64-store-forwarding-threshold): New param.
>         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
>         * doc/invoke.texi: Document new option and new param.
>         * config/aarch64/aarch64-store-forwarding.cc: New file.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
>         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
>         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
>
> Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> ---
> Changes in v4:
> 	- I had problems to make cselib_subst_to_values work correctly
> 	  so I used cselib_lookup to implement the exact same behaviour and
> 	  record the store value at the time we iterate over it.
> 	- Removed the store/load_mem_addr check from is_forwarding as
> 	  unnecessary.
> 	- The pass is called on all optimization levels right now.
> 	- The threshold check should remain as it is as we only care for
> 	  the front element of the list. The comment above the check explains
> 	  why a single if is enough.

I still think this is structurally better as a while.  There's no reason
in principle we why wouldn't want to record the stores in:

	stp	x0, x1, [x4, #8]
	ldp	x0, x1, [x4, #0]
	ldp	x2, x3, [x4, #16]

and then the two stores should have the same distance value.
I realise we don't do that yet, but still.

> 	- The documentation changes requested.
> 	- Adjusted a comment.
>
>  gcc/config.gcc                                |   1 +
>  gcc/config/aarch64/aarch64-passes.def         |   1 +
>  gcc/config/aarch64/aarch64-protos.h           |   1 +
>  .../aarch64/aarch64-store-forwarding.cc       | 321 ++++++++++++++++++
>  gcc/config/aarch64/aarch64.opt                |   9 +
>  gcc/config/aarch64/t-aarch64                  |  10 +
>  gcc/doc/invoke.texi                           |  11 +-
>  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
>  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
>  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
>  10 files changed, 452 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 748430194f3..2ee3b61c4fa 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -350,6 +350,7 @@ aarch64*-*-*)
>  	cxx_target_objs="aarch64-c.o"
>  	d_target_objs="aarch64-d.o"
>  	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> +	extra_objs="${extra_objs} aarch64-store-forwarding.o"
>  	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
>  	target_has_targetm_common=yes
>  	;;
> diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
> index 6ace797b738..fa79e8adca8 100644
> --- a/gcc/config/aarch64/aarch64-passes.def
> +++ b/gcc/config/aarch64/aarch64-passes.def
> @@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
>  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
>  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
>  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index d2718cc87b3..7d9dfa06af9 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1050,6 +1050,7 @@ rtl_opt_pass *make_pass_track_speculation (gcc::context *);
>  rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
>  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
>  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
>
>  poly_uint64 aarch64_regmode_natural_size (machine_mode);
>
> diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc b/gcc/config/aarch64/aarch64-store-forwarding.cc
> new file mode 100644
> index 00000000000..ae3cbe519cd
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> @@ -0,0 +1,321 @@
> +/* Avoid store forwarding optimization pass.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +   Contributed by VRULL GmbH.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define IN_TARGET_CODE 1
> +
> +#include "config.h"
> +#define INCLUDE_LIST
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "alias.h"
> +#include "rtlanal.h"
> +#include "tree-pass.h"
> +#include "cselib.h"
> +
> +/* This is an RTL pass that detects store forwarding from stores to larger
> +   loads (load pairs). For example, it can transform cases like
> +
> +   str  d5, [sp, #320]
> +   fmul d5, d31, d29
> +   ldp  d31, d17, [sp, #312] # Large load from small store
> +
> +   to
> +
> +   str  d5, [sp, #320]
> +   fmul d5, d31, d29
> +   ldr  d31, [sp, #312]
> +   ldr  d17, [sp, #320]
> +
> +   Design: The pass follows a straightforward design.  It starts by
> +   initializing the alias analysis and the cselib.  Both of these are used to
> +   find stores and larger loads with overlapping addresses, which are
> +   candidates for store forwarding optimizations.  It then scans on basic block
> +   level to find stores that forward to larger loads and handles them
> +   accordingly as described in the above example.  Finally, the alias analysis
> +   and the cselib library are closed.  */
> +
> +typedef struct
> +{
> +  rtx_insn *store_insn;
> +  rtx store_mem_addr;
> +  unsigned int insn_cnt;
> +} store_info;
> +
> +typedef std::list<store_info> list_store_info;
> +
> +/* Statistics counters.  */
> +static unsigned int stats_store_count = 0;
> +static unsigned int stats_ldp_count = 0;
> +static unsigned int stats_ssll_count = 0;
> +static unsigned int stats_transformed_count = 0;
> +
> +/* Default.  */
> +static rtx dummy;
> +static bool is_load (rtx expr, rtx &op_1=dummy);
> +
> +/* Return true if SET expression EXPR is a store; otherwise false.  */
> +
> +static bool
> +is_store (rtx expr)
> +{
> +  return MEM_P (SET_DEST (expr));
> +}
> +
> +/* Return true if SET expression EXPR is a load; otherwise false.  OP_1 will
> +   contain the MEM operand of the load.  */
> +
> +static bool
> +is_load (rtx expr, rtx &op_1)
> +{
> +  op_1 = SET_SRC (expr);
> +
> +  if (GET_CODE (op_1) == ZERO_EXTEND
> +      || GET_CODE (op_1) == SIGN_EXTEND)
> +    op_1 = XEXP (op_1, 0);
> +
> +  return MEM_P (op_1);
> +}
> +
> +/* Return true if STORE_MEM_ADDR is forwarding to the address of LOAD_MEM;
> +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx containing
> +   STORE_MEM_ADDR.  */
> +
> +static bool
> +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode store_mem_mode)
> +{
> +  /* Sometimes we do not have the proper value.  */
> +  if (!CSELIB_VAL_PTR (store_mem_addr))
> +    return false;
> +
> +  gcc_checking_assert (MEM_P (load_mem));
> +
> +  rtx load_mem_addr = get_addr (XEXP (load_mem, 0));
> +  machine_mode load_mem_mode = GET_MODE (load_mem);
> +  load_mem_addr = cselib_lookup (load_mem_addr, load_mem_mode, 1,
> +				 load_mem_mode)->val_rtx;

Like I said in the previous review, it shouldn't be necessary to do any
manual lookup on the load address.  rtx_equal_for_cselib_1 does the
lookup itself.  Does that not work?

The patch is OK with the four lines above deleted, if that works,
and with s/if/while/.  But please reply if that combination doesn't work.

Thanks,
Richard

> +  return rtx_equal_for_cselib_1 (store_mem_addr,
> +				 load_mem_addr,
> +				 store_mem_mode, 0);
> +}
> +
> +/* Return true if INSN is a load pair, preceded by a store forwarding to it;
> +   otherwise false.  STORE_EXPRS contains the stores.  */
> +
> +static bool
> +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn *insn)
> +{
> +  unsigned int load_count = 0;
> +  bool forwarding = false;
> +  rtx expr = PATTERN (insn);
> +
> +  if (GET_CODE (expr) != PARALLEL
> +      || XVECLEN (expr, 0) != 2)
> +    return false;
> +
> +  for (int i = 0; i < XVECLEN (expr, 0); i++)
> +    {
> +      rtx op_1;
> +      rtx out_exp = XVECEXP (expr, 0, i);
> +
> +      if (GET_CODE (out_exp) != SET)
> +	continue;
> +
> +      if (!is_load (out_exp, op_1))
> +	continue;
> +
> +      load_count++;
> +
> +      for (store_info str : store_exprs)
> +	{
> +	  rtx store_insn = str.store_insn;
> +
> +	  if (!is_forwarding (str.store_mem_addr, op_1,
> +			      GET_MODE (SET_DEST (PATTERN (store_insn)))))
> +	    continue;
> +
> +	  if (dump_file)
> +	    {
> +	      fprintf (dump_file,
> +		       "Store forwarding to PARALLEL with loads:\n");
> +	      fprintf (dump_file, "  From: ");
> +	      print_rtl_single (dump_file, store_insn);
> +	      fprintf (dump_file, "  To: ");
> +	      print_rtl_single (dump_file, insn);
> +	    }
> +
> +	  forwarding = true;
> +       }
> +    }
> +
> +  if (load_count == 2)
> +    stats_ldp_count++;
> +
> +  return load_count == 2 && forwarding;
> +}
> +
> +/* Break a load pair into its 2 distinct loads, except if the base source
> +   address to load from is overwriten in the first load.  INSN should be the
> +   PARALLEL of the load pair.  */
> +
> +static void
> +break_ldp (rtx_insn *insn)
> +{
> +  rtx expr = PATTERN (insn);
> +
> +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) == 2);
> +
> +  rtx load_0 = XVECEXP (expr, 0, 0);
> +  rtx load_1 = XVECEXP (expr, 0, 1);
> +
> +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> +
> +  /* The base address was overwriten in the first load.  */
> +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> +    return;
> +
> +  emit_insn_before (load_0, insn);
> +  emit_insn_before (load_1, insn);
> +  remove_insn (insn);
> +
> +  stats_transformed_count++;
> +}
> +
> +static void
> +scan_and_transform_bb_level ()
> +{
> +  rtx_insn *insn, *next;
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      list_store_info store_exprs;
> +      unsigned int insn_cnt = 0;
> +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
> +	{
> +	  next = NEXT_INSN (insn);
> +
> +	  /* If we cross a CALL_P insn, clear the list, because the
> +	     small-store-to-large-load is unlikely to cause performance
> +	     difference.  */
> +	  if (CALL_P (insn))
> +	    store_exprs.clear ();
> +
> +	  if (!NONJUMP_INSN_P (insn))
> +	    continue;
> +
> +	  cselib_process_insn (insn);
> +
> +	  rtx expr = single_set (insn);
> +
> +	  /* If a store is encountered, append it to the store_exprs list to
> +	     check it later.  */
> +	  if (expr && is_store (expr))
> +	    {
> +	      rtx store_mem = SET_DEST (expr);
> +	      rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> +	      machine_mode store_mem_mode = GET_MODE (store_mem);
> +	      store_mem_addr = cselib_lookup (store_mem_addr,
> +					      store_mem_mode, 1,
> +					      store_mem_mode)->val_rtx;
> +	      store_exprs.push_back ({ insn, store_mem_addr, insn_cnt++ });
> +	      stats_store_count++;
> +	    }
> +
> +	  /* Check for small-store-to-large-load.  */
> +	  if (is_small_store_to_large_load (store_exprs, insn))
> +	    {
> +	      stats_ssll_count++;
> +	      break_ldp (insn);
> +	    }
> +
> +	  /* Pop the first store from the list if it's distance crosses the
> +	     maximum accepted threshold.  The list contains unique values
> +	     sorted in ascending order, meaning that only one distance can be
> +	     off at a time.  */
> +	  if (!store_exprs.empty ()
> +	      && (insn_cnt - store_exprs.front ().insn_cnt
> +		 > (unsigned int) aarch64_store_forwarding_threshold_param))
> +	    store_exprs.pop_front ();
> +	}
> +    }
> +}
> +
> +static void
> +execute_avoid_store_forwarding ()
> +{
> +  init_alias_analysis ();
> +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> +  scan_and_transform_bb_level ();
> +  end_alias_analysis ();
> +  cselib_finish ();
> +  statistics_counter_event (cfun, "Number of stores identified: ",
> +			    stats_store_count);
> +  statistics_counter_event (cfun, "Number of load pairs identified: ",
> +			    stats_ldp_count);
> +  statistics_counter_event (cfun,
> +			    "Number of forwarding cases identified: ",
> +			    stats_ssll_count);
> +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> +			    stats_transformed_count);
> +}
> +
> +const pass_data pass_data_avoid_store_forwarding =
> +{
> +  RTL_PASS, /* type.  */
> +  "avoid_store_forwarding", /* name.  */
> +  OPTGROUP_NONE, /* optinfo_flags.  */
> +  TV_NONE, /* tv_id.  */
> +  0, /* properties_required.  */
> +  0, /* properties_provided.  */
> +  0, /* properties_destroyed.  */
> +  0, /* todo_flags_start.  */
> +  0 /* todo_flags_finish.  */
> +};
> +
> +class pass_avoid_store_forwarding : public rtl_opt_pass
> +{
> +public:
> +  pass_avoid_store_forwarding (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      return aarch64_flag_avoid_store_forwarding;
> +    }
> +
> +  virtual unsigned int execute (function *)
> +    {
> +      execute_avoid_store_forwarding ();
> +      return 0;
> +    }
> +
> +}; // class pass_avoid_store_forwarding
> +
> +/* Create a new avoid store forwarding pass instance.  */
> +
> +rtl_opt_pass *
> +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> +{
> +  return new pass_avoid_store_forwarding (ctxt);
> +}
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index f5a518202a1..e4498d53b46 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -304,6 +304,10 @@ moutline-atomics
>  Target Var(aarch64_flag_outline_atomics) Init(2) Save
>  Generate local calls to out-of-line atomic operations.
>
> +mavoid-store-forwarding
> +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
> +Avoid store forwarding to load pairs.
> +
>  -param=aarch64-sve-compare-costs=
>  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
>  When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach.  Also use the cost model to choose between SVE and Advanced SIMD vectorization.
> @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
>
>  EnumValue
>  Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> +
> +-param=aarch64-store-forwarding-threshold=
> +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) Init(20) Param
> +Maximum instruction distance allowed between a store and a load pair for this to be
> +considered a candidate to avoid when using -mavoid-store-forwarding.
> diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> index a9a244ab6d6..7639b50358d 100644
> --- a/gcc/config/aarch64/t-aarch64
> +++ b/gcc/config/aarch64/t-aarch64
> @@ -176,6 +176,16 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
>  	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
>  		$(srcdir)/config/aarch64/aarch64-cc-fusion.cc
>
> +aarch64-store-forwarding.o: \
> +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
> +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
> +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> +    $(srcdir)/config/aarch64/aarch64-protos.h
> +	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> +		$(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> +
>  comma=,
>  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
>  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 2b51ff304f6..39dbc04207e 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -798,7 +798,7 @@ Objective-C and Objective-C++ Dialects}.
>  -moverride=@var{string}  -mverbose-cost-dump
>  -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
>  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> --moutline-atomics }
> +-moutline-atomics -mavoid-store-forwarding}
>
>  @emph{Adapteva Epiphany Options}
>  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> @@ -16738,6 +16738,11 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
>  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
>  source pointer is aligned to at least double the alignment of the type.
>
> +@item aarch64-store-forwarding-threshold
> +Maximum allowed instruction distance between a store and a load pair for
> +this to be considered a candidate to avoid when using
> +@option{-mavoid-store-forwarding}.
> +
>  @item aarch64-loop-vect-issue-rate-niters
>  The tuning for some AArch64 CPUs tries to take both latencies and issue
>  rates into account when deciding whether a loop should be vectorized
> @@ -20763,6 +20768,10 @@ Generate code which uses only the general-purpose registers.  This will prevent
>  the compiler from using floating-point and Advanced SIMD registers but will not
>  impose any restrictions on the assembler.
>
> +@item -mavoid-store-forwarding
> +@itemx -mno-avoid-store-forwarding
> +Avoid store forwarding to load pairs.
> +
>  @opindex mlittle-endian
>  @item -mlittle-endian
>  Generate little-endian code.  This is the default when GCC is configured for an
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> new file mode 100644
> index 00000000000..b77de6c64b6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Different address, same offset, no overlap  */
> +
> +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> +	TYPE r, y; \
> +	st_arr[0] = i; \
> +	ld_arr[0] = dummy; \
> +	r = st_arr_2[0]; \
> +	y = st_arr_2[1]; \
> +	return r + y; \
> +}
> +
> +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> new file mode 100644
> index 00000000000..f1b3a66abfd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Same address, different offset, no overlap  */
> +
> +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> +	TYPE r, y; \
> +	st_arr[0] = i; \
> +	ld_arr[0] = dummy; \
> +	r = st_arr[10]; \
> +	y = st_arr[11]; \
> +	return r + y; \
> +}
> +
> +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> new file mode 100644
> index 00000000000..8d5ce5cc87e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Same address, same offset, overlap  */
> +
> +#define LDP_SSLL_OVERLAP(TYPE) \
> +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> +	TYPE r, y; \
> +	st_arr[0] = i; \
> +	ld_arr[0] = dummy; \
> +	r = st_arr[0]; \
> +	y = st_arr[1]; \
> +	return r + y; \
> +}
> +
> +LDP_SSLL_OVERLAP(uint32_t)
> +LDP_SSLL_OVERLAP(uint64_t)
> +LDP_SSLL_OVERLAP(int32_t)
> +LDP_SSLL_OVERLAP(int64_t)
> +LDP_SSLL_OVERLAP(int)
> +LDP_SSLL_OVERLAP(long)
> +LDP_SSLL_OVERLAP(float)
> +LDP_SSLL_OVERLAP(double)
> +LDP_SSLL_OVERLAP(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> --
> 2.41.0
  
Manos Anagnostakis Dec. 4, 2023, 7:43 p.m. UTC | #2
Στις Δευ 4 Δεκ 2023, 21:22 ο χρήστης Richard Sandiford <
richard.sandiford@arm.com> έγραψε:

> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> > This is an RTL pass that detects store forwarding from stores to larger
> loads (load pairs).
> >
> > This optimization is SPEC2017-driven and was found to be beneficial for
> some benchmarks,
> > through testing on ampere1/ampere1a machines.
> >
> > For example, it can transform cases like
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldp  d31, d17, [sp, #312] # Large load from small store
> >
> > to
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldr  d31, [sp, #312]
> > ldr  d17, [sp, #320]
> >
> > Currently, the pass is disabled by default on all architectures and
> enabled by a target-specific option.
> >
> > If deemed beneficial enough for a default, it will be enabled on
> ampere1/ampere1a,
> > or other architectures as well, without needing to be turned on by this
> option.
> >
> > Bootstrapped and regtested on aarch64-linux.
> >
> > gcc/ChangeLog:
> >
> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
> pass.
> >         * config/aarch64/aarch64-protos.h
> (make_pass_avoid_store_forwarding): Declare.
> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
> option.
> >       (aarch64-store-forwarding-threshold): New param.
> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> >         * doc/invoke.texi: Document new option and new param.
> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> >
> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > ---
> > Changes in v4:
> >       - I had problems to make cselib_subst_to_values work correctly
> >         so I used cselib_lookup to implement the exact same behaviour and
> >         record the store value at the time we iterate over it.
> >       - Removed the store/load_mem_addr check from is_forwarding as
> >         unnecessary.
> >       - The pass is called on all optimization levels right now.
> >       - The threshold check should remain as it is as we only care for
> >         the front element of the list. The comment above the check
> explains
> >         why a single if is enough.
>
> I still think this is structurally better as a while.  There's no reason
> in principle we why wouldn't want to record the stores in:
>
>         stp     x0, x1, [x4, #8]
>         ldp     x0, x1, [x4, #0]
>         ldp     x2, x3, [x4, #16]
>
> and then the two stores should have the same distance value.
> I realise we don't do that yet, but still.
>
Ah, you mean forwarding from stp. I was a bit confused with what you meant
the previous time. This was not initially meant for this patch, but I think
it wouldn't take long to implement that before pushing this. It is your
call of course if I should include it.

>
> >       - The documentation changes requested.
> >       - Adjusted a comment.
> >
> >  gcc/config.gcc                                |   1 +
> >  gcc/config/aarch64/aarch64-passes.def         |   1 +
> >  gcc/config/aarch64/aarch64-protos.h           |   1 +
> >  .../aarch64/aarch64-store-forwarding.cc       | 321 ++++++++++++++++++
> >  gcc/config/aarch64/aarch64.opt                |   9 +
> >  gcc/config/aarch64/t-aarch64                  |  10 +
> >  gcc/doc/invoke.texi                           |  11 +-
> >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
> >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
> >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
> >  10 files changed, 452 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >
> > diff --git a/gcc/config.gcc b/gcc/config.gcc
> > index 748430194f3..2ee3b61c4fa 100644
> > --- a/gcc/config.gcc
> > +++ b/gcc/config.gcc
> > @@ -350,6 +350,7 @@ aarch64*-*-*)
> >       cxx_target_objs="aarch64-c.o"
> >       d_target_objs="aarch64-d.o"
> >       extra_objs="aarch64-builtins.o aarch-common.o
> aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o
> aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o
> cortex-a57-fma-steering.o aarch64-speculation.o
> falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> > +     extra_objs="${extra_objs} aarch64-store-forwarding.o"
> >       target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> >       target_has_targetm_common=yes
> >       ;;
> > diff --git a/gcc/config/aarch64/aarch64-passes.def
> b/gcc/config/aarch64/aarch64-passes.def
> > index 6ace797b738..fa79e8adca8 100644
> > --- a/gcc/config/aarch64/aarch64-passes.def
> > +++ b/gcc/config/aarch64/aarch64-passes.def
> > @@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1,
> pass_track_speculation);
> >  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
> >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index d2718cc87b3..7d9dfa06af9 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -1050,6 +1050,7 @@ rtl_opt_pass *make_pass_track_speculation
> (gcc::context *);
> >  rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
> >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> >
> >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
> >
> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc
> b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > new file mode 100644
> > index 00000000000..ae3cbe519cd
> > --- /dev/null
> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > @@ -0,0 +1,321 @@
> > +/* Avoid store forwarding optimization pass.
> > +   Copyright (C) 2023 Free Software Foundation, Inc.
> > +   Contributed by VRULL GmbH.
> > +
> > +   This file is part of GCC.
> > +
> > +   GCC is free software; you can redistribute it and/or modify it
> > +   under the terms of the GNU General Public License as published by
> > +   the Free Software Foundation; either version 3, or (at your option)
> > +   any later version.
> > +
> > +   GCC is distributed in the hope that it will be useful, but
> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   General Public License for more details.
> > +
> > +   You should have received a copy of the GNU General Public License
> > +   along with GCC; see the file COPYING3.  If not see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#define IN_TARGET_CODE 1
> > +
> > +#include "config.h"
> > +#define INCLUDE_LIST
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "rtl.h"
> > +#include "alias.h"
> > +#include "rtlanal.h"
> > +#include "tree-pass.h"
> > +#include "cselib.h"
> > +
> > +/* This is an RTL pass that detects store forwarding from stores to
> larger
> > +   loads (load pairs). For example, it can transform cases like
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldp  d31, d17, [sp, #312] # Large load from small store
> > +
> > +   to
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldr  d31, [sp, #312]
> > +   ldr  d17, [sp, #320]
> > +
> > +   Design: The pass follows a straightforward design.  It starts by
> > +   initializing the alias analysis and the cselib.  Both of these are
> used to
> > +   find stores and larger loads with overlapping addresses, which are
> > +   candidates for store forwarding optimizations.  It then scans on
> basic block
> > +   level to find stores that forward to larger loads and handles them
> > +   accordingly as described in the above example.  Finally, the alias
> analysis
> > +   and the cselib library are closed.  */
> > +
> > +typedef struct
> > +{
> > +  rtx_insn *store_insn;
> > +  rtx store_mem_addr;
> > +  unsigned int insn_cnt;
> > +} store_info;
> > +
> > +typedef std::list<store_info> list_store_info;
> > +
> > +/* Statistics counters.  */
> > +static unsigned int stats_store_count = 0;
> > +static unsigned int stats_ldp_count = 0;
> > +static unsigned int stats_ssll_count = 0;
> > +static unsigned int stats_transformed_count = 0;
> > +
> > +/* Default.  */
> > +static rtx dummy;
> > +static bool is_load (rtx expr, rtx &op_1=dummy);
> > +
> > +/* Return true if SET expression EXPR is a store; otherwise false.  */
> > +
> > +static bool
> > +is_store (rtx expr)
> > +{
> > +  return MEM_P (SET_DEST (expr));
> > +}
> > +
> > +/* Return true if SET expression EXPR is a load; otherwise false.  OP_1
> will
> > +   contain the MEM operand of the load.  */
> > +
> > +static bool
> > +is_load (rtx expr, rtx &op_1)
> > +{
> > +  op_1 = SET_SRC (expr);
> > +
> > +  if (GET_CODE (op_1) == ZERO_EXTEND
> > +      || GET_CODE (op_1) == SIGN_EXTEND)
> > +    op_1 = XEXP (op_1, 0);
> > +
> > +  return MEM_P (op_1);
> > +}
> > +
> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of
> LOAD_MEM;
> > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx
> containing
> > +   STORE_MEM_ADDR.  */
> > +
> > +static bool
> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode
> store_mem_mode)
> > +{
> > +  /* Sometimes we do not have the proper value.  */
> > +  if (!CSELIB_VAL_PTR (store_mem_addr))
> > +    return false;
> > +
> > +  gcc_checking_assert (MEM_P (load_mem));
> > +
> > +  rtx load_mem_addr = get_addr (XEXP (load_mem, 0));
> > +  machine_mode load_mem_mode = GET_MODE (load_mem);
> > +  load_mem_addr = cselib_lookup (load_mem_addr, load_mem_mode, 1,
> > +                              load_mem_mode)->val_rtx;
>
> Like I said in the previous review, it shouldn't be necessary to do any
> manual lookup on the load address.  rtx_equal_for_cselib_1 does the
> lookup itself.  Does that not work?
>
I thought you meant only that the if check was redundant here, which it
was. I'll reply if cselib can handle the load all by itself.

Thanks for the review!
Manos.

>
> The patch is OK with the four lines above deleted, if that works,
> and with s/if/while/.  But please reply if that combination doesn't work.
>
> Thanks,
> Richard
>
> > +  return rtx_equal_for_cselib_1 (store_mem_addr,
> > +                              load_mem_addr,
> > +                              store_mem_mode, 0);
> > +}
> > +
> > +/* Return true if INSN is a load pair, preceded by a store forwarding
> to it;
> > +   otherwise false.  STORE_EXPRS contains the stores.  */
> > +
> > +static bool
> > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn
> *insn)
> > +{
> > +  unsigned int load_count = 0;
> > +  bool forwarding = false;
> > +  rtx expr = PATTERN (insn);
> > +
> > +  if (GET_CODE (expr) != PARALLEL
> > +      || XVECLEN (expr, 0) != 2)
> > +    return false;
> > +
> > +  for (int i = 0; i < XVECLEN (expr, 0); i++)
> > +    {
> > +      rtx op_1;
> > +      rtx out_exp = XVECEXP (expr, 0, i);
> > +
> > +      if (GET_CODE (out_exp) != SET)
> > +     continue;
> > +
> > +      if (!is_load (out_exp, op_1))
> > +     continue;
> > +
> > +      load_count++;
> > +
> > +      for (store_info str : store_exprs)
> > +     {
> > +       rtx store_insn = str.store_insn;
> > +
> > +       if (!is_forwarding (str.store_mem_addr, op_1,
> > +                           GET_MODE (SET_DEST (PATTERN (store_insn)))))
> > +         continue;
> > +
> > +       if (dump_file)
> > +         {
> > +           fprintf (dump_file,
> > +                    "Store forwarding to PARALLEL with loads:\n");
> > +           fprintf (dump_file, "  From: ");
> > +           print_rtl_single (dump_file, store_insn);
> > +           fprintf (dump_file, "  To: ");
> > +           print_rtl_single (dump_file, insn);
> > +         }
> > +
> > +       forwarding = true;
> > +       }
> > +    }
> > +
> > +  if (load_count == 2)
> > +    stats_ldp_count++;
> > +
> > +  return load_count == 2 && forwarding;
> > +}
> > +
> > +/* Break a load pair into its 2 distinct loads, except if the base
> source
> > +   address to load from is overwriten in the first load.  INSN should
> be the
> > +   PARALLEL of the load pair.  */
> > +
> > +static void
> > +break_ldp (rtx_insn *insn)
> > +{
> > +  rtx expr = PATTERN (insn);
> > +
> > +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0)
> == 2);
> > +
> > +  rtx load_0 = XVECEXP (expr, 0, 0);
> > +  rtx load_1 = XVECEXP (expr, 0, 1);
> > +
> > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> > +
> > +  /* The base address was overwriten in the first load.  */
> > +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> > +    return;
> > +
> > +  emit_insn_before (load_0, insn);
> > +  emit_insn_before (load_1, insn);
> > +  remove_insn (insn);
> > +
> > +  stats_transformed_count++;
> > +}
> > +
> > +static void
> > +scan_and_transform_bb_level ()
> > +{
> > +  rtx_insn *insn, *next;
> > +  basic_block bb;
> > +  FOR_EACH_BB_FN (bb, cfun)
> > +    {
> > +      list_store_info store_exprs;
> > +      unsigned int insn_cnt = 0;
> > +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn =
> next)
> > +     {
> > +       next = NEXT_INSN (insn);
> > +
> > +       /* If we cross a CALL_P insn, clear the list, because the
> > +          small-store-to-large-load is unlikely to cause performance
> > +          difference.  */
> > +       if (CALL_P (insn))
> > +         store_exprs.clear ();
> > +
> > +       if (!NONJUMP_INSN_P (insn))
> > +         continue;
> > +
> > +       cselib_process_insn (insn);
> > +
> > +       rtx expr = single_set (insn);
> > +
> > +       /* If a store is encountered, append it to the store_exprs list
> to
> > +          check it later.  */
> > +       if (expr && is_store (expr))
> > +         {
> > +           rtx store_mem = SET_DEST (expr);
> > +           rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> > +           machine_mode store_mem_mode = GET_MODE (store_mem);
> > +           store_mem_addr = cselib_lookup (store_mem_addr,
> > +                                           store_mem_mode, 1,
> > +                                           store_mem_mode)->val_rtx;
> > +           store_exprs.push_back ({ insn, store_mem_addr, insn_cnt++ });
> > +           stats_store_count++;
> > +         }
> > +
> > +       /* Check for small-store-to-large-load.  */
> > +       if (is_small_store_to_large_load (store_exprs, insn))
> > +         {
> > +           stats_ssll_count++;
> > +           break_ldp (insn);
> > +         }
> > +
> > +       /* Pop the first store from the list if it's distance crosses the
> > +          maximum accepted threshold.  The list contains unique values
> > +          sorted in ascending order, meaning that only one distance can
> be
> > +          off at a time.  */
> > +       if (!store_exprs.empty ()
> > +           && (insn_cnt - store_exprs.front ().insn_cnt
> > +              > (unsigned int)
> aarch64_store_forwarding_threshold_param))
> > +         store_exprs.pop_front ();
> > +     }
> > +    }
> > +}
> > +
> > +static void
> > +execute_avoid_store_forwarding ()
> > +{
> > +  init_alias_analysis ();
> > +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> > +  scan_and_transform_bb_level ();
> > +  end_alias_analysis ();
> > +  cselib_finish ();
> > +  statistics_counter_event (cfun, "Number of stores identified: ",
> > +                         stats_store_count);
> > +  statistics_counter_event (cfun, "Number of load pairs identified: ",
> > +                         stats_ldp_count);
> > +  statistics_counter_event (cfun,
> > +                         "Number of forwarding cases identified: ",
> > +                         stats_ssll_count);
> > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> > +                         stats_transformed_count);
> > +}
> > +
> > +const pass_data pass_data_avoid_store_forwarding =
> > +{
> > +  RTL_PASS, /* type.  */
> > +  "avoid_store_forwarding", /* name.  */
> > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > +  TV_NONE, /* tv_id.  */
> > +  0, /* properties_required.  */
> > +  0, /* properties_provided.  */
> > +  0, /* properties_destroyed.  */
> > +  0, /* todo_flags_start.  */
> > +  0 /* todo_flags_finish.  */
> > +};
> > +
> > +class pass_avoid_store_forwarding : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_avoid_store_forwarding (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      return aarch64_flag_avoid_store_forwarding;
> > +    }
> > +
> > +  virtual unsigned int execute (function *)
> > +    {
> > +      execute_avoid_store_forwarding ();
> > +      return 0;
> > +    }
> > +
> > +}; // class pass_avoid_store_forwarding
> > +
> > +/* Create a new avoid store forwarding pass instance.  */
> > +
> > +rtl_opt_pass *
> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> > +{
> > +  return new pass_avoid_store_forwarding (ctxt);
> > +}
> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > index f5a518202a1..e4498d53b46 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -304,6 +304,10 @@ moutline-atomics
> >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
> >  Generate local calls to out-of-line atomic operations.
> >
> > +mavoid-store-forwarding
> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0)
> Optimization
> > +Avoid store forwarding to load pairs.
> > +
> >  -param=aarch64-sve-compare-costs=
> >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1)
> IntegerRange(0, 1) Param
> >  When vectorizing for SVE, consider using unpacked vectors for smaller
> elements and use the cost model to pick the cheapest approach.  Also use
> the cost model to choose between SVE and Advanced SIMD vectorization.
> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never)
> Value(AARCH64_LDP_STP_POLICY_NEVER)
> >
> >  EnumValue
> >  Enum(aarch64_ldp_stp_policy) String(aligned)
> Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> > +
> > +-param=aarch64-store-forwarding-threshold=
> > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param)
> Init(20) Param
> > +Maximum instruction distance allowed between a store and a load pair
> for this to be
> > +considered a candidate to avoid when using -mavoid-store-forwarding.
> > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> > index a9a244ab6d6..7639b50358d 100644
> > --- a/gcc/config/aarch64/t-aarch64
> > +++ b/gcc/config/aarch64/t-aarch64
> > @@ -176,6 +176,16 @@ aarch64-cc-fusion.o:
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> >       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> >               $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> >
> > +aarch64-store-forwarding.o: \
> > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h
> $(RTL_BASE_H) \
> > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H)
> $(RECOG_H) \
> > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> > +    $(srcdir)/config/aarch64/aarch64-protos.h
> > +     $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > +             $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> > +
> >  comma=,
> >  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst
> $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> >  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 2b51ff304f6..39dbc04207e 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -798,7 +798,7 @@ Objective-C and Objective-C++ Dialects}.
> >  -moverride=@var{string}  -mverbose-cost-dump
> >  -mstack-protector-guard=@var{guard}
> -mstack-protector-guard-reg=@var{sysreg}
> >  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> > --moutline-atomics }
> > +-moutline-atomics -mavoid-store-forwarding}
> >
> >  @emph{Adapteva Epiphany Options}
> >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> > @@ -16738,6 +16738,11 @@ With @option{--param=aarch64-stp-policy=never},
> do not emit stp.
> >  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
> >  source pointer is aligned to at least double the alignment of the type.
> >
> > +@item aarch64-store-forwarding-threshold
> > +Maximum allowed instruction distance between a store and a load pair for
> > +this to be considered a candidate to avoid when using
> > +@option{-mavoid-store-forwarding}.
> > +
> >  @item aarch64-loop-vect-issue-rate-niters
> >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> >  rates into account when deciding whether a loop should be vectorized
> > @@ -20763,6 +20768,10 @@ Generate code which uses only the
> general-purpose registers.  This will prevent
> >  the compiler from using floating-point and Advanced SIMD registers but
> will not
> >  impose any restrictions on the assembler.
> >
> > +@item -mavoid-store-forwarding
> > +@itemx -mno-avoid-store-forwarding
> > +Avoid store forwarding to load pairs.
> > +
> >  @opindex mlittle-endian
> >  @item -mlittle-endian
> >  Generate little-endian code.  This is the default when GCC is
> configured for an
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > new file mode 100644
> > index 00000000000..b77de6c64b6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Different address, same offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr,
> TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> > +     TYPE r, y; \
> > +     st_arr[0] = i; \
> > +     ld_arr[0] = dummy; \
> > +     r = st_arr_2[0]; \
> > +     y = st_arr_2[1]; \
> > +     return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > new file mode 100644
> > index 00000000000..f1b3a66abfd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, different offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE
> i, TYPE dummy){ \
> > +     TYPE r, y; \
> > +     st_arr[0] = i; \
> > +     ld_arr[0] = dummy; \
> > +     r = st_arr[10]; \
> > +     y = st_arr[11]; \
> > +     return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > new file mode 100644
> > index 00000000000..8d5ce5cc87e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, same offset, overlap  */
> > +
> > +#define LDP_SSLL_OVERLAP(TYPE) \
> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE
> dummy){ \
> > +     TYPE r, y; \
> > +     st_arr[0] = i; \
> > +     ld_arr[0] = dummy; \
> > +     r = st_arr[0]; \
> > +     y = st_arr[1]; \
> > +     return r + y; \
> > +}
> > +
> > +LDP_SSLL_OVERLAP(uint32_t)
> > +LDP_SSLL_OVERLAP(uint64_t)
> > +LDP_SSLL_OVERLAP(int32_t)
> > +LDP_SSLL_OVERLAP(int64_t)
> > +LDP_SSLL_OVERLAP(int)
> > +LDP_SSLL_OVERLAP(long)
> > +LDP_SSLL_OVERLAP(float)
> > +LDP_SSLL_OVERLAP(double)
> > +LDP_SSLL_OVERLAP(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > --
> > 2.41.0
>
  
Richard Sandiford Dec. 4, 2023, 8:35 p.m. UTC | #3
Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> Στις Δευ 4 Δεκ 2023, 21:22 ο χρήστης Richard Sandiford <
> richard.sandiford@arm.com> έγραψε:
>
>> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
>> > This is an RTL pass that detects store forwarding from stores to larger
>> loads (load pairs).
>> >
>> > This optimization is SPEC2017-driven and was found to be beneficial for
>> some benchmarks,
>> > through testing on ampere1/ampere1a machines.
>> >
>> > For example, it can transform cases like
>> >
>> > str  d5, [sp, #320]
>> > fmul d5, d31, d29
>> > ldp  d31, d17, [sp, #312] # Large load from small store
>> >
>> > to
>> >
>> > str  d5, [sp, #320]
>> > fmul d5, d31, d29
>> > ldr  d31, [sp, #312]
>> > ldr  d17, [sp, #320]
>> >
>> > Currently, the pass is disabled by default on all architectures and
>> enabled by a target-specific option.
>> >
>> > If deemed beneficial enough for a default, it will be enabled on
>> ampere1/ampere1a,
>> > or other architectures as well, without needing to be turned on by this
>> option.
>> >
>> > Bootstrapped and regtested on aarch64-linux.
>> >
>> > gcc/ChangeLog:
>> >
>> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
>> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
>> pass.
>> >         * config/aarch64/aarch64-protos.h
>> (make_pass_avoid_store_forwarding): Declare.
>> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
>> option.
>> >       (aarch64-store-forwarding-threshold): New param.
>> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
>> >         * doc/invoke.texi: Document new option and new param.
>> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
>> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
>> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
>> >
>> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
>> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
>> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
>> > ---
>> > Changes in v4:
>> >       - I had problems to make cselib_subst_to_values work correctly
>> >         so I used cselib_lookup to implement the exact same behaviour and
>> >         record the store value at the time we iterate over it.
>> >       - Removed the store/load_mem_addr check from is_forwarding as
>> >         unnecessary.
>> >       - The pass is called on all optimization levels right now.
>> >       - The threshold check should remain as it is as we only care for
>> >         the front element of the list. The comment above the check
>> explains
>> >         why a single if is enough.
>>
>> I still think this is structurally better as a while.  There's no reason
>> in principle we why wouldn't want to record the stores in:
>>
>>         stp     x0, x1, [x4, #8]
>>         ldp     x0, x1, [x4, #0]
>>         ldp     x2, x3, [x4, #16]
>>
>> and then the two stores should have the same distance value.
>> I realise we don't do that yet, but still.
>>
> Ah, you mean forwarding from stp. I was a bit confused with what you meant
> the previous time. This was not initially meant for this patch, but I think
> it wouldn't take long to implement that before pushing this. It is your
> call of course if I should include it.

No strong opinion either way, really.  It's definitely fine to check
only for STRs initially.  I'd just rather not bake that assumption into
places where we don't need to.

If you do check for STPs, it'd probably be worth committing the pass
without it at first, waiting until Alex's patch is in, and then doing
STP support as a separate follow-up patch.  That's because Alex's patch
will convert STPs to using a single double-width memory operand.

(Waiting shouldn't jeopardise the chances of the patch going in, since
the pass was posted well in time and the hold-up isn't your fault.)

> [...]
>> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of
>> LOAD_MEM;
>> > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx
>> containing
>> > +   STORE_MEM_ADDR.  */
>> > +
>> > +static bool
>> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode
>> store_mem_mode)
>> > +{
>> > +  /* Sometimes we do not have the proper value.  */
>> > +  if (!CSELIB_VAL_PTR (store_mem_addr))
>> > +    return false;
>> > +
>> > +  gcc_checking_assert (MEM_P (load_mem));
>> > +
>> > +  rtx load_mem_addr = get_addr (XEXP (load_mem, 0));
>> > +  machine_mode load_mem_mode = GET_MODE (load_mem);
>> > +  load_mem_addr = cselib_lookup (load_mem_addr, load_mem_mode, 1,
>> > +                              load_mem_mode)->val_rtx;
>>
>> Like I said in the previous review, it shouldn't be necessary to do any
>> manual lookup on the load address.  rtx_equal_for_cselib_1 does the
>> lookup itself.  Does that not work?
>>
> I thought you meant only that the if check was redundant here, which it
> was.

Ah, no, I meant the full lookup, sorry.

> I'll reply if cselib can handle the load all by itself.

Thanks!

Richard
  
Manos Anagnostakis Dec. 5, 2023, 2:30 p.m. UTC | #4
Hi Richard,

the patch is working correctly with the four lines deleted and "get_addr"
on "load_mem" inlined on "rtx_equal_for_cselib_1" call.
Also changed store_info to str_info to avoid a warning of duplicate names
on bootstrap (one definition rule).

Rebased on top of your sme2 changes and submitted as v5.

Thanks!
Manos.

On Mon, Dec 4, 2023 at 10:35 PM Richard Sandiford <richard.sandiford@arm.com>
wrote:

> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> > Στις Δευ 4 Δεκ 2023, 21:22 ο χρήστης Richard Sandiford <
> > richard.sandiford@arm.com> έγραψε:
> >
> >> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> >> > This is an RTL pass that detects store forwarding from stores to
> larger
> >> loads (load pairs).
> >> >
> >> > This optimization is SPEC2017-driven and was found to be beneficial
> for
> >> some benchmarks,
> >> > through testing on ampere1/ampere1a machines.
> >> >
> >> > For example, it can transform cases like
> >> >
> >> > str  d5, [sp, #320]
> >> > fmul d5, d31, d29
> >> > ldp  d31, d17, [sp, #312] # Large load from small store
> >> >
> >> > to
> >> >
> >> > str  d5, [sp, #320]
> >> > fmul d5, d31, d29
> >> > ldr  d31, [sp, #312]
> >> > ldr  d17, [sp, #320]
> >> >
> >> > Currently, the pass is disabled by default on all architectures and
> >> enabled by a target-specific option.
> >> >
> >> > If deemed beneficial enough for a default, it will be enabled on
> >> ampere1/ampere1a,
> >> > or other architectures as well, without needing to be turned on by
> this
> >> option.
> >> >
> >> > Bootstrapped and regtested on aarch64-linux.
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> >> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
> >> pass.
> >> >         * config/aarch64/aarch64-protos.h
> >> (make_pass_avoid_store_forwarding): Declare.
> >> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
> >> option.
> >> >       (aarch64-store-forwarding-threshold): New param.
> >> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> >> >         * doc/invoke.texi: Document new option and new param.
> >> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >
> >> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> >> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> >> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> >> >
> >> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> >> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> >> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> >> > ---
> >> > Changes in v4:
> >> >       - I had problems to make cselib_subst_to_values work correctly
> >> >         so I used cselib_lookup to implement the exact same behaviour
> and
> >> >         record the store value at the time we iterate over it.
> >> >       - Removed the store/load_mem_addr check from is_forwarding as
> >> >         unnecessary.
> >> >       - The pass is called on all optimization levels right now.
> >> >       - The threshold check should remain as it is as we only care for
> >> >         the front element of the list. The comment above the check
> >> explains
> >> >         why a single if is enough.
> >>
> >> I still think this is structurally better as a while.  There's no reason
> >> in principle we why wouldn't want to record the stores in:
> >>
> >>         stp     x0, x1, [x4, #8]
> >>         ldp     x0, x1, [x4, #0]
> >>         ldp     x2, x3, [x4, #16]
> >>
> >> and then the two stores should have the same distance value.
> >> I realise we don't do that yet, but still.
> >>
> > Ah, you mean forwarding from stp. I was a bit confused with what you
> meant
> > the previous time. This was not initially meant for this patch, but I
> think
> > it wouldn't take long to implement that before pushing this. It is your
> > call of course if I should include it.
>
> No strong opinion either way, really.  It's definitely fine to check
> only for STRs initially.  I'd just rather not bake that assumption into
> places where we don't need to.
>
> If you do check for STPs, it'd probably be worth committing the pass
> without it at first, waiting until Alex's patch is in, and then doing
> STP support as a separate follow-up patch.  That's because Alex's patch
> will convert STPs to using a single double-width memory operand.
>
> (Waiting shouldn't jeopardise the chances of the patch going in, since
> the pass was posted well in time and the hold-up isn't your fault.)
>
> > [...]
> >> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of
> >> LOAD_MEM;
> >> > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx
> >> containing
> >> > +   STORE_MEM_ADDR.  */
> >> > +
> >> > +static bool
> >> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode
> >> store_mem_mode)
> >> > +{
> >> > +  /* Sometimes we do not have the proper value.  */
> >> > +  if (!CSELIB_VAL_PTR (store_mem_addr))
> >> > +    return false;
> >> > +
> >> > +  gcc_checking_assert (MEM_P (load_mem));
> >> > +
> >> > +  rtx load_mem_addr = get_addr (XEXP (load_mem, 0));
> >> > +  machine_mode load_mem_mode = GET_MODE (load_mem);
> >> > +  load_mem_addr = cselib_lookup (load_mem_addr, load_mem_mode, 1,
> >> > +                              load_mem_mode)->val_rtx;
> >>
> >> Like I said in the previous review, it shouldn't be necessary to do any
> >> manual lookup on the load address.  rtx_equal_for_cselib_1 does the
> >> lookup itself.  Does that not work?
> >>
> > I thought you meant only that the if check was redundant here, which it
> > was.
>
> Ah, no, I meant the full lookup, sorry.
>
> > I'll reply if cselib can handle the load all by itself.
>
> Thanks!
>
> Richard
>
  

Patch

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 748430194f3..2ee3b61c4fa 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -350,6 +350,7 @@  aarch64*-*-*)
 	cxx_target_objs="aarch64-c.o"
 	d_target_objs="aarch64-d.o"
 	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
+	extra_objs="${extra_objs} aarch64-store-forwarding.o"
 	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
 	target_has_targetm_common=yes
 	;;
diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
index 6ace797b738..fa79e8adca8 100644
--- a/gcc/config/aarch64/aarch64-passes.def
+++ b/gcc/config/aarch64/aarch64-passes.def
@@ -23,3 +23,4 @@  INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
 INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
 INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
 INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
+INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index d2718cc87b3..7d9dfa06af9 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1050,6 +1050,7 @@  rtl_opt_pass *make_pass_track_speculation (gcc::context *);
 rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
 rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
 rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
+rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);

 poly_uint64 aarch64_regmode_natural_size (machine_mode);

diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc b/gcc/config/aarch64/aarch64-store-forwarding.cc
new file mode 100644
index 00000000000..ae3cbe519cd
--- /dev/null
+++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
@@ -0,0 +1,321 @@ 
+/* Avoid store forwarding optimization pass.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   Contributed by VRULL GmbH.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#define INCLUDE_LIST
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "alias.h"
+#include "rtlanal.h"
+#include "tree-pass.h"
+#include "cselib.h"
+
+/* This is an RTL pass that detects store forwarding from stores to larger
+   loads (load pairs). For example, it can transform cases like
+
+   str  d5, [sp, #320]
+   fmul d5, d31, d29
+   ldp  d31, d17, [sp, #312] # Large load from small store
+
+   to
+
+   str  d5, [sp, #320]
+   fmul d5, d31, d29
+   ldr  d31, [sp, #312]
+   ldr  d17, [sp, #320]
+
+   Design: The pass follows a straightforward design.  It starts by
+   initializing the alias analysis and the cselib.  Both of these are used to
+   find stores and larger loads with overlapping addresses, which are
+   candidates for store forwarding optimizations.  It then scans on basic block
+   level to find stores that forward to larger loads and handles them
+   accordingly as described in the above example.  Finally, the alias analysis
+   and the cselib library are closed.  */
+
+typedef struct
+{
+  rtx_insn *store_insn;
+  rtx store_mem_addr;
+  unsigned int insn_cnt;
+} store_info;
+
+typedef std::list<store_info> list_store_info;
+
+/* Statistics counters.  */
+static unsigned int stats_store_count = 0;
+static unsigned int stats_ldp_count = 0;
+static unsigned int stats_ssll_count = 0;
+static unsigned int stats_transformed_count = 0;
+
+/* Default.  */
+static rtx dummy;
+static bool is_load (rtx expr, rtx &op_1=dummy);
+
+/* Return true if SET expression EXPR is a store; otherwise false.  */
+
+static bool
+is_store (rtx expr)
+{
+  return MEM_P (SET_DEST (expr));
+}
+
+/* Return true if SET expression EXPR is a load; otherwise false.  OP_1 will
+   contain the MEM operand of the load.  */
+
+static bool
+is_load (rtx expr, rtx &op_1)
+{
+  op_1 = SET_SRC (expr);
+
+  if (GET_CODE (op_1) == ZERO_EXTEND
+      || GET_CODE (op_1) == SIGN_EXTEND)
+    op_1 = XEXP (op_1, 0);
+
+  return MEM_P (op_1);
+}
+
+/* Return true if STORE_MEM_ADDR is forwarding to the address of LOAD_MEM;
+   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx containing
+   STORE_MEM_ADDR.  */
+
+static bool
+is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode store_mem_mode)
+{
+  /* Sometimes we do not have the proper value.  */
+  if (!CSELIB_VAL_PTR (store_mem_addr))
+    return false;
+
+  gcc_checking_assert (MEM_P (load_mem));
+
+  rtx load_mem_addr = get_addr (XEXP (load_mem, 0));
+  machine_mode load_mem_mode = GET_MODE (load_mem);
+  load_mem_addr = cselib_lookup (load_mem_addr, load_mem_mode, 1,
+				 load_mem_mode)->val_rtx;
+
+  return rtx_equal_for_cselib_1 (store_mem_addr,
+				 load_mem_addr,
+				 store_mem_mode, 0);
+}
+
+/* Return true if INSN is a load pair, preceded by a store forwarding to it;
+   otherwise false.  STORE_EXPRS contains the stores.  */
+
+static bool
+is_small_store_to_large_load (list_store_info store_exprs, rtx_insn *insn)
+{
+  unsigned int load_count = 0;
+  bool forwarding = false;
+  rtx expr = PATTERN (insn);
+
+  if (GET_CODE (expr) != PARALLEL
+      || XVECLEN (expr, 0) != 2)
+    return false;
+
+  for (int i = 0; i < XVECLEN (expr, 0); i++)
+    {
+      rtx op_1;
+      rtx out_exp = XVECEXP (expr, 0, i);
+
+      if (GET_CODE (out_exp) != SET)
+	continue;
+
+      if (!is_load (out_exp, op_1))
+	continue;
+
+      load_count++;
+
+      for (store_info str : store_exprs)
+	{
+	  rtx store_insn = str.store_insn;
+
+	  if (!is_forwarding (str.store_mem_addr, op_1,
+			      GET_MODE (SET_DEST (PATTERN (store_insn)))))
+	    continue;
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "Store forwarding to PARALLEL with loads:\n");
+	      fprintf (dump_file, "  From: ");
+	      print_rtl_single (dump_file, store_insn);
+	      fprintf (dump_file, "  To: ");
+	      print_rtl_single (dump_file, insn);
+	    }
+
+	  forwarding = true;
+       }
+    }
+
+  if (load_count == 2)
+    stats_ldp_count++;
+
+  return load_count == 2 && forwarding;
+}
+
+/* Break a load pair into its 2 distinct loads, except if the base source
+   address to load from is overwriten in the first load.  INSN should be the
+   PARALLEL of the load pair.  */
+
+static void
+break_ldp (rtx_insn *insn)
+{
+  rtx expr = PATTERN (insn);
+
+  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) == 2);
+
+  rtx load_0 = XVECEXP (expr, 0, 0);
+  rtx load_1 = XVECEXP (expr, 0, 1);
+
+  gcc_checking_assert (is_load (load_0) && is_load (load_1));
+
+  /* The base address was overwriten in the first load.  */
+  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
+    return;
+
+  emit_insn_before (load_0, insn);
+  emit_insn_before (load_1, insn);
+  remove_insn (insn);
+
+  stats_transformed_count++;
+}
+
+static void
+scan_and_transform_bb_level ()
+{
+  rtx_insn *insn, *next;
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      list_store_info store_exprs;
+      unsigned int insn_cnt = 0;
+      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
+	{
+	  next = NEXT_INSN (insn);
+
+	  /* If we cross a CALL_P insn, clear the list, because the
+	     small-store-to-large-load is unlikely to cause performance
+	     difference.  */
+	  if (CALL_P (insn))
+	    store_exprs.clear ();
+
+	  if (!NONJUMP_INSN_P (insn))
+	    continue;
+
+	  cselib_process_insn (insn);
+
+	  rtx expr = single_set (insn);
+
+	  /* If a store is encountered, append it to the store_exprs list to
+	     check it later.  */
+	  if (expr && is_store (expr))
+	    {
+	      rtx store_mem = SET_DEST (expr);
+	      rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
+	      machine_mode store_mem_mode = GET_MODE (store_mem);
+	      store_mem_addr = cselib_lookup (store_mem_addr,
+					      store_mem_mode, 1,
+					      store_mem_mode)->val_rtx;
+	      store_exprs.push_back ({ insn, store_mem_addr, insn_cnt++ });
+	      stats_store_count++;
+	    }
+
+	  /* Check for small-store-to-large-load.  */
+	  if (is_small_store_to_large_load (store_exprs, insn))
+	    {
+	      stats_ssll_count++;
+	      break_ldp (insn);
+	    }
+
+	  /* Pop the first store from the list if it's distance crosses the
+	     maximum accepted threshold.  The list contains unique values
+	     sorted in ascending order, meaning that only one distance can be
+	     off at a time.  */
+	  if (!store_exprs.empty ()
+	      && (insn_cnt - store_exprs.front ().insn_cnt
+		 > (unsigned int) aarch64_store_forwarding_threshold_param))
+	    store_exprs.pop_front ();
+	}
+    }
+}
+
+static void
+execute_avoid_store_forwarding ()
+{
+  init_alias_analysis ();
+  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
+  scan_and_transform_bb_level ();
+  end_alias_analysis ();
+  cselib_finish ();
+  statistics_counter_event (cfun, "Number of stores identified: ",
+			    stats_store_count);
+  statistics_counter_event (cfun, "Number of load pairs identified: ",
+			    stats_ldp_count);
+  statistics_counter_event (cfun,
+			    "Number of forwarding cases identified: ",
+			    stats_ssll_count);
+  statistics_counter_event (cfun, "Number of trasformed cases: ",
+			    stats_transformed_count);
+}
+
+const pass_data pass_data_avoid_store_forwarding =
+{
+  RTL_PASS, /* type.  */
+  "avoid_store_forwarding", /* name.  */
+  OPTGROUP_NONE, /* optinfo_flags.  */
+  TV_NONE, /* tv_id.  */
+  0, /* properties_required.  */
+  0, /* properties_provided.  */
+  0, /* properties_destroyed.  */
+  0, /* todo_flags_start.  */
+  0 /* todo_flags_finish.  */
+};
+
+class pass_avoid_store_forwarding : public rtl_opt_pass
+{
+public:
+  pass_avoid_store_forwarding (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return aarch64_flag_avoid_store_forwarding;
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      execute_avoid_store_forwarding ();
+      return 0;
+    }
+
+}; // class pass_avoid_store_forwarding
+
+/* Create a new avoid store forwarding pass instance.  */
+
+rtl_opt_pass *
+make_pass_avoid_store_forwarding (gcc::context *ctxt)
+{
+  return new pass_avoid_store_forwarding (ctxt);
+}
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index f5a518202a1..e4498d53b46 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -304,6 +304,10 @@  moutline-atomics
 Target Var(aarch64_flag_outline_atomics) Init(2) Save
 Generate local calls to out-of-line atomic operations.

+mavoid-store-forwarding
+Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
+Avoid store forwarding to load pairs.
+
 -param=aarch64-sve-compare-costs=
 Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
 When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach.  Also use the cost model to choose between SVE and Advanced SIMD vectorization.
@@ -360,3 +364,8 @@  Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)

 EnumValue
 Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
+
+-param=aarch64-store-forwarding-threshold=
+Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) Init(20) Param
+Maximum instruction distance allowed between a store and a load pair for this to be
+considered a candidate to avoid when using -mavoid-store-forwarding.
diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
index a9a244ab6d6..7639b50358d 100644
--- a/gcc/config/aarch64/t-aarch64
+++ b/gcc/config/aarch64/t-aarch64
@@ -176,6 +176,16 @@  aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
 	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
 		$(srcdir)/config/aarch64/aarch64-cc-fusion.cc

+aarch64-store-forwarding.o: \
+    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
+    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
+    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
+    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
+    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
+    $(srcdir)/config/aarch64/aarch64-protos.h
+	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
+		$(srcdir)/config/aarch64/aarch64-store-forwarding.cc
+
 comma=,
 MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
 MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 2b51ff304f6..39dbc04207e 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -798,7 +798,7 @@  Objective-C and Objective-C++ Dialects}.
 -moverride=@var{string}  -mverbose-cost-dump
 -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
 -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
--moutline-atomics }
+-moutline-atomics -mavoid-store-forwarding}

 @emph{Adapteva Epiphany Options}
 @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
@@ -16738,6 +16738,11 @@  With @option{--param=aarch64-stp-policy=never}, do not emit stp.
 With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
 source pointer is aligned to at least double the alignment of the type.

+@item aarch64-store-forwarding-threshold
+Maximum allowed instruction distance between a store and a load pair for
+this to be considered a candidate to avoid when using
+@option{-mavoid-store-forwarding}.
+
 @item aarch64-loop-vect-issue-rate-niters
 The tuning for some AArch64 CPUs tries to take both latencies and issue
 rates into account when deciding whether a loop should be vectorized
@@ -20763,6 +20768,10 @@  Generate code which uses only the general-purpose registers.  This will prevent
 the compiler from using floating-point and Advanced SIMD registers but will not
 impose any restrictions on the assembler.

+@item -mavoid-store-forwarding
+@itemx -mno-avoid-store-forwarding
+Avoid store forwarding to load pairs.
+
 @opindex mlittle-endian
 @item -mlittle-endian
 Generate little-endian code.  This is the default when GCC is configured for an
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
new file mode 100644
index 00000000000..b77de6c64b6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
@@ -0,0 +1,33 @@ 
+/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
+
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+/* Different address, same offset, no overlap  */
+
+#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
+TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE *st_arr_2, TYPE i, TYPE dummy){ \
+	TYPE r, y; \
+	st_arr[0] = i; \
+	ld_arr[0] = dummy; \
+	r = st_arr_2[0]; \
+	y = st_arr_2[1]; \
+	return r + y; \
+}
+
+LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(int)
+LDP_SSLL_NO_OVERLAP_ADDRESS(long)
+LDP_SSLL_NO_OVERLAP_ADDRESS(float)
+LDP_SSLL_NO_OVERLAP_ADDRESS(double)
+LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
new file mode 100644
index 00000000000..f1b3a66abfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
@@ -0,0 +1,33 @@ 
+/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
+
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+/* Same address, different offset, no overlap  */
+
+#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
+TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
+	TYPE r, y; \
+	st_arr[0] = i; \
+	ld_arr[0] = dummy; \
+	r = st_arr[10]; \
+	y = st_arr[11]; \
+	return r + y; \
+}
+
+LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(int)
+LDP_SSLL_NO_OVERLAP_OFFSET(long)
+LDP_SSLL_NO_OVERLAP_OFFSET(float)
+LDP_SSLL_NO_OVERLAP_OFFSET(double)
+LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
new file mode 100644
index 00000000000..8d5ce5cc87e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
@@ -0,0 +1,33 @@ 
+/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
+
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+/* Same address, same offset, overlap  */
+
+#define LDP_SSLL_OVERLAP(TYPE) \
+TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
+	TYPE r, y; \
+	st_arr[0] = i; \
+	ld_arr[0] = dummy; \
+	r = st_arr[0]; \
+	y = st_arr[1]; \
+	return r + y; \
+}
+
+LDP_SSLL_OVERLAP(uint32_t)
+LDP_SSLL_OVERLAP(uint64_t)
+LDP_SSLL_OVERLAP(int32_t)
+LDP_SSLL_OVERLAP(int64_t)
+LDP_SSLL_OVERLAP(int)
+LDP_SSLL_OVERLAP(long)
+LDP_SSLL_OVERLAP(float)
+LDP_SSLL_OVERLAP(double)
+LDP_SSLL_OVERLAP(v4si)
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */