[RFC] Prevent the scheduler from moving prefetch instructions when expanding __builtin_prefetch [PR 116713]

Message ID d50531d8-1d97-4fca-ae4b-d12729aebeb7@app.fastmail.com
State RFC
Headers
Series [RFC] Prevent the scheduler from moving prefetch instructions when expanding __builtin_prefetch [PR 116713] |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 fail Test failed

Commit Message

Pietro Monteiro Sept. 27, 2024, 4:26 a.m. UTC
  The prefetch instruction that is emitted by __builtin_prefetch is re-ordered on GCC, but not on clang[0]. GCC's behavior is surprising because when using the builtin you want the instruction to be placed at the exact point where you put it. Moving it around, specially across load/stores, may end up being a pessimization. Adding a blockage instruction before the prefetch prevents the scheduler from moving it.

[0] https://godbolt.org/z/Ycjr7Tq8b


-- 8< --
  

Comments

Oleg Endo Sept. 27, 2024, 5:31 a.m. UTC | #1
On Fri, 2024-09-27 at 00:26 -0400, Pietro Monteiro wrote:
> The prefetch instruction that is emitted by __builtin_prefetch is re-ordered on GCC, but not on clang[0]. GCC's behavior is surprising because when using the builtin you want the instruction to be placed at the exact point where you put it. Moving it around, specially across load/stores, may end up being a pessimization. Adding a blockage instruction before the prefetch prevents the scheduler from moving it.
> 
> [0] https://godbolt.org/z/Ycjr7Tq8b
> 
> 

As pointed out by Andrew in the PR, this issue has been around for a while:

see
https://gcc.gnu.org/legacy-ml/gcc-patches/2009-05/msg01233.html
and
https://gcc.gnu.org/legacy-ml/gcc-patches/2009-09/msg00130.html

However, those fix the reordering only for sched2.  Hypothetically the
reordering could happen also before sched2 by some other RTL pass.  So
perhaps it's better to insert the barrier right when the builtin is emitted.

Best regards,
Oleg Endo






> -- 8< --
> 
> 
> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> index 37c7c98e5c..fec751e0d6 100644
> --- a/gcc/builtins.cc
> +++ b/gcc/builtins.cc
> @@ -1329,7 +1329,12 @@ expand_builtin_prefetch (tree exp)
>        create_integer_operand (&ops[1], INTVAL (op1));
>        create_integer_operand (&ops[2], INTVAL (op2));
>        if (maybe_expand_insn (targetm.code_for_prefetch, 3, ops))
> -	return;
> +        {
> +          /* Prevent the prefetch from being moved.  */
> +          rtx_insn *last = get_last_insn ();
> +          emit_insn_before (gen_blockage (), last);
> +          return;
> +        }
>      }
>  
>    /* Don't do anything with direct references to volatile memory, but
  
Richard Biener Sept. 27, 2024, 7 a.m. UTC | #2
On Fri, Sep 27, 2024 at 6:27 AM Pietro Monteiro
<pietro@sociotechnical.xyz> wrote:
>
> The prefetch instruction that is emitted by __builtin_prefetch is re-ordered on GCC, but not on clang[0]. GCC's behavior is surprising because when using the builtin you want the instruction to be placed at the exact point where you put it. Moving it around, specially across load/stores, may end up being a pessimization. Adding a blockage instruction before the prefetch prevents the scheduler from moving it.
>
> [0] https://godbolt.org/z/Ycjr7Tq8b

I think the testcase is quite broken (aka not real-world).  I would also suggest
that a hard scheduling barrier isn't the correct tool (see Olegs response), but
instead prefetch should properly model a data dependence so it only blocks
code motion for dependent accesses.  On the GIMPLE side disambiguation
happens solely based on pointer analysis, on RTL where prefetch is likely
an UNSPEC I would suggest to model the dependence as a USE/CLOBBER
pair of a MEM.

Richard.

> -- 8< --
>
>
> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> index 37c7c98e5c..fec751e0d6 100644
> --- a/gcc/builtins.cc
> +++ b/gcc/builtins.cc
> @@ -1329,7 +1329,12 @@ expand_builtin_prefetch (tree exp)
>        create_integer_operand (&ops[1], INTVAL (op1));
>        create_integer_operand (&ops[2], INTVAL (op2));
>        if (maybe_expand_insn (targetm.code_for_prefetch, 3, ops))
> -       return;
> +        {
> +          /* Prevent the prefetch from being moved.  */
> +          rtx_insn *last = get_last_insn ();
> +          emit_insn_before (gen_blockage (), last);
> +          return;
> +        }
>      }
>
>    /* Don't do anything with direct references to volatile memory, but
  

Patch

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 37c7c98e5c..fec751e0d6 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -1329,7 +1329,12 @@  expand_builtin_prefetch (tree exp)
       create_integer_operand (&ops[1], INTVAL (op1));
       create_integer_operand (&ops[2], INTVAL (op2));
       if (maybe_expand_insn (targetm.code_for_prefetch, 3, ops))
-	return;
+        {
+          /* Prevent the prefetch from being moved.  */
+          rtx_insn *last = get_last_insn ();
+          emit_insn_before (gen_blockage (), last);
+          return;
+        }
     }
 
   /* Don't do anything with direct references to volatile memory, but