match.pd, v2: Avoid introducing UB in the a r<< (32-b) -> a r>> b optimization [PR117927]

Message ID Z35aP1ndQSpNKxBe@tucnak
State New
Headers
Series match.pd, v2: Avoid introducing UB in the a r<< (32-b) -> a r>> b optimization [PR117927] |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed

Commit Message

Jakub Jelinek Jan. 8, 2025, 10:58 a.m. UTC
  On Wed, Jan 08, 2025 at 10:17:59AM +0100, Richard Biener wrote:
> > As mentioned in the PR, the a r<< (bitsize-b) to a r>> b and similar
> > match.pd optimization which has been introduced in GCC 15 can introduce
> > UB which wasn't there before, in particular if b is equal at runtime
> > to bitsize, then a r<< 0 is turned into a r>> bitsize.
> > 
> > The following patch fixes it by optimizing it early only if VRP
> > tells us the count isn't equal to the bitsize, and late into
> > a r>> (b & (bitsize - 1)) if bitsize is power of two and the subtraction
> > has single use, on various targets the masking then goes away because
> > its rotate instructions do masking already.  The latter can't be
> > done too early though, because then the expr_not_equal_to case is
> > basically useless and we introduce the masking always and can't find out
> > anymore that there was originally no masking.  Even cfun->after_inlining
> > check would be too early, there is forwprop before vrp, so the patch
> > introduces a new PROP for vrp2 finish, after which there is another
> > forwprop.  Or do you have ideas about what other condition could be used
> > for late matching only?
> > 
> > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> Maybe instead add PROP_last_full_fold and set that at the start of the
> last forwprop?  I think in the end we may want to have a special
> pre-expand (at ISEL time?) folding and have extra patterns enabled
> there (and some disabled eventually), but I think we don't want to
> introduce this now.
> 
> With the rest I agree.

So like this?  Works in quick testing, ok for trunk if it passes full
bootstrap/regtest?

2025-01-08  Jakub Jelinek  <jakub@redhat.com>
	    Andrew Pinski  <quic_apinski@quicinc.com>

	PR tree-optimization/117927
	* tree-pass.h (PROP_last_full_fold): Define.
	* passes.def: Add last= parameters to pass_forwprop.
	* tree-ssa-forwprop.cc (pass_forwprop): Add last_p non-static
	data member and initialize it in the ctor.
	(pass_forwprop::set_pass_param): New method.
	(pass_forwprop::execute): Set PROP_last_full_fold in curr_properties
	at the start if last_p.
	* match.pd (a rrotate (32-b) -> a lrotate b): Only optimize either
	if @2 is known not to be equal to prec or if during/after last
	forwprop the subtraction has single use and prec is power of two; in
	that case transform it into orotate by masked count.

	* gcc.dg/tree-ssa/pr117927.c: New test.



	Jakub
  

Comments

Richard Biener Jan. 8, 2025, 11:59 a.m. UTC | #1
On Wed, 8 Jan 2025, Jakub Jelinek wrote:

> On Wed, Jan 08, 2025 at 10:17:59AM +0100, Richard Biener wrote:
> > > As mentioned in the PR, the a r<< (bitsize-b) to a r>> b and similar
> > > match.pd optimization which has been introduced in GCC 15 can introduce
> > > UB which wasn't there before, in particular if b is equal at runtime
> > > to bitsize, then a r<< 0 is turned into a r>> bitsize.
> > > 
> > > The following patch fixes it by optimizing it early only if VRP
> > > tells us the count isn't equal to the bitsize, and late into
> > > a r>> (b & (bitsize - 1)) if bitsize is power of two and the subtraction
> > > has single use, on various targets the masking then goes away because
> > > its rotate instructions do masking already.  The latter can't be
> > > done too early though, because then the expr_not_equal_to case is
> > > basically useless and we introduce the masking always and can't find out
> > > anymore that there was originally no masking.  Even cfun->after_inlining
> > > check would be too early, there is forwprop before vrp, so the patch
> > > introduces a new PROP for vrp2 finish, after which there is another
> > > forwprop.  Or do you have ideas about what other condition could be used
> > > for late matching only?
> > > 
> > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> > 
> > Maybe instead add PROP_last_full_fold and set that at the start of the
> > last forwprop?  I think in the end we may want to have a special
> > pre-expand (at ISEL time?) folding and have extra patterns enabled
> > there (and some disabled eventually), but I think we don't want to
> > introduce this now.
> > 
> > With the rest I agree.
> 
> So like this?  Works in quick testing, ok for trunk if it passes full
> bootstrap/regtest?

LGTM.

Thanks,
Richard.

> 2025-01-08  Jakub Jelinek  <jakub@redhat.com>
> 	    Andrew Pinski  <quic_apinski@quicinc.com>
> 
> 	PR tree-optimization/117927
> 	* tree-pass.h (PROP_last_full_fold): Define.
> 	* passes.def: Add last= parameters to pass_forwprop.
> 	* tree-ssa-forwprop.cc (pass_forwprop): Add last_p non-static
> 	data member and initialize it in the ctor.
> 	(pass_forwprop::set_pass_param): New method.
> 	(pass_forwprop::execute): Set PROP_last_full_fold in curr_properties
> 	at the start if last_p.
> 	* match.pd (a rrotate (32-b) -> a lrotate b): Only optimize either
> 	if @2 is known not to be equal to prec or if during/after last
> 	forwprop the subtraction has single use and prec is power of two; in
> 	that case transform it into orotate by masked count.
> 
> 	* gcc.dg/tree-ssa/pr117927.c: New test.
> 
> --- gcc/tree-pass.h.jj	2025-01-08 10:05:09.634203321 +0100
> +++ gcc/tree-pass.h	2025-01-08 11:22:56.356672128 +0100
> @@ -230,6 +230,7 @@ protected:
>  #define PROP_assumptions_done	(1 << 19)	/* Assume function kept
>  						   around.  */
>  #define PROP_gimple_lbitint	(1 << 20)       /* lowered large _BitInt */
> +#define PROP_last_full_fold	(1 << 21)	/* Start of last forwprop.  */
>  
>  #define PROP_gimple \
>    (PROP_gimple_any | PROP_gimple_lcf | PROP_gimple_leh | PROP_gimple_lomp)
> --- gcc/passes.def.jj	2025-01-02 11:23:21.976440640 +0100
> +++ gcc/passes.def	2025-01-08 11:24:47.865096546 +0100
> @@ -80,7 +80,7 @@ along with GCC; see the file COPYING3.
>  	  NEXT_PASS (pass_ccp, false /* nonzero_p */);
>  	  /* After CCP we rewrite no longer addressed locals into SSA
>  	     form if possible.  */
> -	  NEXT_PASS (pass_forwprop);
> +	  NEXT_PASS (pass_forwprop, /*last=*/false);
>            NEXT_PASS (pass_early_thread_jumps, /*first=*/true);
>  	  NEXT_PASS (pass_sra_early);
>  	  /* pass_build_ealias is a dummy pass that ensures that we
> @@ -214,7 +214,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_complete_unrolli);
>        NEXT_PASS (pass_backprop);
>        NEXT_PASS (pass_phiprop);
> -      NEXT_PASS (pass_forwprop);
> +      NEXT_PASS (pass_forwprop, /*last=*/false);
>        /* pass_build_alias is a dummy pass that ensures that we
>  	 execute TODO_rebuild_alias at this point.  */
>        NEXT_PASS (pass_build_alias);
> @@ -254,7 +254,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_isolate_erroneous_paths);
>        NEXT_PASS (pass_reassoc, true /* early_p */);
>        NEXT_PASS (pass_dce);
> -      NEXT_PASS (pass_forwprop);
> +      NEXT_PASS (pass_forwprop, /*last=*/false);
>        NEXT_PASS (pass_phiopt, false /* early_p */);
>        NEXT_PASS (pass_ccp, true /* nonzero_p */);
>        /* After CCP we rewrite no longer addressed locals into SSA
> @@ -356,7 +356,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_dce, true /* update_address_taken_p */, true /* remove_unused_locals */);
>        /* After late DCE we rewrite no longer addressed locals into SSA
>  	 form if possible.  */
> -      NEXT_PASS (pass_forwprop);
> +      NEXT_PASS (pass_forwprop, /*last=*/true);
>        NEXT_PASS (pass_sink_code, true /* unsplit edges */);
>        NEXT_PASS (pass_phiopt, false /* early_p */);
>        NEXT_PASS (pass_fold_builtins);
> --- gcc/tree-ssa-forwprop.cc.jj	2025-01-03 17:55:01.067219567 +0100
> +++ gcc/tree-ssa-forwprop.cc	2025-01-08 11:46:05.935096203 +0100
> @@ -4107,14 +4107,22 @@ class pass_forwprop : public gimple_opt_
>  {
>  public:
>    pass_forwprop (gcc::context *ctxt)
> -    : gimple_opt_pass (pass_data_forwprop, ctxt)
> +    : gimple_opt_pass (pass_data_forwprop, ctxt), last_p (false)
>    {}
>  
>    /* opt_pass methods: */
>    opt_pass * clone () final override { return new pass_forwprop (m_ctxt); }
> +  void set_pass_param (unsigned int n, bool param) final override
> +    {
> +      gcc_assert (n == 0);
> +      last_p = param;
> +    }
>    bool gate (function *) final override { return flag_tree_forwprop; }
>    unsigned int execute (function *) final override;
>  
> + private:
> +  /* Determines whether the pass instance should set PROP_last_full_fold.  */
> +  bool last_p;
>  }; // class pass_forwprop
>  
>  unsigned int
> @@ -4123,6 +4131,8 @@ pass_forwprop::execute (function *fun)
>    unsigned int todoflags = 0;
>  
>    cfg_changed = false;
> +  if (last_p)
> +    fun->curr_properties |= PROP_last_full_fold;
>  
>    calculate_dominance_info (CDI_DOMINATORS);
>  
> --- gcc/match.pd.jj	2025-01-08 10:05:24.515994525 +0100
> +++ gcc/match.pd	2025-01-08 11:30:24.822335431 +0100
> @@ -4950,14 +4950,32 @@ (define_operator_list SYNC_FETCH_AND_AND
>  			    build_int_cst (TREE_TYPE (@1),
>  					   element_precision (type)), @1); }))
>  
> -/* a rrotate (32-b) -> a lrotate b */
> -/* a lrotate (32-b) -> a rrotate b */
> +/* a rrotate (bitsize-b) -> a lrotate b */
> +/* a lrotate (bitsize-b) -> a rrotate b */
> +/* Only do the above when it is known that b != bitsize.
> +   Otherwise canonicalize to a orotate (b & mask) when the subtraction
> +   has single use and prec is a power of two.  */
>  (for rotate (lrotate rrotate)
>       orotate (rrotate lrotate)
>   (simplify
> -  (rotate @0 (minus INTEGER_CST@1 @2))
> -   (if (element_precision (TREE_TYPE (@0)) == wi::to_wide (@1))
> -     (orotate @0 @2))))
> +  (rotate @0 (minus@3 INTEGER_CST@1 @2))
> +  (with { auto prec = element_precision (TREE_TYPE (@0)); }
> +   (if (prec == wi::to_wide (@1))
> +    (switch
> +     (if (expr_not_equal_to (@2, wi::uhwi (prec,
> +					   TYPE_PRECISION (TREE_TYPE (@2)))))
> +      (orotate @0 @2))
> +     (if (single_use (@3)
> +	  && pow2p_hwi (prec)
> +	  /* Defer this case until last forwprop, so that VRP could be run and
> +	     expr_not_equal_to had a chance to match.  Otherwise we'd do
> +	     pretty much always just the second case.  */
> +	  && cfun
> +	  && ((cfun->curr_properties & PROP_last_full_fold) != 0
> +	      || !flag_tree_vrp
> +	      || optimize_debug))
> +      (orotate @0
> +	(bit_and @2 { build_int_cst (TREE_TYPE (@2), prec - 1); }))))))))
>  
>  /* Turn (a OP c1) OP c2 into a OP (c1+c2).  */
>  (for op (lrotate rrotate rshift lshift)
> --- gcc/testsuite/gcc.dg/tree-ssa/pr117927.c.jj	2025-01-08 11:22:04.773400985 +0100
> +++ gcc/testsuite/gcc.dg/tree-ssa/pr117927.c	2025-01-08 11:22:04.773400985 +0100
> @@ -0,0 +1,97 @@
> +/* PR tree-optimization/117927 */
> +/* { dg-do compile { target { int32 && longlong64 } } } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +/* { dg-final { scan-tree-dump-times " r<< " 4 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times " r>> " 4 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times " & 31;" 2 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times " & 63;" 2 "optimized" } } */
> +/* { dg-final { scan-tree-dump-not "32 - " "optimized" } } */
> +/* { dg-final { scan-tree-dump-not "64 - " "optimized" } } */
> +
> +static inline
> +unsigned lrotate32 (unsigned x, int t)
> +{
> +  unsigned tl = x << t;
> +  unsigned th = x >> (-t & 31);
> +  return tl | th;
> +}
> +
> +static inline
> +unsigned rrotate32 (unsigned x, int t)
> +{
> +  unsigned tl = x >> t;
> +  unsigned th = x << (-t & 31);
> +  return tl | th;
> +}
> +
> +static inline
> +unsigned long long lrotate64 (unsigned long long x, int t)
> +{
> +  unsigned long long tl = x << t;
> +  unsigned long long th = x >> (-t & 63);
> +  return tl | th;
> +}
> +
> +static inline
> +unsigned long long rrotate64 (unsigned long long x, int t)
> +{
> +  unsigned long long tl = x >> t;
> +  unsigned long long th = x << (-t & 63);
> +  return tl | th;
> +}
> +
> +unsigned
> +f1 (unsigned x, int t)
> +{
> +  return lrotate32 (x, 32 - t);
> +}
> +
> +unsigned long long
> +f2 (unsigned long long x, int t)
> +{
> +  return lrotate64 (x, 64 - t);
> +}
> +
> +unsigned
> +f3 (unsigned x, int t)
> +{
> +  if (t == 32)
> +    __builtin_unreachable ();
> +  return lrotate32 (x, 32 - t);
> +}
> +
> +unsigned long long
> +f4 (unsigned long long x, int t)
> +{
> +  if (t == 64)
> +    __builtin_unreachable ();
> +  return lrotate64 (x, 64 - t);
> +}
> +
> +unsigned
> +f5 (unsigned x, int t)
> +{
> +  return rrotate32 (x, 32 - t);
> +}
> +
> +unsigned long long
> +f6 (unsigned long long x, int t)
> +{
> +  return rrotate64 (x, 64 - t);
> +}
> +
> +unsigned
> +f7 (unsigned x, int t)
> +{
> +  if (t == 32)
> +    __builtin_unreachable ();
> +  return rrotate32 (x, 32 - t);
> +}
> +
> +unsigned long long
> +f8 (unsigned long long x, int t)
> +{
> +  if (t == 64)
> +    __builtin_unreachable ();
> +  return rrotate64 (x, 64 - t);
> +}
> 
> 
> 	Jakub
> 
>
  
Andrew Pinski Jan. 8, 2025, 9:27 p.m. UTC | #2
On Wed, Jan 8, 2025 at 3:08 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Wed, Jan 08, 2025 at 10:17:59AM +0100, Richard Biener wrote:
> > > As mentioned in the PR, the a r<< (bitsize-b) to a r>> b and similar
> > > match.pd optimization which has been introduced in GCC 15 can introduce
> > > UB which wasn't there before, in particular if b is equal at runtime
> > > to bitsize, then a r<< 0 is turned into a r>> bitsize.
> > >
> > > The following patch fixes it by optimizing it early only if VRP
> > > tells us the count isn't equal to the bitsize, and late into
> > > a r>> (b & (bitsize - 1)) if bitsize is power of two and the subtraction
> > > has single use, on various targets the masking then goes away because
> > > its rotate instructions do masking already.  The latter can't be
> > > done too early though, because then the expr_not_equal_to case is
> > > basically useless and we introduce the masking always and can't find out
> > > anymore that there was originally no masking.  Even cfun->after_inlining
> > > check would be too early, there is forwprop before vrp, so the patch
> > > introduces a new PROP for vrp2 finish, after which there is another
> > > forwprop.  Or do you have ideas about what other condition could be used
> > > for late matching only?
> > >
> > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> >
> > Maybe instead add PROP_last_full_fold and set that at the start of the
> > last forwprop?  I think in the end we may want to have a special
> > pre-expand (at ISEL time?) folding and have extra patterns enabled
> > there (and some disabled eventually), but I think we don't want to
> > introduce this now.
> >
> > With the rest I agree.
>
> So like this?  Works in quick testing, ok for trunk if it passes full
> bootstrap/regtest?
>
> 2025-01-08  Jakub Jelinek  <jakub@redhat.com>
>             Andrew Pinski  <quic_apinski@quicinc.com>
>
>         PR tree-optimization/117927
>         * tree-pass.h (PROP_last_full_fold): Define.
>         * passes.def: Add last= parameters to pass_forwprop.
>         * tree-ssa-forwprop.cc (pass_forwprop): Add last_p non-static
>         data member and initialize it in the ctor.
>         (pass_forwprop::set_pass_param): New method.
>         (pass_forwprop::execute): Set PROP_last_full_fold in curr_properties
>         at the start if last_p.
>         * match.pd (a rrotate (32-b) -> a lrotate b): Only optimize either
>         if @2 is known not to be equal to prec or if during/after last
>         forwprop the subtraction has single use and prec is power of two; in
>         that case transform it into orotate by masked count.
>
>         * gcc.dg/tree-ssa/pr117927.c: New test.
>
> --- gcc/tree-pass.h.jj  2025-01-08 10:05:09.634203321 +0100
> +++ gcc/tree-pass.h     2025-01-08 11:22:56.356672128 +0100
> @@ -230,6 +230,7 @@ protected:
>  #define PROP_assumptions_done  (1 << 19)       /* Assume function kept
>                                                    around.  */
>  #define PROP_gimple_lbitint    (1 << 20)       /* lowered large _BitInt */
> +#define PROP_last_full_fold    (1 << 21)       /* Start of last forwprop.  */
>
>  #define PROP_gimple \
>    (PROP_gimple_any | PROP_gimple_lcf | PROP_gimple_leh | PROP_gimple_lomp)
> --- gcc/passes.def.jj   2025-01-02 11:23:21.976440640 +0100
> +++ gcc/passes.def      2025-01-08 11:24:47.865096546 +0100
> @@ -80,7 +80,7 @@ along with GCC; see the file COPYING3.
>           NEXT_PASS (pass_ccp, false /* nonzero_p */);
>           /* After CCP we rewrite no longer addressed locals into SSA
>              form if possible.  */
> -         NEXT_PASS (pass_forwprop);
> +         NEXT_PASS (pass_forwprop, /*last=*/false);
>            NEXT_PASS (pass_early_thread_jumps, /*first=*/true);
>           NEXT_PASS (pass_sra_early);
>           /* pass_build_ealias is a dummy pass that ensures that we
> @@ -214,7 +214,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_complete_unrolli);
>        NEXT_PASS (pass_backprop);
>        NEXT_PASS (pass_phiprop);
> -      NEXT_PASS (pass_forwprop);
> +      NEXT_PASS (pass_forwprop, /*last=*/false);
>        /* pass_build_alias is a dummy pass that ensures that we
>          execute TODO_rebuild_alias at this point.  */
>        NEXT_PASS (pass_build_alias);
> @@ -254,7 +254,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_isolate_erroneous_paths);
>        NEXT_PASS (pass_reassoc, true /* early_p */);
>        NEXT_PASS (pass_dce);
> -      NEXT_PASS (pass_forwprop);
> +      NEXT_PASS (pass_forwprop, /*last=*/false);
>        NEXT_PASS (pass_phiopt, false /* early_p */);
>        NEXT_PASS (pass_ccp, true /* nonzero_p */);
>        /* After CCP we rewrite no longer addressed locals into SSA
> @@ -356,7 +356,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_dce, true /* update_address_taken_p */, true /* remove_unused_locals */);
>        /* After late DCE we rewrite no longer addressed locals into SSA
>          form if possible.  */
> -      NEXT_PASS (pass_forwprop);
> +      NEXT_PASS (pass_forwprop, /*last=*/true);
>        NEXT_PASS (pass_sink_code, true /* unsplit edges */);
>        NEXT_PASS (pass_phiopt, false /* early_p */);
>        NEXT_PASS (pass_fold_builtins);

Since below you set the last_p field to false by default, you only
need to change the one where you need the param changed. But both are
ok.
Also there is a style difference here about if the name of the
parameter goes after the value or before. Both ways are currently used
so this is something to clean up later on; I filed PR 118366 for this
to remind myself to fix this once stage1 opens up.

Otherwise LGTM. Plus PROP_last_full_fold was something I wanted to do
before and I got distracted from other things.

Thanks,
Andrew

> --- gcc/tree-ssa-forwprop.cc.jj 2025-01-03 17:55:01.067219567 +0100
> +++ gcc/tree-ssa-forwprop.cc    2025-01-08 11:46:05.935096203 +0100
> @@ -4107,14 +4107,22 @@ class pass_forwprop : public gimple_opt_
>  {
>  public:
>    pass_forwprop (gcc::context *ctxt)
> -    : gimple_opt_pass (pass_data_forwprop, ctxt)
> +    : gimple_opt_pass (pass_data_forwprop, ctxt), last_p (false)
>    {}
>
>    /* opt_pass methods: */
>    opt_pass * clone () final override { return new pass_forwprop (m_ctxt); }
> +  void set_pass_param (unsigned int n, bool param) final override
> +    {
> +      gcc_assert (n == 0);
> +      last_p = param;
> +    }
>    bool gate (function *) final override { return flag_tree_forwprop; }
>    unsigned int execute (function *) final override;
>
> + private:
> +  /* Determines whether the pass instance should set PROP_last_full_fold.  */
> +  bool last_p;
>  }; // class pass_forwprop
>
>  unsigned int
> @@ -4123,6 +4131,8 @@ pass_forwprop::execute (function *fun)
>    unsigned int todoflags = 0;
>
>    cfg_changed = false;
> +  if (last_p)
> +    fun->curr_properties |= PROP_last_full_fold;
>
>    calculate_dominance_info (CDI_DOMINATORS);
>
> --- gcc/match.pd.jj     2025-01-08 10:05:24.515994525 +0100
> +++ gcc/match.pd        2025-01-08 11:30:24.822335431 +0100
> @@ -4950,14 +4950,32 @@ (define_operator_list SYNC_FETCH_AND_AND
>                             build_int_cst (TREE_TYPE (@1),
>                                            element_precision (type)), @1); }))
>
> -/* a rrotate (32-b) -> a lrotate b */
> -/* a lrotate (32-b) -> a rrotate b */
> +/* a rrotate (bitsize-b) -> a lrotate b */
> +/* a lrotate (bitsize-b) -> a rrotate b */
> +/* Only do the above when it is known that b != bitsize.
> +   Otherwise canonicalize to a orotate (b & mask) when the subtraction
> +   has single use and prec is a power of two.  */
>  (for rotate (lrotate rrotate)
>       orotate (rrotate lrotate)
>   (simplify
> -  (rotate @0 (minus INTEGER_CST@1 @2))
> -   (if (element_precision (TREE_TYPE (@0)) == wi::to_wide (@1))
> -     (orotate @0 @2))))
> +  (rotate @0 (minus@3 INTEGER_CST@1 @2))
> +  (with { auto prec = element_precision (TREE_TYPE (@0)); }
> +   (if (prec == wi::to_wide (@1))
> +    (switch
> +     (if (expr_not_equal_to (@2, wi::uhwi (prec,
> +                                          TYPE_PRECISION (TREE_TYPE (@2)))))
> +      (orotate @0 @2))
> +     (if (single_use (@3)
> +         && pow2p_hwi (prec)
> +         /* Defer this case until last forwprop, so that VRP could be run and
> +            expr_not_equal_to had a chance to match.  Otherwise we'd do
> +            pretty much always just the second case.  */
> +         && cfun
> +         && ((cfun->curr_properties & PROP_last_full_fold) != 0
> +             || !flag_tree_vrp
> +             || optimize_debug))
> +      (orotate @0
> +       (bit_and @2 { build_int_cst (TREE_TYPE (@2), prec - 1); }))))))))
>
>  /* Turn (a OP c1) OP c2 into a OP (c1+c2).  */
>  (for op (lrotate rrotate rshift lshift)
> --- gcc/testsuite/gcc.dg/tree-ssa/pr117927.c.jj 2025-01-08 11:22:04.773400985 +0100
> +++ gcc/testsuite/gcc.dg/tree-ssa/pr117927.c    2025-01-08 11:22:04.773400985 +0100
> @@ -0,0 +1,97 @@
> +/* PR tree-optimization/117927 */
> +/* { dg-do compile { target { int32 && longlong64 } } } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +/* { dg-final { scan-tree-dump-times " r<< " 4 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times " r>> " 4 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times " & 31;" 2 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times " & 63;" 2 "optimized" } } */
> +/* { dg-final { scan-tree-dump-not "32 - " "optimized" } } */
> +/* { dg-final { scan-tree-dump-not "64 - " "optimized" } } */
> +
> +static inline
> +unsigned lrotate32 (unsigned x, int t)
> +{
> +  unsigned tl = x << t;
> +  unsigned th = x >> (-t & 31);
> +  return tl | th;
> +}
> +
> +static inline
> +unsigned rrotate32 (unsigned x, int t)
> +{
> +  unsigned tl = x >> t;
> +  unsigned th = x << (-t & 31);
> +  return tl | th;
> +}
> +
> +static inline
> +unsigned long long lrotate64 (unsigned long long x, int t)
> +{
> +  unsigned long long tl = x << t;
> +  unsigned long long th = x >> (-t & 63);
> +  return tl | th;
> +}
> +
> +static inline
> +unsigned long long rrotate64 (unsigned long long x, int t)
> +{
> +  unsigned long long tl = x >> t;
> +  unsigned long long th = x << (-t & 63);
> +  return tl | th;
> +}
> +
> +unsigned
> +f1 (unsigned x, int t)
> +{
> +  return lrotate32 (x, 32 - t);
> +}
> +
> +unsigned long long
> +f2 (unsigned long long x, int t)
> +{
> +  return lrotate64 (x, 64 - t);
> +}
> +
> +unsigned
> +f3 (unsigned x, int t)
> +{
> +  if (t == 32)
> +    __builtin_unreachable ();
> +  return lrotate32 (x, 32 - t);
> +}
> +
> +unsigned long long
> +f4 (unsigned long long x, int t)
> +{
> +  if (t == 64)
> +    __builtin_unreachable ();
> +  return lrotate64 (x, 64 - t);
> +}
> +
> +unsigned
> +f5 (unsigned x, int t)
> +{
> +  return rrotate32 (x, 32 - t);
> +}
> +
> +unsigned long long
> +f6 (unsigned long long x, int t)
> +{
> +  return rrotate64 (x, 64 - t);
> +}
> +
> +unsigned
> +f7 (unsigned x, int t)
> +{
> +  if (t == 32)
> +    __builtin_unreachable ();
> +  return rrotate32 (x, 32 - t);
> +}
> +
> +unsigned long long
> +f8 (unsigned long long x, int t)
> +{
> +  if (t == 64)
> +    __builtin_unreachable ();
> +  return rrotate64 (x, 64 - t);
> +}
>
>
>         Jakub
>
  

Patch

--- gcc/tree-pass.h.jj	2025-01-08 10:05:09.634203321 +0100
+++ gcc/tree-pass.h	2025-01-08 11:22:56.356672128 +0100
@@ -230,6 +230,7 @@  protected:
 #define PROP_assumptions_done	(1 << 19)	/* Assume function kept
 						   around.  */
 #define PROP_gimple_lbitint	(1 << 20)       /* lowered large _BitInt */
+#define PROP_last_full_fold	(1 << 21)	/* Start of last forwprop.  */
 
 #define PROP_gimple \
   (PROP_gimple_any | PROP_gimple_lcf | PROP_gimple_leh | PROP_gimple_lomp)
--- gcc/passes.def.jj	2025-01-02 11:23:21.976440640 +0100
+++ gcc/passes.def	2025-01-08 11:24:47.865096546 +0100
@@ -80,7 +80,7 @@  along with GCC; see the file COPYING3.
 	  NEXT_PASS (pass_ccp, false /* nonzero_p */);
 	  /* After CCP we rewrite no longer addressed locals into SSA
 	     form if possible.  */
-	  NEXT_PASS (pass_forwprop);
+	  NEXT_PASS (pass_forwprop, /*last=*/false);
           NEXT_PASS (pass_early_thread_jumps, /*first=*/true);
 	  NEXT_PASS (pass_sra_early);
 	  /* pass_build_ealias is a dummy pass that ensures that we
@@ -214,7 +214,7 @@  along with GCC; see the file COPYING3.
       NEXT_PASS (pass_complete_unrolli);
       NEXT_PASS (pass_backprop);
       NEXT_PASS (pass_phiprop);
-      NEXT_PASS (pass_forwprop);
+      NEXT_PASS (pass_forwprop, /*last=*/false);
       /* pass_build_alias is a dummy pass that ensures that we
 	 execute TODO_rebuild_alias at this point.  */
       NEXT_PASS (pass_build_alias);
@@ -254,7 +254,7 @@  along with GCC; see the file COPYING3.
       NEXT_PASS (pass_isolate_erroneous_paths);
       NEXT_PASS (pass_reassoc, true /* early_p */);
       NEXT_PASS (pass_dce);
-      NEXT_PASS (pass_forwprop);
+      NEXT_PASS (pass_forwprop, /*last=*/false);
       NEXT_PASS (pass_phiopt, false /* early_p */);
       NEXT_PASS (pass_ccp, true /* nonzero_p */);
       /* After CCP we rewrite no longer addressed locals into SSA
@@ -356,7 +356,7 @@  along with GCC; see the file COPYING3.
       NEXT_PASS (pass_dce, true /* update_address_taken_p */, true /* remove_unused_locals */);
       /* After late DCE we rewrite no longer addressed locals into SSA
 	 form if possible.  */
-      NEXT_PASS (pass_forwprop);
+      NEXT_PASS (pass_forwprop, /*last=*/true);
       NEXT_PASS (pass_sink_code, true /* unsplit edges */);
       NEXT_PASS (pass_phiopt, false /* early_p */);
       NEXT_PASS (pass_fold_builtins);
--- gcc/tree-ssa-forwprop.cc.jj	2025-01-03 17:55:01.067219567 +0100
+++ gcc/tree-ssa-forwprop.cc	2025-01-08 11:46:05.935096203 +0100
@@ -4107,14 +4107,22 @@  class pass_forwprop : public gimple_opt_
 {
 public:
   pass_forwprop (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_forwprop, ctxt)
+    : gimple_opt_pass (pass_data_forwprop, ctxt), last_p (false)
   {}
 
   /* opt_pass methods: */
   opt_pass * clone () final override { return new pass_forwprop (m_ctxt); }
+  void set_pass_param (unsigned int n, bool param) final override
+    {
+      gcc_assert (n == 0);
+      last_p = param;
+    }
   bool gate (function *) final override { return flag_tree_forwprop; }
   unsigned int execute (function *) final override;
 
+ private:
+  /* Determines whether the pass instance should set PROP_last_full_fold.  */
+  bool last_p;
 }; // class pass_forwprop
 
 unsigned int
@@ -4123,6 +4131,8 @@  pass_forwprop::execute (function *fun)
   unsigned int todoflags = 0;
 
   cfg_changed = false;
+  if (last_p)
+    fun->curr_properties |= PROP_last_full_fold;
 
   calculate_dominance_info (CDI_DOMINATORS);
 
--- gcc/match.pd.jj	2025-01-08 10:05:24.515994525 +0100
+++ gcc/match.pd	2025-01-08 11:30:24.822335431 +0100
@@ -4950,14 +4950,32 @@  (define_operator_list SYNC_FETCH_AND_AND
 			    build_int_cst (TREE_TYPE (@1),
 					   element_precision (type)), @1); }))
 
-/* a rrotate (32-b) -> a lrotate b */
-/* a lrotate (32-b) -> a rrotate b */
+/* a rrotate (bitsize-b) -> a lrotate b */
+/* a lrotate (bitsize-b) -> a rrotate b */
+/* Only do the above when it is known that b != bitsize.
+   Otherwise canonicalize to a orotate (b & mask) when the subtraction
+   has single use and prec is a power of two.  */
 (for rotate (lrotate rrotate)
      orotate (rrotate lrotate)
  (simplify
-  (rotate @0 (minus INTEGER_CST@1 @2))
-   (if (element_precision (TREE_TYPE (@0)) == wi::to_wide (@1))
-     (orotate @0 @2))))
+  (rotate @0 (minus@3 INTEGER_CST@1 @2))
+  (with { auto prec = element_precision (TREE_TYPE (@0)); }
+   (if (prec == wi::to_wide (@1))
+    (switch
+     (if (expr_not_equal_to (@2, wi::uhwi (prec,
+					   TYPE_PRECISION (TREE_TYPE (@2)))))
+      (orotate @0 @2))
+     (if (single_use (@3)
+	  && pow2p_hwi (prec)
+	  /* Defer this case until last forwprop, so that VRP could be run and
+	     expr_not_equal_to had a chance to match.  Otherwise we'd do
+	     pretty much always just the second case.  */
+	  && cfun
+	  && ((cfun->curr_properties & PROP_last_full_fold) != 0
+	      || !flag_tree_vrp
+	      || optimize_debug))
+      (orotate @0
+	(bit_and @2 { build_int_cst (TREE_TYPE (@2), prec - 1); }))))))))
 
 /* Turn (a OP c1) OP c2 into a OP (c1+c2).  */
 (for op (lrotate rrotate rshift lshift)
--- gcc/testsuite/gcc.dg/tree-ssa/pr117927.c.jj	2025-01-08 11:22:04.773400985 +0100
+++ gcc/testsuite/gcc.dg/tree-ssa/pr117927.c	2025-01-08 11:22:04.773400985 +0100
@@ -0,0 +1,97 @@ 
+/* PR tree-optimization/117927 */
+/* { dg-do compile { target { int32 && longlong64 } } } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump-times " r<< " 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times " r>> " 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times " & 31;" 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-times " & 63;" 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-not "32 - " "optimized" } } */
+/* { dg-final { scan-tree-dump-not "64 - " "optimized" } } */
+
+static inline
+unsigned lrotate32 (unsigned x, int t)
+{
+  unsigned tl = x << t;
+  unsigned th = x >> (-t & 31);
+  return tl | th;
+}
+
+static inline
+unsigned rrotate32 (unsigned x, int t)
+{
+  unsigned tl = x >> t;
+  unsigned th = x << (-t & 31);
+  return tl | th;
+}
+
+static inline
+unsigned long long lrotate64 (unsigned long long x, int t)
+{
+  unsigned long long tl = x << t;
+  unsigned long long th = x >> (-t & 63);
+  return tl | th;
+}
+
+static inline
+unsigned long long rrotate64 (unsigned long long x, int t)
+{
+  unsigned long long tl = x >> t;
+  unsigned long long th = x << (-t & 63);
+  return tl | th;
+}
+
+unsigned
+f1 (unsigned x, int t)
+{
+  return lrotate32 (x, 32 - t);
+}
+
+unsigned long long
+f2 (unsigned long long x, int t)
+{
+  return lrotate64 (x, 64 - t);
+}
+
+unsigned
+f3 (unsigned x, int t)
+{
+  if (t == 32)
+    __builtin_unreachable ();
+  return lrotate32 (x, 32 - t);
+}
+
+unsigned long long
+f4 (unsigned long long x, int t)
+{
+  if (t == 64)
+    __builtin_unreachable ();
+  return lrotate64 (x, 64 - t);
+}
+
+unsigned
+f5 (unsigned x, int t)
+{
+  return rrotate32 (x, 32 - t);
+}
+
+unsigned long long
+f6 (unsigned long long x, int t)
+{
+  return rrotate64 (x, 64 - t);
+}
+
+unsigned
+f7 (unsigned x, int t)
+{
+  if (t == 32)
+    __builtin_unreachable ();
+  return rrotate32 (x, 32 - t);
+}
+
+unsigned long long
+f8 (unsigned long long x, int t)
+{
+  if (t == 64)
+    __builtin_unreachable ();
+  return rrotate64 (x, 64 - t);
+}