cselib: For CALL_INSNs to const/pure fns invalidate memory below sp [PR117239]

Message ID Z6HLHwhxFVuIInCY@tucnak
State New
Headers
Series cselib: For CALL_INSNs to const/pure fns invalidate memory below sp [PR117239] |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed

Commit Message

Jakub Jelinek Feb. 4, 2025, 8:09 a.m. UTC
  Hi!

The following testcase is miscompiled on x86_64 during postreload.
After reload (with IPA-RA figuring out the calls don't modify any
registers but %rax for return value) postreload sees
(insn 14 12 15 2 (set (mem:DI (plus:DI (reg/f:DI 7 sp)
                (const_int 16 [0x10])) [0  S8 A64])
        (reg:DI 1 dx [orig:105 q+16 ] [105])) "pr117239.c":18:7 95 {*movdi_internal}
     (nil))
(call_insn/i 15 14 16 2 (set (reg:SI 0 ax)
        (call (mem:QI (symbol_ref:DI ("baz") [flags 0x3]  <function_decl 0x7ffb2e2bdf00 r>) [0 baz S1 A8])
            (const_int 24 [0x18]))) "pr117239.c":18:7 1476 {*call_value}
     (expr_list:REG_CALL_DECL (symbol_ref:DI ("baz") [flags 0x3]  <function_decl 0x7ffb2e2bdf00 baz>)
        (expr_list:REG_EH_REGION (const_int 0 [0])
            (nil)))
    (nil))
(insn 16 15 18 2 (parallel [
            (set (reg/f:DI 7 sp)
                (plus:DI (reg/f:DI 7 sp)
                    (const_int 24 [0x18])))
            (clobber (reg:CC 17 flags))
        ]) "pr117239.c":18:7 285 {*adddi_1}
     (expr_list:REG_ARGS_SIZE (const_int 0 [0])
        (nil)))
...
(call_insn/i 19 18 21 2 (set (reg:SI 0 ax)
        (call (mem:QI (symbol_ref:DI ("foo") [flags 0x3]  <function_decl 0x7ffb2e2bdb00 l>) [0 foo S1 A8])
            (const_int 0 [0]))) "pr117239.c":19:3 1476 {*call_value}
     (expr_list:REG_CALL_DECL (symbol_ref:DI ("foo") [flags 0x3]  <function_decl 0x7ffb2e2bdb00 foo>)
        (expr_list:REG_EH_REGION (const_int 0 [0])
            (nil)))
    (nil))
(insn 21 19 26 2 (parallel [
            (set (reg/f:DI 7 sp)
                (plus:DI (reg/f:DI 7 sp)
                    (const_int -24 [0xffffffffffffffe8])))
            (clobber (reg:CC 17 flags))
        ]) "pr117239.c":19:3 discrim 1 285 {*adddi_1}
     (expr_list:REG_ARGS_SIZE (const_int 24 [0x18])
        (nil)))
(insn 26 21 24 2 (set (mem:DI (plus:DI (reg/f:DI 7 sp)
                (const_int 16 [0x10])) [0  S8 A64])
        (reg:DI 1 dx [orig:105 q+16 ] [105])) "pr117239.c":19:3 discrim 1 95 {*movdi_internal}
     (nil))
i.e.
        movq    %rdx, 16(%rsp)
        call    baz
        addq    $24, %rsp
...
        call    foo
        subq    $24, %rsp
        movq    %rdx, 16(%rsp)
Now, postreload uses cselib and cselib remembered that %rdx value has been
stored into 16(%rsp).  Both baz and foo are pure calls.  If they weren't,
when processing those CALL_INSNs cselib would invalidate all MEMs
      if (RTL_LOOPING_CONST_OR_PURE_CALL_P (insn)
          || !(RTL_CONST_OR_PURE_CALL_P (insn)))
        cselib_invalidate_mem (callmem);
where callmem is (mem:BLK (scratch)).  But they are pure, so instead the
code just invalidates the argument slots from CALL_INSN_FUNCTION_USAGE.
The calls actually clobber more than that, even const/pure calls clobber
all memory below the stack pointer.  And that is something that hasn't been
invalidated.  In this failing testcase, the call to baz is not a big deal,
we don't have anything remembered in memory below %rsp at that call.
But then we increment %rsp by 24, so the %rsp+16 is now 8 bytes below stack
and do the call to foo.  And that call now actually, not just in theory,
clobbers the memory below the stack pointer (in particular overwrites it
with the return value).  But cselib does not invalidate.  Then %rsp
is decremented again (in preparation for another call, to bar) and cselib
is processing store of %rdx (which IPA-RA says has not been modified by
either baz or foo calls) to %rsp + 16, and it sees the memory already has
that value, so the store is useless, let's remove it.
But it is not, the call to foo has changed it, so it needs to be stored
again.

The following patch adds targetted invalidation of memory below stack
pointer (or on SPARC memory below stack pointer + 2047 when stack bias is
used, or on PA memory above stack pointer instead).
Now, memory below stack pointer is special, except for functions using
alloca/VLAs I believe no addressable memory should be there, it should be
purely outgoing function argument area, if we take address of some automatic
variable, it should live all the time above the outgoing function argument
area.  So on top of just trying to flush memory below stack pointer
(represented by %rsp - PTRDIFF_MAX with PTRDIFF_MAX size on most arches),
the patch tries to optimize and only invalidate memory that has address
clearly derived from stack pointer (memory with other bases is not
invalidated) and if we can prove (we see same SP_DERIVED_VALUE_P bases in
both VALUEs) it is above current stack, also don't call
canon_anti_dependence which might just give up in certain cases.

I've gathered statistics from x86_64-linux and i686-linux
bootstraps/regtests.  During -m64 compilations from those, there were
3718396 + 42634 + 27761 cases of processing MEMs in cselib_invalidate_mem
(callmem[1]) calls, the first number is number of MEMs not invalidated
because of the optimization, i.e.
+             if (sp_derived_base == NULL_RTX)
+               {
+                 has_mem = true;
+                 num_mems++;
+                 p = &(*p)->next;
+                 continue;
+               }
in the patch, the second number is number of MEMs not invalidated because
canon_anti_dependence returned false and finally the last number is number
of MEMs actually invalidated (so that is what hasn't been invalidated
before).  During -m32 compilations the numbers were
1422412 + 39354 + 16509 with the same meaning.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Note, when there is no red zone, in theory even the sp = sp + incr
instruction invalidates memory below the new stack pointer, as signal
can come and overwrite the memory.  So maybe we should be invalidating
something at those instructions as well.  But in leaf functions we certainly
can have even addressable automatic vars in the red zone (which would make
it harder to distinguish), on the other side aren't normally storing
anything below the red zone, and in non-leaf it should normally be just the
outgoing arguments area.

2025-02-04  Jakub Jelinek  <jakub@redhat.com>

	PR rtl-optimization/117239
	* cselib.cc (callmem): Change type from rtx to rtx[2].
	(cselib_preserve_only_values): Use callmem[0] rather than callmem.
	(cselib_invalidate_mem): Optimize and don't try to invalidate
	for the mem_rtx == callmem[1] case MEMs which clearly can't be
	below the stack pointer.
	(cselib_process_insn): Likewise.  For const/pure calls also
	call cselib_invalidate_mem (callmem[1]).
	(cselib_init): Initialize callmem[0] rather than callmem and also
	initialize callmem[1].

	* gcc.dg/pr117239.c: New test.


	Jakub
  

Comments

Richard Biener Feb. 4, 2025, 10:53 a.m. UTC | #1
On Tue, 4 Feb 2025, Jakub Jelinek wrote:

> Hi!
> 
> The following testcase is miscompiled on x86_64 during postreload.
> After reload (with IPA-RA figuring out the calls don't modify any
> registers but %rax for return value) postreload sees
> (insn 14 12 15 2 (set (mem:DI (plus:DI (reg/f:DI 7 sp)
>                 (const_int 16 [0x10])) [0  S8 A64])
>         (reg:DI 1 dx [orig:105 q+16 ] [105])) "pr117239.c":18:7 95 {*movdi_internal}
>      (nil))
> (call_insn/i 15 14 16 2 (set (reg:SI 0 ax)
>         (call (mem:QI (symbol_ref:DI ("baz") [flags 0x3]  <function_decl 0x7ffb2e2bdf00 r>) [0 baz S1 A8])
>             (const_int 24 [0x18]))) "pr117239.c":18:7 1476 {*call_value}
>      (expr_list:REG_CALL_DECL (symbol_ref:DI ("baz") [flags 0x3]  <function_decl 0x7ffb2e2bdf00 baz>)
>         (expr_list:REG_EH_REGION (const_int 0 [0])
>             (nil)))
>     (nil))
> (insn 16 15 18 2 (parallel [
>             (set (reg/f:DI 7 sp)
>                 (plus:DI (reg/f:DI 7 sp)
>                     (const_int 24 [0x18])))
>             (clobber (reg:CC 17 flags))
>         ]) "pr117239.c":18:7 285 {*adddi_1}
>      (expr_list:REG_ARGS_SIZE (const_int 0 [0])
>         (nil)))
> ...
> (call_insn/i 19 18 21 2 (set (reg:SI 0 ax)
>         (call (mem:QI (symbol_ref:DI ("foo") [flags 0x3]  <function_decl 0x7ffb2e2bdb00 l>) [0 foo S1 A8])
>             (const_int 0 [0]))) "pr117239.c":19:3 1476 {*call_value}
>      (expr_list:REG_CALL_DECL (symbol_ref:DI ("foo") [flags 0x3]  <function_decl 0x7ffb2e2bdb00 foo>)
>         (expr_list:REG_EH_REGION (const_int 0 [0])
>             (nil)))
>     (nil))
> (insn 21 19 26 2 (parallel [
>             (set (reg/f:DI 7 sp)
>                 (plus:DI (reg/f:DI 7 sp)
>                     (const_int -24 [0xffffffffffffffe8])))
>             (clobber (reg:CC 17 flags))
>         ]) "pr117239.c":19:3 discrim 1 285 {*adddi_1}
>      (expr_list:REG_ARGS_SIZE (const_int 24 [0x18])
>         (nil)))
> (insn 26 21 24 2 (set (mem:DI (plus:DI (reg/f:DI 7 sp)
>                 (const_int 16 [0x10])) [0  S8 A64])
>         (reg:DI 1 dx [orig:105 q+16 ] [105])) "pr117239.c":19:3 discrim 1 95 {*movdi_internal}
>      (nil))
> i.e.
>         movq    %rdx, 16(%rsp)
>         call    baz
>         addq    $24, %rsp
> ...
>         call    foo
>         subq    $24, %rsp
>         movq    %rdx, 16(%rsp)
> Now, postreload uses cselib and cselib remembered that %rdx value has been
> stored into 16(%rsp).  Both baz and foo are pure calls.  If they weren't,
> when processing those CALL_INSNs cselib would invalidate all MEMs
>       if (RTL_LOOPING_CONST_OR_PURE_CALL_P (insn)
>           || !(RTL_CONST_OR_PURE_CALL_P (insn)))
>         cselib_invalidate_mem (callmem);
> where callmem is (mem:BLK (scratch)).  But they are pure, so instead the
> code just invalidates the argument slots from CALL_INSN_FUNCTION_USAGE.
> The calls actually clobber more than that, even const/pure calls clobber
> all memory below the stack pointer.  And that is something that hasn't been
> invalidated.  In this failing testcase, the call to baz is not a big deal,
> we don't have anything remembered in memory below %rsp at that call.
> But then we increment %rsp by 24, so the %rsp+16 is now 8 bytes below stack
> and do the call to foo.  And that call now actually, not just in theory,
> clobbers the memory below the stack pointer (in particular overwrites it
> with the return value).  But cselib does not invalidate.  Then %rsp
> is decremented again (in preparation for another call, to bar) and cselib
> is processing store of %rdx (which IPA-RA says has not been modified by
> either baz or foo calls) to %rsp + 16, and it sees the memory already has
> that value, so the store is useless, let's remove it.
> But it is not, the call to foo has changed it, so it needs to be stored
> again.
> 
> The following patch adds targetted invalidation of memory below stack
> pointer (or on SPARC memory below stack pointer + 2047 when stack bias is
> used, or on PA memory above stack pointer instead).
> Now, memory below stack pointer is special, except for functions using
> alloca/VLAs I believe no addressable memory should be there, it should be
> purely outgoing function argument area, if we take address of some automatic
> variable, it should live all the time above the outgoing function argument
> area.  So on top of just trying to flush memory below stack pointer
> (represented by %rsp - PTRDIFF_MAX with PTRDIFF_MAX size on most arches),
> the patch tries to optimize and only invalidate memory that has address
> clearly derived from stack pointer (memory with other bases is not
> invalidated) and if we can prove (we see same SP_DERIVED_VALUE_P bases in
> both VALUEs) it is above current stack, also don't call
> canon_anti_dependence which might just give up in certain cases.
> 
> I've gathered statistics from x86_64-linux and i686-linux
> bootstraps/regtests.  During -m64 compilations from those, there were
> 3718396 + 42634 + 27761 cases of processing MEMs in cselib_invalidate_mem
> (callmem[1]) calls, the first number is number of MEMs not invalidated
> because of the optimization, i.e.
> +             if (sp_derived_base == NULL_RTX)
> +               {
> +                 has_mem = true;
> +                 num_mems++;
> +                 p = &(*p)->next;
> +                 continue;
> +               }
> in the patch, the second number is number of MEMs not invalidated because
> canon_anti_dependence returned false and finally the last number is number
> of MEMs actually invalidated (so that is what hasn't been invalidated
> before).  During -m32 compilations the numbers were
> 1422412 + 39354 + 16509 with the same meaning.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> Note, when there is no red zone, in theory even the sp = sp + incr
> instruction invalidates memory below the new stack pointer, as signal
> can come and overwrite the memory.  So maybe we should be invalidating
> something at those instructions as well.  But in leaf functions we certainly
> can have even addressable automatic vars in the red zone (which would make
> it harder to distinguish), on the other side aren't normally storing
> anything below the red zone, and in non-leaf it should normally be just the
> outgoing arguments area.
> 
> 2025-02-04  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR rtl-optimization/117239
> 	* cselib.cc (callmem): Change type from rtx to rtx[2].
> 	(cselib_preserve_only_values): Use callmem[0] rather than callmem.
> 	(cselib_invalidate_mem): Optimize and don't try to invalidate
> 	for the mem_rtx == callmem[1] case MEMs which clearly can't be
> 	below the stack pointer.
> 	(cselib_process_insn): Likewise.  For const/pure calls also
> 	call cselib_invalidate_mem (callmem[1]).
> 	(cselib_init): Initialize callmem[0] rather than callmem and also
> 	initialize callmem[1].
> 
> 	* gcc.dg/pr117239.c: New test.
> 
> --- gcc/cselib.cc.jj	2025-02-01 00:46:53.073275225 +0100
> +++ gcc/cselib.cc	2025-02-03 13:16:16.381772989 +0100
> @@ -248,8 +248,9 @@ static unsigned int *used_regs;
>  static unsigned int n_used_regs;
>  
>  /* We pass this to cselib_invalidate_mem to invalidate all of
> -   memory for a non-const call instruction.  */
> -static GTY(()) rtx callmem;
> +   memory for a non-const call instruction and memory below stack pointer
> +   for const/pure calls.  */
> +static GTY(()) rtx callmem[2];
>  
>  /* Set by discard_useless_locs if it deleted the last location of any
>     value.  */
> @@ -808,7 +809,7 @@ cselib_preserve_only_values (void)
>    for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
>      cselib_invalidate_regno (i, reg_raw_mode[i]);
>  
> -  cselib_invalidate_mem (callmem);
> +  cselib_invalidate_mem (callmem[0]);
>  
>    remove_useless_values ();
>  
> @@ -2644,6 +2645,97 @@ cselib_invalidate_mem (rtx mem_rtx)
>  	      p = &(*p)->next;
>  	      continue;
>  	    }
> +
> +	  /* When invalidating memory below the stack pointer for const/pure
> +	     calls and alloca/VLAs aren't used, attempt to optimize.  Values
> +	     stored into area below the stack pointer shouldn't be addressable
> +	     and should be stored just through stack pointer derived
> +	     expressions, so don't invalidate MEMs not using stack derived
> +	     addresses, or if the MEMs clearly aren't below the stack
> +	     pointer.  */
> +	  if (mem_rtx == callmem[1]
> +	      && num_mems < param_max_cselib_memory_locations
> +	      && GET_CODE (XEXP (x, 0)) == VALUE
> +	      && !cfun->calls_alloca)
> +	    {
> +	      cselib_val *v2 = CSELIB_VAL_PTR (XEXP (x, 0));
> +	      rtx sp_derived_base = NULL_RTX;
> +	      HOST_WIDE_INT sp_derived_off = 0;
> +	      if (SP_DERIVED_VALUE_P (v2->val_rtx))
> +		sp_derived_base = v2->val_rtx;
> +	      else
> +		for (struct elt_loc_list *l = v2->locs; l; l = l->next)
> +		  if (GET_CODE (l->loc) == PLUS
> +		      && GET_CODE (XEXP (l->loc, 0)) == VALUE
> +		      && SP_DERIVED_VALUE_P (XEXP (l->loc, 0))
> +		      && CONST_INT_P (XEXP (l->loc, 1)))
> +		    {
> +		      sp_derived_base = XEXP (l->loc, 0);
> +		      sp_derived_off = INTVAL (XEXP (l->loc, 1));
> +		      break;
> +		    }
> +	      if (sp_derived_base)
> +		if (cselib_val *v3
> +		    = cselib_lookup_1 (stack_pointer_rtx, Pmode, 0, VOIDmode))
> +		  {
> +		    bool ok = false;
> +		    HOST_WIDE_INT off = 0;
> +		    if (v3->val_rtx == sp_derived_base)
> +		      ok = true;
> +		    else
> +		      for (struct elt_loc_list *l = v3->locs; l; l = l->next)

isn't this invariant on the whole workset?  You are looking up
the CSELIB val for stack_pointer_rtx and traversing it's locations.
It looks like you want to know whether there's an offsetted stack
location from sp_derived_base (a VALUE).  There might be multiple
ones, but you stop at the first?  But still use that random offset?

> +			if (GET_CODE (l->loc) == PLUS
> +			    && XEXP (l->loc, 0) == sp_derived_base
> +			    && CONST_INT_P (XEXP (l->loc, 1)))
> +			  {

so what did we actually do here?  This needs a comment, esp.
'off' vs. 'sp_derived_off'.

> +			    ok = true;
> +			    off = INTVAL (XEXP (l->loc, 1));
> +			    break;
> +			  }
> +		    if (ok)
> +		      {
> +			if (STACK_GROWS_DOWNWARD)
> +			  {
> +#ifdef STACK_ADDRESS_OFFSET
> +			    /* On SPARC take stack pointer bias into account as
> +			       well.  */
> +			    off
> +			      += (STACK_ADDRESS_OFFSET
> +				  - FIRST_PARM_OFFSET (current_function_decl));
> +#endif
> +			    if (sp_derived_off >= off)
> +			      /* x is at or above the current stack pointer,
> +				 no need to invalidate it.  */
> +			      sp_derived_base = NULL_RTX;
> +			  }
> +			else
> +			  {
> +			    HOST_WIDE_INT sz;
> +			    enum machine_mode mode = GET_MODE (x);
> +			    if ((MEM_SIZE_KNOWN_P (x)
> +				 && MEM_SIZE (x).is_constant (&sz))
> +				|| (mode != BLKmode
> +				    && GET_MODE_SIZE (mode).is_constant (&sz)))
> +			      if (sp_derived_off < off
> +				  && ((HOST_WIDE_INT)
> +				      ((unsigned HOST_WIDE_INT) sp_derived_off
> +				       + sz) <= off))
> +				/* x's end is below or at the current stack
> +				   pointer in !STACK_GROWS_DOWNWARD target,
> +				   no need to invalidate it.  */
> +				sp_derived_base = NULL_RTX;
> +			  }
> +		      }
> +		  }
> +	      if (sp_derived_base == NULL_RTX)

So, this seems to take SP_DERIVED_VALUE_P conservative in the wrong way?
It seems CSELIB sets this if it is sure the value is SP derived but
for correctness reasons you have to assume a value is SP derived if
you can't prove it is not?  Or is your argument that no such proof
is necessary because the stack slot would have its address taken and
would then necessarily be not a spill slot?  This is all about spill
slots, right?  IIRC we do have spill slot MEMs specially marked
with MEM_EXPR equal to get_spill_slot_decl, there's also may_be_sp_based_p
using the somewhat broken RTL alias analysis (and there's
static_reg_base_value with the various stack area kinds, not sure if
helpful here).

> +		{
> +		  has_mem = true;
> +		  num_mems++;
> +		  p = &(*p)->next;
> +		  continue;
> +		}
> +	    }
> +
>  	  if (num_mems < param_max_cselib_memory_locations
>  	      && ! canon_anti_dependence (x, false, mem_rtx,
>  					  GET_MODE (mem_rtx), mem_addr))
> @@ -3196,14 +3288,20 @@ cselib_process_insn (rtx_insn *insn)
>  	 as if they were regular functions.  */
>        if (RTL_LOOPING_CONST_OR_PURE_CALL_P (insn)
>  	  || !(RTL_CONST_OR_PURE_CALL_P (insn)))
> -	cselib_invalidate_mem (callmem);
> +	cselib_invalidate_mem (callmem[0]);
>        else
> -	/* For const/pure calls, invalidate any argument slots because
> -	   they are owned by the callee.  */
> -	for (x = CALL_INSN_FUNCTION_USAGE (insn); x; x = XEXP (x, 1))
> -	  if (GET_CODE (XEXP (x, 0)) == USE
> -	      && MEM_P (XEXP (XEXP (x, 0), 0)))
> -	    cselib_invalidate_mem (XEXP (XEXP (x, 0), 0));
> +	{
> +	  /* For const/pure calls, invalidate any argument slots because
> +	     they are owned by the callee.  */
> +	  for (x = CALL_INSN_FUNCTION_USAGE (insn); x; x = XEXP (x, 1))
> +	    if (GET_CODE (XEXP (x, 0)) == USE
> +		&& MEM_P (XEXP (XEXP (x, 0), 0)))
> +	      cselib_invalidate_mem (XEXP (XEXP (x, 0), 0));
> +	  /* And invalidate memory below the stack (or above for
> +	     !STACK_GROWS_DOWNWARD), as even const/pure call can invalidate
> +	     that.  */
> +	  cselib_invalidate_mem (callmem[1]);
> +	}
>      }
>  
>    cselib_record_sets (insn);
> @@ -3256,8 +3354,27 @@ cselib_init (int record_what)
>  
>    /* (mem:BLK (scratch)) is a special mechanism to conflict with everything,
>       see canon_true_dependence.  This is only created once.  */
> -  if (! callmem)
> -    callmem = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (VOIDmode));
> +  if (! callmem[0])
> +    {
> +      callmem[0] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (VOIDmode));
> +      /* Similarly create a MEM representing roughly everything below
> +	 the stack for STACK_GROWS_DOWNWARD targets or everything above
> +	 it otherwise.  */
> +      if (STACK_GROWS_DOWNWARD)
> +	{
> +	  unsigned HOST_WIDE_INT off = -(GET_MODE_MASK (Pmode) >> 1);
> +#ifdef STACK_ADDRESS_OFFSET
> +	  /* On SPARC take stack pointer bias into account as well.  */
> +	  off += (STACK_ADDRESS_OFFSET
> +		  - FIRST_PARM_OFFSET (current_function_decl)));
> +#endif
> +	  callmem[1] = plus_constant (Pmode, stack_pointer_rtx, off);
> +	}
> +      else
> +	callmem[1] = stack_pointer_rtx;
> +      callmem[1] = gen_rtx_MEM (BLKmode, callmem[1]);
> +      set_mem_size (callmem[1], GET_MODE_MASK (Pmode) >> 1);
> +    }
>  
>    cselib_nregs = max_reg_num ();
>  
> --- gcc/testsuite/gcc.dg/pr117239.c.jj	2025-02-03 11:13:59.399159640 +0100
> +++ gcc/testsuite/gcc.dg/pr117239.c	2025-02-03 11:13:59.399159640 +0100
> @@ -0,0 +1,42 @@
> +/* PR rtl-optimization/117239 */
> +/* { dg-do run } */
> +/* { dg-options "-fno-inline -O2" } */
> +/* { dg-additional-options "-fschedule-insns" { target i?86-*-* x86_64-*-* } } */
> +
> +int a, b, c = 1, d;
> +
> +int
> +foo (void)
> +{
> +  return a;
> +}
> +
> +struct A {
> +  int e, f, g, h;
> +  short i;
> +  int j;
> +};
> +
> +void
> +bar (int x, struct A y)
> +{
> +  if (y.j == 1)
> +    c = 0;
> +}
> +
> +int
> +baz (struct A x)
> +{
> +  return b;
> +}
> +
> +int
> +main ()
> +{
> +  struct A k = { 0, 0, 0, 0, 0, 1 };
> +  d = baz (k);
> +  bar (foo (), k);
> +  if (c != 0)
> +    __builtin_abort ();
> +  return 0;
> +}
> 
> 	Jakub
> 
>
  
Jakub Jelinek Feb. 4, 2025, 11:59 a.m. UTC | #2
On Tue, Feb 04, 2025 at 11:53:21AM +0100, Richard Biener wrote:
> > +	      if (sp_derived_base)
> > +		if (cselib_val *v3
> > +		    = cselib_lookup_1 (stack_pointer_rtx, Pmode, 0, VOIDmode))
> > +		  {
> > +		    bool ok = false;
> > +		    HOST_WIDE_INT off = 0;
> > +		    if (v3->val_rtx == sp_derived_base)
> > +		      ok = true;
> > +		    else
> > +		      for (struct elt_loc_list *l = v3->locs; l; l = l->next)
> 
> isn't this invariant on the whole workset?  You are looking up
> the CSELIB val for stack_pointer_rtx and traversing it's locations.

The way it is written it isn't invariant, but you're right it might be a
good idea to tweak it so that it is invariant and just compute it lazily the
first time we need it during the single cselib_invalidate_memory call
(with SP_DERIVED_VALUE_P tests instead of == sp_derived_base) and then just
compare that base against sp_derived_base (perhaps with better names for the
2 pairs of variables, VALUE and corresponding offset).

> It looks like you want to know whether there's an offsetted stack
> location from sp_derived_base (a VALUE).  There might be multiple
> ones, but you stop at the first?  But still use that random offset?

There shouldn't be multiple ones by construction.
The first time we try to look up stack_pointer_rtx, we create a VALUE
with SP_DERIVED_VALUE_P.  If stack_pointer_rtx or (plus:P (sp) (const_int N))
is looked up later, with/without any stack_pointer_rtx adjustments by
constant in between, we get either that SP_DERIVED_VALUE_P VALUE or a VALUE
with that (plus:P SP_DERIVED_VALUE_P (const_int N)) as one of its locs.
This was done in r10-7665 to make sure we don't construct arbitrarily deep
chains of VALUEs when sp keeps changing a lot.

The point of the patch is that (unless proven otherwise) I believe that
these sometimes above sometimes below the stack pointer stores with
exception of epilogue freeing it completely are solely for outgoing argument
slots and that normally one should be able to track those to be stack
pointer related (even on ia64 and the likes), so they should have the
SP_DERIVED_VALUE_P or PLUS SP_DERIVED_VALUE_P CONST_INT somewhere.

The code only attempts to optimize based on offsets if both the MEM's
address and stack_pointer_rtx VALUEs are derived from the same
SP_DERIVED_VALUE_P VALUE (the likely case, to get a different such
VALUE I think one needs to not be in var-tracking and reset the table
but then even the MEMs are invalidated, or in var-tracking across the sp ->
fp setting), and in that case those aren't random offsets, those are
2 offsets from the same base.  So x_addr is known to be
sp_derived_base + sp_derived_off, and current sp is known to be
sp_derived_base + off.  So one can just compare the offsets to know
what is below the stack pointer or offsets with x mem_size for the PA case.

> > +	      if (sp_derived_base == NULL_RTX)
> 
> So, this seems to take SP_DERIVED_VALUE_P conservative in the wrong way?
> It seems CSELIB sets this if it is sure the value is SP derived but
> for correctness reasons you have to assume a value is SP derived if
> you can't prove it is not?

I don't want to cause significant performance and debugging quality
regressions without a proof that it is really needed otherwise.  Don't
remember any PRs about this until this one, so it can't be that common.
If I needed to differentiate between any stack pointer based MEMs vs.
something clearly guaranteed to be outside of the current stack frame
(+ perhaps the incoming arguments area or the like), then sure, I need to be
conservative, any addressable automatic var in the current stack frame can
have its address taken, go through some pointer arithmetics and escape
somewhere etc.
But I can't imagine how addresses of outgoing argument slots could be
escaping somewhere or not have the stack pointer visible in the address
computation.
If one passes C++ non-PODs as arguments/return values, those are passed
by invisible reference, the objects are actually normal automatic VAR_DECLs,
so those aren't in the outgoing stack arguments area.  If I use say large
POD aggregate argument passed on the stack, even if what I pass there is
address taken and perhaps self-referential, when it is copied into the
stack slot it still shouldn't refer to addresses within the outgoing
argument slot, there is just a bitwise copy.
Even in the case where way say use memcpy to do the bitwise copy into the
outgoing argument slot and therefore need to have the address loaded in
an argument, I think cselib should see it was stack pointer based.  And if
say CSE sees the same is used in another memcpy, it still should see the
original was sp derived.

>  Or is your argument that no such proof
> is necessary because the stack slot would have its address taken and
> would then necessarily be not a spill slot?  This is all about spill
> slots, right?  IIRC we do have spill slot MEMs specially marked
> with MEM_EXPR equal to get_spill_slot_decl, there's also may_be_sp_based_p
> using the somewhat broken RTL alias analysis (and there's
> static_reg_base_value with the various stack area kinds, not sure if
> helpful here).

Stack spills again go above the outgoing argument area, so should be in the
current frame from the prologue to the epilogue, not affected by volatility
of sp decrements and increments for particular calls when not really
accumulating outgoing args.

I'm a little bit worried about aarch64 SVE and RISC-V, dunno what happens
on poly-int sized outgoing argument pushes because right now the
SP_DERIVED_VALUE_P handling stuff in cselib.cc is about CONST_INT offsets.
But then, both those targets are unconditional ACCUMULATE_OUTGOING_ARGS.

And the testcase for the PR isn't miscompiled with additional
-maccumulate-outgoing-args or say -mtune=intel which implies that.
So, perhaps I could restrict that cselib_invalidate_mem (callmem[1]);
to if (!ACCUMULATE_OUTGOING_ARGS).

	Jakub
  
Richard Biener Feb. 4, 2025, 12:11 p.m. UTC | #3
On Tue, 4 Feb 2025, Jakub Jelinek wrote:

> On Tue, Feb 04, 2025 at 11:53:21AM +0100, Richard Biener wrote:
> > > +	      if (sp_derived_base)
> > > +		if (cselib_val *v3
> > > +		    = cselib_lookup_1 (stack_pointer_rtx, Pmode, 0, VOIDmode))
> > > +		  {
> > > +		    bool ok = false;
> > > +		    HOST_WIDE_INT off = 0;
> > > +		    if (v3->val_rtx == sp_derived_base)
> > > +		      ok = true;
> > > +		    else
> > > +		      for (struct elt_loc_list *l = v3->locs; l; l = l->next)
> > 
> > isn't this invariant on the whole workset?  You are looking up
> > the CSELIB val for stack_pointer_rtx and traversing it's locations.
> 
> The way it is written it isn't invariant, but you're right it might be a
> good idea to tweak it so that it is invariant and just compute it lazily the
> first time we need it during the single cselib_invalidate_memory call
> (with SP_DERIVED_VALUE_P tests instead of == sp_derived_base) and then just
> compare that base against sp_derived_base (perhaps with better names for the
> 2 pairs of variables, VALUE and corresponding offset).
> 
> > It looks like you want to know whether there's an offsetted stack
> > location from sp_derived_base (a VALUE).  There might be multiple
> > ones, but you stop at the first?  But still use that random offset?
> 
> There shouldn't be multiple ones by construction.
> The first time we try to look up stack_pointer_rtx, we create a VALUE
> with SP_DERIVED_VALUE_P.  If stack_pointer_rtx or (plus:P (sp) (const_int N))
> is looked up later, with/without any stack_pointer_rtx adjustments by
> constant in between, we get either that SP_DERIVED_VALUE_P VALUE or a VALUE
> with that (plus:P SP_DERIVED_VALUE_P (const_int N)) as one of its locs.
> This was done in r10-7665 to make sure we don't construct arbitrarily deep
> chains of VALUEs when sp keeps changing a lot.
> 
> The point of the patch is that (unless proven otherwise) I believe that
> these sometimes above sometimes below the stack pointer stores with
> exception of epilogue freeing it completely are solely for outgoing argument
> slots and that normally one should be able to track those to be stack
> pointer related (even on ia64 and the likes), so they should have the
> SP_DERIVED_VALUE_P or PLUS SP_DERIVED_VALUE_P CONST_INT somewhere.
> 
> The code only attempts to optimize based on offsets if both the MEM's
> address and stack_pointer_rtx VALUEs are derived from the same
> SP_DERIVED_VALUE_P VALUE (the likely case, to get a different such
> VALUE I think one needs to not be in var-tracking and reset the table
> but then even the MEMs are invalidated, or in var-tracking across the sp ->
> fp setting), and in that case those aren't random offsets, those are
> 2 offsets from the same base.  So x_addr is known to be
> sp_derived_base + sp_derived_off, and current sp is known to be
> sp_derived_base + off.  So one can just compare the offsets to know
> what is below the stack pointer or offsets with x mem_size for the PA case.
> 
> > > +	      if (sp_derived_base == NULL_RTX)
> > 
> > So, this seems to take SP_DERIVED_VALUE_P conservative in the wrong way?
> > It seems CSELIB sets this if it is sure the value is SP derived but
> > for correctness reasons you have to assume a value is SP derived if
> > you can't prove it is not?
> 
> I don't want to cause significant performance and debugging quality
> regressions without a proof that it is really needed otherwise.  Don't
> remember any PRs about this until this one, so it can't be that common.
> If I needed to differentiate between any stack pointer based MEMs vs.
> something clearly guaranteed to be outside of the current stack frame
> (+ perhaps the incoming arguments area or the like), then sure, I need to be
> conservative, any addressable automatic var in the current stack frame can
> have its address taken, go through some pointer arithmetics and escape
> somewhere etc.
> But I can't imagine how addresses of outgoing argument slots could be
> escaping somewhere or not have the stack pointer visible in the address
> computation.
> If one passes C++ non-PODs as arguments/return values, those are passed
> by invisible reference, the objects are actually normal automatic VAR_DECLs,
> so those aren't in the outgoing stack arguments area.  If I use say large
> POD aggregate argument passed on the stack, even if what I pass there is
> address taken and perhaps self-referential, when it is copied into the
> stack slot it still shouldn't refer to addresses within the outgoing
> argument slot, there is just a bitwise copy.
> Even in the case where way say use memcpy to do the bitwise copy into the
> outgoing argument slot and therefore need to have the address loaded in
> an argument, I think cselib should see it was stack pointer based.  And if
> say CSE sees the same is used in another memcpy, it still should see the
> original was sp derived.

OK, fair enough.  I think there should be a comment indicating this
isn't a full conservative approach but handling a certain class of
accesses only, with the hope that this is all that's actually needed.

> >  Or is your argument that no such proof
> > is necessary because the stack slot would have its address taken and
> > would then necessarily be not a spill slot?  This is all about spill
> > slots, right?  IIRC we do have spill slot MEMs specially marked
> > with MEM_EXPR equal to get_spill_slot_decl, there's also may_be_sp_based_p
> > using the somewhat broken RTL alias analysis (and there's
> > static_reg_base_value with the various stack area kinds, not sure if
> > helpful here).
> 
> Stack spills again go above the outgoing argument area, so should be in the
> current frame from the prologue to the epilogue, not affected by volatility
> of sp decrements and increments for particular calls when not really
> accumulating outgoing args.
> 
> I'm a little bit worried about aarch64 SVE and RISC-V, dunno what happens
> on poly-int sized outgoing argument pushes because right now the
> SP_DERIVED_VALUE_P handling stuff in cselib.cc is about CONST_INT offsets.
> But then, both those targets are unconditional ACCUMULATE_OUTGOING_ARGS.
> 
> And the testcase for the PR isn't miscompiled with additional
> -maccumulate-outgoing-args or say -mtune=intel which implies that.
> So, perhaps I could restrict that cselib_invalidate_mem (callmem[1]);
> to if (!ACCUMULATE_OUTGOING_ARGS).

I guess that then makes sense.

Richard.

> 
> 	Jakub
> 
>
  

Patch

--- gcc/cselib.cc.jj	2025-02-01 00:46:53.073275225 +0100
+++ gcc/cselib.cc	2025-02-03 13:16:16.381772989 +0100
@@ -248,8 +248,9 @@  static unsigned int *used_regs;
 static unsigned int n_used_regs;
 
 /* We pass this to cselib_invalidate_mem to invalidate all of
-   memory for a non-const call instruction.  */
-static GTY(()) rtx callmem;
+   memory for a non-const call instruction and memory below stack pointer
+   for const/pure calls.  */
+static GTY(()) rtx callmem[2];
 
 /* Set by discard_useless_locs if it deleted the last location of any
    value.  */
@@ -808,7 +809,7 @@  cselib_preserve_only_values (void)
   for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
     cselib_invalidate_regno (i, reg_raw_mode[i]);
 
-  cselib_invalidate_mem (callmem);
+  cselib_invalidate_mem (callmem[0]);
 
   remove_useless_values ();
 
@@ -2644,6 +2645,97 @@  cselib_invalidate_mem (rtx mem_rtx)
 	      p = &(*p)->next;
 	      continue;
 	    }
+
+	  /* When invalidating memory below the stack pointer for const/pure
+	     calls and alloca/VLAs aren't used, attempt to optimize.  Values
+	     stored into area below the stack pointer shouldn't be addressable
+	     and should be stored just through stack pointer derived
+	     expressions, so don't invalidate MEMs not using stack derived
+	     addresses, or if the MEMs clearly aren't below the stack
+	     pointer.  */
+	  if (mem_rtx == callmem[1]
+	      && num_mems < param_max_cselib_memory_locations
+	      && GET_CODE (XEXP (x, 0)) == VALUE
+	      && !cfun->calls_alloca)
+	    {
+	      cselib_val *v2 = CSELIB_VAL_PTR (XEXP (x, 0));
+	      rtx sp_derived_base = NULL_RTX;
+	      HOST_WIDE_INT sp_derived_off = 0;
+	      if (SP_DERIVED_VALUE_P (v2->val_rtx))
+		sp_derived_base = v2->val_rtx;
+	      else
+		for (struct elt_loc_list *l = v2->locs; l; l = l->next)
+		  if (GET_CODE (l->loc) == PLUS
+		      && GET_CODE (XEXP (l->loc, 0)) == VALUE
+		      && SP_DERIVED_VALUE_P (XEXP (l->loc, 0))
+		      && CONST_INT_P (XEXP (l->loc, 1)))
+		    {
+		      sp_derived_base = XEXP (l->loc, 0);
+		      sp_derived_off = INTVAL (XEXP (l->loc, 1));
+		      break;
+		    }
+	      if (sp_derived_base)
+		if (cselib_val *v3
+		    = cselib_lookup_1 (stack_pointer_rtx, Pmode, 0, VOIDmode))
+		  {
+		    bool ok = false;
+		    HOST_WIDE_INT off = 0;
+		    if (v3->val_rtx == sp_derived_base)
+		      ok = true;
+		    else
+		      for (struct elt_loc_list *l = v3->locs; l; l = l->next)
+			if (GET_CODE (l->loc) == PLUS
+			    && XEXP (l->loc, 0) == sp_derived_base
+			    && CONST_INT_P (XEXP (l->loc, 1)))
+			  {
+			    ok = true;
+			    off = INTVAL (XEXP (l->loc, 1));
+			    break;
+			  }
+		    if (ok)
+		      {
+			if (STACK_GROWS_DOWNWARD)
+			  {
+#ifdef STACK_ADDRESS_OFFSET
+			    /* On SPARC take stack pointer bias into account as
+			       well.  */
+			    off
+			      += (STACK_ADDRESS_OFFSET
+				  - FIRST_PARM_OFFSET (current_function_decl));
+#endif
+			    if (sp_derived_off >= off)
+			      /* x is at or above the current stack pointer,
+				 no need to invalidate it.  */
+			      sp_derived_base = NULL_RTX;
+			  }
+			else
+			  {
+			    HOST_WIDE_INT sz;
+			    enum machine_mode mode = GET_MODE (x);
+			    if ((MEM_SIZE_KNOWN_P (x)
+				 && MEM_SIZE (x).is_constant (&sz))
+				|| (mode != BLKmode
+				    && GET_MODE_SIZE (mode).is_constant (&sz)))
+			      if (sp_derived_off < off
+				  && ((HOST_WIDE_INT)
+				      ((unsigned HOST_WIDE_INT) sp_derived_off
+				       + sz) <= off))
+				/* x's end is below or at the current stack
+				   pointer in !STACK_GROWS_DOWNWARD target,
+				   no need to invalidate it.  */
+				sp_derived_base = NULL_RTX;
+			  }
+		      }
+		  }
+	      if (sp_derived_base == NULL_RTX)
+		{
+		  has_mem = true;
+		  num_mems++;
+		  p = &(*p)->next;
+		  continue;
+		}
+	    }
+
 	  if (num_mems < param_max_cselib_memory_locations
 	      && ! canon_anti_dependence (x, false, mem_rtx,
 					  GET_MODE (mem_rtx), mem_addr))
@@ -3196,14 +3288,20 @@  cselib_process_insn (rtx_insn *insn)
 	 as if they were regular functions.  */
       if (RTL_LOOPING_CONST_OR_PURE_CALL_P (insn)
 	  || !(RTL_CONST_OR_PURE_CALL_P (insn)))
-	cselib_invalidate_mem (callmem);
+	cselib_invalidate_mem (callmem[0]);
       else
-	/* For const/pure calls, invalidate any argument slots because
-	   they are owned by the callee.  */
-	for (x = CALL_INSN_FUNCTION_USAGE (insn); x; x = XEXP (x, 1))
-	  if (GET_CODE (XEXP (x, 0)) == USE
-	      && MEM_P (XEXP (XEXP (x, 0), 0)))
-	    cselib_invalidate_mem (XEXP (XEXP (x, 0), 0));
+	{
+	  /* For const/pure calls, invalidate any argument slots because
+	     they are owned by the callee.  */
+	  for (x = CALL_INSN_FUNCTION_USAGE (insn); x; x = XEXP (x, 1))
+	    if (GET_CODE (XEXP (x, 0)) == USE
+		&& MEM_P (XEXP (XEXP (x, 0), 0)))
+	      cselib_invalidate_mem (XEXP (XEXP (x, 0), 0));
+	  /* And invalidate memory below the stack (or above for
+	     !STACK_GROWS_DOWNWARD), as even const/pure call can invalidate
+	     that.  */
+	  cselib_invalidate_mem (callmem[1]);
+	}
     }
 
   cselib_record_sets (insn);
@@ -3256,8 +3354,27 @@  cselib_init (int record_what)
 
   /* (mem:BLK (scratch)) is a special mechanism to conflict with everything,
      see canon_true_dependence.  This is only created once.  */
-  if (! callmem)
-    callmem = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (VOIDmode));
+  if (! callmem[0])
+    {
+      callmem[0] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (VOIDmode));
+      /* Similarly create a MEM representing roughly everything below
+	 the stack for STACK_GROWS_DOWNWARD targets or everything above
+	 it otherwise.  */
+      if (STACK_GROWS_DOWNWARD)
+	{
+	  unsigned HOST_WIDE_INT off = -(GET_MODE_MASK (Pmode) >> 1);
+#ifdef STACK_ADDRESS_OFFSET
+	  /* On SPARC take stack pointer bias into account as well.  */
+	  off += (STACK_ADDRESS_OFFSET
+		  - FIRST_PARM_OFFSET (current_function_decl)));
+#endif
+	  callmem[1] = plus_constant (Pmode, stack_pointer_rtx, off);
+	}
+      else
+	callmem[1] = stack_pointer_rtx;
+      callmem[1] = gen_rtx_MEM (BLKmode, callmem[1]);
+      set_mem_size (callmem[1], GET_MODE_MASK (Pmode) >> 1);
+    }
 
   cselib_nregs = max_reg_num ();
 
--- gcc/testsuite/gcc.dg/pr117239.c.jj	2025-02-03 11:13:59.399159640 +0100
+++ gcc/testsuite/gcc.dg/pr117239.c	2025-02-03 11:13:59.399159640 +0100
@@ -0,0 +1,42 @@ 
+/* PR rtl-optimization/117239 */
+/* { dg-do run } */
+/* { dg-options "-fno-inline -O2" } */
+/* { dg-additional-options "-fschedule-insns" { target i?86-*-* x86_64-*-* } } */
+
+int a, b, c = 1, d;
+
+int
+foo (void)
+{
+  return a;
+}
+
+struct A {
+  int e, f, g, h;
+  short i;
+  int j;
+};
+
+void
+bar (int x, struct A y)
+{
+  if (y.j == 1)
+    c = 0;
+}
+
+int
+baz (struct A x)
+{
+  return b;
+}
+
+int
+main ()
+{
+  struct A k = { 0, 0, 0, 0, 0, 1 };
+  d = baz (k);
+  bar (foo (), k);
+  if (c != 0)
+    __builtin_abort ();
+  return 0;
+}