Fix alignment of stack slots for overaligned types [PR103500]

Message ID 20211130164824.mmgllowvv5nsczf2@arm.com
State New
Headers
Series Fix alignment of stack slots for overaligned types [PR103500] |

Commit Message

Alex Coplan Nov. 30, 2021, 4:48 p.m. UTC
  Hi,

This fixes PR103500 i.e. ensuring that stack slots for
passed-by-reference overaligned types are appropriately aligned. For the
testcase:

typedef struct __attribute__((aligned(32))) {
  long x,y;
} S;
S x;
void f(S);
void g(void) { f(x); }

on AArch64, we currently generate (at -O2):

g:
        adrp    x1, .LANCHOR0
        add     x1, x1, :lo12:.LANCHOR0
        stp     x29, x30, [sp, -48]!
        mov     x29, sp
        ldp     q0, q1, [x1]
        add     x0, sp, 16
        stp     q0, q1, [sp, 16]
        bl      f
        ldp     x29, x30, [sp], 48
        ret

so the stack slot for the passed-by-reference copy of the structure is
at sp + 16, and the sp is only guaranteed to be 16-byte aligned, so the
structure is only 16-byte aligned. The PCS requires the structure to be
32-byte aligned. After this patch, we generate:

g:
        adrp    x1, .LANCHOR0
        add     x1, x1, :lo12:.LANCHOR0
        stp     x29, x30, [sp, -64]!
        mov     x29, sp
        add     x0, sp, 47
        ldp     q0, q1, [x1]
        and     x0, x0, -32
        stp     q0, q1, [x0]
        bl      f
        ldp     x29, x30, [sp], 64
        ret

i.e. we ensure 32-byte alignment for the struct.

The approach taken here is similar to that in
function.c:assign_parm_setup_block where it handles the case for
DECL_ALIGN (parm) > MAX_SUPPORTED_STACK_ALIGNMENT. This in turn is
similar to the approach taken in cfgexpand.c:expand_stack_vars (where
the function calls get_dynamic_stack_size) which is the code that
handles the alignment for overaligned structures as addressable local
variables (see the related case discussed in the PR).

This patch also updates the aapcs64 test mentioned in the PR to avoid
the frontend folding away the alignment check. I've confirmed that the
execution test actually fails on aarch64-linux-gnu prior to the patch
being applied and passes afterwards.

Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
arm-linux-gnueabihf: no regressions.

I'd appreciate any feedback. Is it OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

	PR middle-end/103500
	* function.c (get_stack_local_alignment): Align BLKmode overaligned
	types to the alignment required by the type.
	(assign_stack_temp_for_type): Handle BLKmode overaligned stack
	slots by allocating a larger-than-necessary buffer and aligning
	the address within appropriately.

gcc/testsuite/ChangeLog:

	PR middle-end/103500
	* gcc.target/aarch64/aapcs64/rec_align-8.c (test_pass_by_ref):
	Prevent the frontend from folding our alignment check away by
	using snprintf to store the pointer into a string and recovering
	it with sscanf.
  

Comments

Florian Weimer Nov. 30, 2021, 6:38 p.m. UTC | #1
* Alex Coplan via Gcc-patches:

> Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
> arm-linux-gnueabihf: no regressions.
>
> I'd appreciate any feedback. Is it OK for trunk?

Does this need an ABI warning?

Thanks,
Florian
  
Alex Coplan Dec. 1, 2021, 10:35 a.m. UTC | #2
On 30/11/2021 19:38, Florian Weimer wrote:
> * Alex Coplan via Gcc-patches:
> 
> > Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
> > arm-linux-gnueabihf: no regressions.
> >
> > I'd appreciate any feedback. Is it OK for trunk?
> 
> Does this need an ABI warning?

That's a good question. I'm not sure exactly what changes merit an ABI
warning in GCC.

In this case, we aren't changing how the parameter is passed in the
sense that the callee can still locate it in the same way (by following
the pointer in the appropriate GPR), we're simply increasing the
alignment of the stack slot used for the indirected copy.

So on this basis I would guess that it doesn't need a warning, but I'm
happy to be corrected.

Thanks,
Alex

> 
> Thanks,
> Florian
>
  
Alex Coplan Dec. 14, 2021, 10:01 a.m. UTC | #3
Hi,

I'd just like to ping this for review:

https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585785.html

Thanks,
Alex

On 30/11/2021 16:48, Alex Coplan via Gcc-patches wrote:
> Hi,
> 
> This fixes PR103500 i.e. ensuring that stack slots for
> passed-by-reference overaligned types are appropriately aligned. For the
> testcase:
> 
> typedef struct __attribute__((aligned(32))) {
>   long x,y;
> } S;
> S x;
> void f(S);
> void g(void) { f(x); }
> 
> on AArch64, we currently generate (at -O2):
> 
> g:
>         adrp    x1, .LANCHOR0
>         add     x1, x1, :lo12:.LANCHOR0
>         stp     x29, x30, [sp, -48]!
>         mov     x29, sp
>         ldp     q0, q1, [x1]
>         add     x0, sp, 16
>         stp     q0, q1, [sp, 16]
>         bl      f
>         ldp     x29, x30, [sp], 48
>         ret
> 
> so the stack slot for the passed-by-reference copy of the structure is
> at sp + 16, and the sp is only guaranteed to be 16-byte aligned, so the
> structure is only 16-byte aligned. The PCS requires the structure to be
> 32-byte aligned. After this patch, we generate:
> 
> g:
>         adrp    x1, .LANCHOR0
>         add     x1, x1, :lo12:.LANCHOR0
>         stp     x29, x30, [sp, -64]!
>         mov     x29, sp
>         add     x0, sp, 47
>         ldp     q0, q1, [x1]
>         and     x0, x0, -32
>         stp     q0, q1, [x0]
>         bl      f
>         ldp     x29, x30, [sp], 64
>         ret
> 
> i.e. we ensure 32-byte alignment for the struct.
> 
> The approach taken here is similar to that in
> function.c:assign_parm_setup_block where it handles the case for
> DECL_ALIGN (parm) > MAX_SUPPORTED_STACK_ALIGNMENT. This in turn is
> similar to the approach taken in cfgexpand.c:expand_stack_vars (where
> the function calls get_dynamic_stack_size) which is the code that
> handles the alignment for overaligned structures as addressable local
> variables (see the related case discussed in the PR).
> 
> This patch also updates the aapcs64 test mentioned in the PR to avoid
> the frontend folding away the alignment check. I've confirmed that the
> execution test actually fails on aarch64-linux-gnu prior to the patch
> being applied and passes afterwards.
> 
> Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
> arm-linux-gnueabihf: no regressions.
> 
> I'd appreciate any feedback. Is it OK for trunk?
> 
> Thanks,
> Alex
> 
> gcc/ChangeLog:
> 
> 	PR middle-end/103500
> 	* function.c (get_stack_local_alignment): Align BLKmode overaligned
> 	types to the alignment required by the type.
> 	(assign_stack_temp_for_type): Handle BLKmode overaligned stack
> 	slots by allocating a larger-than-necessary buffer and aligning
> 	the address within appropriately.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR middle-end/103500
> 	* gcc.target/aarch64/aapcs64/rec_align-8.c (test_pass_by_ref):
> 	Prevent the frontend from folding our alignment check away by
> 	using snprintf to store the pointer into a string and recovering
> 	it with sscanf.

> diff --git a/gcc/function.c b/gcc/function.c
> index 61b3bd036b8..5ed722ab959 100644
> --- a/gcc/function.c
> +++ b/gcc/function.c
> @@ -278,7 +278,9 @@ get_stack_local_alignment (tree type, machine_mode mode)
>    unsigned int alignment;
>  
>    if (mode == BLKmode)
> -    alignment = BIGGEST_ALIGNMENT;
> +    alignment = (type && TYPE_ALIGN (type) > MAX_SUPPORTED_STACK_ALIGNMENT)
> +      ? TYPE_ALIGN (type)
> +      : BIGGEST_ALIGNMENT;
>    else
>      alignment = GET_MODE_ALIGNMENT (mode);
>  
> @@ -872,21 +874,35 @@ assign_stack_temp_for_type (machine_mode mode, poly_int64 size, tree type)
>  
>        p = ggc_alloc<temp_slot> ();
>  
> -      /* We are passing an explicit alignment request to assign_stack_local.
> -	 One side effect of that is assign_stack_local will not round SIZE
> -	 to ensure the frame offset remains suitably aligned.
> -
> -	 So for requests which depended on the rounding of SIZE, we go ahead
> -	 and round it now.  We also make sure ALIGNMENT is at least
> -	 BIGGEST_ALIGNMENT.  */
> -      gcc_assert (mode != BLKmode || align == BIGGEST_ALIGNMENT);
> -      p->slot = assign_stack_local_1 (mode,
> -				      (mode == BLKmode
> -				       ? aligned_upper_bound (size,
> -							      (int) align
> -							      / BITS_PER_UNIT)
> -				       : size),
> -				      align, 0);
> +      if (mode == BLKmode && align > MAX_SUPPORTED_STACK_ALIGNMENT)
> +	{
> +	  rtx allocsize = gen_int_mode (size, Pmode);
> +	  get_dynamic_stack_size (&allocsize, 0, align, NULL);
> +	  gcc_assert (CONST_INT_P (allocsize));
> +	  size = UINTVAL (allocsize);
> +	  p->slot = assign_stack_local_1 (mode,
> +					  size,
> +					  BIGGEST_ALIGNMENT, 0);
> +	  rtx addr = align_dynamic_address (XEXP (p->slot, 0), align);
> +	  mark_reg_pointer (addr, align);
> +	  p->slot = gen_rtx_MEM (GET_MODE (p->slot), addr);
> +	  MEM_NOTRAP_P (p->slot) = 1;
> +	}
> +      else
> +	/* We are passing an explicit alignment request to assign_stack_local.
> +	   One side effect of that is assign_stack_local will not round SIZE
> +	   to ensure the frame offset remains suitably aligned.
> +
> +	   So for requests which depended on the rounding of SIZE, we go ahead
> +	   and round it now.  We also make sure ALIGNMENT is at least
> +	   BIGGEST_ALIGNMENT.  */
> +	p->slot = assign_stack_local_1 (mode,
> +					(mode == BLKmode
> +					 ? aligned_upper_bound (size,
> +								(int) align
> +								/ BITS_PER_UNIT)
> +					 : size),
> +					align, 0);
>  
>        p->align = align;
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> index c9351802191..0b0ac0b9b97 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> @@ -21,10 +21,18 @@ test_pass_by_ref (int x0, overaligned x1, int x2)
>      abort ();
>    if (memcmp ((void *) &x1, (void *)&a, sizeof (overaligned)))
>      abort ();
> -  long addr = ((long) &x1) & 31;
> +
> +  char buf[64];
> +  void *ptr;
> +
> +  __builtin_snprintf (buf, sizeof(buf), "%p", &x1);
> +  if (__builtin_sscanf(buf, "%p", &ptr) != 1)
> +    abort ();
> +
> +  long addr = ((long) ptr) & 31;
>    if (addr != 0)
>      {
> -      __builtin_printf ("Alignment was %d\n", addr);
> +      __builtin_printf ("Alignment was %ld\n", addr);
>        abort ();
>      }
>  }
  
Richard Sandiford Dec. 20, 2021, 1:19 p.m. UTC | #4
Alex Coplan via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi,
>
> This fixes PR103500 i.e. ensuring that stack slots for
> passed-by-reference overaligned types are appropriately aligned. For the
> testcase:
>
> typedef struct __attribute__((aligned(32))) {
>   long x,y;
> } S;
> S x;
> void f(S);
> void g(void) { f(x); }
>
> on AArch64, we currently generate (at -O2):
>
> g:
>         adrp    x1, .LANCHOR0
>         add     x1, x1, :lo12:.LANCHOR0
>         stp     x29, x30, [sp, -48]!
>         mov     x29, sp
>         ldp     q0, q1, [x1]
>         add     x0, sp, 16
>         stp     q0, q1, [sp, 16]
>         bl      f
>         ldp     x29, x30, [sp], 48
>         ret
>
> so the stack slot for the passed-by-reference copy of the structure is
> at sp + 16, and the sp is only guaranteed to be 16-byte aligned, so the
> structure is only 16-byte aligned. The PCS requires the structure to be
> 32-byte aligned. After this patch, we generate:
>
> g:
>         adrp    x1, .LANCHOR0
>         add     x1, x1, :lo12:.LANCHOR0
>         stp     x29, x30, [sp, -64]!
>         mov     x29, sp
>         add     x0, sp, 47
>         ldp     q0, q1, [x1]
>         and     x0, x0, -32
>         stp     q0, q1, [x0]
>         bl      f
>         ldp     x29, x30, [sp], 64
>         ret
>
> i.e. we ensure 32-byte alignment for the struct.
>
> The approach taken here is similar to that in
> function.c:assign_parm_setup_block where it handles the case for
> DECL_ALIGN (parm) > MAX_SUPPORTED_STACK_ALIGNMENT. This in turn is
> similar to the approach taken in cfgexpand.c:expand_stack_vars (where
> the function calls get_dynamic_stack_size) which is the code that
> handles the alignment for overaligned structures as addressable local
> variables (see the related case discussed in the PR).

A difference with the latter is that cfgexpand (AFAICT) always
honours the DECL/TYPE_ALIGN, with LOCAL_DECL_ALIGNMENT supposedly
only increasing the alignment for efficiency reasons (rather than
decreasing it).

So…

> This patch also updates the aapcs64 test mentioned in the PR to avoid
> the frontend folding away the alignment check. I've confirmed that the
> execution test actually fails on aarch64-linux-gnu prior to the patch
> being applied and passes afterwards.
>
> Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
> arm-linux-gnueabihf: no regressions.
>
> I'd appreciate any feedback. Is it OK for trunk?
>
> Thanks,
> Alex
>
> gcc/ChangeLog:
>
> 	PR middle-end/103500
> 	* function.c (get_stack_local_alignment): Align BLKmode overaligned
> 	types to the alignment required by the type.
> 	(assign_stack_temp_for_type): Handle BLKmode overaligned stack
> 	slots by allocating a larger-than-necessary buffer and aligning
> 	the address within appropriately.
>
> gcc/testsuite/ChangeLog:
>
> 	PR middle-end/103500
> 	* gcc.target/aarch64/aapcs64/rec_align-8.c (test_pass_by_ref):
> 	Prevent the frontend from folding our alignment check away by
> 	using snprintf to store the pointer into a string and recovering
> 	it with sscanf.
>
> diff --git a/gcc/function.c b/gcc/function.c
> index 61b3bd036b8..5ed722ab959 100644
> --- a/gcc/function.c
> +++ b/gcc/function.c
> @@ -278,7 +278,9 @@ get_stack_local_alignment (tree type, machine_mode mode)
>    unsigned int alignment;
>  
>    if (mode == BLKmode)
> -    alignment = BIGGEST_ALIGNMENT;
> +    alignment = (type && TYPE_ALIGN (type) > MAX_SUPPORTED_STACK_ALIGNMENT)
> +      ? TYPE_ALIGN (type)
> +      : BIGGEST_ALIGNMENT;

…I'm not sure about this calculation.  Why do we only honour TYPE_ALIGN
if it's greater than MAX_SUPPORTED_STACK_ALIGNMENT, and fall all the
way back to BIGGEST_ALIGNMENT otherwise?  It looks like on nvptx
this would have the effect of honouring (say) 2048-byte alignment,
but not 32-byte alignment (which falls between BIGGEST_ALIGNMENT
and MAX_SUPPORTED_STACK_ALIGNMENT).

Also, what about the non-BLKmode case?  A type's mode cannot be more
aligned than the type, but a type is allowed to be more aligned than
its mode.  E.g.:

typedef unsigned int V __attribute__((vector_size(32)));
typedef struct __attribute__((aligned(32))) { V x; } S;
S x;
void f(S);
void g(void) { f(x); }

doesn't seem to be handled by the patch.  It's possible to create
variations on this on aarch64 up to 2048 bits using SVE and
-msve-vector-bits=N.

Perhaps we should treat the old alignment calculation as a minimum
and take the max of that and TYPE_ALIGN, where TYPE_ALIGN is given.

>    else
>      alignment = GET_MODE_ALIGNMENT (mode);
>  
> @@ -872,21 +874,35 @@ assign_stack_temp_for_type (machine_mode mode, poly_int64 size, tree type)
>  
>        p = ggc_alloc<temp_slot> ();
>  
> -      /* We are passing an explicit alignment request to assign_stack_local.
> -	 One side effect of that is assign_stack_local will not round SIZE
> -	 to ensure the frame offset remains suitably aligned.
> -
> -	 So for requests which depended on the rounding of SIZE, we go ahead
> -	 and round it now.  We also make sure ALIGNMENT is at least
> -	 BIGGEST_ALIGNMENT.  */
> -      gcc_assert (mode != BLKmode || align == BIGGEST_ALIGNMENT);
> -      p->slot = assign_stack_local_1 (mode,
> -				      (mode == BLKmode
> -				       ? aligned_upper_bound (size,
> -							      (int) align
> -							      / BITS_PER_UNIT)
> -				       : size),
> -				      align, 0);
> +      if (mode == BLKmode && align > MAX_SUPPORTED_STACK_ALIGNMENT)
> +	{
> +	  rtx allocsize = gen_int_mode (size, Pmode);
> +	  get_dynamic_stack_size (&allocsize, 0, align, NULL);
> +	  gcc_assert (CONST_INT_P (allocsize));
> +	  size = UINTVAL (allocsize);

Since the size coming in is a poly_int64, I think we should treat
the result in the same way, using rtx_to_poly_int64 rather than
requiring a CONST_INT.  We might not handle that case properly yet,
but rtx_to_poly_int64 would trap mistakes, and using it now would
avoid fewer surprises later.

Thanks,
Richard

> +	  p->slot = assign_stack_local_1 (mode,
> +					  size,
> +					  BIGGEST_ALIGNMENT, 0);
> +	  rtx addr = align_dynamic_address (XEXP (p->slot, 0), align);
> +	  mark_reg_pointer (addr, align);
> +	  p->slot = gen_rtx_MEM (GET_MODE (p->slot), addr);
> +	  MEM_NOTRAP_P (p->slot) = 1;
> +	}
> +      else
> +	/* We are passing an explicit alignment request to assign_stack_local.
> +	   One side effect of that is assign_stack_local will not round SIZE
> +	   to ensure the frame offset remains suitably aligned.
> +
> +	   So for requests which depended on the rounding of SIZE, we go ahead
> +	   and round it now.  We also make sure ALIGNMENT is at least
> +	   BIGGEST_ALIGNMENT.  */
> +	p->slot = assign_stack_local_1 (mode,
> +					(mode == BLKmode
> +					 ? aligned_upper_bound (size,
> +								(int) align
> +								/ BITS_PER_UNIT)
> +					 : size),
> +					align, 0);
>  
>        p->align = align;
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> index c9351802191..0b0ac0b9b97 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> @@ -21,10 +21,18 @@ test_pass_by_ref (int x0, overaligned x1, int x2)
>      abort ();
>    if (memcmp ((void *) &x1, (void *)&a, sizeof (overaligned)))
>      abort ();
> -  long addr = ((long) &x1) & 31;
> +
> +  char buf[64];
> +  void *ptr;
> +
> +  __builtin_snprintf (buf, sizeof(buf), "%p", &x1);
> +  if (__builtin_sscanf(buf, "%p", &ptr) != 1)
> +    abort ();
> +
> +  long addr = ((long) ptr) & 31;
>    if (addr != 0)
>      {
> -      __builtin_printf ("Alignment was %d\n", addr);
> +      __builtin_printf ("Alignment was %ld\n", addr);
>        abort ();
>      }
>  }
  
Alex Coplan Jan. 7, 2022, 3:19 p.m. UTC | #5
Hi Richard,

Thanks for the review.

On 20/12/2021 13:19, Richard Sandiford wrote:
> Alex Coplan via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Hi,
> >
> > This fixes PR103500 i.e. ensuring that stack slots for
> > passed-by-reference overaligned types are appropriately aligned. For the
> > testcase:
> >
> > typedef struct __attribute__((aligned(32))) {
> >   long x,y;
> > } S;
> > S x;
> > void f(S);
> > void g(void) { f(x); }
> >
> > on AArch64, we currently generate (at -O2):
> >
> > g:
> >         adrp    x1, .LANCHOR0
> >         add     x1, x1, :lo12:.LANCHOR0
> >         stp     x29, x30, [sp, -48]!
> >         mov     x29, sp
> >         ldp     q0, q1, [x1]
> >         add     x0, sp, 16
> >         stp     q0, q1, [sp, 16]
> >         bl      f
> >         ldp     x29, x30, [sp], 48
> >         ret
> >
> > so the stack slot for the passed-by-reference copy of the structure is
> > at sp + 16, and the sp is only guaranteed to be 16-byte aligned, so the
> > structure is only 16-byte aligned. The PCS requires the structure to be
> > 32-byte aligned. After this patch, we generate:
> >
> > g:
> >         adrp    x1, .LANCHOR0
> >         add     x1, x1, :lo12:.LANCHOR0
> >         stp     x29, x30, [sp, -64]!
> >         mov     x29, sp
> >         add     x0, sp, 47
> >         ldp     q0, q1, [x1]
> >         and     x0, x0, -32
> >         stp     q0, q1, [x0]
> >         bl      f
> >         ldp     x29, x30, [sp], 64
> >         ret
> >
> > i.e. we ensure 32-byte alignment for the struct.
> >
> > The approach taken here is similar to that in
> > function.c:assign_parm_setup_block where it handles the case for
> > DECL_ALIGN (parm) > MAX_SUPPORTED_STACK_ALIGNMENT. This in turn is
> > similar to the approach taken in cfgexpand.c:expand_stack_vars (where
> > the function calls get_dynamic_stack_size) which is the code that
> > handles the alignment for overaligned structures as addressable local
> > variables (see the related case discussed in the PR).
> 
> A difference with the latter is that cfgexpand (AFAICT) always
> honours the DECL/TYPE_ALIGN, with LOCAL_DECL_ALIGNMENT supposedly
> only increasing the alignment for efficiency reasons (rather than
> decreasing it).
> 
> So…
> 
> > This patch also updates the aapcs64 test mentioned in the PR to avoid
> > the frontend folding away the alignment check. I've confirmed that the
> > execution test actually fails on aarch64-linux-gnu prior to the patch
> > being applied and passes afterwards.
> >
> > Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
> > arm-linux-gnueabihf: no regressions.
> >
> > I'd appreciate any feedback. Is it OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> > 	PR middle-end/103500
> > 	* function.c (get_stack_local_alignment): Align BLKmode overaligned
> > 	types to the alignment required by the type.
> > 	(assign_stack_temp_for_type): Handle BLKmode overaligned stack
> > 	slots by allocating a larger-than-necessary buffer and aligning
> > 	the address within appropriately.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	PR middle-end/103500
> > 	* gcc.target/aarch64/aapcs64/rec_align-8.c (test_pass_by_ref):
> > 	Prevent the frontend from folding our alignment check away by
> > 	using snprintf to store the pointer into a string and recovering
> > 	it with sscanf.
> >
> > diff --git a/gcc/function.c b/gcc/function.c
> > index 61b3bd036b8..5ed722ab959 100644
> > --- a/gcc/function.c
> > +++ b/gcc/function.c
> > @@ -278,7 +278,9 @@ get_stack_local_alignment (tree type, machine_mode mode)
> >    unsigned int alignment;
> >  
> >    if (mode == BLKmode)
> > -    alignment = BIGGEST_ALIGNMENT;
> > +    alignment = (type && TYPE_ALIGN (type) > MAX_SUPPORTED_STACK_ALIGNMENT)
> > +      ? TYPE_ALIGN (type)
> > +      : BIGGEST_ALIGNMENT;
> 
> …I'm not sure about this calculation.  Why do we only honour TYPE_ALIGN
> if it's greater than MAX_SUPPORTED_STACK_ALIGNMENT, and fall all the
> way back to BIGGEST_ALIGNMENT otherwise?  It looks like on nvptx
> this would have the effect of honouring (say) 2048-byte alignment,
> but not 32-byte alignment (which falls between BIGGEST_ALIGNMENT
> and MAX_SUPPORTED_STACK_ALIGNMENT).

So, to be honest, this was a bit of a bodge to try and work around
issues on x86. My original attempt at solving the PR used the more
obvious calculation you suggest below, i.e. the max of BIGGEST_ALIGNMENT
and TYPE_ALIGN (type). The problem with that is, on x86,
MAX_SUPPORTED_STACK_ALIGNMENT has a huge value (2^31).
explow.c:get_dynamic_stack_size has:

  if (size_align % MAX_SUPPORTED_STACK_ALIGNMENT != 0)
    {
      size = round_push (size);
      [...]
    }
  *psize = size;

so inevitably we end up calling round_push on x86 which in turn ends up
going down the else branch:

  else
    {
      /* If crtl->preferred_stack_boundary might still grow, use
	 virtual_preferred_stack_boundary_rtx instead.  This will be
	 substituted by the right value in vregs pass and optimized
	 during combine.  */
      align_rtx = virtual_preferred_stack_boundary_rtx;
      alignm1_rtx = force_operand (plus_constant (Pmode, align_rtx, -1),
				   NULL_RTX);
    }

and we end up returning a size expression involving
virtual_preferred_stack_boundary_rtx, which clearly cannot be a
CONST_INT (required for the approach in the patch to work: we need a
constant to pass to assign_stack_local_1).

So the idea with that calculation was to only use the new, larger
alignment, if it was larger than MAX_SUPPORTED_STACK_ALIGNMENT (which
should never be the case on x86 for any reasonable value of TYPE_ALIGN).
Otherwise, it would fall back to the old behaviour of just using
BIGGEST_ALIGNMENT.

I realise that this is a rather inelegant hack, but I wasn't sure how
else to approach the problem.

> 
> Also, what about the non-BLKmode case?  A type's mode cannot be more
> aligned than the type, but a type is allowed to be more aligned than
> its mode.  E.g.:

Yes, we should probably try and handle the non-BLKmode case as well. I
was just trying to avoid changing too much at once at the risk of
introducing regressions.

> 
> typedef unsigned int V __attribute__((vector_size(32)));
> typedef struct __attribute__((aligned(32))) { V x; } S;
> S x;
> void f(S);
> void g(void) { f(x); }
> 
> doesn't seem to be handled by the patch.  It's possible to create
> variations on this on aarch64 up to 2048 bits using SVE and
> -msve-vector-bits=N.
> 
> Perhaps we should treat the old alignment calculation as a minimum
> and take the max of that and TYPE_ALIGN, where TYPE_ALIGN is given.

As discussed above, this doesn't work on x86, unfortunately. If you have
any alternative suggestions as to how to approach the problem, it would
be great to hear them.

Thanks,
Alex

> 
> >    else
> >      alignment = GET_MODE_ALIGNMENT (mode);
> >  
> > @@ -872,21 +874,35 @@ assign_stack_temp_for_type (machine_mode mode, poly_int64 size, tree type)
> >  
> >        p = ggc_alloc<temp_slot> ();
> >  
> > -      /* We are passing an explicit alignment request to assign_stack_local.
> > -	 One side effect of that is assign_stack_local will not round SIZE
> > -	 to ensure the frame offset remains suitably aligned.
> > -
> > -	 So for requests which depended on the rounding of SIZE, we go ahead
> > -	 and round it now.  We also make sure ALIGNMENT is at least
> > -	 BIGGEST_ALIGNMENT.  */
> > -      gcc_assert (mode != BLKmode || align == BIGGEST_ALIGNMENT);
> > -      p->slot = assign_stack_local_1 (mode,
> > -				      (mode == BLKmode
> > -				       ? aligned_upper_bound (size,
> > -							      (int) align
> > -							      / BITS_PER_UNIT)
> > -				       : size),
> > -				      align, 0);
> > +      if (mode == BLKmode && align > MAX_SUPPORTED_STACK_ALIGNMENT)
> > +	{
> > +	  rtx allocsize = gen_int_mode (size, Pmode);
> > +	  get_dynamic_stack_size (&allocsize, 0, align, NULL);
> > +	  gcc_assert (CONST_INT_P (allocsize));
> > +	  size = UINTVAL (allocsize);
> 
> Since the size coming in is a poly_int64, I think we should treat
> the result in the same way, using rtx_to_poly_int64 rather than
> requiring a CONST_INT.  We might not handle that case properly yet,
> but rtx_to_poly_int64 would trap mistakes, and using it now would
> avoid fewer surprises later.
> 
> Thanks,
> Richard
> 
> > +	  p->slot = assign_stack_local_1 (mode,
> > +					  size,
> > +					  BIGGEST_ALIGNMENT, 0);
> > +	  rtx addr = align_dynamic_address (XEXP (p->slot, 0), align);
> > +	  mark_reg_pointer (addr, align);
> > +	  p->slot = gen_rtx_MEM (GET_MODE (p->slot), addr);
> > +	  MEM_NOTRAP_P (p->slot) = 1;
> > +	}
> > +      else
> > +	/* We are passing an explicit alignment request to assign_stack_local.
> > +	   One side effect of that is assign_stack_local will not round SIZE
> > +	   to ensure the frame offset remains suitably aligned.
> > +
> > +	   So for requests which depended on the rounding of SIZE, we go ahead
> > +	   and round it now.  We also make sure ALIGNMENT is at least
> > +	   BIGGEST_ALIGNMENT.  */
> > +	p->slot = assign_stack_local_1 (mode,
> > +					(mode == BLKmode
> > +					 ? aligned_upper_bound (size,
> > +								(int) align
> > +								/ BITS_PER_UNIT)
> > +					 : size),
> > +					align, 0);
> >  
> >        p->align = align;
> >  
> > diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > index c9351802191..0b0ac0b9b97 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > @@ -21,10 +21,18 @@ test_pass_by_ref (int x0, overaligned x1, int x2)
> >      abort ();
> >    if (memcmp ((void *) &x1, (void *)&a, sizeof (overaligned)))
> >      abort ();
> > -  long addr = ((long) &x1) & 31;
> > +
> > +  char buf[64];
> > +  void *ptr;
> > +
> > +  __builtin_snprintf (buf, sizeof(buf), "%p", &x1);
> > +  if (__builtin_sscanf(buf, "%p", &ptr) != 1)
> > +    abort ();
> > +
> > +  long addr = ((long) ptr) & 31;
> >    if (addr != 0)
> >      {
> > -      __builtin_printf ("Alignment was %d\n", addr);
> > +      __builtin_printf ("Alignment was %ld\n", addr);
> >        abort ();
> >      }
> >  }
  
H.J. Lu Jan. 7, 2022, 3:24 p.m. UTC | #6
On Fri, Jan 7, 2022 at 7:20 AM Alex Coplan via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi Richard,
>
> Thanks for the review.
>
> On 20/12/2021 13:19, Richard Sandiford wrote:
> > Alex Coplan via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > Hi,
> > >
> > > This fixes PR103500 i.e. ensuring that stack slots for
> > > passed-by-reference overaligned types are appropriately aligned. For the
> > > testcase:
> > >
> > > typedef struct __attribute__((aligned(32))) {
> > >   long x,y;
> > > } S;
> > > S x;
> > > void f(S);
> > > void g(void) { f(x); }
> > >
> > > on AArch64, we currently generate (at -O2):
> > >
> > > g:
> > >         adrp    x1, .LANCHOR0
> > >         add     x1, x1, :lo12:.LANCHOR0
> > >         stp     x29, x30, [sp, -48]!
> > >         mov     x29, sp
> > >         ldp     q0, q1, [x1]
> > >         add     x0, sp, 16
> > >         stp     q0, q1, [sp, 16]
> > >         bl      f
> > >         ldp     x29, x30, [sp], 48
> > >         ret
> > >
> > > so the stack slot for the passed-by-reference copy of the structure is
> > > at sp + 16, and the sp is only guaranteed to be 16-byte aligned, so the
> > > structure is only 16-byte aligned. The PCS requires the structure to be
> > > 32-byte aligned. After this patch, we generate:
> > >
> > > g:
> > >         adrp    x1, .LANCHOR0
> > >         add     x1, x1, :lo12:.LANCHOR0
> > >         stp     x29, x30, [sp, -64]!
> > >         mov     x29, sp
> > >         add     x0, sp, 47
> > >         ldp     q0, q1, [x1]
> > >         and     x0, x0, -32
> > >         stp     q0, q1, [x0]
> > >         bl      f
> > >         ldp     x29, x30, [sp], 64
> > >         ret
> > >
> > > i.e. we ensure 32-byte alignment for the struct.
> > >
> > > The approach taken here is similar to that in
> > > function.c:assign_parm_setup_block where it handles the case for
> > > DECL_ALIGN (parm) > MAX_SUPPORTED_STACK_ALIGNMENT. This in turn is
> > > similar to the approach taken in cfgexpand.c:expand_stack_vars (where
> > > the function calls get_dynamic_stack_size) which is the code that
> > > handles the alignment for overaligned structures as addressable local
> > > variables (see the related case discussed in the PR).
> >
> > A difference with the latter is that cfgexpand (AFAICT) always
> > honours the DECL/TYPE_ALIGN, with LOCAL_DECL_ALIGNMENT supposedly
> > only increasing the alignment for efficiency reasons (rather than
> > decreasing it).
> >
> > So…
> >
> > > This patch also updates the aapcs64 test mentioned in the PR to avoid
> > > the frontend folding away the alignment check. I've confirmed that the
> > > execution test actually fails on aarch64-linux-gnu prior to the patch
> > > being applied and passes afterwards.
> > >
> > > Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
> > > arm-linux-gnueabihf: no regressions.
> > >
> > > I'd appreciate any feedback. Is it OK for trunk?
> > >
> > > Thanks,
> > > Alex
> > >
> > > gcc/ChangeLog:
> > >
> > >     PR middle-end/103500
> > >     * function.c (get_stack_local_alignment): Align BLKmode overaligned
> > >     types to the alignment required by the type.
> > >     (assign_stack_temp_for_type): Handle BLKmode overaligned stack
> > >     slots by allocating a larger-than-necessary buffer and aligning
> > >     the address within appropriately.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >     PR middle-end/103500
> > >     * gcc.target/aarch64/aapcs64/rec_align-8.c (test_pass_by_ref):
> > >     Prevent the frontend from folding our alignment check away by
> > >     using snprintf to store the pointer into a string and recovering
> > >     it with sscanf.
> > >
> > > diff --git a/gcc/function.c b/gcc/function.c
> > > index 61b3bd036b8..5ed722ab959 100644
> > > --- a/gcc/function.c
> > > +++ b/gcc/function.c
> > > @@ -278,7 +278,9 @@ get_stack_local_alignment (tree type, machine_mode mode)
> > >    unsigned int alignment;
> > >
> > >    if (mode == BLKmode)
> > > -    alignment = BIGGEST_ALIGNMENT;
> > > +    alignment = (type && TYPE_ALIGN (type) > MAX_SUPPORTED_STACK_ALIGNMENT)
> > > +      ? TYPE_ALIGN (type)
> > > +      : BIGGEST_ALIGNMENT;
> >
> > …I'm not sure about this calculation.  Why do we only honour TYPE_ALIGN
> > if it's greater than MAX_SUPPORTED_STACK_ALIGNMENT, and fall all the
> > way back to BIGGEST_ALIGNMENT otherwise?  It looks like on nvptx
> > this would have the effect of honouring (say) 2048-byte alignment,
> > but not 32-byte alignment (which falls between BIGGEST_ALIGNMENT
> > and MAX_SUPPORTED_STACK_ALIGNMENT).
>
> So, to be honest, this was a bit of a bodge to try and work around
> issues on x86. My original attempt at solving the PR used the more
> obvious calculation you suggest below, i.e. the max of BIGGEST_ALIGNMENT
> and TYPE_ALIGN (type). The problem with that is, on x86,
> MAX_SUPPORTED_STACK_ALIGNMENT has a huge value (2^31).

On x86, stack alignment limit is the limit of stack.   You can
align stack to any values on x86 as long as your stack allows it.

> explow.c:get_dynamic_stack_size has:
>
>   if (size_align % MAX_SUPPORTED_STACK_ALIGNMENT != 0)
>     {
>       size = round_push (size);
>       [...]
>     }
>   *psize = size;
>
> so inevitably we end up calling round_push on x86 which in turn ends up
> going down the else branch:
>
>   else
>     {
>       /* If crtl->preferred_stack_boundary might still grow, use
>          virtual_preferred_stack_boundary_rtx instead.  This will be
>          substituted by the right value in vregs pass and optimized
>          during combine.  */
>       align_rtx = virtual_preferred_stack_boundary_rtx;
>       alignm1_rtx = force_operand (plus_constant (Pmode, align_rtx, -1),
>                                    NULL_RTX);
>     }
>
> and we end up returning a size expression involving
> virtual_preferred_stack_boundary_rtx, which clearly cannot be a
> CONST_INT (required for the approach in the patch to work: we need a
> constant to pass to assign_stack_local_1).
>
> So the idea with that calculation was to only use the new, larger
> alignment, if it was larger than MAX_SUPPORTED_STACK_ALIGNMENT (which
> should never be the case on x86 for any reasonable value of TYPE_ALIGN).
> Otherwise, it would fall back to the old behaviour of just using
> BIGGEST_ALIGNMENT.
>
> I realise that this is a rather inelegant hack, but I wasn't sure how
> else to approach the problem.
>
> >
> > Also, what about the non-BLKmode case?  A type's mode cannot be more
> > aligned than the type, but a type is allowed to be more aligned than
> > its mode.  E.g.:
>
> Yes, we should probably try and handle the non-BLKmode case as well. I
> was just trying to avoid changing too much at once at the risk of
> introducing regressions.
>
> >
> > typedef unsigned int V __attribute__((vector_size(32)));
> > typedef struct __attribute__((aligned(32))) { V x; } S;
> > S x;
> > void f(S);
> > void g(void) { f(x); }
> >
> > doesn't seem to be handled by the patch.  It's possible to create
> > variations on this on aarch64 up to 2048 bits using SVE and
> > -msve-vector-bits=N.
> >
> > Perhaps we should treat the old alignment calculation as a minimum
> > and take the max of that and TYPE_ALIGN, where TYPE_ALIGN is given.
>
> As discussed above, this doesn't work on x86, unfortunately. If you have
> any alternative suggestions as to how to approach the problem, it would
> be great to hear them.
>
> Thanks,
> Alex
>
> >
> > >    else
> > >      alignment = GET_MODE_ALIGNMENT (mode);
> > >
> > > @@ -872,21 +874,35 @@ assign_stack_temp_for_type (machine_mode mode, poly_int64 size, tree type)
> > >
> > >        p = ggc_alloc<temp_slot> ();
> > >
> > > -      /* We are passing an explicit alignment request to assign_stack_local.
> > > -    One side effect of that is assign_stack_local will not round SIZE
> > > -    to ensure the frame offset remains suitably aligned.
> > > -
> > > -    So for requests which depended on the rounding of SIZE, we go ahead
> > > -    and round it now.  We also make sure ALIGNMENT is at least
> > > -    BIGGEST_ALIGNMENT.  */
> > > -      gcc_assert (mode != BLKmode || align == BIGGEST_ALIGNMENT);
> > > -      p->slot = assign_stack_local_1 (mode,
> > > -                                 (mode == BLKmode
> > > -                                  ? aligned_upper_bound (size,
> > > -                                                         (int) align
> > > -                                                         / BITS_PER_UNIT)
> > > -                                  : size),
> > > -                                 align, 0);
> > > +      if (mode == BLKmode && align > MAX_SUPPORTED_STACK_ALIGNMENT)
> > > +   {
> > > +     rtx allocsize = gen_int_mode (size, Pmode);
> > > +     get_dynamic_stack_size (&allocsize, 0, align, NULL);
> > > +     gcc_assert (CONST_INT_P (allocsize));
> > > +     size = UINTVAL (allocsize);
> >
> > Since the size coming in is a poly_int64, I think we should treat
> > the result in the same way, using rtx_to_poly_int64 rather than
> > requiring a CONST_INT.  We might not handle that case properly yet,
> > but rtx_to_poly_int64 would trap mistakes, and using it now would
> > avoid fewer surprises later.
> >
> > Thanks,
> > Richard
> >
> > > +     p->slot = assign_stack_local_1 (mode,
> > > +                                     size,
> > > +                                     BIGGEST_ALIGNMENT, 0);
> > > +     rtx addr = align_dynamic_address (XEXP (p->slot, 0), align);
> > > +     mark_reg_pointer (addr, align);
> > > +     p->slot = gen_rtx_MEM (GET_MODE (p->slot), addr);
> > > +     MEM_NOTRAP_P (p->slot) = 1;
> > > +   }
> > > +      else
> > > +   /* We are passing an explicit alignment request to assign_stack_local.
> > > +      One side effect of that is assign_stack_local will not round SIZE
> > > +      to ensure the frame offset remains suitably aligned.
> > > +
> > > +      So for requests which depended on the rounding of SIZE, we go ahead
> > > +      and round it now.  We also make sure ALIGNMENT is at least
> > > +      BIGGEST_ALIGNMENT.  */
> > > +   p->slot = assign_stack_local_1 (mode,
> > > +                                   (mode == BLKmode
> > > +                                    ? aligned_upper_bound (size,
> > > +                                                           (int) align
> > > +                                                           / BITS_PER_UNIT)
> > > +                                    : size),
> > > +                                   align, 0);
> > >
> > >        p->align = align;
> > >
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > > index c9351802191..0b0ac0b9b97 100644
> > > --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > > +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > > @@ -21,10 +21,18 @@ test_pass_by_ref (int x0, overaligned x1, int x2)
> > >      abort ();
> > >    if (memcmp ((void *) &x1, (void *)&a, sizeof (overaligned)))
> > >      abort ();
> > > -  long addr = ((long) &x1) & 31;
> > > +
> > > +  char buf[64];
> > > +  void *ptr;
> > > +
> > > +  __builtin_snprintf (buf, sizeof(buf), "%p", &x1);
> > > +  if (__builtin_sscanf(buf, "%p", &ptr) != 1)
> > > +    abort ();
> > > +
> > > +  long addr = ((long) ptr) & 31;
> > >    if (addr != 0)
> > >      {
> > > -      __builtin_printf ("Alignment was %d\n", addr);
> > > +      __builtin_printf ("Alignment was %ld\n", addr);
> > >        abort ();
> > >      }
> > >  }
  
Richard Sandiford Jan. 20, 2022, 5:36 p.m. UTC | #7
Sorry for the slow response.

Alex Coplan <alex.coplan@arm.com> writes:
> On 20/12/2021 13:19, Richard Sandiford wrote:
>> Alex Coplan via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > Hi,
>> >
>> > This fixes PR103500 i.e. ensuring that stack slots for
>> > passed-by-reference overaligned types are appropriately aligned. For the
>> > testcase:
>> >
>> > typedef struct __attribute__((aligned(32))) {
>> >   long x,y;
>> > } S;
>> > S x;
>> > void f(S);
>> > void g(void) { f(x); }
>> >
>> > on AArch64, we currently generate (at -O2):
>> >
>> > g:
>> >         adrp    x1, .LANCHOR0
>> >         add     x1, x1, :lo12:.LANCHOR0
>> >         stp     x29, x30, [sp, -48]!
>> >         mov     x29, sp
>> >         ldp     q0, q1, [x1]
>> >         add     x0, sp, 16
>> >         stp     q0, q1, [sp, 16]
>> >         bl      f
>> >         ldp     x29, x30, [sp], 48
>> >         ret
>> >
>> > so the stack slot for the passed-by-reference copy of the structure is
>> > at sp + 16, and the sp is only guaranteed to be 16-byte aligned, so the
>> > structure is only 16-byte aligned. The PCS requires the structure to be
>> > 32-byte aligned. After this patch, we generate:
>> >
>> > g:
>> >         adrp    x1, .LANCHOR0
>> >         add     x1, x1, :lo12:.LANCHOR0
>> >         stp     x29, x30, [sp, -64]!
>> >         mov     x29, sp
>> >         add     x0, sp, 47
>> >         ldp     q0, q1, [x1]
>> >         and     x0, x0, -32
>> >         stp     q0, q1, [x0]
>> >         bl      f
>> >         ldp     x29, x30, [sp], 64
>> >         ret
>> >
>> > i.e. we ensure 32-byte alignment for the struct.
>> >
>> > The approach taken here is similar to that in
>> > function.c:assign_parm_setup_block where it handles the case for
>> > DECL_ALIGN (parm) > MAX_SUPPORTED_STACK_ALIGNMENT. This in turn is
>> > similar to the approach taken in cfgexpand.c:expand_stack_vars (where
>> > the function calls get_dynamic_stack_size) which is the code that
>> > handles the alignment for overaligned structures as addressable local
>> > variables (see the related case discussed in the PR).
>> 
>> A difference with the latter is that cfgexpand (AFAICT) always
>> honours the DECL/TYPE_ALIGN, with LOCAL_DECL_ALIGNMENT supposedly
>> only increasing the alignment for efficiency reasons (rather than
>> decreasing it).
>> 
>> So…
>> 
>> > This patch also updates the aapcs64 test mentioned in the PR to avoid
>> > the frontend folding away the alignment check. I've confirmed that the
>> > execution test actually fails on aarch64-linux-gnu prior to the patch
>> > being applied and passes afterwards.
>> >
>> > Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
>> > arm-linux-gnueabihf: no regressions.
>> >
>> > I'd appreciate any feedback. Is it OK for trunk?
>> >
>> > Thanks,
>> > Alex
>> >
>> > gcc/ChangeLog:
>> >
>> > 	PR middle-end/103500
>> > 	* function.c (get_stack_local_alignment): Align BLKmode overaligned
>> > 	types to the alignment required by the type.
>> > 	(assign_stack_temp_for_type): Handle BLKmode overaligned stack
>> > 	slots by allocating a larger-than-necessary buffer and aligning
>> > 	the address within appropriately.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> > 	PR middle-end/103500
>> > 	* gcc.target/aarch64/aapcs64/rec_align-8.c (test_pass_by_ref):
>> > 	Prevent the frontend from folding our alignment check away by
>> > 	using snprintf to store the pointer into a string and recovering
>> > 	it with sscanf.
>> >
>> > diff --git a/gcc/function.c b/gcc/function.c
>> > index 61b3bd036b8..5ed722ab959 100644
>> > --- a/gcc/function.c
>> > +++ b/gcc/function.c
>> > @@ -278,7 +278,9 @@ get_stack_local_alignment (tree type, machine_mode mode)
>> >    unsigned int alignment;
>> >  
>> >    if (mode == BLKmode)
>> > -    alignment = BIGGEST_ALIGNMENT;
>> > +    alignment = (type && TYPE_ALIGN (type) > MAX_SUPPORTED_STACK_ALIGNMENT)
>> > +      ? TYPE_ALIGN (type)
>> > +      : BIGGEST_ALIGNMENT;
>> 
>> …I'm not sure about this calculation.  Why do we only honour TYPE_ALIGN
>> if it's greater than MAX_SUPPORTED_STACK_ALIGNMENT, and fall all the
>> way back to BIGGEST_ALIGNMENT otherwise?  It looks like on nvptx
>> this would have the effect of honouring (say) 2048-byte alignment,
>> but not 32-byte alignment (which falls between BIGGEST_ALIGNMENT
>> and MAX_SUPPORTED_STACK_ALIGNMENT).
>
> So, to be honest, this was a bit of a bodge to try and work around
> issues on x86. My original attempt at solving the PR used the more
> obvious calculation you suggest below, i.e. the max of BIGGEST_ALIGNMENT
> and TYPE_ALIGN (type). The problem with that is, on x86,
> MAX_SUPPORTED_STACK_ALIGNMENT has a huge value (2^31).
> explow.c:get_dynamic_stack_size has:
>
>   if (size_align % MAX_SUPPORTED_STACK_ALIGNMENT != 0)
>     {
>       size = round_push (size);
>       [...]
>     }
>   *psize = size;
>
> so inevitably we end up calling round_push on x86 which in turn ends up
> going down the else branch:
>
>   else
>     {
>       /* If crtl->preferred_stack_boundary might still grow, use
> 	 virtual_preferred_stack_boundary_rtx instead.  This will be
> 	 substituted by the right value in vregs pass and optimized
> 	 during combine.  */
>       align_rtx = virtual_preferred_stack_boundary_rtx;
>       alignm1_rtx = force_operand (plus_constant (Pmode, align_rtx, -1),
> 				   NULL_RTX);
>     }
>
> and we end up returning a size expression involving
> virtual_preferred_stack_boundary_rtx, which clearly cannot be a
> CONST_INT (required for the approach in the patch to work: we need a
> constant to pass to assign_stack_local_1).
>
> So the idea with that calculation was to only use the new, larger
> alignment, if it was larger than MAX_SUPPORTED_STACK_ALIGNMENT (which
> should never be the case on x86 for any reasonable value of TYPE_ALIGN).
> Otherwise, it would fall back to the old behaviour of just using
> BIGGEST_ALIGNMENT.
>
> I realise that this is a rather inelegant hack, but I wasn't sure how
> else to approach the problem.

Yeah, I think this shows that get_dynamic_stack_size probably
isn't the right interface to use here.  That's designed for cases
in which we're doing an independent alloca(), but here we want to do
a normal static allocation with an increased size.

>> Also, what about the non-BLKmode case?  A type's mode cannot be more
>> aligned than the type, but a type is allowed to be more aligned than
>> its mode.  E.g.:
>
> Yes, we should probably try and handle the non-BLKmode case as well. I
> was just trying to avoid changing too much at once at the risk of
> introducing regressions.

I think it'd be good to tackle both cases at the same time.

>> typedef unsigned int V __attribute__((vector_size(32)));
>> typedef struct __attribute__((aligned(32))) { V x; } S;
>> S x;
>> void f(S);
>> void g(void) { f(x); }
>> 
>> doesn't seem to be handled by the patch.  It's possible to create
>> variations on this on aarch64 up to 2048 bits using SVE and
>> -msve-vector-bits=N.
>> 
>> Perhaps we should treat the old alignment calculation as a minimum
>> and take the max of that and TYPE_ALIGN, where TYPE_ALIGN is given.
>
> As discussed above, this doesn't work on x86, unfortunately. If you have
> any alternative suggestions as to how to approach the problem, it would
> be great to hear them.

How about instead:

(1) Define a new ASLK_* flag for assign_stack_local_1.

(2) When the flag is set, make:

      if (alignment_in_bits > MAX_SUPPORTED_STACK_ALIGNMENT)
        {
          alignment_in_bits = MAX_SUPPORTED_STACK_ALIGNMENT;
          alignment = MAX_SUPPORTED_STACK_ALIGNMENT / BITS_PER_UNIT;
        }

    increase the size by (new_align - old_align).

(3) When the flag is set, and the new alignment is smaller than the
    original alignment, call align_dynamic_address before creating
    the MEM, as in your original patch.

There are probably other details, but it looks like that should cope
with the x86 dynamic stack realignment scheme even with the max-based
alignment calculation discussed above.

Thanks,
Richard
  
Richard Sandiford Jan. 21, 2022, 8:11 a.m. UTC | #8
Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> How about instead:
>
> (1) Define a new ASLK_* flag for assign_stack_local_1.
>
> (2) When the flag is set, make:
>
>       if (alignment_in_bits > MAX_SUPPORTED_STACK_ALIGNMENT)
>         {
>           alignment_in_bits = MAX_SUPPORTED_STACK_ALIGNMENT;
>           alignment = MAX_SUPPORTED_STACK_ALIGNMENT / BITS_PER_UNIT;
>         }
>
>     increase the size by (new_align - old_align).

Er, I meant old_align - new_align here :-)

>
> (3) When the flag is set, and the new alignment is smaller than the
>     original alignment, call align_dynamic_address before creating
>     the MEM, as in your original patch.
>
> There are probably other details, but it looks like that should cope
> with the x86 dynamic stack realignment scheme even with the max-based
> alignment calculation discussed above.
>
> Thanks,
> Richard
  

Patch

diff --git a/gcc/function.c b/gcc/function.c
index 61b3bd036b8..5ed722ab959 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -278,7 +278,9 @@  get_stack_local_alignment (tree type, machine_mode mode)
   unsigned int alignment;
 
   if (mode == BLKmode)
-    alignment = BIGGEST_ALIGNMENT;
+    alignment = (type && TYPE_ALIGN (type) > MAX_SUPPORTED_STACK_ALIGNMENT)
+      ? TYPE_ALIGN (type)
+      : BIGGEST_ALIGNMENT;
   else
     alignment = GET_MODE_ALIGNMENT (mode);
 
@@ -872,21 +874,35 @@  assign_stack_temp_for_type (machine_mode mode, poly_int64 size, tree type)
 
       p = ggc_alloc<temp_slot> ();
 
-      /* We are passing an explicit alignment request to assign_stack_local.
-	 One side effect of that is assign_stack_local will not round SIZE
-	 to ensure the frame offset remains suitably aligned.
-
-	 So for requests which depended on the rounding of SIZE, we go ahead
-	 and round it now.  We also make sure ALIGNMENT is at least
-	 BIGGEST_ALIGNMENT.  */
-      gcc_assert (mode != BLKmode || align == BIGGEST_ALIGNMENT);
-      p->slot = assign_stack_local_1 (mode,
-				      (mode == BLKmode
-				       ? aligned_upper_bound (size,
-							      (int) align
-							      / BITS_PER_UNIT)
-				       : size),
-				      align, 0);
+      if (mode == BLKmode && align > MAX_SUPPORTED_STACK_ALIGNMENT)
+	{
+	  rtx allocsize = gen_int_mode (size, Pmode);
+	  get_dynamic_stack_size (&allocsize, 0, align, NULL);
+	  gcc_assert (CONST_INT_P (allocsize));
+	  size = UINTVAL (allocsize);
+	  p->slot = assign_stack_local_1 (mode,
+					  size,
+					  BIGGEST_ALIGNMENT, 0);
+	  rtx addr = align_dynamic_address (XEXP (p->slot, 0), align);
+	  mark_reg_pointer (addr, align);
+	  p->slot = gen_rtx_MEM (GET_MODE (p->slot), addr);
+	  MEM_NOTRAP_P (p->slot) = 1;
+	}
+      else
+	/* We are passing an explicit alignment request to assign_stack_local.
+	   One side effect of that is assign_stack_local will not round SIZE
+	   to ensure the frame offset remains suitably aligned.
+
+	   So for requests which depended on the rounding of SIZE, we go ahead
+	   and round it now.  We also make sure ALIGNMENT is at least
+	   BIGGEST_ALIGNMENT.  */
+	p->slot = assign_stack_local_1 (mode,
+					(mode == BLKmode
+					 ? aligned_upper_bound (size,
+								(int) align
+								/ BITS_PER_UNIT)
+					 : size),
+					align, 0);
 
       p->align = align;
 
diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
index c9351802191..0b0ac0b9b97 100644
--- a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
+++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
@@ -21,10 +21,18 @@  test_pass_by_ref (int x0, overaligned x1, int x2)
     abort ();
   if (memcmp ((void *) &x1, (void *)&a, sizeof (overaligned)))
     abort ();
-  long addr = ((long) &x1) & 31;
+
+  char buf[64];
+  void *ptr;
+
+  __builtin_snprintf (buf, sizeof(buf), "%p", &x1);
+  if (__builtin_sscanf(buf, "%p", &ptr) != 1)
+    abort ();
+
+  long addr = ((long) ptr) & 31;
   if (addr != 0)
     {
-      __builtin_printf ("Alignment was %d\n", addr);
+      __builtin_printf ("Alignment was %ld\n", addr);
       abort ();
     }
 }