[v3,AARCH64] Fix PR target/103100 -mstrict-align and memset on not aligned buffers

Message ID 1643164088-18048-1-git-send-email-apinski@marvell.com
State New
Headers
Series [v3,AARCH64] Fix PR target/103100 -mstrict-align and memset on not aligned buffers |

Commit Message

Li, Pan2 via Gcc-patches Jan. 26, 2022, 2:28 a.m. UTC
  From: Andrew Pinski <apinski@marvell.com>

The problem here is that aarch64_expand_setmem does not change the alignment
for strict alignment case. This is version 3 of this patch, is is based on
version 2 and moves the check for the number of instructions from the
optimizing for size case to be always and change the cost of libcalls for
the !size case to be max_size/16 + 1 (or 17) which was the same as before
when handling just the max_size. The main change is dealing with strict
alignment case where we only inline a max of 17 instructions as at that
point the call to the memset will be faster and could handle the dynamic
alignment instead of just the static alignment.

Note the reason why it is +1 is to count for the setting of the simd
duplicate.

OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.

	PR target/103100
gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_expand_setmem): Constraint
	copy_limit to the alignment of the mem if STRICT_ALIGNMENT is
	true. Also constraint the number of instructions for the !size
	case to max_size/16 + 1.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/memset-strict-align-1.c: Update test.
	Reduce the size down to 207 and make s1 global and aligned
	to 16 bytes.
	* gcc.target/aarch64/memset-strict-align-2.c: New test.
---
 gcc/config/aarch64/aarch64.cc                 | 55 ++++++++++---------
 .../aarch64/memset-strict-align-1.c           | 20 +++----
 .../aarch64/memset-strict-align-2.c           | 14 +++++
 3 files changed, 53 insertions(+), 36 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
  

Comments

Richard Sandiford Feb. 1, 2022, 10:44 a.m. UTC | #1
apinski--- via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> From: Andrew Pinski <apinski@marvell.com>
>
> The problem here is that aarch64_expand_setmem does not change the alignment
> for strict alignment case. This is version 3 of this patch, is is based on
> version 2 and moves the check for the number of instructions from the
> optimizing for size case to be always and change the cost of libcalls for
> the !size case to be max_size/16 + 1 (or 17) which was the same as before
> when handling just the max_size.

In this case, if we want the number to be 17 for strict alignment targets,
I think we should just pick 17.  It feels odd to be setting the libcall
cost based on AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS.

I notice that you didn't go for the suggestion in the previous review
round:

| I think in practice the code has only been tuned on targets that
| support LDP/STP Q, so how about moving the copy_limit calculation
| further up and doing:
|
|   unsigned max_set_size = (copy_limit * 8) / BITS_PER_UNIT;
|
| ?

Does that not work?  One advantage of it is that the:

  if (len > max_set_size && !TARGET_MOPS)
    return false;

bail-out will kick in sooner, stopping us from generating (say)
255 garbage insns for a byte-aligned 255-byte copy.

Thanks,
Richard

> The main change is dealing with strict
> alignment case where we only inline a max of 17 instructions as at that
> point the call to the memset will be faster and could handle the dynamic
> alignment instead of just the static alignment.
>
> Note the reason why it is +1 is to count for the setting of the simd
> duplicate.
>
> OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.
>
> 	PR target/103100
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64.cc (aarch64_expand_setmem): Constraint
> 	copy_limit to the alignment of the mem if STRICT_ALIGNMENT is
> 	true. Also constraint the number of instructions for the !size
> 	case to max_size/16 + 1.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/memset-strict-align-1.c: Update test.
> 	Reduce the size down to 207 and make s1 global and aligned
> 	to 16 bytes.
> 	* gcc.target/aarch64/memset-strict-align-2.c: New test.
> ---
>  gcc/config/aarch64/aarch64.cc                 | 55 ++++++++++---------
>  .../aarch64/memset-strict-align-1.c           | 20 +++----
>  .../aarch64/memset-strict-align-2.c           | 14 +++++
>  3 files changed, 53 insertions(+), 36 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 296145e6008..02ecb2154ea 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -23831,8 +23831,11 @@ aarch64_expand_setmem (rtx *operands)
>      (zero constants can use XZR directly).  */
>    unsigned mops_cost = 3 + 1 + cst_val;
>    /* A libcall to memset in the worst case takes 3 instructions to prepare
> -     the arguments + 1 for the call.  */
> -  unsigned libcall_cost = 4;
> +     the arguments + 1 for the call.
> +     In the case of not optimizing for size the cost of doing a libcall
> +     is the max_set_size / 16 + 1 or 17 instructions. The one instruction
> +     is for the vector dup which may or may not be used.  */
> +  unsigned libcall_cost = size_p ? 4 : (max_set_size / 16 + 1);
>  
>    /* Upper bound check.  For large constant-sized setmem use the MOPS sequence
>       when available.  */
> @@ -23842,12 +23845,12 @@ aarch64_expand_setmem (rtx *operands)
>  
>    /* Attempt a sequence with a vector broadcast followed by stores.
>       Count the number of operations involved to see if it's worth it
> -     against the alternatives.  A simple counter simd_ops on the
> +     against the alternatives.  A simple counter inlined_ops on the
>       algorithmically-relevant operations is used rather than an rtx_insn count
>       as all the pointer adjusmtents and mode reinterprets will be optimized
>       away later.  */
>    start_sequence ();
> -  unsigned simd_ops = 0;
> +  unsigned inlined_ops = 0;
>  
>    base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
>    dst = adjust_automodify_address (dst, VOIDmode, base, 0);
> @@ -23855,15 +23858,22 @@ aarch64_expand_setmem (rtx *operands)
>    /* Prepare the val using a DUP/MOVI v0.16B, val.  */
>    src = expand_vector_broadcast (V16QImode, val);
>    src = force_reg (V16QImode, src);
> -  simd_ops++;
> +  inlined_ops++;
>    /* Convert len to bits to make the rest of the code simpler.  */
>    n = len * BITS_PER_UNIT;
>  
>    /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
>       AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.  */
> -  const int copy_limit = (aarch64_tune_params.extra_tuning_flags
> -			  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> -			  ? GET_MODE_BITSIZE (TImode) : 256;
> +  int copy_limit;
> +
> +  if (aarch64_tune_params.extra_tuning_flags
> +      & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> +    copy_limit = GET_MODE_BITSIZE (TImode);
> +  else
> +    copy_limit = 256;
> +
> +  if (STRICT_ALIGNMENT)
> +    copy_limit = MIN (copy_limit, (int)MEM_ALIGN (dst));
>  
>    while (n > 0)
>      {
> @@ -23878,7 +23888,7 @@ aarch64_expand_setmem (rtx *operands)
>  
>        mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
>        aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode);
> -      simd_ops++;
> +      inlined_ops++;
>        n -= mode_bits;
>  
>        /* Do certain trailing copies as overlapping if it's going to be
> @@ -23897,24 +23907,17 @@ aarch64_expand_setmem (rtx *operands)
>    rtx_insn *seq = get_insns ();
>    end_sequence ();
>  
> -  if (size_p)
> -    {
> -      /* When optimizing for size we have 3 options: the SIMD broadcast sequence,
> -	 call to memset or the MOPS expansion.  */
> -      if (TARGET_MOPS
> -	  && mops_cost <= libcall_cost
> -	  && mops_cost <= simd_ops)
> -	return aarch64_expand_setmem_mops (operands);
> -      /* If MOPS is not available or not shorter pick a libcall if the SIMD
> -	 sequence is too long.  */
> -      else if (libcall_cost < simd_ops)
> -	return false;
> -      emit_insn (seq);
> -      return true;
> -    }
> +  /* When optimizing for size we have 3 options: the inlined sequence,
> +     call to memset or the MOPS expansion.  */
> +  if (size_p
> +      && TARGET_MOPS
> +      && mops_cost <= libcall_cost
> +      && mops_cost <= inlined_ops)
> +    return aarch64_expand_setmem_mops (operands);
> +  /* Pick a libcall if the inlined sequence is too long.  */
> +  else if (libcall_cost < inlined_ops)
> +    return false;
>  
> -  /* At this point the SIMD broadcast sequence is the best choice when
> -     optimizing for speed.  */
>    emit_insn (seq);
>    return true;
>  }
> diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> index 664d43aee13..5a7acbf2fa9 100644
> --- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> @@ -1,28 +1,28 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -mstrict-align" } */
>  
> -struct s { char x[255]; };
> +struct s { char x[207]; };
> +struct s s1 __attribute__((aligned(16)));
>  void foo (struct s *);
> -void bar (void) { struct s s1 = {}; foo (&s1); }
> +void bar (void) { s1 = (struct s){}; foo (&s1); }
>  
> -/* memset (s1 = {}, sizeof = 255) should be expanded out
> +/* memset (s1 = {}, sizeof = 207) should be expanded out
>     such that there are no overlap stores when -mstrict-align
>     is in use.
> -   so 7 pairs of 16 bytes stores (224 bytes).
> -   1 16 byte stores
> +   so 5 pairs of 16 bytes stores (80 bytes).
> +   2 16 byte stores (FIXME: should be 0)
>     1 8 byte store
>     1 4 byte store
>     1 2 byte store
>     1 1 byte store
>     */
>  
> -/* { dg-final { scan-assembler-times "stp\tq" 7 } } */
> -/* { dg-final { scan-assembler-times "str\tq" 1 } } */
> +/* { dg-final { scan-assembler-times "stp\tq" 5 } } */
> +/* { dg-final { scan-assembler-times "str\tq" 2 } } */
>  /* { dg-final { scan-assembler-times "str\txzr" 1 } } */
>  /* { dg-final { scan-assembler-times "str\twzr" 1 } } */
>  /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */
>  /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */
>  
> -/* Also one store pair for the frame-pointer and the LR. */
> -/* { dg-final { scan-assembler-times "stp\tx" 1 } } */
> -
> +/* No store pair with 8 byte words is needed as foo is called with a sibcall. */
> +/* { dg-final { scan-assembler-times "stp\tx" 0 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
> new file mode 100644
> index 00000000000..67d994fe39e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstrict-align" } */
> +
> +struct s { char x[15]; };
> +void foo (struct s *);
> +void bar (struct s *s1) { *s1 = (struct s){}; foo (s1); }
> +
> +/* memset (s1 = {}, sizeof = 15) should be expanded out
> +   such that there are no overlap stores when -mstrict-align
> +   is in use. As the alignment of s1 is unknown, byte stores are needed.
> +   so 15 byte stores
> +   */
> +
> +/* { dg-final { scan-assembler-times "strb\twzr" 15 } } */
  

Patch

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 296145e6008..02ecb2154ea 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23831,8 +23831,11 @@  aarch64_expand_setmem (rtx *operands)
     (zero constants can use XZR directly).  */
   unsigned mops_cost = 3 + 1 + cst_val;
   /* A libcall to memset in the worst case takes 3 instructions to prepare
-     the arguments + 1 for the call.  */
-  unsigned libcall_cost = 4;
+     the arguments + 1 for the call.
+     In the case of not optimizing for size the cost of doing a libcall
+     is the max_set_size / 16 + 1 or 17 instructions. The one instruction
+     is for the vector dup which may or may not be used.  */
+  unsigned libcall_cost = size_p ? 4 : (max_set_size / 16 + 1);
 
   /* Upper bound check.  For large constant-sized setmem use the MOPS sequence
      when available.  */
@@ -23842,12 +23845,12 @@  aarch64_expand_setmem (rtx *operands)
 
   /* Attempt a sequence with a vector broadcast followed by stores.
      Count the number of operations involved to see if it's worth it
-     against the alternatives.  A simple counter simd_ops on the
+     against the alternatives.  A simple counter inlined_ops on the
      algorithmically-relevant operations is used rather than an rtx_insn count
      as all the pointer adjusmtents and mode reinterprets will be optimized
      away later.  */
   start_sequence ();
-  unsigned simd_ops = 0;
+  unsigned inlined_ops = 0;
 
   base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
   dst = adjust_automodify_address (dst, VOIDmode, base, 0);
@@ -23855,15 +23858,22 @@  aarch64_expand_setmem (rtx *operands)
   /* Prepare the val using a DUP/MOVI v0.16B, val.  */
   src = expand_vector_broadcast (V16QImode, val);
   src = force_reg (V16QImode, src);
-  simd_ops++;
+  inlined_ops++;
   /* Convert len to bits to make the rest of the code simpler.  */
   n = len * BITS_PER_UNIT;
 
   /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
      AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.  */
-  const int copy_limit = (aarch64_tune_params.extra_tuning_flags
-			  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
-			  ? GET_MODE_BITSIZE (TImode) : 256;
+  int copy_limit;
+
+  if (aarch64_tune_params.extra_tuning_flags
+      & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
+    copy_limit = GET_MODE_BITSIZE (TImode);
+  else
+    copy_limit = 256;
+
+  if (STRICT_ALIGNMENT)
+    copy_limit = MIN (copy_limit, (int)MEM_ALIGN (dst));
 
   while (n > 0)
     {
@@ -23878,7 +23888,7 @@  aarch64_expand_setmem (rtx *operands)
 
       mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
       aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode);
-      simd_ops++;
+      inlined_ops++;
       n -= mode_bits;
 
       /* Do certain trailing copies as overlapping if it's going to be
@@ -23897,24 +23907,17 @@  aarch64_expand_setmem (rtx *operands)
   rtx_insn *seq = get_insns ();
   end_sequence ();
 
-  if (size_p)
-    {
-      /* When optimizing for size we have 3 options: the SIMD broadcast sequence,
-	 call to memset or the MOPS expansion.  */
-      if (TARGET_MOPS
-	  && mops_cost <= libcall_cost
-	  && mops_cost <= simd_ops)
-	return aarch64_expand_setmem_mops (operands);
-      /* If MOPS is not available or not shorter pick a libcall if the SIMD
-	 sequence is too long.  */
-      else if (libcall_cost < simd_ops)
-	return false;
-      emit_insn (seq);
-      return true;
-    }
+  /* When optimizing for size we have 3 options: the inlined sequence,
+     call to memset or the MOPS expansion.  */
+  if (size_p
+      && TARGET_MOPS
+      && mops_cost <= libcall_cost
+      && mops_cost <= inlined_ops)
+    return aarch64_expand_setmem_mops (operands);
+  /* Pick a libcall if the inlined sequence is too long.  */
+  else if (libcall_cost < inlined_ops)
+    return false;
 
-  /* At this point the SIMD broadcast sequence is the best choice when
-     optimizing for speed.  */
   emit_insn (seq);
   return true;
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
index 664d43aee13..5a7acbf2fa9 100644
--- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
+++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
@@ -1,28 +1,28 @@ 
 /* { dg-do compile } */
 /* { dg-options "-O2 -mstrict-align" } */
 
-struct s { char x[255]; };
+struct s { char x[207]; };
+struct s s1 __attribute__((aligned(16)));
 void foo (struct s *);
-void bar (void) { struct s s1 = {}; foo (&s1); }
+void bar (void) { s1 = (struct s){}; foo (&s1); }
 
-/* memset (s1 = {}, sizeof = 255) should be expanded out
+/* memset (s1 = {}, sizeof = 207) should be expanded out
    such that there are no overlap stores when -mstrict-align
    is in use.
-   so 7 pairs of 16 bytes stores (224 bytes).
-   1 16 byte stores
+   so 5 pairs of 16 bytes stores (80 bytes).
+   2 16 byte stores (FIXME: should be 0)
    1 8 byte store
    1 4 byte store
    1 2 byte store
    1 1 byte store
    */
 
-/* { dg-final { scan-assembler-times "stp\tq" 7 } } */
-/* { dg-final { scan-assembler-times "str\tq" 1 } } */
+/* { dg-final { scan-assembler-times "stp\tq" 5 } } */
+/* { dg-final { scan-assembler-times "str\tq" 2 } } */
 /* { dg-final { scan-assembler-times "str\txzr" 1 } } */
 /* { dg-final { scan-assembler-times "str\twzr" 1 } } */
 /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */
 /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */
 
-/* Also one store pair for the frame-pointer and the LR. */
-/* { dg-final { scan-assembler-times "stp\tx" 1 } } */
-
+/* No store pair with 8 byte words is needed as foo is called with a sibcall. */
+/* { dg-final { scan-assembler-times "stp\tx" 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
new file mode 100644
index 00000000000..67d994fe39e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstrict-align" } */
+
+struct s { char x[15]; };
+void foo (struct s *);
+void bar (struct s *s1) { *s1 = (struct s){}; foo (s1); }
+
+/* memset (s1 = {}, sizeof = 15) should be expanded out
+   such that there are no overlap stores when -mstrict-align
+   is in use. As the alignment of s1 is unknown, byte stores are needed.
+   so 15 byte stores
+   */
+
+/* { dg-final { scan-assembler-times "strb\twzr" 15 } } */