[v2] PR tree-optimization/98335: Improvements to DSE's compute_trims.
Commit Message
Hi Richard,
Many thanks. Yes, your proposed ao_ref_alignment is exactly what I was looking for.
Here's the second revision of my patch for PR tree-optimization/98335 that both uses/
introduces ao_ref_alignment and more intelligently aligns/trims both head and tail,
for example handling the case discussed by Richard and Jeff Law, of a 16 byte-aligned
object where we wish to avoid trimming (just) the last three bytes. It uses the useful
property that writing N consecutive bytes, typically requires popcount(N) store
instructions, so we wish to align (if we can) that we begin/end with a store of N' bytes
where popcount(N') is one, if that isn't already the case.
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
make -k check with no new failures. Is this revised version ok for mainline?
2022-03-09 Roger Sayle <roger@nextmovesoftware.com>
Richard Biener <rguenther@suse.de>
gcc/ChangeLog
PR tree-optimization/98335
* builtins.cc (get_object_alignment_2): Export.
* builtins.h (get_object_alignment_2): Likewise.
* tree-ssa-alias.cc (ao_ref_alignment): New.
* tree-ssa-alias.h (ao_ref_alignment): Declare.
* tree-ssa-dse.cc (compute_trims): Improve logic deciding whether
to align head/tail, writing more bytes but using fewer store insns.
(maybe_trim_memstar_call): Silence compiler warnings by using
memset to initialize lendata.
gcc/testsuite/ChangeLog
PR tree-optimization/98335
* g++.dg/pr98335.C: New test case.
* gcc.dg/pr86010.c: New test case.
* gcc.dg/pr86010-2.c: New test case.
Thanks again for your help.
Roger
--
> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: 08 March 2022 10:44
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's
> compute_trims.
>
> On Tue, Mar 8, 2022 at 11:10 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, Mar 7, 2022 at 11:04 AM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> > >
> > >
> > > This patch is the main middle-end piece of a fix for PR
> > > tree-opt/98335, which is a code-quality regression affecting
> > > mainline. The issue occurs in DSE's (dead store elimination's)
> > > compute_trims function that determines where a store to memory can
> > > be trimmed. In the testcase given in the PR, this function notices
> > > that the first byte of a DImode store is dead, and replaces the
> > > 8-byte store at (aligned) offset zero, with a 7-byte store at
> > > (unaligned) offset one. Most architectures can store a power-of-two
> > > bytes (up to a maximum) in single instruction, so writing 7 bytes
> > > requires more instructions than writing 8 bytes. This patch follows Jakub
> Jelinek's suggestion in comment 5, that compute_trims needs improved
> heuristics.
> > >
> > > In this patch, decision of whether and how to align trim_head is
> > > based on the number of bytes being written, the alignment of the
> > > start of the object and where within the object the first byte is
> > > written. The first tests check whether we're already writing to the
> > > start of the object, and that we're writing three or more bytes. If
> > > we're only writing one or two bytes, there's no benefit from providing
> additional alignment.
> > > Then we determine the alignment of the object, which is either 1, 2,
> > > 4, 8 or 16 byte aligned (capping at 16 guarantees that we never
> > > write more than 7 bytes beyond the minimum required). If the buffer
> > > is only
> > > 1 or 2 byte aligned there's no benefit from additional alignment.
> > > For the remaining cases, alignment of trim_head is based upon where
> > > within each aligned block (word) the first byte is written. For
> > > example, storing the last byte (or last half-word) of a word can be
> > > performed with a single insn.
> > >
> > > On x86_64-pc-linux-gnu with -O2 the new test case in the PR goes from:
> > >
> > > movl $0, -24(%rsp)
> > > movabsq $72057594037927935, %rdx
> > > movl $0, -21(%rsp)
> > > andq -24(%rsp), %rdx
> > > movq %rdx, %rax
> > > salq $8, %rax
> > > movb c(%rip), %al
> > > ret
> > >
> > > to
> > >
> > > xorl %eax, %eax
> > > movb c(%rip), %al
> > > ret
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > bootstrap and make -k check with no new failures. I've also added
> > > new testcases for the original motivating PR
> > > tree-optimization/86010, to ensure that those remain optimized (in future).
> Ok for mainline?
> >
> > diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc index
> > 2b22a61..080e406 100644
> > --- a/gcc/tree-ssa-dse.cc
> > +++ b/gcc/tree-ssa-dse.cc
> > @@ -405,10 +405,36 @@ compute_trims (ao_ref *ref, sbitmap live, int
> > *trim_head, int *trim_tail,
> > int first_live = bitmap_first_set_bit (live);
> > *trim_head = first_live - first_orig;
> >
> > - /* If more than a word remains, then make sure to keep the
> > - starting point at least word aligned. */
> > - if (last_live - first_live > UNITS_PER_WORD)
> > - *trim_head &= ~(UNITS_PER_WORD - 1);
> > + /* If REF is aligned, try to maintain this alignment if it reduces
> > + the number of (power-of-two sized aligned) writes to memory.
> > + First check that we're writing >= 3 bytes at a non-zero offset.
> > + */ if (first_live
> > + && last_live - first_live >= 2)
> > + {
> > + unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (ref->base));
> >
> > you can't simply use TYPE_ALIGN_* on ref->base. You can use
> > get_object_alignment on ref->ref, but ref->ref can be NULL in case the
> > ref was initialized from a builtin call like memcpy.
> >
> > Also ref->base is offsetted by ref->offset which you don't seem to
> > account for. In theory one could export get_object_alignment_2 and if
> > ref->ref is NULL, use that on ref->base, passing addr_p = true, and
> > then adjust the resulting bitpos by ref->offset and fix align
> > accordingly (trimming might also align an access if the original
> > access was offsetted from known alignment).
> >
> > That said, a helper like ao_ref_alignment () might be useful here.
>
> Like the attached - free feel to use that.
>
> Richard.
>
> >
> > I wonder if we can apply good heuristics to compute_trims without
> > taking into account context, like maybe_trimp_complex_store is already
> > limiting itself to useful subsets and the constructor and memstar
> > cases will only benefit if they end up being expanded inline via
> > *_by_pieces, not if expanded as a call.
> >
> > You don't seem to adjust *trim_tail at all, if an aligned 16 byte
> > region is trimmed there by 3 that will result in two extra stores as well, no?
> >
Comments
On Wed, Mar 9, 2022 at 6:43 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Richard,
> Many thanks. Yes, your proposed ao_ref_alignment is exactly what I was looking for.
> Here's the second revision of my patch for PR tree-optimization/98335 that both uses/
> introduces ao_ref_alignment and more intelligently aligns/trims both head and tail,
> for example handling the case discussed by Richard and Jeff Law, of a 16 byte-aligned
> object where we wish to avoid trimming (just) the last three bytes. It uses the useful
> property that writing N consecutive bytes, typically requires popcount(N) store
> instructions, so we wish to align (if we can) that we begin/end with a store of N' bytes
> where popcount(N') is one, if that isn't already the case.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check with no new failures. Is this revised version ok for mainline?
OK.
I wonder if we can also handle bitpos != 0, like if we have
a an access 3 bytes into an 8 byte aligned object and
want to trim 2 bytes at start then it would be good to trim only
1 byte so the access is then 4 byte aligned (4 bytes into the
8 byte aligned object). Similar, trimming 6 bytes should be reduced
to trimming 5 bytes.
Richard.
> 2022-03-09 Roger Sayle <roger@nextmovesoftware.com>
> Richard Biener <rguenther@suse.de>
>
> gcc/ChangeLog
> PR tree-optimization/98335
> * builtins.cc (get_object_alignment_2): Export.
> * builtins.h (get_object_alignment_2): Likewise.
> * tree-ssa-alias.cc (ao_ref_alignment): New.
> * tree-ssa-alias.h (ao_ref_alignment): Declare.
>
> * tree-ssa-dse.cc (compute_trims): Improve logic deciding whether
> to align head/tail, writing more bytes but using fewer store insns.
> (maybe_trim_memstar_call): Silence compiler warnings by using
> memset to initialize lendata.
>
> gcc/testsuite/ChangeLog
> PR tree-optimization/98335
> * g++.dg/pr98335.C: New test case.
> * gcc.dg/pr86010.c: New test case.
> * gcc.dg/pr86010-2.c: New test case.
>
> Thanks again for your help.
> Roger
> --
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: 08 March 2022 10:44
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> > Subject: Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's
> > compute_trims.
> >
> > On Tue, Mar 8, 2022 at 11:10 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Mon, Mar 7, 2022 at 11:04 AM Roger Sayle
> > <roger@nextmovesoftware.com> wrote:
> > > >
> > > >
> > > > This patch is the main middle-end piece of a fix for PR
> > > > tree-opt/98335, which is a code-quality regression affecting
> > > > mainline. The issue occurs in DSE's (dead store elimination's)
> > > > compute_trims function that determines where a store to memory can
> > > > be trimmed. In the testcase given in the PR, this function notices
> > > > that the first byte of a DImode store is dead, and replaces the
> > > > 8-byte store at (aligned) offset zero, with a 7-byte store at
> > > > (unaligned) offset one. Most architectures can store a power-of-two
> > > > bytes (up to a maximum) in single instruction, so writing 7 bytes
> > > > requires more instructions than writing 8 bytes. This patch follows Jakub
> > Jelinek's suggestion in comment 5, that compute_trims needs improved
> > heuristics.
> > > >
> > > > In this patch, decision of whether and how to align trim_head is
> > > > based on the number of bytes being written, the alignment of the
> > > > start of the object and where within the object the first byte is
> > > > written. The first tests check whether we're already writing to the
> > > > start of the object, and that we're writing three or more bytes. If
> > > > we're only writing one or two bytes, there's no benefit from providing
> > additional alignment.
> > > > Then we determine the alignment of the object, which is either 1, 2,
> > > > 4, 8 or 16 byte aligned (capping at 16 guarantees that we never
> > > > write more than 7 bytes beyond the minimum required). If the buffer
> > > > is only
> > > > 1 or 2 byte aligned there's no benefit from additional alignment.
> > > > For the remaining cases, alignment of trim_head is based upon where
> > > > within each aligned block (word) the first byte is written. For
> > > > example, storing the last byte (or last half-word) of a word can be
> > > > performed with a single insn.
> > > >
> > > > On x86_64-pc-linux-gnu with -O2 the new test case in the PR goes from:
> > > >
> > > > movl $0, -24(%rsp)
> > > > movabsq $72057594037927935, %rdx
> > > > movl $0, -21(%rsp)
> > > > andq -24(%rsp), %rdx
> > > > movq %rdx, %rax
> > > > salq $8, %rax
> > > > movb c(%rip), %al
> > > > ret
> > > >
> > > > to
> > > >
> > > > xorl %eax, %eax
> > > > movb c(%rip), %al
> > > > ret
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > bootstrap and make -k check with no new failures. I've also added
> > > > new testcases for the original motivating PR
> > > > tree-optimization/86010, to ensure that those remain optimized (in future).
> > Ok for mainline?
> > >
> > > diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc index
> > > 2b22a61..080e406 100644
> > > --- a/gcc/tree-ssa-dse.cc
> > > +++ b/gcc/tree-ssa-dse.cc
> > > @@ -405,10 +405,36 @@ compute_trims (ao_ref *ref, sbitmap live, int
> > > *trim_head, int *trim_tail,
> > > int first_live = bitmap_first_set_bit (live);
> > > *trim_head = first_live - first_orig;
> > >
> > > - /* If more than a word remains, then make sure to keep the
> > > - starting point at least word aligned. */
> > > - if (last_live - first_live > UNITS_PER_WORD)
> > > - *trim_head &= ~(UNITS_PER_WORD - 1);
> > > + /* If REF is aligned, try to maintain this alignment if it reduces
> > > + the number of (power-of-two sized aligned) writes to memory.
> > > + First check that we're writing >= 3 bytes at a non-zero offset.
> > > + */ if (first_live
> > > + && last_live - first_live >= 2)
> > > + {
> > > + unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (ref->base));
> > >
> > > you can't simply use TYPE_ALIGN_* on ref->base. You can use
> > > get_object_alignment on ref->ref, but ref->ref can be NULL in case the
> > > ref was initialized from a builtin call like memcpy.
> > >
> > > Also ref->base is offsetted by ref->offset which you don't seem to
> > > account for. In theory one could export get_object_alignment_2 and if
> > > ref->ref is NULL, use that on ref->base, passing addr_p = true, and
> > > then adjust the resulting bitpos by ref->offset and fix align
> > > accordingly (trimming might also align an access if the original
> > > access was offsetted from known alignment).
> > >
> > > That said, a helper like ao_ref_alignment () might be useful here.
> >
> > Like the attached - free feel to use that.
> >
> > Richard.
> >
> > >
> > > I wonder if we can apply good heuristics to compute_trims without
> > > taking into account context, like maybe_trimp_complex_store is already
> > > limiting itself to useful subsets and the constructor and memstar
> > > cases will only benefit if they end up being expanded inline via
> > > *_by_pieces, not if expanded as a call.
> > >
> > > You don't seem to adjust *trim_tail at all, if an aligned 16 byte
> > > region is trimmed there by 3 that will result in two extra stores as well, no?
> > >
>
>
@@ -233,7 +233,7 @@ called_as_built_in (tree node)
If ADDR_P is true we are taking the address of the memory reference EXP
and thus cannot rely on the access taking place. */
-static bool
+bool
get_object_alignment_2 (tree exp, unsigned int *alignp,
unsigned HOST_WIDE_INT *bitposp, bool addr_p)
{
@@ -52,6 +52,8 @@ extern bool force_folding_builtin_constant_p;
extern bool called_as_built_in (tree);
extern bool get_object_alignment_1 (tree, unsigned int *,
unsigned HOST_WIDE_INT *);
+extern bool get_object_alignment_2 (tree, unsigned int *,
+ unsigned HOST_WIDE_INT *, bool);
extern unsigned int get_object_alignment (tree);
extern bool get_pointer_alignment_1 (tree, unsigned int *,
unsigned HOST_WIDE_INT *);
@@ -790,6 +790,29 @@ ao_ref_alias_ptr_type (ao_ref *ref)
return ret;
}
+/* Return the alignment of the access *REF and store it in the *ALIGN
+ and *BITPOS pairs. Returns false if no alignment could be determined.
+ See get_object_alignment_2 for details. */
+
+bool
+ao_ref_alignment (ao_ref *ref, unsigned int *align,
+ unsigned HOST_WIDE_INT *bitpos)
+{
+ if (ref->ref)
+ return get_object_alignment_1 (ref->ref, align, bitpos);
+
+ /* When we just have ref->base we cannot use get_object_alignment since
+ that will eventually use the type of the appearant access while for
+ example ao_ref_init_from_ptr_and_range is not careful to adjust that. */
+ *align = BITS_PER_UNIT;
+ HOST_WIDE_INT offset;
+ if (!ref->offset.is_constant (&offset)
+ || !get_object_alignment_2 (ref->base, align, bitpos, true))
+ return false;
+ *bitpos += (unsigned HOST_WIDE_INT)offset * BITS_PER_UNIT;
+ *bitpos = *bitpos & (*align - 1);
+ return true;
+}
/* Init an alias-oracle reference representation from a gimple pointer
PTR a range specified by OFFSET, SIZE and MAX_SIZE under the assumption
@@ -119,6 +119,8 @@ extern alias_set_type ao_ref_alias_set (ao_ref *);
extern alias_set_type ao_ref_base_alias_set (ao_ref *);
extern tree ao_ref_alias_ptr_type (ao_ref *);
extern tree ao_ref_base_alias_ptr_type (ao_ref *);
+extern bool ao_ref_alignment (ao_ref *, unsigned int *,
+ unsigned HOST_WIDE_INT *);
extern bool ptr_deref_may_alias_global_p (tree);
extern bool ptr_derefs_may_alias_p (tree, tree);
extern bool ptrs_compare_unequal (tree, tree);
@@ -405,10 +405,56 @@ compute_trims (ao_ref *ref, sbitmap live, int *trim_head, int *trim_tail,
int first_live = bitmap_first_set_bit (live);
*trim_head = first_live - first_orig;
- /* If more than a word remains, then make sure to keep the
- starting point at least word aligned. */
- if (last_live - first_live > UNITS_PER_WORD)
- *trim_head &= ~(UNITS_PER_WORD - 1);
+ /* If REF is aligned, try to maintain this alignment if it reduces
+ the number of (power-of-two sized aligned) writes to memory. */
+ unsigned int align_bits;
+ unsigned HOST_WIDE_INT bitpos;
+ if ((*trim_head || *trim_tail)
+ && last_live - first_live >= 2
+ && ao_ref_alignment (ref, &align_bits, &bitpos)
+ && align_bits >= 32
+ && bitpos == 0
+ && align_bits % BITS_PER_UNIT == 0)
+ {
+ unsigned int align_units = align_bits / BITS_PER_UNIT;
+ if (align_units > 16)
+ align_units = 16;
+ while ((first_live | (align_units - 1)) > (unsigned int)last_live)
+ align_units >>= 1;
+
+ if (*trim_head)
+ {
+ unsigned int pos = first_live & (align_units - 1);
+ for (unsigned int i = 1; i <= align_units; i <<= 1)
+ {
+ unsigned int mask = ~(i - 1);
+ unsigned int bytes = align_units - (pos & mask);
+ if (wi::popcount (bytes) <= 1)
+ {
+ *trim_head &= mask;
+ break;
+ }
+ }
+ }
+
+ if (*trim_tail)
+ {
+ unsigned int pos = last_live & (align_units - 1);
+ for (unsigned int i = 1; i <= align_units; i <<= 1)
+ {
+ int mask = i - 1;
+ unsigned int bytes = (pos | mask) + 1;
+ if ((last_live | mask) > (last_live + *trim_tail))
+ break;
+ if (wi::popcount (bytes) <= 1)
+ {
+ unsigned int extra = (last_live | mask) - last_live;
+ *trim_tail -= extra;
+ break;
+ }
+ }
+ }
+ }
if ((*trim_head || *trim_tail)
&& dump_file && (dump_flags & TDF_DETAILS))
@@ -590,7 +636,8 @@ maybe_trim_memstar_call (ao_ref *ref, sbitmap live, gimple *stmt)
about those bytes, the presence or absence of '\0' bytes
in there will affect whether it acts for the non-trimmed
bytes as memset or memcpy/strncpy. */
- c_strlen_data lendata = { };
+ c_strlen_data lendata;
+ memset (&lendata, 0, sizeof (lendata));
int orig_head_trim = head_trim;
tree srcstr = gimple_call_arg (stmt, 1);
if (!get_range_strlen (srcstr, &lendata, /*eltsize=*/1)
new file mode 100644
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+struct Data {
+ char a;
+ int b;
+};
+
+char c;
+
+Data val(int idx) {
+ return { c }; // { dg-warning "extended initializer" "c++ 98" { target { c++98_only } } }
+}
+
+/* { dg-final { scan-tree-dump-not " + 1B] = {}" "optimized" } } */
new file mode 100644
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+void f (void*);
+
+void g (char *a)
+{
+ __builtin_memset (a, 0, 8);
+ __builtin_memset (a, 0, 8);
+
+ f (a);
+}
+
+void h (char *a)
+{
+ __builtin_memset (a, 0, 8);
+ __builtin_memset (a, 0, 7);
+
+ f (a);
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_memset" 2 "optimized" } } */
new file mode 100644
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+void f (void*);
+
+void g (void)
+{
+ char a[8];
+ __builtin_memset (a, 0, 8);
+ __builtin_memset (a, 0, 8);
+
+ f (a);
+}
+
+void h (void)
+{
+ char a[8];
+ __builtin_memset (a, 0, 8);
+ __builtin_memset (a, 0, 7);
+
+ f (a);
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_memset" 2 "optimized" } } */