From patchwork Wed Mar 9 17:43:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roger Sayle X-Patchwork-Id: 51824 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8D09C3858428 for ; Wed, 9 Mar 2022 17:44:03 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id 14E973858C74 for ; Wed, 9 Mar 2022 17:43:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 14E973858C74 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID: Date:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=/rsYldtB+ySzxZuyO2LYm4VWD05E0sjVtANnX//pAdk=; b=ZTDq48mJOE+fEXbk95lv28B4hT wWMxoV5Ahx3QrNciLhAz8DNmy344mZ7WMUqKsOMB+6zJKlAtG58mAlQzOqshxL54hEm1HBPvK17md WvpZLL2+ON1RHxTwanzl2ZHJo3/s1Y6Xj619Lk/toCo8LiAlFp7zRIgCGMrPQbzIbBoTrrkwl73bw fonFShlVzC/BWfVKIO7LP40WXwX1nxfI9OyPoUYFJChQdOWGId0pFqgdY9NH6pgRTCZIsLTkUqapk 4XHc5bJ8na6RhWwavmBTMBQYwmY6KIH2Kq+cM6lv2m5SKAtPgRnzR8qenTt3lG3VPWxh2nUzVMp0i OrCOX3mQ==; Received: from [185.62.158.67] (port=50092 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1nS0Lg-0000FD-BQ; Wed, 09 Mar 2022 12:43:44 -0500 From: "Roger Sayle" To: "'Richard Biener'" Subject: [PATCH v2] PR tree-optimization/98335: Improvements to DSE's compute_trims. Date: Wed, 9 Mar 2022 17:43:42 -0000 Message-ID: <007601d833dd$38977a70$a9c66f50$@nextmovesoftware.com> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 16.0 Thread-Index: Adgz2uNiPThzHEJ7SX+xfeiGKtthAg== Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, LOTS_OF_MONEY, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 'GCC Patches' Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" Hi Richard, Many thanks. Yes, your proposed ao_ref_alignment is exactly what I was looking for. Here's the second revision of my patch for PR tree-optimization/98335 that both uses/ introduces ao_ref_alignment and more intelligently aligns/trims both head and tail, for example handling the case discussed by Richard and Jeff Law, of a 16 byte-aligned object where we wish to avoid trimming (just) the last three bytes. It uses the useful property that writing N consecutive bytes, typically requires popcount(N) store instructions, so we wish to align (if we can) that we begin/end with a store of N' bytes where popcount(N') is one, if that isn't already the case. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check with no new failures. Is this revised version ok for mainline? 2022-03-09 Roger Sayle Richard Biener gcc/ChangeLog PR tree-optimization/98335 * builtins.cc (get_object_alignment_2): Export. * builtins.h (get_object_alignment_2): Likewise. * tree-ssa-alias.cc (ao_ref_alignment): New. * tree-ssa-alias.h (ao_ref_alignment): Declare. * tree-ssa-dse.cc (compute_trims): Improve logic deciding whether to align head/tail, writing more bytes but using fewer store insns. (maybe_trim_memstar_call): Silence compiler warnings by using memset to initialize lendata. gcc/testsuite/ChangeLog PR tree-optimization/98335 * g++.dg/pr98335.C: New test case. * gcc.dg/pr86010.c: New test case. * gcc.dg/pr86010-2.c: New test case. Thanks again for your help. Roger --- > -----Original Message----- > From: Richard Biener > Sent: 08 March 2022 10:44 > To: Roger Sayle > Cc: GCC Patches > Subject: Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's > compute_trims. > > On Tue, Mar 8, 2022 at 11:10 AM Richard Biener > wrote: > > > > On Mon, Mar 7, 2022 at 11:04 AM Roger Sayle > wrote: > > > > > > > > > This patch is the main middle-end piece of a fix for PR > > > tree-opt/98335, which is a code-quality regression affecting > > > mainline. The issue occurs in DSE's (dead store elimination's) > > > compute_trims function that determines where a store to memory can > > > be trimmed. In the testcase given in the PR, this function notices > > > that the first byte of a DImode store is dead, and replaces the > > > 8-byte store at (aligned) offset zero, with a 7-byte store at > > > (unaligned) offset one. Most architectures can store a power-of-two > > > bytes (up to a maximum) in single instruction, so writing 7 bytes > > > requires more instructions than writing 8 bytes. This patch follows Jakub > Jelinek's suggestion in comment 5, that compute_trims needs improved > heuristics. > > > > > > In this patch, decision of whether and how to align trim_head is > > > based on the number of bytes being written, the alignment of the > > > start of the object and where within the object the first byte is > > > written. The first tests check whether we're already writing to the > > > start of the object, and that we're writing three or more bytes. If > > > we're only writing one or two bytes, there's no benefit from providing > additional alignment. > > > Then we determine the alignment of the object, which is either 1, 2, > > > 4, 8 or 16 byte aligned (capping at 16 guarantees that we never > > > write more than 7 bytes beyond the minimum required). If the buffer > > > is only > > > 1 or 2 byte aligned there's no benefit from additional alignment. > > > For the remaining cases, alignment of trim_head is based upon where > > > within each aligned block (word) the first byte is written. For > > > example, storing the last byte (or last half-word) of a word can be > > > performed with a single insn. > > > > > > On x86_64-pc-linux-gnu with -O2 the new test case in the PR goes from: > > > > > > movl $0, -24(%rsp) > > > movabsq $72057594037927935, %rdx > > > movl $0, -21(%rsp) > > > andq -24(%rsp), %rdx > > > movq %rdx, %rax > > > salq $8, %rax > > > movb c(%rip), %al > > > ret > > > > > > to > > > > > > xorl %eax, %eax > > > movb c(%rip), %al > > > ret > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > bootstrap and make -k check with no new failures. I've also added > > > new testcases for the original motivating PR > > > tree-optimization/86010, to ensure that those remain optimized (in future). > Ok for mainline? > > > > diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc index > > 2b22a61..080e406 100644 > > --- a/gcc/tree-ssa-dse.cc > > +++ b/gcc/tree-ssa-dse.cc > > @@ -405,10 +405,36 @@ compute_trims (ao_ref *ref, sbitmap live, int > > *trim_head, int *trim_tail, > > int first_live = bitmap_first_set_bit (live); > > *trim_head = first_live - first_orig; > > > > - /* If more than a word remains, then make sure to keep the > > - starting point at least word aligned. */ > > - if (last_live - first_live > UNITS_PER_WORD) > > - *trim_head &= ~(UNITS_PER_WORD - 1); > > + /* If REF is aligned, try to maintain this alignment if it reduces > > + the number of (power-of-two sized aligned) writes to memory. > > + First check that we're writing >= 3 bytes at a non-zero offset. > > + */ if (first_live > > + && last_live - first_live >= 2) > > + { > > + unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (ref->base)); > > > > you can't simply use TYPE_ALIGN_* on ref->base. You can use > > get_object_alignment on ref->ref, but ref->ref can be NULL in case the > > ref was initialized from a builtin call like memcpy. > > > > Also ref->base is offsetted by ref->offset which you don't seem to > > account for. In theory one could export get_object_alignment_2 and if > > ref->ref is NULL, use that on ref->base, passing addr_p = true, and > > then adjust the resulting bitpos by ref->offset and fix align > > accordingly (trimming might also align an access if the original > > access was offsetted from known alignment). > > > > That said, a helper like ao_ref_alignment () might be useful here. > > Like the attached - free feel to use that. > > Richard. > > > > > I wonder if we can apply good heuristics to compute_trims without > > taking into account context, like maybe_trimp_complex_store is already > > limiting itself to useful subsets and the constructor and memstar > > cases will only benefit if they end up being expanded inline via > > *_by_pieces, not if expanded as a call. > > > > You don't seem to adjust *trim_tail at all, if an aligned 16 byte > > region is trimmed there by 3 that will result in two extra stores as well, no? > > diff --git a/gcc/builtins.cc b/gcc/builtins.cc index d784a57..4c6c293 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -233,7 +233,7 @@ called_as_built_in (tree node) If ADDR_P is true we are taking the address of the memory reference EXP and thus cannot rely on the access taking place. */ -static bool +bool get_object_alignment_2 (tree exp, unsigned int *alignp, unsigned HOST_WIDE_INT *bitposp, bool addr_p) { diff --git a/gcc/builtins.h b/gcc/builtins.h index ea10b4b..5ad830c 100644 --- a/gcc/builtins.h +++ b/gcc/builtins.h @@ -52,6 +52,8 @@ extern bool force_folding_builtin_constant_p; extern bool called_as_built_in (tree); extern bool get_object_alignment_1 (tree, unsigned int *, unsigned HOST_WIDE_INT *); +extern bool get_object_alignment_2 (tree, unsigned int *, + unsigned HOST_WIDE_INT *, bool); extern unsigned int get_object_alignment (tree); extern bool get_pointer_alignment_1 (tree, unsigned int *, unsigned HOST_WIDE_INT *); diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc index 3e8d245..50bd47b 100644 --- a/gcc/tree-ssa-alias.cc +++ b/gcc/tree-ssa-alias.cc @@ -790,6 +790,29 @@ ao_ref_alias_ptr_type (ao_ref *ref) return ret; } +/* Return the alignment of the access *REF and store it in the *ALIGN + and *BITPOS pairs. Returns false if no alignment could be determined. + See get_object_alignment_2 for details. */ + +bool +ao_ref_alignment (ao_ref *ref, unsigned int *align, + unsigned HOST_WIDE_INT *bitpos) +{ + if (ref->ref) + return get_object_alignment_1 (ref->ref, align, bitpos); + + /* When we just have ref->base we cannot use get_object_alignment since + that will eventually use the type of the appearant access while for + example ao_ref_init_from_ptr_and_range is not careful to adjust that. */ + *align = BITS_PER_UNIT; + HOST_WIDE_INT offset; + if (!ref->offset.is_constant (&offset) + || !get_object_alignment_2 (ref->base, align, bitpos, true)) + return false; + *bitpos += (unsigned HOST_WIDE_INT)offset * BITS_PER_UNIT; + *bitpos = *bitpos & (*align - 1); + return true; +} /* Init an alias-oracle reference representation from a gimple pointer PTR a range specified by OFFSET, SIZE and MAX_SIZE under the assumption diff --git a/gcc/tree-ssa-alias.h b/gcc/tree-ssa-alias.h index 6ae1a65..dfb2127 100644 --- a/gcc/tree-ssa-alias.h +++ b/gcc/tree-ssa-alias.h @@ -119,6 +119,8 @@ extern alias_set_type ao_ref_alias_set (ao_ref *); extern alias_set_type ao_ref_base_alias_set (ao_ref *); extern tree ao_ref_alias_ptr_type (ao_ref *); extern tree ao_ref_base_alias_ptr_type (ao_ref *); +extern bool ao_ref_alignment (ao_ref *, unsigned int *, + unsigned HOST_WIDE_INT *); extern bool ptr_deref_may_alias_global_p (tree); extern bool ptr_derefs_may_alias_p (tree, tree); extern bool ptrs_compare_unequal (tree, tree); diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc index 2b22a61..49f91e2 100644 --- a/gcc/tree-ssa-dse.cc +++ b/gcc/tree-ssa-dse.cc @@ -405,10 +405,56 @@ compute_trims (ao_ref *ref, sbitmap live, int *trim_head, int *trim_tail, int first_live = bitmap_first_set_bit (live); *trim_head = first_live - first_orig; - /* If more than a word remains, then make sure to keep the - starting point at least word aligned. */ - if (last_live - first_live > UNITS_PER_WORD) - *trim_head &= ~(UNITS_PER_WORD - 1); + /* If REF is aligned, try to maintain this alignment if it reduces + the number of (power-of-two sized aligned) writes to memory. */ + unsigned int align_bits; + unsigned HOST_WIDE_INT bitpos; + if ((*trim_head || *trim_tail) + && last_live - first_live >= 2 + && ao_ref_alignment (ref, &align_bits, &bitpos) + && align_bits >= 32 + && bitpos == 0 + && align_bits % BITS_PER_UNIT == 0) + { + unsigned int align_units = align_bits / BITS_PER_UNIT; + if (align_units > 16) + align_units = 16; + while ((first_live | (align_units - 1)) > (unsigned int)last_live) + align_units >>= 1; + + if (*trim_head) + { + unsigned int pos = first_live & (align_units - 1); + for (unsigned int i = 1; i <= align_units; i <<= 1) + { + unsigned int mask = ~(i - 1); + unsigned int bytes = align_units - (pos & mask); + if (wi::popcount (bytes) <= 1) + { + *trim_head &= mask; + break; + } + } + } + + if (*trim_tail) + { + unsigned int pos = last_live & (align_units - 1); + for (unsigned int i = 1; i <= align_units; i <<= 1) + { + int mask = i - 1; + unsigned int bytes = (pos | mask) + 1; + if ((last_live | mask) > (last_live + *trim_tail)) + break; + if (wi::popcount (bytes) <= 1) + { + unsigned int extra = (last_live | mask) - last_live; + *trim_tail -= extra; + break; + } + } + } + } if ((*trim_head || *trim_tail) && dump_file && (dump_flags & TDF_DETAILS)) @@ -590,7 +636,8 @@ maybe_trim_memstar_call (ao_ref *ref, sbitmap live, gimple *stmt) about those bytes, the presence or absence of '\0' bytes in there will affect whether it acts for the non-trimmed bytes as memset or memcpy/strncpy. */ - c_strlen_data lendata = { }; + c_strlen_data lendata; + memset (&lendata, 0, sizeof (lendata)); int orig_head_trim = head_trim; tree srcstr = gimple_call_arg (stmt, 1); if (!get_range_strlen (srcstr, &lendata, /*eltsize=*/1) diff --git a/gcc/testsuite/g++.dg/pr98335.C b/gcc/testsuite/g++.dg/pr98335.C new file mode 100644 index 0000000..c54f4d9 --- /dev/null +++ b/gcc/testsuite/g++.dg/pr98335.C @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +struct Data { + char a; + int b; +}; + +char c; + +Data val(int idx) { + return { c }; // { dg-warning "extended initializer" "c++ 98" { target { c++98_only } } } +} + +/* { dg-final { scan-tree-dump-not " + 1B] = {}" "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/pr86010-2.c b/gcc/testsuite/gcc.dg/pr86010-2.c new file mode 100644 index 0000000..4c82e65 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr86010-2.c @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +void f (void*); + +void g (char *a) +{ + __builtin_memset (a, 0, 8); + __builtin_memset (a, 0, 8); + + f (a); +} + +void h (char *a) +{ + __builtin_memset (a, 0, 8); + __builtin_memset (a, 0, 7); + + f (a); +} + +/* { dg-final { scan-tree-dump-times "__builtin_memset" 2 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/pr86010.c b/gcc/testsuite/gcc.dg/pr86010.c new file mode 100644 index 0000000..ac27989 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr86010.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +void f (void*); + +void g (void) +{ + char a[8]; + __builtin_memset (a, 0, 8); + __builtin_memset (a, 0, 8); + + f (a); +} + +void h (void) +{ + char a[8]; + __builtin_memset (a, 0, 8); + __builtin_memset (a, 0, 7); + + f (a); +} + +/* { dg-final { scan-tree-dump-times "__builtin_memset" 2 "optimized" } } */