Message ID | patch-15007-tamar@arm.com |
---|---|
State | Committed |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2B0AB3858C60 for <patchwork@sourceware.org>; Tue, 2 Nov 2021 13:51:03 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2B0AB3858C60 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1635861063; bh=gZHwNGp5VCoyhozTMvbSDX7YmMSi6lmWh+qdrC+EDdY=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=s+RZN9UqXtDsTi5KEVkeeljda9Tcj0DTHEyS95L2FlBKOkaUfGoBvwmtNmk/AQXkY V6kqcmYNjNXUOBdY+ogklN1wwJtQoRTGqrTKmbHZpLMgWdGovCcNSqVefqN1wX/H16 MaTHJwBhZYT6ZR+a9KemW6Hb2VzRCjj9+HujF7/A= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from EUR04-VI1-obe.outbound.protection.outlook.com (mail-eopbgr80059.outbound.protection.outlook.com [40.107.8.59]) by sourceware.org (Postfix) with ESMTPS id C65723858432 for <gcc-patches@gcc.gnu.org>; Tue, 2 Nov 2021 13:50:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C65723858432 Received: from AS9PR06CA0334.eurprd06.prod.outlook.com (2603:10a6:20b:466::19) by AM6PR08MB3399.eurprd08.prod.outlook.com (2603:10a6:20b:47::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4649.14; Tue, 2 Nov 2021 13:50:26 +0000 Received: from VE1EUR03FT050.eop-EUR03.prod.protection.outlook.com (2603:10a6:20b:466:cafe::44) by AS9PR06CA0334.outlook.office365.com (2603:10a6:20b:466::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4649.14 via Frontend Transport; Tue, 2 Nov 2021 13:50:25 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by VE1EUR03FT050.mail.protection.outlook.com (10.152.19.209) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4649.14 via Frontend Transport; Tue, 2 Nov 2021 13:50:25 +0000 Received: ("Tessian outbound 4ce13939bd4a:v108"); Tue, 02 Nov 2021 13:50:24 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 4176a230fb78fc34 X-CR-MTA-TID: 64aa7808 Received: from b05dca9b4fc0.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id B0437B2F-B6C7-4C77-8FF8-9CFBED7BCA8B.1; Tue, 02 Nov 2021 13:50:16 +0000 Received: from EUR05-DB8-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id b05dca9b4fc0.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Tue, 02 Nov 2021 13:50:16 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=YgzTSfe/W4KO2NBVSJ9XC00Ml1Bwa8eFs3mSep3jQXu041Yk46petiidQJq3h5IcOdKkI/uPFniBptgercqqH6zZiAiGj+hrt1tEXT1RL/pCkcq1EGg8rIqsoNmp7MbIjfvFavZnLMvUede7k3+Y/EMJk77Ry2vzmed01B/54hEs+va41REiLTQysP5MMokI/XCd4labUyXqpv4G00prXoZCU71k1bkh5Pvvr1k5sZqwucJWj52I1FuWfv8B6L6kiuPFwHi2HdllAZTIRe8H3k7ly46pd/WBiC7uW99teqR/iL3PMmuOfn6IPCMUF/yYXANQM7KUqll/qiWIGW9GVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=gZHwNGp5VCoyhozTMvbSDX7YmMSi6lmWh+qdrC+EDdY=; b=MugdHla5GK5Ntc8XXSg/whs5KASueCDs6jQ+tIfU8LR2+3Tifwjv7l+jU86PuN45w8BBnpUyRoi6yMbuOyx5Rr3VaH6ZW1lUiAMf22shI+IICPzEj5g+WMdWpjLmUsWrQdp9L8I+W+yOWjRMp90ttn2QT6RbOSKHsS4TqxBuF4Rhp+d+BWo06QgnWg1AFyOrjUu4zA7USrNQkUKQqmd3JaTW0wNbPtl56QcpoMlVmKDpdRjH5ejrGikTDl+9CiSo6DTd2STb62NYb8XkDDH3OwTLgFyenY89FtvSUYcWWtSJPU+Z3DJm7GgxLHfhGWihkR+6KZsz90vSx3ZR3p49fQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Authentication-Results-Original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; Received: from VI1PR08MB5325.eurprd08.prod.outlook.com (2603:10a6:803:13e::17) by VE1PR08MB5565.eurprd08.prod.outlook.com (2603:10a6:800:1b2::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4649.17; Tue, 2 Nov 2021 13:50:10 +0000 Received: from VI1PR08MB5325.eurprd08.prod.outlook.com ([fe80::31cf:ea55:2234:c50b]) by VI1PR08MB5325.eurprd08.prod.outlook.com ([fe80::31cf:ea55:2234:c50b%7]) with mapi id 15.20.4649.020; Tue, 2 Nov 2021 13:50:10 +0000 Date: Tue, 2 Nov 2021 13:50:08 +0000 To: gcc-patches@gcc.gnu.org Subject: [PATCH]middle-end Add an RPO pass after successful vectorization Message-ID: <patch-15007-tamar@arm.com> Content-Type: multipart/mixed; boundary="wRRV7LY7NUeQGEoC" Content-Disposition: inline User-Agent: Mutt/1.9.4 (2018-02-28) X-ClientProxiedBy: LO2P265CA0386.GBRP265.PROD.OUTLOOK.COM (2603:10a6:600:f::14) To VI1PR08MB5325.eurprd08.prod.outlook.com (2603:10a6:803:13e::17) MIME-Version: 1.0 Received: from arm.com (217.140.106.55) by LO2P265CA0386.GBRP265.PROD.OUTLOOK.COM (2603:10a6:600:f::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4649.15 via Frontend Transport; Tue, 2 Nov 2021 13:50:10 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: f4ee1172-8c1b-4e2a-ea22-08d99e07b8fc X-MS-TrafficTypeDiagnostic: VE1PR08MB5565:|AM6PR08MB3399: X-Microsoft-Antispam-PRVS: <AM6PR08MB33992D789F992AFE8045AC22FF8B9@AM6PR08MB3399.eurprd08.prod.outlook.com> x-checkrecipientrouted: true NoDisclaimer: true X-MS-Oob-TLC-OOBClassifiers: OLM:9508;OLM:9508; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: D3j89wZH9xL+3NhS/NvBouclZwljZsrK/iEkizpECzvgT9FjH6hGC8lj8LdcgF7nLMrfa7X3/ECejvtIz7QodnqpqtXiw2QWDKaVJD62pKZd8zZNLkX90uPrh2Y04MG3mr4Z5IKvdPHD7xeQx6FP8AoM8h6IdfgP4rSl4SrjQVN7kBvB53q92LEgBAyF9DERqeV7fZEqM/vq1kmUoSXsz0YmP9HLJXHxkNTJgOHOUQt8s9gqcTNJbAuO3d1SX+h+T5ZbqCICsH8yGjdrXD+hxeGW3hApB4qr/h+BufMzrxyEDzCqNk7PKNe7pUI9gZjr0QQzau+549bBiqPIXS9qPbTWr3K7aTJRml4KvToCqs5lf2np9mhppPaBOGde4ju4WgBbeBYvu6HufyzDWBOFJCAbeQqEAblP/votVymTn8bdUCgrUtgWgHtcwU5bZAPUDsugx+G5RxHRf16fOVgK74sKZsM7ovI8KFnHyfqYFmAz7KMcBfbJPiyj7HglZ+QHd4by/ejNKXPINBeVS7PcRmwKdUONmJzxPruAqVoSnhFf4EA4ZXy2fkX6YYRM85o5Pv53VZpUiUHLG9a/pwm2xSt75+BExfCYz8SPcGoWBlaZJ8JSljGWpi3Brt1JAN/3p1kSzKtsR6dQBiBkeDXfs5cnK+JMLVeBsYI/4A8U1GxW1Aza2jJIJvMVzYrrBtedQ9BB6wWey8EwBk1hcnnEt8D53Pr6YLLVrpOJQdkEVQDYEhuLzqJ2Tv3nqJDVyBbi X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VI1PR08MB5325.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(366004)(316002)(38100700002)(38350700002)(44832011)(36756003)(4743002)(55016002)(86362001)(8676002)(66476007)(66556008)(2616005)(6916009)(956004)(66946007)(83380400001)(33964004)(44144004)(235185007)(5660300002)(4326008)(2906002)(8936002)(186003)(26005)(8886007)(7696005)(52116002)(508600001)(4216001)(2700100001); DIR:OUT; SFP:1101; X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1PR08MB5565 Original-Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: VE1EUR03FT050.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: 3cd3ee6d-e507-4202-dd4c-08d99e07aff2 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: ihTkX2O6VI7qAqq79TWqZV6jZcfQKx/Y8XqbsdLnRRSKkLtg8u5l0movS4yYbsHnQC7ydIZaxskbYhdV25/C172zlClEg0PLXnRYr1GdDHrY8/Ea8MSVmNj2Hpk4jYnSQWWtdyNUAFPPpzmLKt3m5ykPksuWx/ZbOqc5GzbNqwzPZ3+hS/CEOVie4lmbQWMEGQWnNmwPqjUgZQZtc7qYfRf3RdxwYE1PPi5o5nBQZhwIG74BoKkDVO5WZiA05H+RhlpFMeTUEhQ9iI7zB6hEpu+X6uw+j/NwDGL8ZgKPxWI8gn8LCtNZU1jxJ41bxOQ3Ukm4BJjV7+jbFQus47sdovkRJPSerCWPOXAc03dI6whXzP6KfeCbQ9Ujc9Z2BxbecvtriLOaQ/xY3LN8JnNxWb+oouUuydrrxKaJ7/zrDTrbI1/QAOozTS009ASEecUQx1+FDsVeP3+1wJIvgmHO/VUgRR5sqfp0HoGWbulFhG+QlaE+POSe2pxayYEZBuTeqiTYAgKplLmtYMSBlYzUCM1o0Pt3rqLuhFrJ0TuhToqMJ8JKHXhXGNDu0vI72qxgm819w+3QFFUyiYtjeTCc3vCzFhuIKC7sf4SIKNZzvjq6AlUGFHzoCrRiedW99EReEAtyBscPQfXivMJiDJ0nJqkMJgBodcBVb48FDMj0FZ2LZIIrWRVW5yKN0iSocCwjuYbr1gXaaXSTZarZdn6E2loYlUjLaOV31HjItOYxyKeMk8iYN6YJDQzTBz7f5+ue X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(36840700001)(46966006)(44832011)(356005)(81166007)(70206006)(235185007)(8936002)(186003)(8886007)(5660300002)(83380400001)(47076005)(86362001)(7696005)(2616005)(336012)(508600001)(8676002)(82310400003)(316002)(6916009)(4326008)(36756003)(4743002)(44144004)(33964004)(107886003)(55016002)(956004)(70586007)(26005)(2906002)(36860700001)(4216001)(2700100001); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 02 Nov 2021 13:50:25.4118 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: f4ee1172-8c1b-4e2a-ea22-08d99e07b8fc X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: VE1EUR03FT050.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB3399 X-Spam-Status: No, score=-13.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, MSGID_FROM_MTA_HEADER, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Tamar Christina <tamar.christina@arm.com> Cc: nd@arm.com, rguenther@suse.de Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
middle-end Add an RPO pass after successful vectorization
|
|
Commit Message
Tamar Christina
Nov. 2, 2021, 1:50 p.m. UTC
Hi All, Following my current SVE predicate optimization series a problem has presented itself in that the way vector masks are generated for masked operations relies on CSE to share masks efficiently. The issue however is that masking is done using the & operand and & is associative and so reassoc decides to reassociate the masked operations. This makes CSE then unable to CSE an unmasked and a masked operation leading to duplicate operations being performed. To counter this we want to add an RPO pass over the vectorized loop body when vectorization succeeds. This makes it then no longer reliant on the RTL level CSE. I have not added a testcase for this as it requires the changes in my patch series, however the entire series relies on this patch to work so all the tests there cover it. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon successful vectorization. --- inline copy of patch -- diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c660af4b6d32dc51 100644 --
Comments
On Tue, 2 Nov 2021, Tamar Christina wrote: > Hi All, > > Following my current SVE predicate optimization series a problem has presented > itself in that the way vector masks are generated for masked operations relies > on CSE to share masks efficiently. > > The issue however is that masking is done using the & operand and & is > associative and so reassoc decides to reassociate the masked operations. But it does this for the purpose of canonicalization and thus CSE. > This makes CSE then unable to CSE an unmasked and a masked operation leading to > duplicate operations being performed. > > To counter this we want to add an RPO pass over the vectorized loop body when > vectorization succeeds. This makes it then no longer reliant on the RTL level > CSE. > > I have not added a testcase for this as it requires the changes in my patch > series, however the entire series relies on this patch to work so all the > tests there cover it. > > Bootstrapped Regtested on aarch64-none-linux-gnu, > x86_64-linux-gnu and no issues. > > Ok for master? You are running VN over _all_ loop bodies rather only those vectorized. We loop over vectorized loops earlier for optimizing masked store sequences. I suppose you could hook in there. I'll also notice that we have pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE. So I don't understand why it doesn't work to CSE later. for (i = 1; i < number_of_loops (cfun); i++) { loop_vec_info loop_vinfo; bool has_mask_store; loop = get_loop (cfun, i); if (!loop || !loop->aux) continue; loop_vinfo = (loop_vec_info) loop->aux; has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); delete loop_vinfo; if (has_mask_store && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) optimize_mask_stores (loop); loop->aux = NULL; } > Thanks, > Tamar > > gcc/ChangeLog: > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon > successful vectorization. > > --- inline copy of patch -- > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c > index 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c660af4b6d32dc51 100644 > --- a/gcc/tree-vectorizer.c > +++ b/gcc/tree-vectorizer.c > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see > #include "gimple-pretty-print.h" > #include "opt-problem.h" > #include "internal-fn.h" > - > +#include "tree-ssa-sccvn.h" > > /* Loop or bb location, with hotness information. */ > dump_user_location_t vect_location; > @@ -1323,6 +1323,27 @@ vectorize_loops (void) > ??? Also while we try hard to update loop-closed SSA form we fail > to properly do this in some corner-cases (see PR56286). */ > rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa_only_virtuals); > + > + for (i = 1; i < number_of_loops (cfun); i++) > + { > + loop = get_loop (cfun, i); > + if (!loop || !single_exit (loop)) > + continue; > + > + bitmap exit_bbs; > + /* Perform local CSE, this esp. helps because we emit code for > + predicates that need to be shared for optimal predicate usage. > + However reassoc will re-order them and prevent CSE from working > + as it should. CSE only the loop body, not the entry. */ > + exit_bbs = BITMAP_ALLOC (NULL); > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > + bitmap_set_bit (exit_bbs, loop->latch->index); > + > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > + > + BITMAP_FREE (exit_bbs); > + } > + > return TODO_cleanup_cfg; > } > > > >
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Tuesday, November 2, 2021 2:24 PM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > vectorization > > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > Hi All, > > > > Following my current SVE predicate optimization series a problem has > > presented itself in that the way vector masks are generated for masked > > operations relies on CSE to share masks efficiently. > > > > The issue however is that masking is done using the & operand and & is > > associative and so reassoc decides to reassociate the masked operations. > > But it does this for the purpose of canonicalization and thus CSE. Yes, but it turns something like (a & b) & mask into a & (b & mask). When (a & b) is used somewhere else you now lose the CSE. So it's actually hurting In this case. > > > This makes CSE then unable to CSE an unmasked and a masked operation > > leading to duplicate operations being performed. > > > > To counter this we want to add an RPO pass over the vectorized loop > > body when vectorization succeeds. This makes it then no longer > > reliant on the RTL level CSE. > > > > I have not added a testcase for this as it requires the changes in my > > patch series, however the entire series relies on this patch to work > > so all the tests there cover it. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and > > no issues. > > > > Ok for master? > > You are running VN over _all_ loop bodies rather only those vectorized. > We loop over vectorized loops earlier for optimizing masked store sequences. > I suppose you could hook in there. I'll also notice that we have > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE. > So I don't understand why it doesn't work to CSE later. > Atm, say you have the conditions a > b, and a > b & a > c We generate mask1 = (a > b) & loop_mask mask2 = (a > b & a > c) & loop_mask with the intention that mask1 can be re-used in mask2. Reassoc changes this to mask2 = a > b & (a > c & loop_mask) Which has now unmasked (a > b) in mask2, which leaves us unable to combine the mask1 and mask2. It doesn't generate incorrect code, just inefficient. > for (i = 1; i < number_of_loops (cfun); i++) > { > loop_vec_info loop_vinfo; > bool has_mask_store; > > loop = get_loop (cfun, i); > if (!loop || !loop->aux) > continue; > loop_vinfo = (loop_vec_info) loop->aux; > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); > delete loop_vinfo; > if (has_mask_store > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) > optimize_mask_stores (loop); > loop->aux = NULL; > } > Ah thanks, I'll make the changes. Thanks, Tamar > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN > upon > > successful vectorization. > > > > --- inline copy of patch -- > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index > > > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c > 660 > > af4b6d32dc51 100644 > > --- a/gcc/tree-vectorizer.c > > +++ b/gcc/tree-vectorizer.c > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see > > #include "gimple-pretty-print.h" > > #include "opt-problem.h" > > #include "internal-fn.h" > > - > > +#include "tree-ssa-sccvn.h" > > > > /* Loop or bb location, with hotness information. */ > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@ > > vectorize_loops (void) > > ??? Also while we try hard to update loop-closed SSA form we fail > > to properly do this in some corner-cases (see PR56286). */ > > rewrite_into_loop_closed_ssa (NULL, > > TODO_update_ssa_only_virtuals); > > + > > + for (i = 1; i < number_of_loops (cfun); i++) > > + { > > + loop = get_loop (cfun, i); > > + if (!loop || !single_exit (loop)) > > + continue; > > + > > + bitmap exit_bbs; > > + /* Perform local CSE, this esp. helps because we emit code for > > + predicates that need to be shared for optimal predicate usage. > > + However reassoc will re-order them and prevent CSE from working > > + as it should. CSE only the loop body, not the entry. */ > > + exit_bbs = BITMAP_ALLOC (NULL); > > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > > + bitmap_set_bit (exit_bbs, loop->latch->index); > > + > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > > + > > + BITMAP_FREE (exit_bbs); > > + } > > + > > return TODO_cleanup_cfg; > > } > > > > > > > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
On Tue, 2 Nov 2021, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguenther@suse.de> > > Sent: Tuesday, November 2, 2021 2:24 PM > > To: Tamar Christina <Tamar.Christina@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > > vectorization > > > > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > > > Hi All, > > > > > > Following my current SVE predicate optimization series a problem has > > > presented itself in that the way vector masks are generated for masked > > > operations relies on CSE to share masks efficiently. > > > > > > The issue however is that masking is done using the & operand and & is > > > associative and so reassoc decides to reassociate the masked operations. > > > > But it does this for the purpose of canonicalization and thus CSE. > > Yes, but it turns something like > > (a & b) & mask into a & (b & mask). > > When (a & b) is used somewhere else you now lose the CSE. So it's actually hurting > In this case. OK, so that's a known "issue" with reassoc, it doesn't consider global CSE opportunities and I guess it pushes 'mask' to leaf if it is loop carried. > > > > > This makes CSE then unable to CSE an unmasked and a masked operation > > > leading to duplicate operations being performed. > > > > > > To counter this we want to add an RPO pass over the vectorized loop > > > body when vectorization succeeds. This makes it then no longer > > > reliant on the RTL level CSE. > > > > > > I have not added a testcase for this as it requires the changes in my > > > patch series, however the entire series relies on this patch to work > > > so all the tests there cover it. > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and > > > no issues. > > > > > > Ok for master? > > > > You are running VN over _all_ loop bodies rather only those vectorized. > > We loop over vectorized loops earlier for optimizing masked store sequences. > > I suppose you could hook in there. I'll also notice that we have > > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE. > > So I don't understand why it doesn't work to CSE later. > > > > Atm, say you have the conditions a > b, and a > b & a > c > > We generate > > mask1 = (a > b) & loop_mask > mask2 = (a > b & a > c) & loop_mask > > with the intention that mask1 can be re-used in mask2. > > Reassoc changes this to mask2 = a > b & (a > c & loop_mask) > > Which has now unmasked (a > b) in mask2, which leaves us unable to combine > the mask1 and mask2. It doesn't generate incorrect code, just inefficient. > > > for (i = 1; i < number_of_loops (cfun); i++) > > { > > loop_vec_info loop_vinfo; > > bool has_mask_store; > > > > loop = get_loop (cfun, i); > > if (!loop || !loop->aux) > > continue; > > loop_vinfo = (loop_vec_info) loop->aux; > > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); > > delete loop_vinfo; > > if (has_mask_store > > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) > > optimize_mask_stores (loop); > > loop->aux = NULL; > > } > > > > Ah thanks, I'll make the changes. Note I think that full-blown CSE is a bit overkill just to counter a deficient reassoc (or VN). At least it is supposed to be "cheap" and can be conditionalized on loop masks being used as well. > Thanks, > Tamar > > > > > > Thanks, > > > Tamar > > > > > > gcc/ChangeLog: > > > > > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN > > upon > > > successful vectorization. > > > > > > --- inline copy of patch -- > > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index > > > > > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c > > 660 > > > af4b6d32dc51 100644 > > > --- a/gcc/tree-vectorizer.c > > > +++ b/gcc/tree-vectorizer.c > > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see > > > #include "gimple-pretty-print.h" > > > #include "opt-problem.h" > > > #include "internal-fn.h" > > > - > > > +#include "tree-ssa-sccvn.h" > > > > > > /* Loop or bb location, with hotness information. */ > > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@ > > > vectorize_loops (void) > > > ??? Also while we try hard to update loop-closed SSA form we fail > > > to properly do this in some corner-cases (see PR56286). */ > > > rewrite_into_loop_closed_ssa (NULL, > > > TODO_update_ssa_only_virtuals); > > > + > > > + for (i = 1; i < number_of_loops (cfun); i++) > > > + { > > > + loop = get_loop (cfun, i); > > > + if (!loop || !single_exit (loop)) > > > + continue; > > > + > > > + bitmap exit_bbs; > > > + /* Perform local CSE, this esp. helps because we emit code for > > > + predicates that need to be shared for optimal predicate usage. > > > + However reassoc will re-order them and prevent CSE from working > > > + as it should. CSE only the loop body, not the entry. */ > > > + exit_bbs = BITMAP_ALLOC (NULL); > > > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > > > + bitmap_set_bit (exit_bbs, loop->latch->index); > > > + > > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > > > + > > > + BITMAP_FREE (exit_bbs); > > > + } > > > + > > > return TODO_cleanup_cfg; > > > } > > > > > > > > > > > > > > > > -- > > Richard Biener <rguenther@suse.de> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg) >
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On Tue, 2 Nov 2021, Tamar Christina wrote: > >> > -----Original Message----- >> > From: Richard Biener <rguenther@suse.de> >> > Sent: Tuesday, November 2, 2021 2:24 PM >> > To: Tamar Christina <Tamar.Christina@arm.com> >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful >> > vectorization >> > >> > On Tue, 2 Nov 2021, Tamar Christina wrote: >> > >> > > Hi All, >> > > >> > > Following my current SVE predicate optimization series a problem has >> > > presented itself in that the way vector masks are generated for masked >> > > operations relies on CSE to share masks efficiently. >> > > >> > > The issue however is that masking is done using the & operand and & is >> > > associative and so reassoc decides to reassociate the masked operations. >> > >> > But it does this for the purpose of canonicalization and thus CSE. >> >> Yes, but it turns something like >> >> (a & b) & mask into a & (b & mask). >> >> When (a & b) is used somewhere else you now lose the CSE. So it's actually hurting >> In this case. > > OK, so that's a known "issue" with reassoc, it doesn't consider global > CSE opportunities and I guess it pushes 'mask' to leaf if it is loop > carried. > >> > >> > > This makes CSE then unable to CSE an unmasked and a masked operation >> > > leading to duplicate operations being performed. >> > > >> > > To counter this we want to add an RPO pass over the vectorized loop >> > > body when vectorization succeeds. This makes it then no longer >> > > reliant on the RTL level CSE. >> > > >> > > I have not added a testcase for this as it requires the changes in my >> > > patch series, however the entire series relies on this patch to work >> > > so all the tests there cover it. >> > > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and >> > > no issues. >> > > >> > > Ok for master? >> > >> > You are running VN over _all_ loop bodies rather only those vectorized. >> > We loop over vectorized loops earlier for optimizing masked store sequences. >> > I suppose you could hook in there. I'll also notice that we have >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE. >> > So I don't understand why it doesn't work to CSE later. >> > >> >> Atm, say you have the conditions a > b, and a > b & a > c >> >> We generate >> >> mask1 = (a > b) & loop_mask >> mask2 = (a > b & a > c) & loop_mask >> >> with the intention that mask1 can be re-used in mask2. >> >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask) >> >> Which has now unmasked (a > b) in mask2, which leaves us unable to combine >> the mask1 and mask2. It doesn't generate incorrect code, just inefficient. >> >> > for (i = 1; i < number_of_loops (cfun); i++) >> > { >> > loop_vec_info loop_vinfo; >> > bool has_mask_store; >> > >> > loop = get_loop (cfun, i); >> > if (!loop || !loop->aux) >> > continue; >> > loop_vinfo = (loop_vec_info) loop->aux; >> > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); >> > delete loop_vinfo; >> > if (has_mask_store >> > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) >> > optimize_mask_stores (loop); >> > loop->aux = NULL; >> > } >> > >> >> Ah thanks, I'll make the changes. > > Note I think that full-blown CSE is a bit overkill just to counter > a deficient reassoc (or VN). At least it is supposed to be "cheap" > and can be conditionalized on loop masks being used as well. Not sure we should make this conditional on loop masks being used. It seems either that: (a) the vectoriser is supposed to avoid creating code that has folding or VN opportunities, in which case we need to generate the vectorised code in a smarter way or (b) the vectoriser is allowed to create code that has folding or VN opportunities, in which case it would be good to have a defined place to get rid of them. I'm just worried that if we make it conditional on loop masks, we could see cases that in which non-loop-mask stuff is optimised differently based on whether the loop has masks or not. E.g. we might get worse code with an unpredicated main loop and a predicated epilogue compared to a predicated main loop. Thanks, Richard > >> Thanks, >> Tamar >> >> > >> > > Thanks, >> > > Tamar >> > > >> > > gcc/ChangeLog: >> > > >> > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN >> > upon >> > > successful vectorization. >> > > >> > > --- inline copy of patch -- >> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index >> > > >> > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c >> > 660 >> > > af4b6d32dc51 100644 >> > > --- a/gcc/tree-vectorizer.c >> > > +++ b/gcc/tree-vectorizer.c >> > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see >> > > #include "gimple-pretty-print.h" >> > > #include "opt-problem.h" >> > > #include "internal-fn.h" >> > > - >> > > +#include "tree-ssa-sccvn.h" >> > > >> > > /* Loop or bb location, with hotness information. */ >> > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@ >> > > vectorize_loops (void) >> > > ??? Also while we try hard to update loop-closed SSA form we fail >> > > to properly do this in some corner-cases (see PR56286). */ >> > > rewrite_into_loop_closed_ssa (NULL, >> > > TODO_update_ssa_only_virtuals); >> > > + >> > > + for (i = 1; i < number_of_loops (cfun); i++) >> > > + { >> > > + loop = get_loop (cfun, i); >> > > + if (!loop || !single_exit (loop)) >> > > + continue; >> > > + >> > > + bitmap exit_bbs; >> > > + /* Perform local CSE, this esp. helps because we emit code for >> > > + predicates that need to be shared for optimal predicate usage. >> > > + However reassoc will re-order them and prevent CSE from working >> > > + as it should. CSE only the loop body, not the entry. */ >> > > + exit_bbs = BITMAP_ALLOC (NULL); >> > > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); >> > > + bitmap_set_bit (exit_bbs, loop->latch->index); >> > > + >> > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); >> > > + >> > > + BITMAP_FREE (exit_bbs); >> > > + } >> > > + >> > > return TODO_cleanup_cfg; >> > > } >> > > >> > > >> > > >> > > >> > >> > -- >> > Richard Biener <rguenther@suse.de> >> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 >> > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg) >>
On Tue, 2 Nov 2021, Richard Sandiford wrote: > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > >> > -----Original Message----- > >> > From: Richard Biener <rguenther@suse.de> > >> > Sent: Tuesday, November 2, 2021 2:24 PM > >> > To: Tamar Christina <Tamar.Christina@arm.com> > >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > >> > vectorization > >> > > >> > On Tue, 2 Nov 2021, Tamar Christina wrote: > >> > > >> > > Hi All, > >> > > > >> > > Following my current SVE predicate optimization series a problem has > >> > > presented itself in that the way vector masks are generated for masked > >> > > operations relies on CSE to share masks efficiently. > >> > > > >> > > The issue however is that masking is done using the & operand and & is > >> > > associative and so reassoc decides to reassociate the masked operations. > >> > > >> > But it does this for the purpose of canonicalization and thus CSE. > >> > >> Yes, but it turns something like > >> > >> (a & b) & mask into a & (b & mask). > >> > >> When (a & b) is used somewhere else you now lose the CSE. So it's actually hurting > >> In this case. > > > > OK, so that's a known "issue" with reassoc, it doesn't consider global > > CSE opportunities and I guess it pushes 'mask' to leaf if it is loop > > carried. > > > >> > > >> > > This makes CSE then unable to CSE an unmasked and a masked operation > >> > > leading to duplicate operations being performed. > >> > > > >> > > To counter this we want to add an RPO pass over the vectorized loop > >> > > body when vectorization succeeds. This makes it then no longer > >> > > reliant on the RTL level CSE. > >> > > > >> > > I have not added a testcase for this as it requires the changes in my > >> > > patch series, however the entire series relies on this patch to work > >> > > so all the tests there cover it. > >> > > > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and > >> > > no issues. > >> > > > >> > > Ok for master? > >> > > >> > You are running VN over _all_ loop bodies rather only those vectorized. > >> > We loop over vectorized loops earlier for optimizing masked store sequences. > >> > I suppose you could hook in there. I'll also notice that we have > >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE. > >> > So I don't understand why it doesn't work to CSE later. > >> > > >> > >> Atm, say you have the conditions a > b, and a > b & a > c > >> > >> We generate > >> > >> mask1 = (a > b) & loop_mask > >> mask2 = (a > b & a > c) & loop_mask > >> > >> with the intention that mask1 can be re-used in mask2. > >> > >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask) > >> > >> Which has now unmasked (a > b) in mask2, which leaves us unable to combine > >> the mask1 and mask2. It doesn't generate incorrect code, just inefficient. > >> > >> > for (i = 1; i < number_of_loops (cfun); i++) > >> > { > >> > loop_vec_info loop_vinfo; > >> > bool has_mask_store; > >> > > >> > loop = get_loop (cfun, i); > >> > if (!loop || !loop->aux) > >> > continue; > >> > loop_vinfo = (loop_vec_info) loop->aux; > >> > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); > >> > delete loop_vinfo; > >> > if (has_mask_store > >> > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) > >> > optimize_mask_stores (loop); > >> > loop->aux = NULL; > >> > } > >> > > >> > >> Ah thanks, I'll make the changes. > > > > Note I think that full-blown CSE is a bit overkill just to counter > > a deficient reassoc (or VN). At least it is supposed to be "cheap" > > and can be conditionalized on loop masks being used as well. > > Not sure we should make this conditional on loop masks being used. > It seems either that: > > (a) the vectoriser is supposed to avoid creating code that has folding > or VN opportunities, in which case we need to generate the vectorised > code in a smarter way or > > (b) the vectoriser is allowed to create code that has folding or VN > opportunities, in which case it would be good to have a defined > place to get rid of them. It's certainly (b), and the definitive place to get rid of those is the post-loop optimizer FRE pass. That just happens to be after a reassoc pass which makes FRE run into the pre-existing issue that we fail to capture all (or the best) possible CSE opportunity from separate associatable chains. > I'm just worried that if we make it conditional on loop masks, > we could see cases that in which non-loop-mask stuff is optimised > differently based on whether the loop has masks or not. E.g. > we might get worse code with an unpredicated main loop and > a predicated epilogue compared to a predicated main loop. Sure. Note for loop vectorization we can indeed reasonably easy CSE the main body and RPO VN should be O(region size) and cheap enough for this case (we could even add an extra cheap entry for single basic-blocks). For BB vectorization we have to rely on the full function FRE pass though. Richard. > Thanks, > Richard > > > >> Thanks, > >> Tamar > >> > >> > > >> > > Thanks, > >> > > Tamar > >> > > > >> > > gcc/ChangeLog: > >> > > > >> > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN > >> > upon > >> > > successful vectorization. > >> > > > >> > > --- inline copy of patch -- > >> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index > >> > > > >> > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c > >> > 660 > >> > > af4b6d32dc51 100644 > >> > > --- a/gcc/tree-vectorizer.c > >> > > +++ b/gcc/tree-vectorizer.c > >> > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see > >> > > #include "gimple-pretty-print.h" > >> > > #include "opt-problem.h" > >> > > #include "internal-fn.h" > >> > > - > >> > > +#include "tree-ssa-sccvn.h" > >> > > > >> > > /* Loop or bb location, with hotness information. */ > >> > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@ > >> > > vectorize_loops (void) > >> > > ??? Also while we try hard to update loop-closed SSA form we fail > >> > > to properly do this in some corner-cases (see PR56286). */ > >> > > rewrite_into_loop_closed_ssa (NULL, > >> > > TODO_update_ssa_only_virtuals); > >> > > + > >> > > + for (i = 1; i < number_of_loops (cfun); i++) > >> > > + { > >> > > + loop = get_loop (cfun, i); > >> > > + if (!loop || !single_exit (loop)) > >> > > + continue; > >> > > + > >> > > + bitmap exit_bbs; > >> > > + /* Perform local CSE, this esp. helps because we emit code for > >> > > + predicates that need to be shared for optimal predicate usage. > >> > > + However reassoc will re-order them and prevent CSE from working > >> > > + as it should. CSE only the loop body, not the entry. */ > >> > > + exit_bbs = BITMAP_ALLOC (NULL); > >> > > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > >> > > + bitmap_set_bit (exit_bbs, loop->latch->index); > >> > > + > >> > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > >> > > + > >> > > + BITMAP_FREE (exit_bbs); > >> > > + } > >> > > + > >> > > return TODO_cleanup_cfg; > >> > > } > >> > > > >> > > > >> > > > >> > > > >> > > >> > -- > >> > Richard Biener <rguenther@suse.de> > >> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > >> > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg) > >> >
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Tuesday, November 2, 2021 6:22 PM > To: Richard Sandiford <Richard.Sandiford@arm.com> > Cc: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar > Christina <Tamar.Christina@arm.com>; nd <nd@arm.com> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > vectorization > > On Tue, 2 Nov 2021, Richard Sandiford wrote: > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > > > >> > -----Original Message----- > > >> > From: Richard Biener <rguenther@suse.de> > > >> > Sent: Tuesday, November 2, 2021 2:24 PM > > >> > To: Tamar Christina <Tamar.Christina@arm.com> > > >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > > >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > > >> > vectorization > > >> > > > >> > On Tue, 2 Nov 2021, Tamar Christina wrote: > > >> > > > >> > > Hi All, > > >> > > > > >> > > Following my current SVE predicate optimization series a > > >> > > problem has presented itself in that the way vector masks are > > >> > > generated for masked operations relies on CSE to share masks > efficiently. > > >> > > > > >> > > The issue however is that masking is done using the & operand > > >> > > and & is associative and so reassoc decides to reassociate the > masked operations. > > >> > > > >> > But it does this for the purpose of canonicalization and thus CSE. > > >> > > >> Yes, but it turns something like > > >> > > >> (a & b) & mask into a & (b & mask). > > >> > > >> When (a & b) is used somewhere else you now lose the CSE. So it's > > >> actually hurting In this case. > > > > > > OK, so that's a known "issue" with reassoc, it doesn't consider > > > global CSE opportunities and I guess it pushes 'mask' to leaf if it > > > is loop carried. > > > > > >> > > > >> > > This makes CSE then unable to CSE an unmasked and a masked > > >> > > operation leading to duplicate operations being performed. > > >> > > > > >> > > To counter this we want to add an RPO pass over the vectorized > > >> > > loop body when vectorization succeeds. This makes it then no > > >> > > longer reliant on the RTL level CSE. > > >> > > > > >> > > I have not added a testcase for this as it requires the changes > > >> > > in my patch series, however the entire series relies on this > > >> > > patch to work so all the tests there cover it. > > >> > > > > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > >> > > x86_64-linux-gnu and no issues. > > >> > > > > >> > > Ok for master? > > >> > > > >> > You are running VN over _all_ loop bodies rather only those > vectorized. > > >> > We loop over vectorized loops earlier for optimizing masked store > sequences. > > >> > I suppose you could hook in there. I'll also notice that we have > > >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a > late FRE. > > >> > So I don't understand why it doesn't work to CSE later. > > >> > > > >> > > >> Atm, say you have the conditions a > b, and a > b & a > c > > >> > > >> We generate > > >> > > >> mask1 = (a > b) & loop_mask > > >> mask2 = (a > b & a > c) & loop_mask > > >> > > >> with the intention that mask1 can be re-used in mask2. > > >> > > >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask) > > >> > > >> Which has now unmasked (a > b) in mask2, which leaves us unable to > > >> combine the mask1 and mask2. It doesn't generate incorrect code, just > inefficient. > > >> > > >> > for (i = 1; i < number_of_loops (cfun); i++) > > >> > { > > >> > loop_vec_info loop_vinfo; > > >> > bool has_mask_store; > > >> > > > >> > loop = get_loop (cfun, i); > > >> > if (!loop || !loop->aux) > > >> > continue; > > >> > loop_vinfo = (loop_vec_info) loop->aux; > > >> > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); > > >> > delete loop_vinfo; > > >> > if (has_mask_store > > >> > && targetm.vectorize.empty_mask_is_expensive > (IFN_MASK_STORE)) > > >> > optimize_mask_stores (loop); > > >> > loop->aux = NULL; > > >> > } > > >> > > > >> > > >> Ah thanks, I'll make the changes. > > > > > > Note I think that full-blown CSE is a bit overkill just to counter a > > > deficient reassoc (or VN). At least it is supposed to be "cheap" > > > and can be conditionalized on loop masks being used as well. > > > > Not sure we should make this conditional on loop masks being used. > > It seems either that: > > > > (a) the vectoriser is supposed to avoid creating code that has folding > > or VN opportunities, in which case we need to generate the vectorised > > code in a smarter way or > > > > (b) the vectoriser is allowed to create code that has folding or VN > > opportunities, in which case it would be good to have a defined > > place to get rid of them. > > It's certainly (b), and the definitive place to get rid of those is the post-loop > optimizer FRE pass. That just happens to be after a reassoc pass which > makes FRE run into the pre-existing issue that we fail to capture all (or the > best) possible CSE opportunity from separate associatable chains. > > > I'm just worried that if we make it conditional on loop masks, we > > could see cases that in which non-loop-mask stuff is optimised > > differently based on whether the loop has masks or not. E.g. > > we might get worse code with an unpredicated main loop and a > > predicated epilogue compared to a predicated main loop. > > Sure. Note for loop vectorization we can indeed reasonably easy CSE the > main body and RPO VN should be O(region size) and cheap enough for this > case (we could even add an extra cheap entry for single basic-blocks). For BB > vectorization we have to rely on the full function FRE pass though. > I've moved the code so it works only on the vector loops. I don't think there needs to be any code change to handle single BB efficiently no? the walk stop at the loop exit block so it's already optimal in the case, unless I misunderstood? Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon successful vectorization. --- inline copy of patch --- diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index edb7538a67f00cd80a608ee82510cf437fe88083..9d8015ea35984963db1ddb90af32935d08c8a0e8 100644 --- a/gcc/tree-vectorizer.c +++ b/gcc/tree-vectorizer.c @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see #include "gimple-pretty-print.h" #include "opt-problem.h" #include "internal-fn.h" - +#include "tree-ssa-sccvn.h" /* Loop or bb location, with hotness information. */ dump_user_location_t vect_location; @@ -1298,6 +1298,20 @@ vectorize_loops (void) if (has_mask_store && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) optimize_mask_stores (loop); + + bitmap exit_bbs; + /* Perform local CSE, this esp. helps because we emit code for + predicates that need to be shared for optimal predicate usage. + However reassoc will re-order them and prevent CSE from working + as it should. CSE only the loop body, not the entry. */ + exit_bbs = BITMAP_ALLOC (NULL); + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); + bitmap_set_bit (exit_bbs, loop->latch->index); + + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); + + BITMAP_FREE (exit_bbs); + loop->aux = NULL; }
On Fri, 5 Nov 2021, Tamar Christina wrote: > > > > -----Original Message----- > > From: Richard Biener <rguenther@suse.de> > > Sent: Tuesday, November 2, 2021 6:22 PM > > To: Richard Sandiford <Richard.Sandiford@arm.com> > > Cc: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar > > Christina <Tamar.Christina@arm.com>; nd <nd@arm.com> > > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > > vectorization > > > > On Tue, 2 Nov 2021, Richard Sandiford wrote: > > > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > > > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > > > > > >> > -----Original Message----- > > > >> > From: Richard Biener <rguenther@suse.de> > > > >> > Sent: Tuesday, November 2, 2021 2:24 PM > > > >> > To: Tamar Christina <Tamar.Christina@arm.com> > > > >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > > > >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > > > >> > vectorization > > > >> > > > > >> > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > >> > > > > >> > > Hi All, > > > >> > > > > > >> > > Following my current SVE predicate optimization series a > > > >> > > problem has presented itself in that the way vector masks are > > > >> > > generated for masked operations relies on CSE to share masks > > efficiently. > > > >> > > > > > >> > > The issue however is that masking is done using the & operand > > > >> > > and & is associative and so reassoc decides to reassociate the > > masked operations. > > > >> > > > > >> > But it does this for the purpose of canonicalization and thus CSE. > > > >> > > > >> Yes, but it turns something like > > > >> > > > >> (a & b) & mask into a & (b & mask). > > > >> > > > >> When (a & b) is used somewhere else you now lose the CSE. So it's > > > >> actually hurting In this case. > > > > > > > > OK, so that's a known "issue" with reassoc, it doesn't consider > > > > global CSE opportunities and I guess it pushes 'mask' to leaf if it > > > > is loop carried. > > > > > > > >> > > > > >> > > This makes CSE then unable to CSE an unmasked and a masked > > > >> > > operation leading to duplicate operations being performed. > > > >> > > > > > >> > > To counter this we want to add an RPO pass over the vectorized > > > >> > > loop body when vectorization succeeds. This makes it then no > > > >> > > longer reliant on the RTL level CSE. > > > >> > > > > > >> > > I have not added a testcase for this as it requires the changes > > > >> > > in my patch series, however the entire series relies on this > > > >> > > patch to work so all the tests there cover it. > > > >> > > > > > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > >> > > x86_64-linux-gnu and no issues. > > > >> > > > > > >> > > Ok for master? > > > >> > > > > >> > You are running VN over _all_ loop bodies rather only those > > vectorized. > > > >> > We loop over vectorized loops earlier for optimizing masked store > > sequences. > > > >> > I suppose you could hook in there. I'll also notice that we have > > > >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a > > late FRE. > > > >> > So I don't understand why it doesn't work to CSE later. > > > >> > > > > >> > > > >> Atm, say you have the conditions a > b, and a > b & a > c > > > >> > > > >> We generate > > > >> > > > >> mask1 = (a > b) & loop_mask > > > >> mask2 = (a > b & a > c) & loop_mask > > > >> > > > >> with the intention that mask1 can be re-used in mask2. > > > >> > > > >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask) > > > >> > > > >> Which has now unmasked (a > b) in mask2, which leaves us unable to > > > >> combine the mask1 and mask2. It doesn't generate incorrect code, just > > inefficient. > > > >> > > > >> > for (i = 1; i < number_of_loops (cfun); i++) > > > >> > { > > > >> > loop_vec_info loop_vinfo; > > > >> > bool has_mask_store; > > > >> > > > > >> > loop = get_loop (cfun, i); > > > >> > if (!loop || !loop->aux) > > > >> > continue; > > > >> > loop_vinfo = (loop_vec_info) loop->aux; > > > >> > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); > > > >> > delete loop_vinfo; > > > >> > if (has_mask_store > > > >> > && targetm.vectorize.empty_mask_is_expensive > > (IFN_MASK_STORE)) > > > >> > optimize_mask_stores (loop); > > > >> > loop->aux = NULL; > > > >> > } > > > >> > > > > >> > > > >> Ah thanks, I'll make the changes. > > > > > > > > Note I think that full-blown CSE is a bit overkill just to counter a > > > > deficient reassoc (or VN). At least it is supposed to be "cheap" > > > > and can be conditionalized on loop masks being used as well. > > > > > > Not sure we should make this conditional on loop masks being used. > > > It seems either that: > > > > > > (a) the vectoriser is supposed to avoid creating code that has folding > > > or VN opportunities, in which case we need to generate the vectorised > > > code in a smarter way or > > > > > > (b) the vectoriser is allowed to create code that has folding or VN > > > opportunities, in which case it would be good to have a defined > > > place to get rid of them. > > > > It's certainly (b), and the definitive place to get rid of those is the post-loop > > optimizer FRE pass. That just happens to be after a reassoc pass which > > makes FRE run into the pre-existing issue that we fail to capture all (or the > > best) possible CSE opportunity from separate associatable chains. > > > > > I'm just worried that if we make it conditional on loop masks, we > > > could see cases that in which non-loop-mask stuff is optimised > > > differently based on whether the loop has masks or not. E.g. > > > we might get worse code with an unpredicated main loop and a > > > predicated epilogue compared to a predicated main loop. > > > > Sure. Note for loop vectorization we can indeed reasonably easy CSE the > > main body and RPO VN should be O(region size) and cheap enough for this > > case (we could even add an extra cheap entry for single basic-blocks). For BB > > vectorization we have to rely on the full function FRE pass though. > > > > I've moved the code so it works only on the vector loops. I don't think there needs > to be any code change to handle single BB efficiently no? the walk stop at the loop > exit block so it's already optimal in the case, unless I misunderstood? Yeah, it's at most some constant time work we could avoid. > Bootstrapped Regtested on aarch64-none-linux-gnu, > x86_64-linux-gnu and no issues. > > Ok for master? See below... > Thanks, > Tamar > > gcc/ChangeLog: > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon > successful vectorization. > > --- inline copy of patch --- > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c > index edb7538a67f00cd80a608ee82510cf437fe88083..9d8015ea35984963db1ddb90af32935d08c8a0e8 100644 > --- a/gcc/tree-vectorizer.c > +++ b/gcc/tree-vectorizer.c > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see > #include "gimple-pretty-print.h" > #include "opt-problem.h" > #include "internal-fn.h" > - > +#include "tree-ssa-sccvn.h" > > /* Loop or bb location, with hotness information. */ > dump_user_location_t vect_location; > @@ -1298,6 +1298,20 @@ vectorize_loops (void) > if (has_mask_store > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) > optimize_mask_stores (loop); > + > + bitmap exit_bbs; > + /* Perform local CSE, this esp. helps because we emit code for > + predicates that need to be shared for optimal predicate usage. > + However reassoc will re-order them and prevent CSE from working > + as it should. CSE only the loop body, not the entry. */ > + exit_bbs = BITMAP_ALLOC (NULL); use auto_bitmap exit_bbs; and the allocation and ... > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > + bitmap_set_bit (exit_bbs, loop->latch->index); treating the latch as exit is probably premature optimization (yes, it's empty). > + > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > + > + BITMAP_FREE (exit_bbs); ... deallocation can go. Note I wonder whether, if we are already spinning up VN, we should include the preheader in the operation? We regularly end up emitting redundant vector initializers that could be cleaned up earlier this way. Otherwise the change looks OK. Thanks, Richard.
> > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > > + bitmap_set_bit (exit_bbs, loop->latch->index); > > treating the latch as exit is probably premature optimization (yes, it's empty). > > > + > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > > + > > + BITMAP_FREE (exit_bbs); > > ... deallocation can go. Note I wonder whether, if we are already spinning up > VN, we should include the preheader in the operation? > We regularly end up emitting redundant vector initializers that could be > cleaned up earlier this way. I've change it to include the preheader but it looks like this breaks bootstrap on both x86 and AArch64. On x86 the following testcase double matmul_c8_vanilla_bbase_0; double *matmul_c8_vanilla_dest; matmul_c8_vanilla_x; matmul_c8_vanilla() { for (; matmul_c8_vanilla_x; matmul_c8_vanilla_x++) matmul_c8_vanilla_dest[matmul_c8_vanilla_x] += matmul_c8_vanilla_bbase_0; } ICEs with -std=gnu11 -ffast-math -ftree-vectorize -O2 with: internal compiler error: tree check: expected ssa_name, have var_decl in SSA_VAL, at tree-ssa-sccvn.c:535 0x80731c tree_check_failed(tree_node const*, char const*, int, char const*, ...) ../gcc-dsg/gcc/tree.c:8689 0x7ebda2 tree_check(tree_node*, char const*, int, char const*, tree_code) ../gcc-dsg/gcc/tree.h:3433 0x7ebda2 SSA_VAL(tree_node*, bool*) ../gcc-dsg/gcc/tree-ssa-sccvn.c:535 0x7ebda2 vuse_ssa_val ../gcc-dsg/gcc/tree-ssa-sccvn.c:553 0x7ebda2 vn_reference_lookup(tree_node*, tree_node*, vn_lookup_kind, vn_reference_s**, bool, tree_node**, tree_node*) ../gcc-dsg/gcc/tree-ssa-sccvn.c:3664 0x10d8ca5 visit_reference_op_load ../gcc-dsg/gcc/tree-ssa-sccvn.c:5166 0x10d8ca5 visit_stmt ../gcc-dsg/gcc/tree-ssa-sccvn.c:5615 0x10d976c process_bb ../gcc-dsg/gcc/tree-ssa-sccvn.c:7344 0x10dafe5 do_rpo_vn ../gcc-dsg/gcc/tree-ssa-sccvn.c:7942 0x10dc828 do_rpo_vn(function*, edge_def*, bitmap_head*) ../gcc-dsg/gcc/tree-ssa-sccvn.c:8039 0x119c39c vectorize_loops() ../gcc-dsg/gcc/tree-vectorizer.c:1304 on AArch64 this one ICEs with -ffast-math -ftree-vectorize -O2 _Complex *a; _Complex b; c, d; fn1() { _Complex e; for (; c; ++c) e = d * a[c]; b = e; } With the message internal compiler error: tree check: expected ssa_name, have var_decl in VN_INFO, at tree-ssa-sccvn.c:451 0x734073 tree_check_failed(tree_node const*, char const*, int, char const*, ...) ../../gcc-fsf/gcc/tree.c:8691 0x10e2e2f tree_check(tree_node*, char const*, int, char const*, tree_code) ../../gcc-fsf/gcc/tree.h:3433 0x10e2e2f VN_INFO(tree_node*) ../../gcc-fsf/gcc/tree-ssa-sccvn.c:451 0x10ed223 process_bb ../../gcc-fsf/gcc/tree-ssa-sccvn.c:7331 0x10eea43 do_rpo_vn ../../gcc-fsf/gcc/tree-ssa-sccvn.c:7944 0x10efe2b do_rpo_vn(function*, edge_def*, bitmap_head*) ../../gcc-fsf/gcc/tree-ssa-sccvn.c:8039 0x11c436b vectorize_loops() ../../gcc-fsf/gcc/tree-vectorizer.c:1304 Any ideas? Thanks, Tamar > > Otherwise the change looks OK. > --- inline copy of patch --- diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index edb7538a67f00cd80a608ee82510cf437fe88083..029d59016c9652f87d80fc5500f89532c79a66d0 100644 --- a/gcc/tree-vectorizer.c +++ b/gcc/tree-vectorizer.c @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see #include "gimple-pretty-print.h" #include "opt-problem.h" #include "internal-fn.h" - +#include "tree-ssa-sccvn.h" /* Loop or bb location, with hotness information. */ dump_user_location_t vect_location; @@ -1298,6 +1298,17 @@ vectorize_loops (void) if (has_mask_store && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) optimize_mask_stores (loop); + + auto_bitmap exit_bbs; + /* Perform local CSE, this esp. helps because we emit code for + predicates that need to be shared for optimal predicate usage. + However reassoc will re-order them and prevent CSE from working + as it should. CSE only the loop body, not the entry. */ + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); + + edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0); + do_rpo_vn (cfun, entry, exit_bbs); + loop->aux = NULL; }
On Tue, 9 Nov 2021, Tamar Christina wrote: > > > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > > > + bitmap_set_bit (exit_bbs, loop->latch->index); > > > > treating the latch as exit is probably premature optimization (yes, it's empty). > > > > > + > > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > > > + > > > + BITMAP_FREE (exit_bbs); > > > > ... deallocation can go. Note I wonder whether, if we are already spinning up > > VN, we should include the preheader in the operation? > > We regularly end up emitting redundant vector initializers that could be > > cleaned up earlier this way. > > I've change it to include the preheader but it looks like this breaks bootstrap on both > x86 and AArch64. > > On x86 the following testcase > > double matmul_c8_vanilla_bbase_0; > double *matmul_c8_vanilla_dest; > matmul_c8_vanilla_x; > matmul_c8_vanilla() { > for (; matmul_c8_vanilla_x; matmul_c8_vanilla_x++) > matmul_c8_vanilla_dest[matmul_c8_vanilla_x] += matmul_c8_vanilla_bbase_0; > } > > ICEs with -std=gnu11 -ffast-math -ftree-vectorize -O2 with: > > internal compiler error: tree check: expected ssa_name, have var_decl in SSA_VAL, at tree-ssa-sccvn.c:535 > 0x80731c tree_check_failed(tree_node const*, char const*, int, char const*, ...) > ../gcc-dsg/gcc/tree.c:8689 > 0x7ebda2 tree_check(tree_node*, char const*, int, char const*, tree_code) > ../gcc-dsg/gcc/tree.h:3433 > 0x7ebda2 SSA_VAL(tree_node*, bool*) > ../gcc-dsg/gcc/tree-ssa-sccvn.c:535 > 0x7ebda2 vuse_ssa_val > ../gcc-dsg/gcc/tree-ssa-sccvn.c:553 > 0x7ebda2 vn_reference_lookup(tree_node*, tree_node*, vn_lookup_kind, vn_reference_s**, bool, tree_node**, tree_node*) > ../gcc-dsg/gcc/tree-ssa-sccvn.c:3664 > 0x10d8ca5 visit_reference_op_load > ../gcc-dsg/gcc/tree-ssa-sccvn.c:5166 > 0x10d8ca5 visit_stmt > ../gcc-dsg/gcc/tree-ssa-sccvn.c:5615 > 0x10d976c process_bb > ../gcc-dsg/gcc/tree-ssa-sccvn.c:7344 > 0x10dafe5 do_rpo_vn > ../gcc-dsg/gcc/tree-ssa-sccvn.c:7942 > 0x10dc828 do_rpo_vn(function*, edge_def*, bitmap_head*) > ../gcc-dsg/gcc/tree-ssa-sccvn.c:8039 > 0x119c39c vectorize_loops() > ../gcc-dsg/gcc/tree-vectorizer.c:1304 OK, as I thought this is .MEMs not in SSA form. We're later fixing that up so maybe try the attached which re-orders the postprocessing after vectorization to do that earlier. Richard. > on AArch64 this one ICEs with -ffast-math -ftree-vectorize -O2 > > _Complex *a; > _Complex b; > c, d; > fn1() { > _Complex e; > for (; c; ++c) > e = d * a[c]; > b = e; > } > > With the message > > internal compiler error: tree check: expected ssa_name, have var_decl in VN_INFO, at tree-ssa-sccvn.c:451 > 0x734073 tree_check_failed(tree_node const*, char const*, int, char const*, ...) > ../../gcc-fsf/gcc/tree.c:8691 > 0x10e2e2f tree_check(tree_node*, char const*, int, char const*, tree_code) > ../../gcc-fsf/gcc/tree.h:3433 > 0x10e2e2f VN_INFO(tree_node*) > ../../gcc-fsf/gcc/tree-ssa-sccvn.c:451 > 0x10ed223 process_bb > ../../gcc-fsf/gcc/tree-ssa-sccvn.c:7331 > 0x10eea43 do_rpo_vn > ../../gcc-fsf/gcc/tree-ssa-sccvn.c:7944 > 0x10efe2b do_rpo_vn(function*, edge_def*, bitmap_head*) > ../../gcc-fsf/gcc/tree-ssa-sccvn.c:8039 > 0x11c436b vectorize_loops() > ../../gcc-fsf/gcc/tree-vectorizer.c:1304 > > Any ideas? > > Thanks, > Tamar > > > > > Otherwise the change looks OK. > > > > --- inline copy of patch --- > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c > index edb7538a67f00cd80a608ee82510cf437fe88083..029d59016c9652f87d80fc5500f89532c79a66d0 100644 > --- a/gcc/tree-vectorizer.c > +++ b/gcc/tree-vectorizer.c > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see > #include "gimple-pretty-print.h" > #include "opt-problem.h" > #include "internal-fn.h" > - > +#include "tree-ssa-sccvn.h" > > /* Loop or bb location, with hotness information. */ > dump_user_location_t vect_location; > @@ -1298,6 +1298,17 @@ vectorize_loops (void) > if (has_mask_store > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) > optimize_mask_stores (loop); > + > + auto_bitmap exit_bbs; > + /* Perform local CSE, this esp. helps because we emit code for > + predicates that need to be shared for optimal predicate usage. > + However reassoc will re-order them and prevent CSE from working > + as it should. CSE only the loop body, not the entry. */ > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > + > + edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0); > + do_rpo_vn (cfun, entry, exit_bbs); > + > loop->aux = NULL; > } > >
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c660af4b6d32dc51 100644 --- a/gcc/tree-vectorizer.c +++ b/gcc/tree-vectorizer.c @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see #include "gimple-pretty-print.h" #include "opt-problem.h" #include "internal-fn.h" - +#include "tree-ssa-sccvn.h" /* Loop or bb location, with hotness information. */ dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@ vectorize_loops (void) ??? Also while we try hard to update loop-closed SSA form we fail to properly do this in some corner-cases (see PR56286). */ rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa_only_virtuals); + + for (i = 1; i < number_of_loops (cfun); i++) + { + loop = get_loop (cfun, i); + if (!loop || !single_exit (loop)) + continue; + + bitmap exit_bbs; + /* Perform local CSE, this esp. helps because we emit code for + predicates that need to be shared for optimal predicate usage. + However reassoc will re-order them and prevent CSE from working + as it should. CSE only the loop body, not the entry. */ + exit_bbs = BITMAP_ALLOC (NULL); + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); + bitmap_set_bit (exit_bbs, loop->latch->index); + + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); + + BITMAP_FREE (exit_bbs); + } + return TODO_cleanup_cfg; }