Message ID | patch-16240-tamar@arm.com |
---|---|
State | Dropped |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 68F4B385140C for <patchwork@sourceware.org>; Mon, 31 Oct 2022 11:59:14 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 68F4B385140C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1667217554; bh=rmbC28QJ7mY2s6QJJ6tPuQFAKcyl2yCevmb8ZKp2Xxg=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=BYe6/aEg0Hibu8Ard7VJazXjkvAsgsqdMrGBS/LIlQJI8smbGDfxEKNF0+D4HR5x4 AqpwT1eDoKYpaWDRFyv71AJR0B8ZPDx1GXNev8GY1pFYmFFFW5VI5K5822bBsyhyT5 MvUYXYrt1dD6EzC+P6kH2pBKiXY4Bv3M5nZYM6QY= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from EUR01-VE1-obe.outbound.protection.outlook.com (mail-eopbgr140059.outbound.protection.outlook.com [40.107.14.59]) by sourceware.org (Postfix) with ESMTPS id 7A8AF385482E for <gcc-patches@gcc.gnu.org>; Mon, 31 Oct 2022 11:57:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7A8AF385482E ARC-Seal: i=2; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=pass; b=JPGRJlWlcnK4LgJpFspP7CQNiFSMpi/WndndCYshLKalFc8rzxhqQgsAtfy7kLt+tSXVYHRV6yS/0O0CfneH8vpvB7bK3QtZkU8dnjWPEUWQAkVT0vfIQVVUTPQIk9h/Yft5NclI3F2Cpkvq1S/QxF8gUA92leEHbPVDy2W2ZTcbW7iZ9qEIyS2dVFIDf5b8qINRatM8TGyeeLlmjJ+PG1+TbWndWD7VNYe1vKZ9iceu2HF8ZxpyFDxzrBFz22tUG0MijjEzZ2lC6uIxqLOHJ2103UFadgn2Ci4GMAn7ul2G22ToV4FEIDAs9d2eKmLjtVw95J0Y1GO2fg18sZAGZA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rmbC28QJ7mY2s6QJJ6tPuQFAKcyl2yCevmb8ZKp2Xxg=; b=imNgik6QKEyzmJ9iS76kL4gzKEJDzLY7bZ/9Lqf1kgpKxAd25rmHiXzGDUALqWpV8BvJlOAz1G9xNy7/z1HmrplSk1Av22nOPZYFxRkMTNMdENmvsfzc6Z1qxQzBIBwlYmzFQmIuCWyZ8GHzCV9WfqXWOKB6rqs+w627zzQRfg0xQPwaEtpyNaF9SWaCSH9XAmd/hXJnwBUVqjiLMiiZoPgtP6QODNho/lOxUBLgu1tFEAG1fvN8N50PWAoFVxuqQcDplmlunOJ6mwlnpoXuw52PpZr2y9i0zgSbjqkbD1cTj5vuHjvZgJf8R9Wj8nm7S+T+jt5cEGrBFfG0/ZjpoQ== ARC-Authentication-Results: i=2; mx.microsoft.com 1; spf=pass (sender ip is 63.35.35.123) smtp.rcpttodomain=gcc.gnu.org smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com; arc=pass (0 oda=1 ltdi=1 spf=[1,1,smtp.mailfrom=arm.com] dkim=[1,1,header.d=arm.com] dmarc=[1,1,header.from=arm.com]) Received: from DB6PR0301CA0080.eurprd03.prod.outlook.com (2603:10a6:6:30::27) by PAWPR08MB10257.eurprd08.prod.outlook.com (2603:10a6:102:367::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5769.15; Mon, 31 Oct 2022 11:57:32 +0000 Received: from DBAEUR03FT035.eop-EUR03.prod.protection.outlook.com (2603:10a6:6:30:cafe::7f) by DB6PR0301CA0080.outlook.office365.com (2603:10a6:6:30::27) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5769.15 via Frontend Transport; Mon, 31 Oct 2022 11:57:32 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; pr=C Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by DBAEUR03FT035.mail.protection.outlook.com (100.127.142.136) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5769.14 via Frontend Transport; Mon, 31 Oct 2022 11:57:31 +0000 Received: ("Tessian outbound 0800d254cb3b:v130"); Mon, 31 Oct 2022 11:57:31 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 98af2278ba981402 X-CR-MTA-TID: 64aa7808 Received: from 98a33b3d9617.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id 7C6FF9E9-61D8-411B-BB9B-DDCDE57357E6.1; Mon, 31 Oct 2022 11:56:46 +0000 Received: from EUR02-DB5-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 98a33b3d9617.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Mon, 31 Oct 2022 11:56:46 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=LOIfRIyycB3o6t+fXwxJSElJDAUjVtT6N4+6IIu/+suBj/18WXinzm5CGQg89trS8xMICDlr+w5rtYB8X5g2wKYKgNnfztGWp+M5G5xcOMXiUGVmjOv8Ck5dONS4Oyy39XXbxKclnqpmqztFaRRnyp0bWn+yGWV3oW/raQ1pNXfMc59y8ZvA0v0XxjykuBQaBf+FJr11o3rh8Itga2CIS0HlR7QdWMrJwrK/pRkZ1N0MfUli0kZ8DUjYoINzY7wtKDbPvZ/y/4wSLdsCV1+bD7tAv9ES+Tt94i77VpqwY38ElFfDUUfnmSKnBoMxWjOmnIuPTIR/igRY3+QOSyrD7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rmbC28QJ7mY2s6QJJ6tPuQFAKcyl2yCevmb8ZKp2Xxg=; b=RDyZFvNg5+P9N2HGvfqRyaF/G3YG/E8elVrr3o1v9aE3jE9euadlPZeh2H7+pojk39qCG5LKPQujsWVNG8YsHMgvhsY+E1fqDzxYUq3x8w+Po7F5qllzJS2/KlJ8IZK33vxfI06dDEc5s39CUUwmvh1mTK/YZx7z0Em4spK8XhB+YT51Is9a1uoz7PbLMFBsVFxSDiWaDs2+sIiIPDFdVVnVrOVmohUsRK38qkMfnQrUy0LXsY27ainLRFJnJk3JFzkYS/1Mv0hHzrzk5CHo0CnKGqa9HpH5lgB4YzgEehzSoVH3knq5fTJ09+k5UWXZWDfba1TIYJKI65Fv4JSQSg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Authentication-Results-Original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; Received: from VI1PR08MB5325.eurprd08.prod.outlook.com (2603:10a6:803:13e::17) by DB9PR08MB6730.eurprd08.prod.outlook.com (2603:10a6:10:2a2::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5769.19; Mon, 31 Oct 2022 11:56:44 +0000 Received: from VI1PR08MB5325.eurprd08.prod.outlook.com ([fe80::c57d:50c2:3502:a52]) by VI1PR08MB5325.eurprd08.prod.outlook.com ([fe80::c57d:50c2:3502:a52%4]) with mapi id 15.20.5769.019; Mon, 31 Oct 2022 11:56:44 +0000 Date: Mon, 31 Oct 2022 11:56:42 +0000 To: gcc-patches@gcc.gnu.org Subject: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Message-ID: <patch-16240-tamar@arm.com> Content-Type: multipart/mixed; boundary="dRYWI5CDSgNrT+i/" Content-Disposition: inline X-ClientProxiedBy: LO4P123CA0430.GBRP123.PROD.OUTLOOK.COM (2603:10a6:600:18b::21) To VI1PR08MB5325.eurprd08.prod.outlook.com (2603:10a6:803:13e::17) MIME-Version: 1.0 X-MS-TrafficTypeDiagnostic: VI1PR08MB5325:EE_|DB9PR08MB6730:EE_|DBAEUR03FT035:EE_|PAWPR08MB10257:EE_ X-MS-Office365-Filtering-Correlation-Id: b6db9dd2-e002-4627-4d18-08dabb371713 x-checkrecipientrouted: true NoDisclaimer: true X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: QJLINOLDq3A0q/8kix+kxsPS1/LtmBppnuDsk0jlEgKnmqG40KnNkfFf8crS+amknDV+vwvq4EyL07qi7SM/y62skhuBeibKpHcr9MHfHRXtYFiVPFu8cALsajnEjGyZxZ16wtB4ebbaUs+5ZKGqBdh2NQpJcfZHyjjLAa6V0iPGVDRz4bRZ8smSsxhkM6yQshYcqL0+ddRPDn7l/0a8xf9dl+v24hrUKQpdQgU5xu0SZOJINEsTWMEFI6YJEEDNCZJLE/xo8Aicj+/2qKgjh4L7inSAvVTDnx2ZbxqmtCnVeC7btrXCWZnMIjesI+MYkhNDX7StAYnXIyG8d4DYKLqIjqTm8Ya6LfSmYphTsFbeERN0lXZX+DmusabckEeA50tLrj3ASw1mX4x/NNQWtz8fjQ+JtabJ2TOAHB89s9lP8BTQm6ZND2QWO8enE1KS2SJk/yxZqPgQbg39U42XwZPmqSIUimag8OOF0+UoY8w309rPLWwpP2dZjjqYYOteFKSPjpTiL2XnTS/Aq8aWhems7bzlFyow85IbPKMF5VxLfkIhB9UtENCmk+/WQXbsW1NKbSarGAKIKc4d450ujZwNWS0MND+ELgCzH4MkBcre53AdQ7VW+LsmNNSUCJEm3aSoA8++SNtMQmzHvDJ0I/+IiPu7HvpRLLuufSydLUrIXMJ+74eyOE42VGylRWDMK8+m0WbnSjqPdVbTiggXDsruTcQjEGEj18PB+uIJIj33a8D85RVYC/MyidJn63I/IpXIGMkAvFzxErhOKv3ABTj1PHvznjj+JpGANvfVY/w= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VI1PR08MB5325.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230022)(4636009)(346002)(366004)(396003)(136003)(376002)(39860400002)(451199015)(83380400001)(66899015)(36756003)(6486002)(2906002)(44832011)(86362001)(38100700002)(6512007)(33964004)(44144004)(26005)(4743002)(2616005)(186003)(316002)(66946007)(66556008)(8676002)(478600001)(4326008)(41300700001)(66476007)(235185007)(5660300002)(6916009)(8936002)(6506007)(4216001)(2700100001); DIR:OUT; SFP:1101; X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB9PR08MB6730 Original-Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: DBAEUR03FT035.eop-EUR03.prod.protection.outlook.com X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id-Prvs: dd27aa71-f569-4ba5-0c1d-08dabb36fb50 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: W9FoGqDbDmFLiBBYOZakT4aqYzWq+1zRRMvCPiqwyd48TtxYUDVh7f6j7PILjFDoptHJa2o3FA2yrjWMZQ7/fD+vvGlJdefLRSphwCJzJEo6U6Y5EMVNpRNA0IP/RfnDom4YDfdmxlkvcYMnnPlK81h+cBA2SA9palTOlM4SZziu4lXjMANN2AkGkg+0qn9iVv/jXJ9jfURJ3AzOPYsSVZhiOcTxrUuNRrdmjYy8PN4udLPMd3uahN8LNKQFrEBQjFDnOszUrZPD76nsJ4M7gwRg8xQLbpRoBoNJtk5wwyfBVgdb8O0lvDf9cXUP4p9SPAUAgoBKRgQSdCdL8C8VQ9bUX/ZB8fFOMXi2ACWoPusZy9S7ExfJzNZ0FxuG4SPoHVyl2tZmUsqsHOAx+gydg/IFJAuug2W4DmFzjpah1o3giu6nv134L5lloDir3xUGiIQgRm4t2Q4/oNW4t7vsZctqNdY5nN+C469JJRx2VYRIBLeLExaocR/jyb8GiXSSFKNCe71BkSdIjjALp6add77mVtEJMMBZRbTrAD5/KiKJLbsV6aoRBo/qqgSvB0T9AoKKKnmCc+BLXYRD5Lls+iiIUzRVXmeXI3cs9HXA5ViXOhdyuy5TCDfP3g2SgEsWkcgnT7kw1lmfJAOZM1Rt/YsVs6Qq8L4ZsawfCBuwpAG5pfK3kSEQivq8Ktg1Ex9uUgQJhsAOjLyNgZ4Gk8aZMUPS8JE25H457woMXFfGFB3gNqSThRJl4DLNg/FbbmjML+9lwuigF0IjpXHK9HWCg/3TNYPJAyO8gJK47hxCMnRFZs1m6tpicPWGQ56DleP0 X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(13230022)(4636009)(136003)(346002)(396003)(39860400002)(376002)(451199015)(36840700001)(46966006)(40470700004)(66899015)(82740400003)(36756003)(86362001)(356005)(81166007)(478600001)(4743002)(2906002)(336012)(83380400001)(40480700001)(44832011)(186003)(44144004)(107886003)(33964004)(2616005)(26005)(6506007)(40460700003)(36860700001)(6512007)(47076005)(4326008)(6916009)(316002)(235185007)(6486002)(82310400005)(70206006)(70586007)(8676002)(41300700001)(8936002)(5660300002)(4216001)(2700100001); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 31 Oct 2022 11:57:31.0926 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: b6db9dd2-e002-4627-4d18-08dabb371713 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: DBAEUR03FT035.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: PAWPR08MB10257 X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, FORGED_SPF_HELO, GIT_PATCH_0, KAM_DMARC_NONE, KAM_LOTSOFHASH, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_NONE, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Tamar Christina <tamar.christina@arm.com> Cc: nd@arm.com, rguenther@suse.de Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
[1/8] middle-end: Recognize scalar reductions from bitfields and array_refs
|
|
Commit Message
Tamar Christina
Oct. 31, 2022, 11:56 a.m. UTC
Hi All, This patch series is to add recognition of pairwise operations (reductions) in match.pd such that we can benefit from them even at -O1 when the vectorizer isn't enabled. Ths use of these allow for a lot simpler codegen in AArch64 and allows us to avoid quite a lot of codegen warts. As an example a simple: typedef float v4sf __attribute__((vector_size (16))); float foo3 (v4sf x) { return x[1] + x[2]; } currently generates: foo3: dup s1, v0.s[1] dup s0, v0.s[2] fadd s0, s1, s0 ret while with this patch series now generates: foo3: ext v0.16b, v0.16b, v0.16b, #4 faddp s0, v0.2s ret This patch will not perform the operation if the source is not a gimple register and leaves memory sources to the vectorizer as it's able to deal correctly with clobbers. The use of these instruction makes a significant difference in codegen quality for AArch64 and Arm. NOTE: The last entry in the series contains tests for all of the previous patches as it's a bit of an all or nothing thing. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * match.pd (adjacent_data_access_p): Import. Add new pattern for bitwise plus, min, max, fmax, fmin. * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs. * tree.cc (adjacent_data_access_p): New. * tree.h (adjacent_data_access_p): New. --- inline copy of patch -- diff --git a/gcc/match.pd b/gcc/match.pd index 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10eb18db23c0cd9f7 100644 -- diff --git a/gcc/match.pd b/gcc/match.pd index 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10eb18db23c0cd9f7 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3. If not see HONOR_NANS uniform_vector_p expand_vec_cmp_expr_p - bitmask_inv_cst_vector_p) + bitmask_inv_cst_vector_p + adjacent_data_access_p) /* Operator lists. */ (define_operator_list tcc_comparison @@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* Canonicalizations of BIT_FIELD_REFs. */ +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ +(for op (plus min max FMIN_ALL FMAX_ALL) + ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX + IFN_REDUC_FMIN IFN_REDUC_FMAX) + (simplify + (op @0 @1) + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) + (with { poly_uint64 nloc = 0; + tree src = adjacent_data_access_p (@0, @1, &nloc, true); + tree ntype = build_vector_type (type, 2); + tree size = TYPE_SIZE (ntype); + tree pos = build_int_cst (TREE_TYPE (size), nloc); + poly_uint64 _sz; + poly_uint64 _total; } + (if (src && is_gimple_reg (src) && ntype + && poly_int_tree_p (size, &_sz) + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) + && known_ge (_total, _sz + nloc)) + (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))) + +(for op (lt gt) + ifni (IFN_REDUC_MIN IFN_REDUC_MAX) + ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX) + (simplify + (cond (op @0 @1) @0 @1) + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) + (with { poly_uint64 nloc = 0; + tree src = adjacent_data_access_p (@0, @1, &nloc, false); + tree ntype = build_vector_type (type, 2); + tree size = TYPE_SIZE (ntype); + tree pos = build_int_cst (TREE_TYPE (size), nloc); + poly_uint64 _sz; + poly_uint64 _total; } + (if (src && is_gimple_reg (src) && ntype + && poly_int_tree_p (size, &_sz) + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) + && known_ge (_total, _sz + nloc)) + (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type))) + (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; })) + (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))) + (simplify (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4) (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006603ae1e3deb6b76 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt) { tree arg = gimple_call_arg (stmt, i); if ((is_gimple_reg_type (TREE_TYPE (arg)) + && !is_gimple_variable (arg) && !is_gimple_val (arg)) || (!is_gimple_reg_type (TREE_TYPE (arg)) && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h b/gcc/tree.h index e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194640b8c1b531b798 100644 --- a/gcc/tree.h +++ b/gcc/tree.h @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree); extern tree bitmask_inv_cst_vector_p (tree); +/* TRUE if the two operands represent adjacent access of data such that a + pairwise operation can be used. */ + +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool); + /* integer_nonzerop (tree x) is nonzero if X is an integer constant with a nonzero value. */ diff --git a/gcc/tree.cc b/gcc/tree.cc index 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311 100644 --- a/gcc/tree.cc +++ b/gcc/tree.cc @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t) return builder.build (); } +/* Returns base address if the two operands represent adjacent access of data + such that a pairwise operation can be used. OP1 must be a lower subpart + than OP2. If POS is not NULL then on return if a value is returned POS + will indicate the position of the lower address. If COMMUTATIVE_P then + the operation is also tried by flipping op1 and op2. */ + +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos, + bool commutative_p) +{ + gcc_assert (op1); + gcc_assert (op2); + if (TREE_CODE (op1) != TREE_CODE (op2) + || TREE_TYPE (op1) != TREE_TYPE (op2)) + return NULL; + + tree type = TREE_TYPE (op1); + gimple *stmt1 = NULL, *stmt2 = NULL; + unsigned int bits = GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type))); + + if (TREE_CODE (op1) == BIT_FIELD_REF + && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2, 0), 0) + && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2, 1), 0) + && known_eq (bit_field_size (op1), bits)) + { + poly_uint64 offset1 = bit_field_offset (op1); + poly_uint64 offset2 = bit_field_offset (op2); + if (known_eq (offset2 - offset1, bits)) + { + if (pos) + *pos = offset1; + return TREE_OPERAND (op1, 0); + } + else if (commutative_p && known_eq (offset1 - offset2, bits)) + { + if (pos) + *pos = offset2; + return TREE_OPERAND (op1, 0); + } + } + else if (TREE_CODE (op1) == ARRAY_REF + && operand_equal_p (get_base_address (op1), get_base_address (op2))) + { + wide_int size1 = wi::to_wide (array_ref_element_size (op1)); + wide_int size2 = wi::to_wide (array_ref_element_size (op2)); + if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8) + || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1)) + || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1))) + return NULL; + + poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1)); + poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1)); + if (known_eq (offset2 - offset1, 1UL)) + { + if (pos) + *pos = offset1 * bits; + return TREE_OPERAND (op1, 0); + } + else if (commutative_p && known_eq (offset1 - offset2, 1UL)) + { + if (pos) + *pos = offset2 * bits; + return TREE_OPERAND (op1, 0); + } + } + else if (TREE_CODE (op1) == SSA_NAME + && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL + && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL + && is_gimple_assign (stmt1) + && is_gimple_assign (stmt2)) + { + if (gimple_assign_rhs_code (stmt1) != ARRAY_REF + && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF + && gimple_assign_rhs_code (stmt2) != ARRAY_REF + && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF) + return NULL; + + return adjacent_data_access_p (gimple_assign_rhs1 (stmt1), + gimple_assign_rhs1 (stmt2), pos, + commutative_p); + } + + return NULL; +} + /* If VECTOR_CST T has a single nonzero element, return the index of that element, otherwise return -1. */
Comments
On 10/31/22 05:56, Tamar Christina wrote: > Hi All, > > This patch series is to add recognition of pairwise operations (reductions) > in match.pd such that we can benefit from them even at -O1 when the vectorizer > isn't enabled. > > Ths use of these allow for a lot simpler codegen in AArch64 and allows us to > avoid quite a lot of codegen warts. > > As an example a simple: > > typedef float v4sf __attribute__((vector_size (16))); > > float > foo3 (v4sf x) > { > return x[1] + x[2]; > } > > currently generates: > > foo3: > dup s1, v0.s[1] > dup s0, v0.s[2] > fadd s0, s1, s0 > ret > > while with this patch series now generates: > > foo3: > ext v0.16b, v0.16b, v0.16b, #4 > faddp s0, v0.2s > ret > > This patch will not perform the operation if the source is not a gimple > register and leaves memory sources to the vectorizer as it's able to deal > correctly with clobbers. > > The use of these instruction makes a significant difference in codegen quality > for AArch64 and Arm. > > NOTE: The last entry in the series contains tests for all of the previous > patches as it's a bit of an all or nothing thing. > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu > and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > * match.pd (adjacent_data_access_p): Import. > Add new pattern for bitwise plus, min, max, fmax, fmin. > * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs. > * tree.cc (adjacent_data_access_p): New. > * tree.h (adjacent_data_access_p): New. Nice stuff. I'd pondered some similar stuff at Tachyum, but got dragged away before it could be implemented. > diff --git a/gcc/tree.cc b/gcc/tree.cc > index 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311 100644 > --- a/gcc/tree.cc > +++ b/gcc/tree.cc > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t) > return builder.build (); > } > > +/* Returns base address if the two operands represent adjacent access of data > + such that a pairwise operation can be used. OP1 must be a lower subpart > + than OP2. If POS is not NULL then on return if a value is returned POS > + will indicate the position of the lower address. If COMMUTATIVE_P then > + the operation is also tried by flipping op1 and op2. */ > + > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos, > + bool commutative_p) Formatting nit. Return type on a different line. OK with that fixed. jeff
On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Hi All, > > This patch series is to add recognition of pairwise operations (reductions) > in match.pd such that we can benefit from them even at -O1 when the vectorizer > isn't enabled. > > Ths use of these allow for a lot simpler codegen in AArch64 and allows us to > avoid quite a lot of codegen warts. > > As an example a simple: > > typedef float v4sf __attribute__((vector_size (16))); > > float > foo3 (v4sf x) > { > return x[1] + x[2]; > } > > currently generates: > > foo3: > dup s1, v0.s[1] > dup s0, v0.s[2] > fadd s0, s1, s0 > ret > > while with this patch series now generates: > > foo3: > ext v0.16b, v0.16b, v0.16b, #4 > faddp s0, v0.2s > ret > > This patch will not perform the operation if the source is not a gimple > register and leaves memory sources to the vectorizer as it's able to deal > correctly with clobbers. But the vectorizer should also be able to cope with the above. I don't think we want to do this as part of general folding. Iff, then this belongs in specific points of the pass pipeline, no? > The use of these instruction makes a significant difference in codegen quality > for AArch64 and Arm. > > NOTE: The last entry in the series contains tests for all of the previous > patches as it's a bit of an all or nothing thing. > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu > and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > * match.pd (adjacent_data_access_p): Import. > Add new pattern for bitwise plus, min, max, fmax, fmin. > * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs. > * tree.cc (adjacent_data_access_p): New. > * tree.h (adjacent_data_access_p): New. > > --- inline copy of patch -- > diff --git a/gcc/match.pd b/gcc/match.pd > index 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10eb18db23c0cd9f7 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3. If not see > HONOR_NANS > uniform_vector_p > expand_vec_cmp_expr_p > - bitmask_inv_cst_vector_p) > + bitmask_inv_cst_vector_p > + adjacent_data_access_p) > > /* Operator lists. */ > (define_operator_list tcc_comparison > @@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > /* Canonicalizations of BIT_FIELD_REFs. */ > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ > +(for op (plus min max FMIN_ALL FMAX_ALL) > + ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX > + IFN_REDUC_FMIN IFN_REDUC_FMAX) > + (simplify > + (op @0 @1) > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > + (with { poly_uint64 nloc = 0; > + tree src = adjacent_data_access_p (@0, @1, &nloc, true); > + tree ntype = build_vector_type (type, 2); > + tree size = TYPE_SIZE (ntype); > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > + poly_uint64 _sz; > + poly_uint64 _total; } > + (if (src && is_gimple_reg (src) && ntype > + && poly_int_tree_p (size, &_sz) > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > + && known_ge (_total, _sz + nloc)) > + (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))) > + > +(for op (lt gt) > + ifni (IFN_REDUC_MIN IFN_REDUC_MAX) > + ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX) > + (simplify > + (cond (op @0 @1) @0 @1) > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > + (with { poly_uint64 nloc = 0; > + tree src = adjacent_data_access_p (@0, @1, &nloc, false); > + tree ntype = build_vector_type (type, 2); > + tree size = TYPE_SIZE (ntype); > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > + poly_uint64 _sz; > + poly_uint64 _total; } > + (if (src && is_gimple_reg (src) && ntype > + && poly_int_tree_p (size, &_sz) > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > + && known_ge (_total, _sz + nloc)) > + (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type))) > + (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; })) > + (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))) > + > (simplify > (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4) > (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); })) > diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc > index 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006603ae1e3deb6b76 100644 > --- a/gcc/tree-cfg.cc > +++ b/gcc/tree-cfg.cc > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt) > { > tree arg = gimple_call_arg (stmt, i); > if ((is_gimple_reg_type (TREE_TYPE (arg)) > + && !is_gimple_variable (arg) > && !is_gimple_val (arg)) > || (!is_gimple_reg_type (TREE_TYPE (arg)) > && !is_gimple_lvalue (arg))) > diff --git a/gcc/tree.h b/gcc/tree.h > index e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194640b8c1b531b798 100644 > --- a/gcc/tree.h > +++ b/gcc/tree.h > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree); > > extern tree bitmask_inv_cst_vector_p (tree); > > +/* TRUE if the two operands represent adjacent access of data such that a > + pairwise operation can be used. */ > + > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool); > + > /* integer_nonzerop (tree x) is nonzero if X is an integer constant > with a nonzero value. */ > > diff --git a/gcc/tree.cc b/gcc/tree.cc > index 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311 100644 > --- a/gcc/tree.cc > +++ b/gcc/tree.cc > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t) > return builder.build (); > } > > +/* Returns base address if the two operands represent adjacent access of data > + such that a pairwise operation can be used. OP1 must be a lower subpart > + than OP2. If POS is not NULL then on return if a value is returned POS > + will indicate the position of the lower address. If COMMUTATIVE_P then > + the operation is also tried by flipping op1 and op2. */ > + > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos, > + bool commutative_p) > +{ > + gcc_assert (op1); > + gcc_assert (op2); > + if (TREE_CODE (op1) != TREE_CODE (op2) > + || TREE_TYPE (op1) != TREE_TYPE (op2)) > + return NULL; > + > + tree type = TREE_TYPE (op1); > + gimple *stmt1 = NULL, *stmt2 = NULL; > + unsigned int bits = GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type))); > + > + if (TREE_CODE (op1) == BIT_FIELD_REF > + && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2, 0), 0) > + && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2, 1), 0) > + && known_eq (bit_field_size (op1), bits)) > + { > + poly_uint64 offset1 = bit_field_offset (op1); > + poly_uint64 offset2 = bit_field_offset (op2); > + if (known_eq (offset2 - offset1, bits)) > + { > + if (pos) > + *pos = offset1; > + return TREE_OPERAND (op1, 0); > + } > + else if (commutative_p && known_eq (offset1 - offset2, bits)) > + { > + if (pos) > + *pos = offset2; > + return TREE_OPERAND (op1, 0); > + } > + } > + else if (TREE_CODE (op1) == ARRAY_REF > + && operand_equal_p (get_base_address (op1), get_base_address (op2))) > + { > + wide_int size1 = wi::to_wide (array_ref_element_size (op1)); > + wide_int size2 = wi::to_wide (array_ref_element_size (op2)); > + if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8) > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1)) > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1))) > + return NULL; > + > + poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1)); > + poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1)); > + if (known_eq (offset2 - offset1, 1UL)) > + { > + if (pos) > + *pos = offset1 * bits; > + return TREE_OPERAND (op1, 0); > + } > + else if (commutative_p && known_eq (offset1 - offset2, 1UL)) > + { > + if (pos) > + *pos = offset2 * bits; > + return TREE_OPERAND (op1, 0); > + } > + } > + else if (TREE_CODE (op1) == SSA_NAME > + && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL > + && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL > + && is_gimple_assign (stmt1) > + && is_gimple_assign (stmt2)) > + { > + if (gimple_assign_rhs_code (stmt1) != ARRAY_REF > + && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF > + && gimple_assign_rhs_code (stmt2) != ARRAY_REF > + && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF) > + return NULL; > + > + return adjacent_data_access_p (gimple_assign_rhs1 (stmt1), > + gimple_assign_rhs1 (stmt2), pos, > + commutative_p); > + } > + > + return NULL; > +} > + > /* If VECTOR_CST T has a single nonzero element, return the index of that > element, otherwise return -1. */ > > > > > > --
> -----Original Message----- > From: Richard Biener <richard.guenther@gmail.com> > Sent: Saturday, November 5, 2022 11:33 AM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from > bitfields and array_refs > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches <gcc- > patches@gcc.gnu.org> wrote: > > > > Hi All, > > > > This patch series is to add recognition of pairwise operations > > (reductions) in match.pd such that we can benefit from them even at > > -O1 when the vectorizer isn't enabled. > > > > Ths use of these allow for a lot simpler codegen in AArch64 and allows > > us to avoid quite a lot of codegen warts. > > > > As an example a simple: > > > > typedef float v4sf __attribute__((vector_size (16))); > > > > float > > foo3 (v4sf x) > > { > > return x[1] + x[2]; > > } > > > > currently generates: > > > > foo3: > > dup s1, v0.s[1] > > dup s0, v0.s[2] > > fadd s0, s1, s0 > > ret > > > > while with this patch series now generates: > > > > foo3: > > ext v0.16b, v0.16b, v0.16b, #4 > > faddp s0, v0.2s > > ret > > > > This patch will not perform the operation if the source is not a > > gimple register and leaves memory sources to the vectorizer as it's > > able to deal correctly with clobbers. > > But the vectorizer should also be able to cope with the above. There are several problems with leaving it up to the vectorizer to do: 1. We only get it at -O2 and higher. 2. The way the vectorizer costs the reduction makes the resulting cost always too high for AArch64. As an example the following: typedef unsigned int u32v4 __attribute__((vector_size(16))); unsigned int f (u32v4 a, u32v4 b) { return a[0] + a[1]; } Doesn't get SLP'ed because the vectorizer costs it as: node 0x485eb30 0 times vec_perm costs 0 in body _1 + _2 1 times vector_stmt costs 1 in body _1 + _2 1 times vec_perm costs 2 in body _1 + _2 1 times vec_to_scalar costs 2 in body And so ultimately you fail because: /app/example.c:8:17: note: Cost model analysis for part in loop 0: Vector cost: 5 Scalar cost: 3 This looks like it's because the vectorizer costs the operation to create the BIT_FIELD_REF <a_3(D), 64, 0>; For the reduction as requiring two scalar extracts and a permute. While it ultimately does produce a BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs. This causes the reduction to almost always be more expensive, so unless the rest of the SLP tree amortizes the cost we never generate them. 3. The SLP only happens on operation that are SLP shaped and where SLP didn't fail. As a simple example, the vectorizer can't SLP the following: unsigned int f (u32v4 a, u32v4 b) { a[0] += b[0]; return a[0] + a[1]; } Because there's not enough VF here and it can't unroll. This and many others fail because they're not an SLP-able operation, or SLP build fails. This causes us to generate for e.g. this example: f: dup s2, v0.s[1] fmov w1, s1 add v0.2s, v2.2s, v0.2s fmov w0, s0 add w0, w0, w1 ret instead of with my patch: f: addp v0.2s, v0.2s, v0.2s add v0.2s, v0.2s, v1.2s fmov w0, s0 ret which is significantly better code. So I don't think the vectorizer is the right solution for this. > I don't think > we want to do this as part of general folding. Iff, then this belongs in specific > points of the pass pipeline, no? The reason I currently have it as such is because in general the compiler doesn't really deal with horizontal reductions at all. Also since the vectorizer itself can introduce reductions I figured it's better to have one representation for this. So admittedly perhaps this should only be done after vectorization as that's when today we expect reductions to be in Gimple. As for having it in a specific point in the pass pipeline, I have it as a general one since a number of passes could create the form for the reduction, for instance vec_lower could break up an operation to allow this to match. The bigger BIT_FIELD_EXPR it creates could also lead to other optimizations. Additionally you had mentioned last time that Andrew was trying to move min/max detection to match.pd So I had figured this was the correct place for it. That said I have no intuition for what would be better here. Since the check is quite cheap. But do you have a particular place you want this move to then? Ideally I'd want it before the last FRE pass, but perhaps isel? Thanks, Tamar > > > The use of these instruction makes a significant difference in codegen > > quality for AArch64 and Arm. > > > > NOTE: The last entry in the series contains tests for all of the > > previous patches as it's a bit of an all or nothing thing. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu > > and no issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * match.pd (adjacent_data_access_p): Import. > > Add new pattern for bitwise plus, min, max, fmax, fmin. > > * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs. > > * tree.cc (adjacent_data_access_p): New. > > * tree.h (adjacent_data_access_p): New. > > > > --- inline copy of patch -- > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10 > eb1 > > 8db23c0cd9f7 100644 > > --- a/gcc/match.pd > > +++ b/gcc/match.pd > > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3. If not see > > HONOR_NANS > > uniform_vector_p > > expand_vec_cmp_expr_p > > - bitmask_inv_cst_vector_p) > > + bitmask_inv_cst_vector_p > > + adjacent_data_access_p) > > > > /* Operator lists. */ > > (define_operator_list tcc_comparison > > @@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > > /* Canonicalizations of BIT_FIELD_REFs. */ > > > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ (for op > > +(plus min max FMIN_ALL FMAX_ALL) > > + ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX > > + IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > + (op @0 @1) > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > + (with { poly_uint64 nloc = 0; > > + tree src = adjacent_data_access_p (@0, @1, &nloc, true); > > + tree ntype = build_vector_type (type, 2); > > + tree size = TYPE_SIZE (ntype); > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > + poly_uint64 _sz; > > + poly_uint64 _total; } > > + (if (src && is_gimple_reg (src) && ntype > > + && poly_int_tree_p (size, &_sz) > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > + && known_ge (_total, _sz + nloc)) > > + (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))) > > + > > +(for op (lt gt) > > + ifni (IFN_REDUC_MIN IFN_REDUC_MAX) > > + ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > + (cond (op @0 @1) @0 @1) > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > + (with { poly_uint64 nloc = 0; > > + tree src = adjacent_data_access_p (@0, @1, &nloc, false); > > + tree ntype = build_vector_type (type, 2); > > + tree size = TYPE_SIZE (ntype); > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > + poly_uint64 _sz; > > + poly_uint64 _total; } > > + (if (src && is_gimple_reg (src) && ntype > > + && poly_int_tree_p (size, &_sz) > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > + && known_ge (_total, _sz + nloc)) > > + (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type))) > > + (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; })) > > + (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))) > > + > > (simplify > > (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4) > > (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); > > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index > > > 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006 > 603 > > ae1e3deb6b76 100644 > > --- a/gcc/tree-cfg.cc > > +++ b/gcc/tree-cfg.cc > > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt) > > { > > tree arg = gimple_call_arg (stmt, i); > > if ((is_gimple_reg_type (TREE_TYPE (arg)) > > + && !is_gimple_variable (arg) > > && !is_gimple_val (arg)) > > || (!is_gimple_reg_type (TREE_TYPE (arg)) > > && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h > > b/gcc/tree.h index > > > e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194 > 640 > > b8c1b531b798 100644 > > --- a/gcc/tree.h > > +++ b/gcc/tree.h > > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree); > > > > extern tree bitmask_inv_cst_vector_p (tree); > > > > +/* TRUE if the two operands represent adjacent access of data such that a > > + pairwise operation can be used. */ > > + > > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool); > > + > > /* integer_nonzerop (tree x) is nonzero if X is an integer constant > > with a nonzero value. */ > > > > diff --git a/gcc/tree.cc b/gcc/tree.cc index > > > 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19 > d > > e9841e29d311 100644 > > --- a/gcc/tree.cc > > +++ b/gcc/tree.cc > > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t) > > return builder.build (); > > } > > > > +/* Returns base address if the two operands represent adjacent access of > data > > + such that a pairwise operation can be used. OP1 must be a lower > subpart > > + than OP2. If POS is not NULL then on return if a value is returned POS > > + will indicate the position of the lower address. If COMMUTATIVE_P > then > > + the operation is also tried by flipping op1 and op2. */ > > + > > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos, > > + bool commutative_p) { > > + gcc_assert (op1); > > + gcc_assert (op2); > > + if (TREE_CODE (op1) != TREE_CODE (op2) > > + || TREE_TYPE (op1) != TREE_TYPE (op2)) > > + return NULL; > > + > > + tree type = TREE_TYPE (op1); > > + gimple *stmt1 = NULL, *stmt2 = NULL; unsigned int bits = > > + GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type))); > > + > > + if (TREE_CODE (op1) == BIT_FIELD_REF > > + && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2, > 0), 0) > > + && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2, > 1), 0) > > + && known_eq (bit_field_size (op1), bits)) > > + { > > + poly_uint64 offset1 = bit_field_offset (op1); > > + poly_uint64 offset2 = bit_field_offset (op2); > > + if (known_eq (offset2 - offset1, bits)) > > + { > > + if (pos) > > + *pos = offset1; > > + return TREE_OPERAND (op1, 0); > > + } > > + else if (commutative_p && known_eq (offset1 - offset2, bits)) > > + { > > + if (pos) > > + *pos = offset2; > > + return TREE_OPERAND (op1, 0); > > + } > > + } > > + else if (TREE_CODE (op1) == ARRAY_REF > > + && operand_equal_p (get_base_address (op1), get_base_address > (op2))) > > + { > > + wide_int size1 = wi::to_wide (array_ref_element_size (op1)); > > + wide_int size2 = wi::to_wide (array_ref_element_size (op2)); > > + if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8) > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1)) > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1))) > > + return NULL; > > + > > + poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1)); > > + poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1)); > > + if (known_eq (offset2 - offset1, 1UL)) > > + { > > + if (pos) > > + *pos = offset1 * bits; > > + return TREE_OPERAND (op1, 0); > > + } > > + else if (commutative_p && known_eq (offset1 - offset2, 1UL)) > > + { > > + if (pos) > > + *pos = offset2 * bits; > > + return TREE_OPERAND (op1, 0); > > + } > > + } > > + else if (TREE_CODE (op1) == SSA_NAME > > + && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL > > + && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL > > + && is_gimple_assign (stmt1) > > + && is_gimple_assign (stmt2)) > > + { > > + if (gimple_assign_rhs_code (stmt1) != ARRAY_REF > > + && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF > > + && gimple_assign_rhs_code (stmt2) != ARRAY_REF > > + && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF) > > + return NULL; > > + > > + return adjacent_data_access_p (gimple_assign_rhs1 (stmt1), > > + gimple_assign_rhs1 (stmt2), pos, > > + commutative_p); > > + } > > + > > + return NULL; > > +} > > + > > /* If VECTOR_CST T has a single nonzero element, return the index of that > > element, otherwise return -1. */ > > > > > > > > > > > > --
On Mon, 7 Nov 2022, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <richard.guenther@gmail.com> > > Sent: Saturday, November 5, 2022 11:33 AM > > To: Tamar Christina <Tamar.Christina@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from > > bitfields and array_refs > > > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches <gcc- > > patches@gcc.gnu.org> wrote: > > > > > > Hi All, > > > > > > This patch series is to add recognition of pairwise operations > > > (reductions) in match.pd such that we can benefit from them even at > > > -O1 when the vectorizer isn't enabled. > > > > > > Ths use of these allow for a lot simpler codegen in AArch64 and allows > > > us to avoid quite a lot of codegen warts. > > > > > > As an example a simple: > > > > > > typedef float v4sf __attribute__((vector_size (16))); > > > > > > float > > > foo3 (v4sf x) > > > { > > > return x[1] + x[2]; > > > } > > > > > > currently generates: > > > > > > foo3: > > > dup s1, v0.s[1] > > > dup s0, v0.s[2] > > > fadd s0, s1, s0 > > > ret > > > > > > while with this patch series now generates: > > > > > > foo3: > > > ext v0.16b, v0.16b, v0.16b, #4 > > > faddp s0, v0.2s > > > ret > > > > > > This patch will not perform the operation if the source is not a > > > gimple register and leaves memory sources to the vectorizer as it's > > > able to deal correctly with clobbers. > > > > But the vectorizer should also be able to cope with the above. > > There are several problems with leaving it up to the vectorizer to do: > > 1. We only get it at -O2 and higher. > 2. The way the vectorizer costs the reduction makes the resulting cost always too high for AArch64. > > As an example the following: > > typedef unsigned int u32v4 __attribute__((vector_size(16))); > unsigned int f (u32v4 a, u32v4 b) > { > return a[0] + a[1]; > } > > Doesn't get SLP'ed because the vectorizer costs it as: > > node 0x485eb30 0 times vec_perm costs 0 in body > _1 + _2 1 times vector_stmt costs 1 in body > _1 + _2 1 times vec_perm costs 2 in body > _1 + _2 1 times vec_to_scalar costs 2 in body > > And so ultimately you fail because: > > /app/example.c:8:17: note: Cost model analysis for part in loop 0: > Vector cost: 5 > Scalar cost: 3 > > This looks like it's because the vectorizer costs the operation to create the BIT_FIELD_REF <a_3(D), 64, 0>; > For the reduction as requiring two scalar extracts and a permute. While it ultimately does produce a > BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs. > > This causes the reduction to almost always be more expensive, so unless the rest of the SLP tree amortizes > the cost we never generate them. On x86 for example the hadds are prohibitly expensive here. Are you sure the horizontal add is actually profitable on arm? Your pattern-matching has no cost modeling at all? > 3. The SLP only happens on operation that are SLP shaped and where SLP didn't fail. > > As a simple example, the vectorizer can't SLP the following: > > unsigned int f (u32v4 a, u32v4 b) > { > a[0] += b[0]; > return a[0] + a[1]; > } > > Because there's not enough VF here and it can't unroll. This and many others fail because they're not an > SLP-able operation, or SLP build fails. That's of course because the pattern matching for reductions is too simple here, getting us a group size of three. Bad association would make your simple pattern matching fail as well. > This causes us to generate for e.g. this example: > > f: > dup s2, v0.s[1] > fmov w1, s1 > add v0.2s, v2.2s, v0.2s > fmov w0, s0 > add w0, w0, w1 > ret > > instead of with my patch: > > f: > addp v0.2s, v0.2s, v0.2s > add v0.2s, v0.2s, v1.2s > fmov w0, s0 > ret > > which is significantly better code. So I don't think the vectorizer is the right solution for this. Simple pattern matching isn't either. In fact basic-block SLP is supposed to be the advanced pattern matching including a cost model. IMHO the correct approach is to improve that, vect_slp_check_for_constructors plus how we handle/recover from SLP discovery fails as in your second example above. > > I don't think > > we want to do this as part of general folding. Iff, then this belongs in specific > > points of the pass pipeline, no? > > The reason I currently have it as such is because in general the compiler doesn't really deal with > horizontal reductions at all. Also since the vectorizer itself can introduce reductions I figured it's > better to have one representation for this. So admittedly perhaps this should only be done after > vectorization as that's when today we expect reductions to be in Gimple. > > As for having it in a specific point in the pass pipeline, I have it as a general one since a number of > passes could create the form for the reduction, for instance vec_lower could break up an operation > to allow this to match. The bigger BIT_FIELD_EXPR it creates could also lead to other optimizations. > > Additionally you had mentioned last time that Andrew was trying to move min/max detection to match.pd > So I had figured this was the correct place for it. That's mostly because we have fold-const patterns for ?: min/max and CFG patterns for min/max in phiopt and it's possible to unify both. > That said I have no intuition for what would be better here. Since the check is quite cheap. But do you have > a particular place you want this move to then? Ideally I'd want it before the last FRE pass, but perhaps > isel? As said, I think it belongs where we can do costing which means the vectorizer. Iff there are two/three instruction sequences that can be peepholed do it in the targets machine description instead. Richard. > Thanks, > Tamar > > > > > > The use of these instruction makes a significant difference in codegen > > > quality for AArch64 and Arm. > > > > > > NOTE: The last entry in the series contains tests for all of the > > > previous patches as it's a bit of an all or nothing thing. > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu > > > and no issues. > > > > > > Ok for master? > > > > > > Thanks, > > > Tamar > > > > > > gcc/ChangeLog: > > > > > > * match.pd (adjacent_data_access_p): Import. > > > Add new pattern for bitwise plus, min, max, fmax, fmin. > > > * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs. > > > * tree.cc (adjacent_data_access_p): New. > > > * tree.h (adjacent_data_access_p): New. > > > > > > --- inline copy of patch -- > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > > > 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10 > > eb1 > > > 8db23c0cd9f7 100644 > > > --- a/gcc/match.pd > > > +++ b/gcc/match.pd > > > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3. If not see > > > HONOR_NANS > > > uniform_vector_p > > > expand_vec_cmp_expr_p > > > - bitmask_inv_cst_vector_p) > > > + bitmask_inv_cst_vector_p > > > + adjacent_data_access_p) > > > > > > /* Operator lists. */ > > > (define_operator_list tcc_comparison > > > @@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > > > > /* Canonicalizations of BIT_FIELD_REFs. */ > > > > > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ (for op > > > +(plus min max FMIN_ALL FMAX_ALL) > > > + ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX > > > + IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > > + (op @0 @1) > > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > > + (with { poly_uint64 nloc = 0; > > > + tree src = adjacent_data_access_p (@0, @1, &nloc, true); > > > + tree ntype = build_vector_type (type, 2); > > > + tree size = TYPE_SIZE (ntype); > > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > > + poly_uint64 _sz; > > > + poly_uint64 _total; } > > > + (if (src && is_gimple_reg (src) && ntype > > > + && poly_int_tree_p (size, &_sz) > > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > > + && known_ge (_total, _sz + nloc)) > > > + (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))) > > > + > > > +(for op (lt gt) > > > + ifni (IFN_REDUC_MIN IFN_REDUC_MAX) > > > + ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > > + (cond (op @0 @1) @0 @1) > > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > > + (with { poly_uint64 nloc = 0; > > > + tree src = adjacent_data_access_p (@0, @1, &nloc, false); > > > + tree ntype = build_vector_type (type, 2); > > > + tree size = TYPE_SIZE (ntype); > > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > > + poly_uint64 _sz; > > > + poly_uint64 _total; } > > > + (if (src && is_gimple_reg (src) && ntype > > > + && poly_int_tree_p (size, &_sz) > > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > > + && known_ge (_total, _sz + nloc)) > > > + (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type))) > > > + (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; })) > > > + (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))) > > > + > > > (simplify > > > (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4) > > > (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); > > > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index > > > > > 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006 > > 603 > > > ae1e3deb6b76 100644 > > > --- a/gcc/tree-cfg.cc > > > +++ b/gcc/tree-cfg.cc > > > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt) > > > { > > > tree arg = gimple_call_arg (stmt, i); > > > if ((is_gimple_reg_type (TREE_TYPE (arg)) > > > + && !is_gimple_variable (arg) > > > && !is_gimple_val (arg)) > > > || (!is_gimple_reg_type (TREE_TYPE (arg)) > > > && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h > > > b/gcc/tree.h index > > > > > e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194 > > 640 > > > b8c1b531b798 100644 > > > --- a/gcc/tree.h > > > +++ b/gcc/tree.h > > > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree); > > > > > > extern tree bitmask_inv_cst_vector_p (tree); > > > > > > +/* TRUE if the two operands represent adjacent access of data such that a > > > + pairwise operation can be used. */ > > > + > > > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool); > > > + > > > /* integer_nonzerop (tree x) is nonzero if X is an integer constant > > > with a nonzero value. */ > > > > > > diff --git a/gcc/tree.cc b/gcc/tree.cc index > > > > > 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19 > > d > > > e9841e29d311 100644 > > > --- a/gcc/tree.cc > > > +++ b/gcc/tree.cc > > > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t) > > > return builder.build (); > > > } > > > > > > +/* Returns base address if the two operands represent adjacent access of > > data > > > + such that a pairwise operation can be used. OP1 must be a lower > > subpart > > > + than OP2. If POS is not NULL then on return if a value is returned POS > > > + will indicate the position of the lower address. If COMMUTATIVE_P > > then > > > + the operation is also tried by flipping op1 and op2. */ > > > + > > > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos, > > > + bool commutative_p) { > > > + gcc_assert (op1); > > > + gcc_assert (op2); > > > + if (TREE_CODE (op1) != TREE_CODE (op2) > > > + || TREE_TYPE (op1) != TREE_TYPE (op2)) > > > + return NULL; > > > + > > > + tree type = TREE_TYPE (op1); > > > + gimple *stmt1 = NULL, *stmt2 = NULL; unsigned int bits = > > > + GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type))); > > > + > > > + if (TREE_CODE (op1) == BIT_FIELD_REF > > > + && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2, > > 0), 0) > > > + && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2, > > 1), 0) > > > + && known_eq (bit_field_size (op1), bits)) > > > + { > > > + poly_uint64 offset1 = bit_field_offset (op1); > > > + poly_uint64 offset2 = bit_field_offset (op2); > > > + if (known_eq (offset2 - offset1, bits)) > > > + { > > > + if (pos) > > > + *pos = offset1; > > > + return TREE_OPERAND (op1, 0); > > > + } > > > + else if (commutative_p && known_eq (offset1 - offset2, bits)) > > > + { > > > + if (pos) > > > + *pos = offset2; > > > + return TREE_OPERAND (op1, 0); > > > + } > > > + } > > > + else if (TREE_CODE (op1) == ARRAY_REF > > > + && operand_equal_p (get_base_address (op1), get_base_address > > (op2))) > > > + { > > > + wide_int size1 = wi::to_wide (array_ref_element_size (op1)); > > > + wide_int size2 = wi::to_wide (array_ref_element_size (op2)); > > > + if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8) > > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1)) > > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1))) > > > + return NULL; > > > + > > > + poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1)); > > > + poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1)); > > > + if (known_eq (offset2 - offset1, 1UL)) > > > + { > > > + if (pos) > > > + *pos = offset1 * bits; > > > + return TREE_OPERAND (op1, 0); > > > + } > > > + else if (commutative_p && known_eq (offset1 - offset2, 1UL)) > > > + { > > > + if (pos) > > > + *pos = offset2 * bits; > > > + return TREE_OPERAND (op1, 0); > > > + } > > > + } > > > + else if (TREE_CODE (op1) == SSA_NAME > > > + && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL > > > + && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL > > > + && is_gimple_assign (stmt1) > > > + && is_gimple_assign (stmt2)) > > > + { > > > + if (gimple_assign_rhs_code (stmt1) != ARRAY_REF > > > + && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF > > > + && gimple_assign_rhs_code (stmt2) != ARRAY_REF > > > + && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF) > > > + return NULL; > > > + > > > + return adjacent_data_access_p (gimple_assign_rhs1 (stmt1), > > > + gimple_assign_rhs1 (stmt2), pos, > > > + commutative_p); > > > + } > > > + > > > + return NULL; > > > +} > > > + > > > /* If VECTOR_CST T has a single nonzero element, return the index of that > > > element, otherwise return -1. */ > > > > > > > > > > > > > > > > > > -- >
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Monday, November 7, 2022 10:18 AM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: Richard Biener <richard.guenther@gmail.com>; gcc- > patches@gcc.gnu.org; nd <nd@arm.com> > Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from > bitfields and array_refs > > On Mon, 7 Nov 2022, Tamar Christina wrote: > > > > -----Original Message----- > > > From: Richard Biener <richard.guenther@gmail.com> > > > Sent: Saturday, November 5, 2022 11:33 AM > > > To: Tamar Christina <Tamar.Christina@arm.com> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de > > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from > > > bitfields and array_refs > > > > > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches > > > <gcc- patches@gcc.gnu.org> wrote: > > > > > > > > Hi All, > > > > > > > > This patch series is to add recognition of pairwise operations > > > > (reductions) in match.pd such that we can benefit from them even > > > > at > > > > -O1 when the vectorizer isn't enabled. > > > > > > > > Ths use of these allow for a lot simpler codegen in AArch64 and > > > > allows us to avoid quite a lot of codegen warts. > > > > > > > > As an example a simple: > > > > > > > > typedef float v4sf __attribute__((vector_size (16))); > > > > > > > > float > > > > foo3 (v4sf x) > > > > { > > > > return x[1] + x[2]; > > > > } > > > > > > > > currently generates: > > > > > > > > foo3: > > > > dup s1, v0.s[1] > > > > dup s0, v0.s[2] > > > > fadd s0, s1, s0 > > > > ret > > > > > > > > while with this patch series now generates: > > > > > > > > foo3: > > > > ext v0.16b, v0.16b, v0.16b, #4 > > > > faddp s0, v0.2s > > > > ret > > > > > > > > This patch will not perform the operation if the source is not a > > > > gimple register and leaves memory sources to the vectorizer as > > > > it's able to deal correctly with clobbers. > > > > > > But the vectorizer should also be able to cope with the above. > > > > There are several problems with leaving it up to the vectorizer to do: > > > > 1. We only get it at -O2 and higher. > > 2. The way the vectorizer costs the reduction makes the resulting cost > always too high for AArch64. > > > > As an example the following: > > > > typedef unsigned int u32v4 __attribute__((vector_size(16))); unsigned > > int f (u32v4 a, u32v4 b) { > > return a[0] + a[1]; > > } > > > > Doesn't get SLP'ed because the vectorizer costs it as: > > > > node 0x485eb30 0 times vec_perm costs 0 in body > > _1 + _2 1 times vector_stmt costs 1 in body > > _1 + _2 1 times vec_perm costs 2 in body > > _1 + _2 1 times vec_to_scalar costs 2 in body > > > > And so ultimately you fail because: > > > > /app/example.c:8:17: note: Cost model analysis for part in loop 0: > > Vector cost: 5 > > Scalar cost: 3 > > > > This looks like it's because the vectorizer costs the operation to > > create the BIT_FIELD_REF <a_3(D), 64, 0>; For the reduction as > > requiring two scalar extracts and a permute. While it ultimately does > produce a BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs. > > > > This causes the reduction to almost always be more expensive, so > > unless the rest of the SLP tree amortizes the cost we never generate them. > > On x86 for example the hadds are prohibitly expensive here. Are you sure > the horizontal add is actually profitable on arm? Your pattern-matching has > no cost modeling at all? Yes, they are dirt cheap, that's why we use them for a lot of our codegen for e.g. compressing values. > > > 3. The SLP only happens on operation that are SLP shaped and where SLP > didn't fail. > > > > As a simple example, the vectorizer can't SLP the following: > > > > unsigned int f (u32v4 a, u32v4 b) > > { > > a[0] += b[0]; > > return a[0] + a[1]; > > } > > > > Because there's not enough VF here and it can't unroll. This and many > > others fail because they're not an SLP-able operation, or SLP build fails. > > That's of course because the pattern matching for reductions is too simple > here, getting us a group size of three. Bad association would make your > simple pattern matching fail as well. > > > This causes us to generate for e.g. this example: > > > > f: > > dup s2, v0.s[1] > > fmov w1, s1 > > add v0.2s, v2.2s, v0.2s > > fmov w0, s0 > > add w0, w0, w1 > > ret > > > > instead of with my patch: > > > > f: > > addp v0.2s, v0.2s, v0.2s > > add v0.2s, v0.2s, v1.2s > > fmov w0, s0 > > ret > > > > which is significantly better code. So I don't think the vectorizer is the right > solution for this. > > Simple pattern matching isn't either. In fact basic-block SLP is supposed to be > the advanced pattern matching including a cost model. The cost model seems a bit moot here, at least on AArch64. There is no sequence of events that would make these pairwise operations more expensive then the alternative, which is to do vector extraction and crossing a register file to do simple addition. And in fact the ISA classifies these instructions as scalar not vector, and it doesn't seem right to need the vectorizer for something that's basic codegen. It seems like the problem here is that the current reductions are designed around x86 specific limitations. So perhaps the solution here is to just have an AArch64 specific Gimple pass or gate this transform on a target hook, or new cheap reduction codes. > > IMHO the correct approach is to improve that, > vect_slp_check_for_constructors plus how we handle/recover from SLP > discovery fails as in your second example above. Is this feasible in the general sense? SLP tree decomposition would then require you to cost every sub tree possible that gets built. That seems quite expensive... The bigger the tree the longer the more you have to decompose.. > > > > I don't think > > > we want to do this as part of general folding. Iff, then this > > > belongs in specific points of the pass pipeline, no? > > > > The reason I currently have it as such is because in general the > > compiler doesn't really deal with horizontal reductions at all. Also > > since the vectorizer itself can introduce reductions I figured it's > > better to have one representation for this. So admittedly perhaps this > should only be done after vectorization as that's when today we expect > reductions to be in Gimple. > > > > As for having it in a specific point in the pass pipeline, I have it > > as a general one since a number of passes could create the form for > > the reduction, for instance vec_lower could break up an operation to allow > this to match. The bigger BIT_FIELD_EXPR it creates could also lead to other > optimizations. > > > > Additionally you had mentioned last time that Andrew was trying to > > move min/max detection to match.pd So I had figured this was the correct > place for it. > > That's mostly because we have fold-const patterns for ?: min/max and CFG > patterns for min/max in phiopt and it's possible to unify both. > > > That said I have no intuition for what would be better here. Since the > > check is quite cheap. But do you have a particular place you want > > this move to then? Ideally I'd want it before the last FRE pass, but perhaps > isel? > > As said, I think it belongs where we can do costing which means the > vectorizer. Iff there are two/three instruction sequences that can be > peepholed do it in the targets machine description instead. We can't do it in RTL, because we don't know when things are sequentially are after register allocator, for integer mode these would have then been assigned to scalar hard register. And so this is unrecoverable. So quite literally, you cannot peephole this. You also cannot use combine because there's no way to ensure that reload generates the same register from subregs. Thanks, Tamar > Richard. > > > Thanks, > > Tamar > > > > > > > > > The use of these instruction makes a significant difference in > > > > codegen quality for AArch64 and Arm. > > > > > > > > NOTE: The last entry in the series contains tests for all of the > > > > previous patches as it's a bit of an all or nothing thing. > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > > x86_64-pc-linux-gnu and no issues. > > > > > > > > Ok for master? > > > > > > > > Thanks, > > > > Tamar > > > > > > > > gcc/ChangeLog: > > > > > > > > * match.pd (adjacent_data_access_p): Import. > > > > Add new pattern for bitwise plus, min, max, fmax, fmin. > > > > * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs. > > > > * tree.cc (adjacent_data_access_p): New. > > > > * tree.h (adjacent_data_access_p): New. > > > > > > > > --- inline copy of patch -- > > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > > > > > > 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10 > > > eb1 > > > > 8db23c0cd9f7 100644 > > > > --- a/gcc/match.pd > > > > +++ b/gcc/match.pd > > > > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3. If not see > > > > HONOR_NANS > > > > uniform_vector_p > > > > expand_vec_cmp_expr_p > > > > - bitmask_inv_cst_vector_p) > > > > + bitmask_inv_cst_vector_p > > > > + adjacent_data_access_p) > > > > > > > > /* Operator lists. */ > > > > (define_operator_list tcc_comparison @@ -7195,6 +7196,47 @@ > > > > DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > > > > > > /* Canonicalizations of BIT_FIELD_REFs. */ > > > > > > > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ (for op > > > > +(plus min max FMIN_ALL FMAX_ALL) > > > > + ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX > > > > + IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > > > + (op @0 @1) > > > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > > > + (with { poly_uint64 nloc = 0; > > > > + tree src = adjacent_data_access_p (@0, @1, &nloc, true); > > > > + tree ntype = build_vector_type (type, 2); > > > > + tree size = TYPE_SIZE (ntype); > > > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > > > + poly_uint64 _sz; > > > > + poly_uint64 _total; } > > > > + (if (src && is_gimple_reg (src) && ntype > > > > + && poly_int_tree_p (size, &_sz) > > > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > > > + && known_ge (_total, _sz + nloc)) > > > > + (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; > > > > +}))))))) > > > > + > > > > +(for op (lt gt) > > > > + ifni (IFN_REDUC_MIN IFN_REDUC_MAX) > > > > + ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > > > + (cond (op @0 @1) @0 @1) > > > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > > > + (with { poly_uint64 nloc = 0; > > > > + tree src = adjacent_data_access_p (@0, @1, &nloc, false); > > > > + tree ntype = build_vector_type (type, 2); > > > > + tree size = TYPE_SIZE (ntype); > > > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > > > + poly_uint64 _sz; > > > > + poly_uint64 _total; } > > > > + (if (src && is_gimple_reg (src) && ntype > > > > + && poly_int_tree_p (size, &_sz) > > > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > > > + && known_ge (_total, _sz + nloc)) > > > > + (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type))) > > > > + (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; })) > > > > + (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; > > > > +})))))))) > > > > + > > > > (simplify > > > > (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4) > > > > (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, > > > > @4); > > > > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index > > > > > > > > 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006 > > > 603 > > > > ae1e3deb6b76 100644 > > > > --- a/gcc/tree-cfg.cc > > > > +++ b/gcc/tree-cfg.cc > > > > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt) > > > > { > > > > tree arg = gimple_call_arg (stmt, i); > > > > if ((is_gimple_reg_type (TREE_TYPE (arg)) > > > > + && !is_gimple_variable (arg) > > > > && !is_gimple_val (arg)) > > > > || (!is_gimple_reg_type (TREE_TYPE (arg)) > > > > && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h > > > > b/gcc/tree.h index > > > > > > > > e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194 > > > 640 > > > > b8c1b531b798 100644 > > > > --- a/gcc/tree.h > > > > +++ b/gcc/tree.h > > > > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree); > > > > > > > > extern tree bitmask_inv_cst_vector_p (tree); > > > > > > > > +/* TRUE if the two operands represent adjacent access of data such > that a > > > > + pairwise operation can be used. */ > > > > + > > > > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, > > > > +bool); > > > > + > > > > /* integer_nonzerop (tree x) is nonzero if X is an integer constant > > > > with a nonzero value. */ > > > > > > > > diff --git a/gcc/tree.cc b/gcc/tree.cc index > > > > > > > > 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b1 > > > 9 > > > d > > > > e9841e29d311 100644 > > > > --- a/gcc/tree.cc > > > > +++ b/gcc/tree.cc > > > > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t) > > > > return builder.build (); > > > > } > > > > > > > > +/* Returns base address if the two operands represent adjacent > > > > +access of > > > data > > > > + such that a pairwise operation can be used. OP1 must be a > > > > + lower > > > subpart > > > > + than OP2. If POS is not NULL then on return if a value is returned > POS > > > > + will indicate the position of the lower address. If > > > > + COMMUTATIVE_P > > > then > > > > + the operation is also tried by flipping op1 and op2. */ > > > > + > > > > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos, > > > > + bool commutative_p) { > > > > + gcc_assert (op1); > > > > + gcc_assert (op2); > > > > + if (TREE_CODE (op1) != TREE_CODE (op2) > > > > + || TREE_TYPE (op1) != TREE_TYPE (op2)) > > > > + return NULL; > > > > + > > > > + tree type = TREE_TYPE (op1); > > > > + gimple *stmt1 = NULL, *stmt2 = NULL; unsigned int bits = > > > > + GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type))); > > > > + > > > > + if (TREE_CODE (op1) == BIT_FIELD_REF > > > > + && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND > > > > + (op2, > > > 0), 0) > > > > + && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND > > > > + (op2, > > > 1), 0) > > > > + && known_eq (bit_field_size (op1), bits)) > > > > + { > > > > + poly_uint64 offset1 = bit_field_offset (op1); > > > > + poly_uint64 offset2 = bit_field_offset (op2); > > > > + if (known_eq (offset2 - offset1, bits)) > > > > + { > > > > + if (pos) > > > > + *pos = offset1; > > > > + return TREE_OPERAND (op1, 0); > > > > + } > > > > + else if (commutative_p && known_eq (offset1 - offset2, bits)) > > > > + { > > > > + if (pos) > > > > + *pos = offset2; > > > > + return TREE_OPERAND (op1, 0); > > > > + } > > > > + } > > > > + else if (TREE_CODE (op1) == ARRAY_REF > > > > + && operand_equal_p (get_base_address (op1), > > > > + get_base_address > > > (op2))) > > > > + { > > > > + wide_int size1 = wi::to_wide (array_ref_element_size (op1)); > > > > + wide_int size2 = wi::to_wide (array_ref_element_size (op2)); > > > > + if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8) > > > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1)) > > > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1))) > > > > + return NULL; > > > > + > > > > + poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, > 1)); > > > > + poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, > 1)); > > > > + if (known_eq (offset2 - offset1, 1UL)) > > > > + { > > > > + if (pos) > > > > + *pos = offset1 * bits; > > > > + return TREE_OPERAND (op1, 0); > > > > + } > > > > + else if (commutative_p && known_eq (offset1 - offset2, 1UL)) > > > > + { > > > > + if (pos) > > > > + *pos = offset2 * bits; > > > > + return TREE_OPERAND (op1, 0); > > > > + } > > > > + } > > > > + else if (TREE_CODE (op1) == SSA_NAME > > > > + && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL > > > > + && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL > > > > + && is_gimple_assign (stmt1) > > > > + && is_gimple_assign (stmt2)) > > > > + { > > > > + if (gimple_assign_rhs_code (stmt1) != ARRAY_REF > > > > + && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF > > > > + && gimple_assign_rhs_code (stmt2) != ARRAY_REF > > > > + && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF) > > > > + return NULL; > > > > + > > > > + return adjacent_data_access_p (gimple_assign_rhs1 (stmt1), > > > > + gimple_assign_rhs1 (stmt2), pos, > > > > + commutative_p); > > > > + } > > > > + > > > > + return NULL; > > > > +} > > > > + > > > > /* If VECTOR_CST T has a single nonzero element, return the index of > that > > > > element, otherwise return -1. */ > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, > Boudien Moerman; HRB 36809 (AG Nuernberg)
On Mon, 7 Nov 2022, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguenther@suse.de> > > Sent: Monday, November 7, 2022 10:18 AM > > To: Tamar Christina <Tamar.Christina@arm.com> > > Cc: Richard Biener <richard.guenther@gmail.com>; gcc- > > patches@gcc.gnu.org; nd <nd@arm.com> > > Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from > > bitfields and array_refs > > > > On Mon, 7 Nov 2022, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <richard.guenther@gmail.com> > > > > Sent: Saturday, November 5, 2022 11:33 AM > > > > To: Tamar Christina <Tamar.Christina@arm.com> > > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de > > > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from > > > > bitfields and array_refs > > > > > > > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches > > > > <gcc- patches@gcc.gnu.org> wrote: > > > > > > > > > > Hi All, > > > > > > > > > > This patch series is to add recognition of pairwise operations > > > > > (reductions) in match.pd such that we can benefit from them even > > > > > at > > > > > -O1 when the vectorizer isn't enabled. > > > > > > > > > > Ths use of these allow for a lot simpler codegen in AArch64 and > > > > > allows us to avoid quite a lot of codegen warts. > > > > > > > > > > As an example a simple: > > > > > > > > > > typedef float v4sf __attribute__((vector_size (16))); > > > > > > > > > > float > > > > > foo3 (v4sf x) > > > > > { > > > > > return x[1] + x[2]; > > > > > } > > > > > > > > > > currently generates: > > > > > > > > > > foo3: > > > > > dup s1, v0.s[1] > > > > > dup s0, v0.s[2] > > > > > fadd s0, s1, s0 > > > > > ret > > > > > > > > > > while with this patch series now generates: > > > > > > > > > > foo3: > > > > > ext v0.16b, v0.16b, v0.16b, #4 > > > > > faddp s0, v0.2s > > > > > ret > > > > > > > > > > This patch will not perform the operation if the source is not a > > > > > gimple register and leaves memory sources to the vectorizer as > > > > > it's able to deal correctly with clobbers. > > > > > > > > But the vectorizer should also be able to cope with the above. > > > > > > There are several problems with leaving it up to the vectorizer to do: > > > > > > 1. We only get it at -O2 and higher. > > > 2. The way the vectorizer costs the reduction makes the resulting cost > > always too high for AArch64. > > > > > > As an example the following: > > > > > > typedef unsigned int u32v4 __attribute__((vector_size(16))); unsigned > > > int f (u32v4 a, u32v4 b) { > > > return a[0] + a[1]; > > > } > > > > > > Doesn't get SLP'ed because the vectorizer costs it as: > > > > > > node 0x485eb30 0 times vec_perm costs 0 in body > > > _1 + _2 1 times vector_stmt costs 1 in body > > > _1 + _2 1 times vec_perm costs 2 in body > > > _1 + _2 1 times vec_to_scalar costs 2 in body > > > > > > And so ultimately you fail because: > > > > > > /app/example.c:8:17: note: Cost model analysis for part in loop 0: > > > Vector cost: 5 > > > Scalar cost: 3 > > > > > > This looks like it's because the vectorizer costs the operation to > > > create the BIT_FIELD_REF <a_3(D), 64, 0>; For the reduction as > > > requiring two scalar extracts and a permute. While it ultimately does > > produce a BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs. > > > > > > This causes the reduction to almost always be more expensive, so > > > unless the rest of the SLP tree amortizes the cost we never generate them. > > > > On x86 for example the hadds are prohibitly expensive here. Are you sure > > the horizontal add is actually profitable on arm? Your pattern-matching has > > no cost modeling at all? > > Yes, they are dirt cheap, that's why we use them for a lot of our codegen for > e.g. compressing values. > > > > > > 3. The SLP only happens on operation that are SLP shaped and where SLP > > didn't fail. > > > > > > As a simple example, the vectorizer can't SLP the following: > > > > > > unsigned int f (u32v4 a, u32v4 b) > > > { > > > a[0] += b[0]; > > > return a[0] + a[1]; > > > } > > > > > > Because there's not enough VF here and it can't unroll. This and many > > > others fail because they're not an SLP-able operation, or SLP build fails. > > > > That's of course because the pattern matching for reductions is too simple > > here, getting us a group size of three. Bad association would make your > > simple pattern matching fail as well. > > > > > This causes us to generate for e.g. this example: > > > > > > f: > > > dup s2, v0.s[1] > > > fmov w1, s1 > > > add v0.2s, v2.2s, v0.2s > > > fmov w0, s0 > > > add w0, w0, w1 > > > ret > > > > > > instead of with my patch: > > > > > > f: > > > addp v0.2s, v0.2s, v0.2s > > > add v0.2s, v0.2s, v1.2s > > > fmov w0, s0 > > > ret > > > > > > which is significantly better code. So I don't think the vectorizer is the right > > solution for this. > > > > Simple pattern matching isn't either. In fact basic-block SLP is supposed to be > > the advanced pattern matching including a cost model. > > The cost model seems a bit moot here, at least on AArch64. There is no sequence > of events that would make these pairwise operations more expensive then the > alternative, which is to do vector extraction and crossing a register file to do simple > addition. > > And in fact the ISA classifies these instructions as scalar not vector, and it doesn't > seem right to need the vectorizer for something that's basic codegen. I probably fail to decipher the asm, 'addp v0.2s, v0.2s, v0.2s' either hides the fact that the output is scalar or that the input is vector. > It seems like the problem here is that the current reductions are designed around > x86 specific limitations. So perhaps the solution here is to just have an AArch64 > specific Gimple pass or gate this transform on a target hook, or new cheap reduction > codes. No, the current reductions are designed to be used by the vectorizer - you are using them for GIMPLE peepholing. x86 assumes cost modeling gets applied before using them (but IIRC x86 doesn't have integer horizontal reductions, but the backend has patterns to create optimal sequences for them. The problem with doing the pattern matching too early is that .REDUC_PLUS isn't recognized widely so any followup simplifications are unlikely (like reassociating after inlining, etc.). > > > > IMHO the correct approach is to improve that, > > vect_slp_check_for_constructors plus how we handle/recover from SLP > > discovery fails as in your second example above. > > Is this feasible in the general sense? SLP tree decomposition would then require > you to cost every sub tree possible that gets built. That seems quite expensive... Well, that's kind-of what we do. But how do you figure the "optimal" way to match a[0] + a[1] + a[0] + a[1]? Your match.pd pattern will apply on each and every add in a chain, the SLP pattern matching is careful to only start matching from the _last_ element of a chain so is actually cheaper. It just isn't very clever in pruning a non-power-of-two chain or in splitting a chain at points where different sources come in. > The bigger the tree the longer the more you have to decompose.. And the more (random) places you have where eventually your pattern matches. > > > > > > I don't think > > > > we want to do this as part of general folding. Iff, then this > > > > belongs in specific points of the pass pipeline, no? > > > > > > The reason I currently have it as such is because in general the > > > compiler doesn't really deal with horizontal reductions at all. Also > > > since the vectorizer itself can introduce reductions I figured it's > > > better to have one representation for this. So admittedly perhaps this > > should only be done after vectorization as that's when today we expect > > reductions to be in Gimple. > > > > > > As for having it in a specific point in the pass pipeline, I have it > > > as a general one since a number of passes could create the form for > > > the reduction, for instance vec_lower could break up an operation to allow > > this to match. The bigger BIT_FIELD_EXPR it creates could also lead to other > > optimizations. > > > > > > Additionally you had mentioned last time that Andrew was trying to > > > move min/max detection to match.pd So I had figured this was the correct > > place for it. > > > > That's mostly because we have fold-const patterns for ?: min/max and CFG > > patterns for min/max in phiopt and it's possible to unify both. > > > > > That said I have no intuition for what would be better here. Since the > > > check is quite cheap. But do you have a particular place you want > > > this move to then? Ideally I'd want it before the last FRE pass, but perhaps > > isel? > > > > As said, I think it belongs where we can do costing which means the > > vectorizer. Iff there are two/three instruction sequences that can be > > peepholed do it in the targets machine description instead. > > We can't do it in RTL, because we don't know when things are sequentially > are after register allocator, for integer mode these would have then been > assigned to scalar hard register. And so this is unrecoverable. > > So quite literally, you cannot peephole this. You also cannot use combine > because there's no way to ensure that reload generates the same register > from subregs. So it's not easily possible the within current infrastructure. But it does look like ARM might eventually benefit from something like STV on x86? Richard. > Thanks, > Tamar > > > Richard. > > > > > Thanks, > > > Tamar > > > > > > > > > > > > The use of these instruction makes a significant difference in > > > > > codegen quality for AArch64 and Arm. > > > > > > > > > > NOTE: The last entry in the series contains tests for all of the > > > > > previous patches as it's a bit of an all or nothing thing. > > > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > > > x86_64-pc-linux-gnu and no issues. > > > > > > > > > > Ok for master? > > > > > > > > > > Thanks, > > > > > Tamar > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > * match.pd (adjacent_data_access_p): Import. > > > > > Add new pattern for bitwise plus, min, max, fmax, fmin. > > > > > * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs. > > > > > * tree.cc (adjacent_data_access_p): New. > > > > > * tree.h (adjacent_data_access_p): New. > > > > > > > > > > --- inline copy of patch -- > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > > > > > > > > > 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10 > > > > eb1 > > > > > 8db23c0cd9f7 100644 > > > > > --- a/gcc/match.pd > > > > > +++ b/gcc/match.pd > > > > > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3. If not see > > > > > HONOR_NANS > > > > > uniform_vector_p > > > > > expand_vec_cmp_expr_p > > > > > - bitmask_inv_cst_vector_p) > > > > > + bitmask_inv_cst_vector_p > > > > > + adjacent_data_access_p) > > > > > > > > > > /* Operator lists. */ > > > > > (define_operator_list tcc_comparison @@ -7195,6 +7196,47 @@ > > > > > DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > > > > > > > > /* Canonicalizations of BIT_FIELD_REFs. */ > > > > > > > > > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ (for op > > > > > +(plus min max FMIN_ALL FMAX_ALL) > > > > > + ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX > > > > > + IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > > > > + (op @0 @1) > > > > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > > > > + (with { poly_uint64 nloc = 0; > > > > > + tree src = adjacent_data_access_p (@0, @1, &nloc, true); > > > > > + tree ntype = build_vector_type (type, 2); > > > > > + tree size = TYPE_SIZE (ntype); > > > > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > > > > + poly_uint64 _sz; > > > > > + poly_uint64 _total; } > > > > > + (if (src && is_gimple_reg (src) && ntype > > > > > + && poly_int_tree_p (size, &_sz) > > > > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > > > > + && known_ge (_total, _sz + nloc)) > > > > > + (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; > > > > > +}))))))) > > > > > + > > > > > +(for op (lt gt) > > > > > + ifni (IFN_REDUC_MIN IFN_REDUC_MAX) > > > > > + ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > > > > + (cond (op @0 @1) @0 @1) > > > > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > > > > + (with { poly_uint64 nloc = 0; > > > > > + tree src = adjacent_data_access_p (@0, @1, &nloc, false); > > > > > + tree ntype = build_vector_type (type, 2); > > > > > + tree size = TYPE_SIZE (ntype); > > > > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > > > > + poly_uint64 _sz; > > > > > + poly_uint64 _total; } > > > > > + (if (src && is_gimple_reg (src) && ntype > > > > > + && poly_int_tree_p (size, &_sz) > > > > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > > > > + && known_ge (_total, _sz + nloc)) > > > > > + (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type))) > > > > > + (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; })) > > > > > + (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; > > > > > +})))))))) > > > > > + > > > > > (simplify > > > > > (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4) > > > > > (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, > > > > > @4); > > > > > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index > > > > > > > > > > > 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006 > > > > 603 > > > > > ae1e3deb6b76 100644 > > > > > --- a/gcc/tree-cfg.cc > > > > > +++ b/gcc/tree-cfg.cc > > > > > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt) > > > > > { > > > > > tree arg = gimple_call_arg (stmt, i); > > > > > if ((is_gimple_reg_type (TREE_TYPE (arg)) > > > > > + && !is_gimple_variable (arg) > > > > > && !is_gimple_val (arg)) > > > > > || (!is_gimple_reg_type (TREE_TYPE (arg)) > > > > > && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h > > > > > b/gcc/tree.h index > > > > > > > > > > > e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194 > > > > 640 > > > > > b8c1b531b798 100644 > > > > > --- a/gcc/tree.h > > > > > +++ b/gcc/tree.h > > > > > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree); > > > > > > > > > > extern tree bitmask_inv_cst_vector_p (tree); > > > > > > > > > > +/* TRUE if the two operands represent adjacent access of data such > > that a > > > > > + pairwise operation can be used. */ > > > > > + > > > > > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, > > > > > +bool); > > > > > + > > > > > /* integer_nonzerop (tree x) is nonzero if X is an integer constant > > > > > with a nonzero value. */ > > > > > > > > > > diff --git a/gcc/tree.cc b/gcc/tree.cc index > > > > > > > > > > > 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b1 > > > > 9 > > > > d > > > > > e9841e29d311 100644 > > > > > --- a/gcc/tree.cc > > > > > +++ b/gcc/tree.cc > > > > > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t) > > > > > return builder.build (); > > > > > } > > > > > > > > > > +/* Returns base address if the two operands represent adjacent > > > > > +access of > > > > data > > > > > + such that a pairwise operation can be used. OP1 must be a > > > > > + lower > > > > subpart > > > > > + than OP2. If POS is not NULL then on return if a value is returned > > POS > > > > > + will indicate the position of the lower address. If > > > > > + COMMUTATIVE_P > > > > then > > > > > + the operation is also tried by flipping op1 and op2. */ > > > > > + > > > > > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos, > > > > > + bool commutative_p) { > > > > > + gcc_assert (op1); > > > > > + gcc_assert (op2); > > > > > + if (TREE_CODE (op1) != TREE_CODE (op2) > > > > > + || TREE_TYPE (op1) != TREE_TYPE (op2)) > > > > > + return NULL; > > > > > + > > > > > + tree type = TREE_TYPE (op1); > > > > > + gimple *stmt1 = NULL, *stmt2 = NULL; unsigned int bits = > > > > > + GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type))); > > > > > + > > > > > + if (TREE_CODE (op1) == BIT_FIELD_REF > > > > > + && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND > > > > > + (op2, > > > > 0), 0) > > > > > + && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND > > > > > + (op2, > > > > 1), 0) > > > > > + && known_eq (bit_field_size (op1), bits)) > > > > > + { > > > > > + poly_uint64 offset1 = bit_field_offset (op1); > > > > > + poly_uint64 offset2 = bit_field_offset (op2); > > > > > + if (known_eq (offset2 - offset1, bits)) > > > > > + { > > > > > + if (pos) > > > > > + *pos = offset1; > > > > > + return TREE_OPERAND (op1, 0); > > > > > + } > > > > > + else if (commutative_p && known_eq (offset1 - offset2, bits)) > > > > > + { > > > > > + if (pos) > > > > > + *pos = offset2; > > > > > + return TREE_OPERAND (op1, 0); > > > > > + } > > > > > + } > > > > > + else if (TREE_CODE (op1) == ARRAY_REF > > > > > + && operand_equal_p (get_base_address (op1), > > > > > + get_base_address > > > > (op2))) > > > > > + { > > > > > + wide_int size1 = wi::to_wide (array_ref_element_size (op1)); > > > > > + wide_int size2 = wi::to_wide (array_ref_element_size (op2)); > > > > > + if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8) > > > > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1)) > > > > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1))) > > > > > + return NULL; > > > > > + > > > > > + poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, > > 1)); > > > > > + poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, > > 1)); > > > > > + if (known_eq (offset2 - offset1, 1UL)) > > > > > + { > > > > > + if (pos) > > > > > + *pos = offset1 * bits; > > > > > + return TREE_OPERAND (op1, 0); > > > > > + } > > > > > + else if (commutative_p && known_eq (offset1 - offset2, 1UL)) > > > > > + { > > > > > + if (pos) > > > > > + *pos = offset2 * bits; > > > > > + return TREE_OPERAND (op1, 0); > > > > > + } > > > > > + } > > > > > + else if (TREE_CODE (op1) == SSA_NAME > > > > > + && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL > > > > > + && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL > > > > > + && is_gimple_assign (stmt1) > > > > > + && is_gimple_assign (stmt2)) > > > > > + { > > > > > + if (gimple_assign_rhs_code (stmt1) != ARRAY_REF > > > > > + && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF > > > > > + && gimple_assign_rhs_code (stmt2) != ARRAY_REF > > > > > + && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF) > > > > > + return NULL; > > > > > + > > > > > + return adjacent_data_access_p (gimple_assign_rhs1 (stmt1), > > > > > + gimple_assign_rhs1 (stmt2), pos, > > > > > + commutative_p); > > > > > + } > > > > > + > > > > > + return NULL; > > > > > +} > > > > > + > > > > > /* If VECTOR_CST T has a single nonzero element, return the index of > > that > > > > > element, otherwise return -1. */ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > -- > > Richard Biener <rguenther@suse.de> > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 > > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, > > Boudien Moerman; HRB 36809 (AG Nuernberg) >
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Monday, November 7, 2022 11:23 AM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: Richard Biener <richard.guenther@gmail.com>; gcc- > patches@gcc.gnu.org; nd <nd@arm.com> > Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from > bitfields and array_refs > > On Mon, 7 Nov 2022, Tamar Christina wrote: > > > > -----Original Message----- > > > From: Richard Biener <rguenther@suse.de> > > > Sent: Monday, November 7, 2022 10:18 AM > > > To: Tamar Christina <Tamar.Christina@arm.com> > > > Cc: Richard Biener <richard.guenther@gmail.com>; gcc- > > > patches@gcc.gnu.org; nd <nd@arm.com> > > > Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from > > > bitfields and array_refs > > > > > > On Mon, 7 Nov 2022, Tamar Christina wrote: > > > > > > > > -----Original Message----- > > > > > From: Richard Biener <richard.guenther@gmail.com> > > > > > Sent: Saturday, November 5, 2022 11:33 AM > > > > > To: Tamar Christina <Tamar.Christina@arm.com> > > > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; > rguenther@suse.de > > > > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions > > > > > from bitfields and array_refs > > > > > > > > > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches > > > > > <gcc- patches@gcc.gnu.org> wrote: > > > > > > > > > > > > Hi All, > > > > > > > > > > > > This patch series is to add recognition of pairwise operations > > > > > > (reductions) in match.pd such that we can benefit from them > > > > > > even at > > > > > > -O1 when the vectorizer isn't enabled. > > > > > > > > > > > > Ths use of these allow for a lot simpler codegen in AArch64 > > > > > > and allows us to avoid quite a lot of codegen warts. > > > > > > > > > > > > As an example a simple: > > > > > > > > > > > > typedef float v4sf __attribute__((vector_size (16))); > > > > > > > > > > > > float > > > > > > foo3 (v4sf x) > > > > > > { > > > > > > return x[1] + x[2]; > > > > > > } > > > > > > > > > > > > currently generates: > > > > > > > > > > > > foo3: > > > > > > dup s1, v0.s[1] > > > > > > dup s0, v0.s[2] > > > > > > fadd s0, s1, s0 > > > > > > ret > > > > > > > > > > > > while with this patch series now generates: > > > > > > > > > > > > foo3: > > > > > > ext v0.16b, v0.16b, v0.16b, #4 > > > > > > faddp s0, v0.2s > > > > > > ret > > > > > > > > > > > > This patch will not perform the operation if the source is not > > > > > > a gimple register and leaves memory sources to the vectorizer > > > > > > as it's able to deal correctly with clobbers. > > > > > > > > > > But the vectorizer should also be able to cope with the above. > > > > > > > > There are several problems with leaving it up to the vectorizer to do: > > > > > > > > 1. We only get it at -O2 and higher. > > > > 2. The way the vectorizer costs the reduction makes the resulting > > > > cost > > > always too high for AArch64. > > > > > > > > As an example the following: > > > > > > > > typedef unsigned int u32v4 __attribute__((vector_size(16))); > > > > unsigned int f (u32v4 a, u32v4 b) { > > > > return a[0] + a[1]; > > > > } > > > > > > > > Doesn't get SLP'ed because the vectorizer costs it as: > > > > > > > > node 0x485eb30 0 times vec_perm costs 0 in body > > > > _1 + _2 1 times vector_stmt costs 1 in body > > > > _1 + _2 1 times vec_perm costs 2 in body > > > > _1 + _2 1 times vec_to_scalar costs 2 in body > > > > > > > > And so ultimately you fail because: > > > > > > > > /app/example.c:8:17: note: Cost model analysis for part in loop 0: > > > > Vector cost: 5 > > > > Scalar cost: 3 > > > > > > > > This looks like it's because the vectorizer costs the operation to > > > > create the BIT_FIELD_REF <a_3(D), 64, 0>; For the reduction as > > > > requiring two scalar extracts and a permute. While it ultimately > > > > does > > > produce a BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs. > > > > > > > > This causes the reduction to almost always be more expensive, so > > > > unless the rest of the SLP tree amortizes the cost we never generate > them. > > > > > > On x86 for example the hadds are prohibitly expensive here. Are you > > > sure the horizontal add is actually profitable on arm? Your > > > pattern-matching has no cost modeling at all? > > > > Yes, they are dirt cheap, that's why we use them for a lot of our > > codegen for e.g. compressing values. > > > > > > > > > 3. The SLP only happens on operation that are SLP shaped and where > > > > SLP > > > didn't fail. > > > > > > > > As a simple example, the vectorizer can't SLP the following: > > > > > > > > unsigned int f (u32v4 a, u32v4 b) > > > > { > > > > a[0] += b[0]; > > > > return a[0] + a[1]; > > > > } > > > > > > > > Because there's not enough VF here and it can't unroll. This and > > > > many others fail because they're not an SLP-able operation, or SLP build > fails. > > > > > > That's of course because the pattern matching for reductions is too > > > simple here, getting us a group size of three. Bad association > > > would make your simple pattern matching fail as well. > > > > > > > This causes us to generate for e.g. this example: > > > > > > > > f: > > > > dup s2, v0.s[1] > > > > fmov w1, s1 > > > > add v0.2s, v2.2s, v0.2s > > > > fmov w0, s0 > > > > add w0, w0, w1 > > > > ret > > > > > > > > instead of with my patch: > > > > > > > > f: > > > > addp v0.2s, v0.2s, v0.2s > > > > add v0.2s, v0.2s, v1.2s > > > > fmov w0, s0 > > > > ret > > > > > > > > which is significantly better code. So I don't think the > > > > vectorizer is the right > > > solution for this. > > > > > > Simple pattern matching isn't either. In fact basic-block SLP is > > > supposed to be the advanced pattern matching including a cost model. > > > > The cost model seems a bit moot here, at least on AArch64. There is > > no sequence of events that would make these pairwise operations more > > expensive then the alternative, which is to do vector extraction and > > crossing a register file to do simple addition. > > > > And in fact the ISA classifies these instructions as scalar not > > vector, and it doesn't seem right to need the vectorizer for something > that's basic codegen. > > I probably fail to decipher the asm, 'addp v0.2s, v0.2s, v0.2s' > either hides the fact that the output is scalar or that the input is vector. That's because of the codegen trick we use to get it for integers as well. e.g. https://developer.arm.com/documentation/dui0801/h/A64-SIMD-Scalar-Instructions/FADDP--scalar- is the reduction for floats. The addp v0.2s is just because there wasn't a point in having both a three operand and two operand version of the instruction. > > > It seems like the problem here is that the current reductions are > > designed around > > x86 specific limitations. So perhaps the solution here is to just > > have an AArch64 specific Gimple pass or gate this transform on a > > target hook, or new cheap reduction codes. > > No, the current reductions are designed to be used by the vectorizer - you > are using them for GIMPLE peepholing. x86 assumes cost modeling gets > applied before using them (but IIRC x86 doesn't have integer horizontal > reductions, but the backend has patterns to create optimal sequences for > them. > > The problem with doing the pattern matching too early is that .REDUC_PLUS > isn't recognized widely so any followup simplifications are unlikely (like > reassociating after inlining, etc.). Agreed, but that can be solved by doing the replacement late per the previous emails. > > > > > > IMHO the correct approach is to improve that, > > > vect_slp_check_for_constructors plus how we handle/recover from SLP > > > discovery fails as in your second example above. > > > > Is this feasible in the general sense? SLP tree decomposition would > > then require you to cost every sub tree possible that gets built. That seems > quite expensive... > > Well, that's kind-of what we do. But how do you figure the "optimal" > way to match a[0] + a[1] + a[0] + a[1]? > I'd expect either pre or post order would both end up doing the optimal thing, So either match the first two, or second two first. If it decided to match the middle two that's also fine but requires re-association to get to the final version. Sequence. > Your match.pd pattern will apply on each and every add in a chain, the SLP > pattern matching is careful to only start matching from the _last_ element of > a chain so is actually cheaper. It just isn't very clever in pruning a non-power- > of-two chain or in splitting a chain at points where different sources come in. > > > The bigger the tree the longer the more you have to decompose.. > > And the more (random) places you have where eventually your pattern > matches. > Yes true, but isn't the point of match.pd to amortize the costs of doing this by grouping the same class of operations together? > > > > > > > > I don't think > > > > > we want to do this as part of general folding. Iff, then this > > > > > belongs in specific points of the pass pipeline, no? > > > > > > > > The reason I currently have it as such is because in general the > > > > compiler doesn't really deal with horizontal reductions at all. > > > > Also since the vectorizer itself can introduce reductions I > > > > figured it's better to have one representation for this. So > > > > admittedly perhaps this > > > should only be done after vectorization as that's when today we > > > expect reductions to be in Gimple. > > > > > > > > As for having it in a specific point in the pass pipeline, I have > > > > it as a general one since a number of passes could create the form > > > > for the reduction, for instance vec_lower could break up an > > > > operation to allow > > > this to match. The bigger BIT_FIELD_EXPR it creates could also lead > > > to other optimizations. > > > > > > > > Additionally you had mentioned last time that Andrew was trying to > > > > move min/max detection to match.pd So I had figured this was the > > > > correct > > > place for it. > > > > > > That's mostly because we have fold-const patterns for ?: min/max and > > > CFG patterns for min/max in phiopt and it's possible to unify both. > > > > > > > That said I have no intuition for what would be better here. Since > > > > the check is quite cheap. But do you have a particular place you > > > > want this move to then? Ideally I'd want it before the last FRE > > > > pass, but perhaps > > > isel? > > > > > > As said, I think it belongs where we can do costing which means the > > > vectorizer. Iff there are two/three instruction sequences that can > > > be peepholed do it in the targets machine description instead. > > > > We can't do it in RTL, because we don't know when things are > > sequentially are after register allocator, for integer mode these > > would have then been assigned to scalar hard register. And so this is > unrecoverable. > > > > So quite literally, you cannot peephole this. You also cannot use > > combine because there's no way to ensure that reload generates the > > same register from subregs. > > So it's not easily possible the within current infrastructure. But it does look > like ARM might eventually benefit from something like STV on x86? > I'm not sure. The problem with trying to do this in RTL is that you'd have to be able to decide from two psuedos whether they come from extracts that are sequential. When coming in from a hard register that's easy yes. When coming in from a load, or any other operation that produces psuedos that becomes harder. But ok, I guess from this thread I can see the patch is dead so I'll drop it. Thanks, Tamar > Richard. > > > Thanks, > > Tamar > > > > > Richard. > > > > > > > Thanks, > > > > Tamar > > > > > > > > > > > > > > > The use of these instruction makes a significant difference in > > > > > > codegen quality for AArch64 and Arm. > > > > > > > > > > > > NOTE: The last entry in the series contains tests for all of > > > > > > the previous patches as it's a bit of an all or nothing thing. > > > > > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > > > > x86_64-pc-linux-gnu and no issues. > > > > > > > > > > > > Ok for master? > > > > > > > > > > > > Thanks, > > > > > > Tamar > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > * match.pd (adjacent_data_access_p): Import. > > > > > > Add new pattern for bitwise plus, min, max, fmax, fmin. > > > > > > * tree-cfg.cc (verify_gimple_call): Allow function arguments in > IFNs. > > > > > > * tree.cc (adjacent_data_access_p): New. > > > > > > * tree.h (adjacent_data_access_p): New. > > > > > > > > > > > > --- inline copy of patch -- > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > > > > > > > > > > > > > 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10 > > > > > eb1 > > > > > > 8db23c0cd9f7 100644 > > > > > > --- a/gcc/match.pd > > > > > > +++ b/gcc/match.pd > > > > > > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3. If not > see > > > > > > HONOR_NANS > > > > > > uniform_vector_p > > > > > > expand_vec_cmp_expr_p > > > > > > - bitmask_inv_cst_vector_p) > > > > > > + bitmask_inv_cst_vector_p > > > > > > + adjacent_data_access_p) > > > > > > > > > > > > /* Operator lists. */ > > > > > > (define_operator_list tcc_comparison @@ -7195,6 +7196,47 @@ > > > > > > DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > > > > > > > > > > /* Canonicalizations of BIT_FIELD_REFs. */ > > > > > > > > > > > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ > > > > > > +(for op (plus min max FMIN_ALL FMAX_ALL) > > > > > > + ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX > > > > > > + IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > > > > > + (op @0 @1) > > > > > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > > > > > + (with { poly_uint64 nloc = 0; > > > > > > + tree src = adjacent_data_access_p (@0, @1, &nloc, true); > > > > > > + tree ntype = build_vector_type (type, 2); > > > > > > + tree size = TYPE_SIZE (ntype); > > > > > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > > > > > + poly_uint64 _sz; > > > > > > + poly_uint64 _total; } > > > > > > + (if (src && is_gimple_reg (src) && ntype > > > > > > + && poly_int_tree_p (size, &_sz) > > > > > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > > > > > + && known_ge (_total, _sz + nloc)) > > > > > > + (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; > > > > > > +}))))))) > > > > > > + > > > > > > +(for op (lt gt) > > > > > > + ifni (IFN_REDUC_MIN IFN_REDUC_MAX) > > > > > > + ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX) (simplify > > > > > > + (cond (op @0 @1) @0 @1) > > > > > > + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) > > > > > > + (with { poly_uint64 nloc = 0; > > > > > > + tree src = adjacent_data_access_p (@0, @1, &nloc, false); > > > > > > + tree ntype = build_vector_type (type, 2); > > > > > > + tree size = TYPE_SIZE (ntype); > > > > > > + tree pos = build_int_cst (TREE_TYPE (size), nloc); > > > > > > + poly_uint64 _sz; > > > > > > + poly_uint64 _total; } > > > > > > + (if (src && is_gimple_reg (src) && ntype > > > > > > + && poly_int_tree_p (size, &_sz) > > > > > > + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) > > > > > > + && known_ge (_total, _sz + nloc)) > > > > > > + (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type))) > > > > > > + (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; })) > > > > > > + (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; > > > > > > +})))))))) > > > > > > + > > > > > > (simplify > > > > > > (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4) > > > > > > (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, > > > > > > @2, @4); > > > > > > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index > > > > > > > > > > > > > > > 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006 > > > > > 603 > > > > > > ae1e3deb6b76 100644 > > > > > > --- a/gcc/tree-cfg.cc > > > > > > +++ b/gcc/tree-cfg.cc > > > > > > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt) > > > > > > { > > > > > > tree arg = gimple_call_arg (stmt, i); > > > > > > if ((is_gimple_reg_type (TREE_TYPE (arg)) > > > > > > + && !is_gimple_variable (arg) > > > > > > && !is_gimple_val (arg)) > > > > > > || (!is_gimple_reg_type (TREE_TYPE (arg)) > > > > > > && !is_gimple_lvalue (arg))) diff --git > > > > > > a/gcc/tree.h b/gcc/tree.h index > > > > > > > > > > > > > > > e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194 > > > > > 640 > > > > > > b8c1b531b798 100644 > > > > > > --- a/gcc/tree.h > > > > > > +++ b/gcc/tree.h > > > > > > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree); > > > > > > > > > > > > extern tree bitmask_inv_cst_vector_p (tree); > > > > > > > > > > > > +/* TRUE if the two operands represent adjacent access of data > > > > > > +such > > > that a > > > > > > + pairwise operation can be used. */ > > > > > > + > > > > > > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, > > > > > > +bool); > > > > > > + > > > > > > /* integer_nonzerop (tree x) is nonzero if X is an integer constant > > > > > > with a nonzero value. */ > > > > > > > > > > > > diff --git a/gcc/tree.cc b/gcc/tree.cc index > > > > > > > > > > > > > > > 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b1 > > > > > 9 > > > > > d > > > > > > e9841e29d311 100644 > > > > > > --- a/gcc/tree.cc > > > > > > +++ b/gcc/tree.cc > > > > > > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t) > > > > > > return builder.build (); > > > > > > } > > > > > > > > > > > > +/* Returns base address if the two operands represent > > > > > > +adjacent access of > > > > > data > > > > > > + such that a pairwise operation can be used. OP1 must be a > > > > > > + lower > > > > > subpart > > > > > > + than OP2. If POS is not NULL then on return if a value is > > > > > > + returned > > > POS > > > > > > + will indicate the position of the lower address. If > > > > > > + COMMUTATIVE_P > > > > > then > > > > > > + the operation is also tried by flipping op1 and op2. */ > > > > > > + > > > > > > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 > *pos, > > > > > > + bool commutative_p) { > > > > > > + gcc_assert (op1); > > > > > > + gcc_assert (op2); > > > > > > + if (TREE_CODE (op1) != TREE_CODE (op2) > > > > > > + || TREE_TYPE (op1) != TREE_TYPE (op2)) > > > > > > + return NULL; > > > > > > + > > > > > > + tree type = TREE_TYPE (op1); gimple *stmt1 = NULL, *stmt2 > > > > > > + = NULL; unsigned int bits = GET_MODE_BITSIZE > > > > > > + (GET_MODE_INNER (TYPE_MODE (type))); > > > > > > + > > > > > > + if (TREE_CODE (op1) == BIT_FIELD_REF > > > > > > + && operand_equal_p (TREE_OPERAND (op1, 0), > TREE_OPERAND > > > > > > + (op2, > > > > > 0), 0) > > > > > > + && operand_equal_p (TREE_OPERAND (op1, 1), > TREE_OPERAND > > > > > > + (op2, > > > > > 1), 0) > > > > > > + && known_eq (bit_field_size (op1), bits)) > > > > > > + { > > > > > > + poly_uint64 offset1 = bit_field_offset (op1); > > > > > > + poly_uint64 offset2 = bit_field_offset (op2); > > > > > > + if (known_eq (offset2 - offset1, bits)) > > > > > > + { > > > > > > + if (pos) > > > > > > + *pos = offset1; > > > > > > + return TREE_OPERAND (op1, 0); > > > > > > + } > > > > > > + else if (commutative_p && known_eq (offset1 - offset2, bits)) > > > > > > + { > > > > > > + if (pos) > > > > > > + *pos = offset2; > > > > > > + return TREE_OPERAND (op1, 0); > > > > > > + } > > > > > > + } > > > > > > + else if (TREE_CODE (op1) == ARRAY_REF > > > > > > + && operand_equal_p (get_base_address (op1), > > > > > > + get_base_address > > > > > (op2))) > > > > > > + { > > > > > > + wide_int size1 = wi::to_wide (array_ref_element_size (op1)); > > > > > > + wide_int size2 = wi::to_wide (array_ref_element_size (op2)); > > > > > > + if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8) > > > > > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1)) > > > > > > + || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1))) > > > > > > + return NULL; > > > > > > + > > > > > > + poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND > > > > > > + (op1, > > > 1)); > > > > > > + poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND > > > > > > + (op2, > > > 1)); > > > > > > + if (known_eq (offset2 - offset1, 1UL)) > > > > > > + { > > > > > > + if (pos) > > > > > > + *pos = offset1 * bits; > > > > > > + return TREE_OPERAND (op1, 0); > > > > > > + } > > > > > > + else if (commutative_p && known_eq (offset1 - offset2, 1UL)) > > > > > > + { > > > > > > + if (pos) > > > > > > + *pos = offset2 * bits; > > > > > > + return TREE_OPERAND (op1, 0); > > > > > > + } > > > > > > + } > > > > > > + else if (TREE_CODE (op1) == SSA_NAME > > > > > > + && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL > > > > > > + && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL > > > > > > + && is_gimple_assign (stmt1) > > > > > > + && is_gimple_assign (stmt2)) > > > > > > + { > > > > > > + if (gimple_assign_rhs_code (stmt1) != ARRAY_REF > > > > > > + && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF > > > > > > + && gimple_assign_rhs_code (stmt2) != ARRAY_REF > > > > > > + && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF) > > > > > > + return NULL; > > > > > > + > > > > > > + return adjacent_data_access_p (gimple_assign_rhs1 (stmt1), > > > > > > + gimple_assign_rhs1 (stmt2), pos, > > > > > > + commutative_p); > > > > > > + } > > > > > > + > > > > > > + return NULL; > > > > > > +} > > > > > > + > > > > > > /* If VECTOR_CST T has a single nonzero element, return the > > > > > > index of > > > that > > > > > > element, otherwise return -1. */ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > -- > > > Richard Biener <rguenther@suse.de> > > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 > > > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, > > > Boudien Moerman; HRB 36809 (AG Nuernberg) > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, > Boudien Moerman; HRB 36809 (AG Nuernberg)
Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >> So it's not easily possible the within current infrastructure. But it does look >> like ARM might eventually benefit from something like STV on x86? >> > > I'm not sure. The problem with trying to do this in RTL is that you'd have to be > able to decide from two psuedos whether they come from extracts that are > sequential. When coming in from a hard register that's easy yes. When coming in > from a load, or any other operation that produces psuedos that becomes harder. Yeah. Just in case anyone reading the above is tempted to implement STV for AArch64: I think it would set a bad precedent if we had a paste-&-adjust version of the x86 pass. AFAIK, the target capabilities and constraints are mostly modelled correctly using existing mechanisms, so I don't think there's anything particularly target-specific about the process of forcing things to be on the general or SIMD/FP side. So if we did have an STV-ish thing for AArch64, I think it should be a target-independent pass that uses hooks and recog, even if the pass is initially enabled for AArch64 only. (FWIW, on the patch itself, I tend to agree that this is really an SLP optimisation. If the vectoriser fails to see the benefit, or if it fails to handle more complex cases, then it would be good to try to fix that.) Thanks, Richard
On Tue, 22 Nov 2022, Richard Sandiford wrote: > Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > >> So it's not easily possible the within current infrastructure. But it does look > >> like ARM might eventually benefit from something like STV on x86? > >> > > > > I'm not sure. The problem with trying to do this in RTL is that you'd have to be > > able to decide from two psuedos whether they come from extracts that are > > sequential. When coming in from a hard register that's easy yes. When coming in > > from a load, or any other operation that produces psuedos that becomes harder. > > Yeah. > > Just in case anyone reading the above is tempted to implement STV for > AArch64: I think it would set a bad precedent if we had a paste-&-adjust > version of the x86 pass. AFAIK, the target capabilities and constraints > are mostly modelled correctly using existing mechanisms, so I don't > think there's anything particularly target-specific about the process > of forcing things to be on the general or SIMD/FP side. > > So if we did have an STV-ish thing for AArch64, I think it should be > a target-independent pass that uses hooks and recog, even if the pass > is initially enabled for AArch64 only. Agreed - maybe some of the x86 code can be leveraged, but of course the cost modeling is the most difficult to get right - IIRC the x86 backend resorts to backend specific tuning flags rather than trying to get rtx_cost or insn_cost "correct" here. > (FWIW, on the patch itself, I tend to agree that this is really an > SLP optimisation. If the vectoriser fails to see the benefit, or if > it fails to handle more complex cases, then it would be good to try > to fix that.) Also agreed - but costing is hard ;) Richard.
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Tuesday, November 22, 2022 10:59 AM > To: Richard Sandiford <Richard.Sandiford@arm.com> > Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar > Christina <Tamar.Christina@arm.com>; Richard Biener > <richard.guenther@gmail.com>; nd <nd@arm.com> > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from > bitfields and array_refs > > On Tue, 22 Nov 2022, Richard Sandiford wrote: > > > Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > >> So it's not easily possible the within current infrastructure. But > > >> it does look like ARM might eventually benefit from something like STV > on x86? > > >> > > > > > > I'm not sure. The problem with trying to do this in RTL is that > > > you'd have to be able to decide from two psuedos whether they come > > > from extracts that are sequential. When coming in from a hard > > > register that's easy yes. When coming in from a load, or any other > operation that produces psuedos that becomes harder. > > > > Yeah. > > > > Just in case anyone reading the above is tempted to implement STV for > > AArch64: I think it would set a bad precedent if we had a > > paste-&-adjust version of the x86 pass. AFAIK, the target > > capabilities and constraints are mostly modelled correctly using > > existing mechanisms, so I don't think there's anything particularly > > target-specific about the process of forcing things to be on the general or > SIMD/FP side. > > > > So if we did have an STV-ish thing for AArch64, I think it should be a > > target-independent pass that uses hooks and recog, even if the pass is > > initially enabled for AArch64 only. > > Agreed - maybe some of the x86 code can be leveraged, but of course the > cost modeling is the most difficult to get right - IIRC the x86 backend resorts > to backend specific tuning flags rather than trying to get rtx_cost or insn_cost > "correct" here. > > > (FWIW, on the patch itself, I tend to agree that this is really an SLP > > optimisation. If the vectoriser fails to see the benefit, or if it > > fails to handle more complex cases, then it would be good to try to > > fix that.) > > Also agreed - but costing is hard ;) I guess, I still disagree here but I've clearly been out-Richard. The problem is still that this is just basic codegen. I still don't think it requires -O2 to be usable. So I guess the only correct implementation is to use an STV-like patch. But given that this is already the second attempt, first RTL one was rejected by Richard, second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV thing before I waste months more.. Thanks, Tamar > > Richard.
Tamar Christina <Tamar.Christina@arm.com> writes: >> -----Original Message----- >> From: Richard Biener <rguenther@suse.de> >> Sent: Tuesday, November 22, 2022 10:59 AM >> To: Richard Sandiford <Richard.Sandiford@arm.com> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar >> Christina <Tamar.Christina@arm.com>; Richard Biener >> <richard.guenther@gmail.com>; nd <nd@arm.com> >> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from >> bitfields and array_refs >> >> On Tue, 22 Nov 2022, Richard Sandiford wrote: >> >> > Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >> > >> So it's not easily possible the within current infrastructure. But >> > >> it does look like ARM might eventually benefit from something like STV >> on x86? >> > >> >> > > >> > > I'm not sure. The problem with trying to do this in RTL is that >> > > you'd have to be able to decide from two psuedos whether they come >> > > from extracts that are sequential. When coming in from a hard >> > > register that's easy yes. When coming in from a load, or any other >> operation that produces psuedos that becomes harder. >> > >> > Yeah. >> > >> > Just in case anyone reading the above is tempted to implement STV for >> > AArch64: I think it would set a bad precedent if we had a >> > paste-&-adjust version of the x86 pass. AFAIK, the target >> > capabilities and constraints are mostly modelled correctly using >> > existing mechanisms, so I don't think there's anything particularly >> > target-specific about the process of forcing things to be on the general or >> SIMD/FP side. >> > >> > So if we did have an STV-ish thing for AArch64, I think it should be a >> > target-independent pass that uses hooks and recog, even if the pass is >> > initially enabled for AArch64 only. >> >> Agreed - maybe some of the x86 code can be leveraged, but of course the >> cost modeling is the most difficult to get right - IIRC the x86 backend resorts >> to backend specific tuning flags rather than trying to get rtx_cost or insn_cost >> "correct" here. >> >> > (FWIW, on the patch itself, I tend to agree that this is really an SLP >> > optimisation. If the vectoriser fails to see the benefit, or if it >> > fails to handle more complex cases, then it would be good to try to >> > fix that.) >> >> Also agreed - but costing is hard ;) > > I guess, I still disagree here but I've clearly been out-Richard. The problem is still > that this is just basic codegen. I still don't think it requires -O2 to be usable. > > So I guess the only correct implementation is to use an STV-like patch. But given > that this is already the second attempt, first RTL one was rejected by Richard, > second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV > thing before I waste months more.. I don't think this in itself is a good motivation for STV. My comment above was more about the idea of STV for AArch64 in general (since it had been raised). Personally I still think the reduction should be generated in gimple. Thanks, Richard
On Tue, 22 Nov 2022, Richard Sandiford wrote: > Tamar Christina <Tamar.Christina@arm.com> writes: > >> -----Original Message----- > >> From: Richard Biener <rguenther@suse.de> > >> Sent: Tuesday, November 22, 2022 10:59 AM > >> To: Richard Sandiford <Richard.Sandiford@arm.com> > >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar > >> Christina <Tamar.Christina@arm.com>; Richard Biener > >> <richard.guenther@gmail.com>; nd <nd@arm.com> > >> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from > >> bitfields and array_refs > >> > >> On Tue, 22 Nov 2022, Richard Sandiford wrote: > >> > >> > Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > >> > >> So it's not easily possible the within current infrastructure. But > >> > >> it does look like ARM might eventually benefit from something like STV > >> on x86? > >> > >> > >> > > > >> > > I'm not sure. The problem with trying to do this in RTL is that > >> > > you'd have to be able to decide from two psuedos whether they come > >> > > from extracts that are sequential. When coming in from a hard > >> > > register that's easy yes. When coming in from a load, or any other > >> operation that produces psuedos that becomes harder. > >> > > >> > Yeah. > >> > > >> > Just in case anyone reading the above is tempted to implement STV for > >> > AArch64: I think it would set a bad precedent if we had a > >> > paste-&-adjust version of the x86 pass. AFAIK, the target > >> > capabilities and constraints are mostly modelled correctly using > >> > existing mechanisms, so I don't think there's anything particularly > >> > target-specific about the process of forcing things to be on the general or > >> SIMD/FP side. > >> > > >> > So if we did have an STV-ish thing for AArch64, I think it should be a > >> > target-independent pass that uses hooks and recog, even if the pass is > >> > initially enabled for AArch64 only. > >> > >> Agreed - maybe some of the x86 code can be leveraged, but of course the > >> cost modeling is the most difficult to get right - IIRC the x86 backend resorts > >> to backend specific tuning flags rather than trying to get rtx_cost or insn_cost > >> "correct" here. > >> > >> > (FWIW, on the patch itself, I tend to agree that this is really an SLP > >> > optimisation. If the vectoriser fails to see the benefit, or if it > >> > fails to handle more complex cases, then it would be good to try to > >> > fix that.) > >> > >> Also agreed - but costing is hard ;) > > > > I guess, I still disagree here but I've clearly been out-Richard. The problem is still > > that this is just basic codegen. I still don't think it requires -O2 to be usable. > > > > So I guess the only correct implementation is to use an STV-like patch. But given > > that this is already the second attempt, first RTL one was rejected by Richard, > > second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV > > thing before I waste months more.. > > I don't think this in itself is a good motivation for STV. My comment > above was more about the idea of STV for AArch64 in general (since it > had been raised). > > Personally I still think the reduction should be generated in gimple. I agree, and the proper place to generate the reduction is in SLP. Richard.
On 11/22/22 04:08, Richard Biener via Gcc-patches wrote: > On Tue, 22 Nov 2022, Richard Sandiford wrote: > >> Tamar Christina <Tamar.Christina@arm.com> writes: >>>> -----Original Message----- >>>> From: Richard Biener <rguenther@suse.de> >>>> Sent: Tuesday, November 22, 2022 10:59 AM >>>> To: Richard Sandiford <Richard.Sandiford@arm.com> >>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar >>>> Christina <Tamar.Christina@arm.com>; Richard Biener >>>> <richard.guenther@gmail.com>; nd <nd@arm.com> >>>> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from >>>> bitfields and array_refs >>>> >>>> On Tue, 22 Nov 2022, Richard Sandiford wrote: >>>> >>>>> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >>>>>>> So it's not easily possible the within current infrastructure. But >>>>>>> it does look like ARM might eventually benefit from something like STV >>>> on x86? >>>>>> I'm not sure. The problem with trying to do this in RTL is that >>>>>> you'd have to be able to decide from two psuedos whether they come >>>>>> from extracts that are sequential. When coming in from a hard >>>>>> register that's easy yes. When coming in from a load, or any other >>>> operation that produces psuedos that becomes harder. >>>>> Yeah. >>>>> >>>>> Just in case anyone reading the above is tempted to implement STV for >>>>> AArch64: I think it would set a bad precedent if we had a >>>>> paste-&-adjust version of the x86 pass. AFAIK, the target >>>>> capabilities and constraints are mostly modelled correctly using >>>>> existing mechanisms, so I don't think there's anything particularly >>>>> target-specific about the process of forcing things to be on the general or >>>> SIMD/FP side. >>>>> So if we did have an STV-ish thing for AArch64, I think it should be a >>>>> target-independent pass that uses hooks and recog, even if the pass is >>>>> initially enabled for AArch64 only. >>>> Agreed - maybe some of the x86 code can be leveraged, but of course the >>>> cost modeling is the most difficult to get right - IIRC the x86 backend resorts >>>> to backend specific tuning flags rather than trying to get rtx_cost or insn_cost >>>> "correct" here. >>>> >>>>> (FWIW, on the patch itself, I tend to agree that this is really an SLP >>>>> optimisation. If the vectoriser fails to see the benefit, or if it >>>>> fails to handle more complex cases, then it would be good to try to >>>>> fix that.) >>>> Also agreed - but costing is hard ;) >>> I guess, I still disagree here but I've clearly been out-Richard. The problem is still >>> that this is just basic codegen. I still don't think it requires -O2 to be usable. >>> >>> So I guess the only correct implementation is to use an STV-like patch. But given >>> that this is already the second attempt, first RTL one was rejected by Richard, >>> second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV >>> thing before I waste months more.. >> I don't think this in itself is a good motivation for STV. My comment >> above was more about the idea of STV for AArch64 in general (since it >> had been raised). >> >> Personally I still think the reduction should be generated in gimple. > I agree, and the proper place to generate the reduction is in SLP. Sorry to have sent things astray with my earlier ACK. It looked reasonable to me. jeff
--- a/gcc/match.pd +++ b/gcc/match.pd @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3. If not see HONOR_NANS uniform_vector_p expand_vec_cmp_expr_p - bitmask_inv_cst_vector_p) + bitmask_inv_cst_vector_p + adjacent_data_access_p) /* Operator lists. */ (define_operator_list tcc_comparison @@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* Canonicalizations of BIT_FIELD_REFs. */ +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ +(for op (plus min max FMIN_ALL FMAX_ALL) + ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX + IFN_REDUC_FMIN IFN_REDUC_FMAX) + (simplify + (op @0 @1) + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) + (with { poly_uint64 nloc = 0; + tree src = adjacent_data_access_p (@0, @1, &nloc, true); + tree ntype = build_vector_type (type, 2); + tree size = TYPE_SIZE (ntype); + tree pos = build_int_cst (TREE_TYPE (size), nloc); + poly_uint64 _sz; + poly_uint64 _total; } + (if (src && is_gimple_reg (src) && ntype + && poly_int_tree_p (size, &_sz) + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) + && known_ge (_total, _sz + nloc)) + (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))) + +(for op (lt gt) + ifni (IFN_REDUC_MIN IFN_REDUC_MAX) + ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX) + (simplify + (cond (op @0 @1) @0 @1) + (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type)) + (with { poly_uint64 nloc = 0; + tree src = adjacent_data_access_p (@0, @1, &nloc, false); + tree ntype = build_vector_type (type, 2); + tree size = TYPE_SIZE (ntype); + tree pos = build_int_cst (TREE_TYPE (size), nloc); + poly_uint64 _sz; + poly_uint64 _total; } + (if (src && is_gimple_reg (src) && ntype + && poly_int_tree_p (size, &_sz) + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total) + && known_ge (_total, _sz + nloc)) + (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type))) + (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; })) + (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))) + (simplify (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4) (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006603ae1e3deb6b76 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt) { tree arg = gimple_call_arg (stmt, i); if ((is_gimple_reg_type (TREE_TYPE (arg)) + && !is_gimple_variable (arg) && !is_gimple_val (arg)) || (!is_gimple_reg_type (TREE_TYPE (arg)) && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h b/gcc/tree.h index e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194640b8c1b531b798 100644 --- a/gcc/tree.h +++ b/gcc/tree.h @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree); extern tree bitmask_inv_cst_vector_p (tree); +/* TRUE if the two operands represent adjacent access of data such that a + pairwise operation can be used. */ + +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool); + /* integer_nonzerop (tree x) is nonzero if X is an integer constant with a nonzero value. */ diff --git a/gcc/tree.cc b/gcc/tree.cc index 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311 100644 --- a/gcc/tree.cc +++ b/gcc/tree.cc @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t) return builder.build (); } +/* Returns base address if the two operands represent adjacent access of data + such that a pairwise operation can be used. OP1 must be a lower subpart + than OP2. If POS is not NULL then on return if a value is returned POS + will indicate the position of the lower address. If COMMUTATIVE_P then + the operation is also tried by flipping op1 and op2. */ + +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos, + bool commutative_p) +{ + gcc_assert (op1); + gcc_assert (op2); + if (TREE_CODE (op1) != TREE_CODE (op2) + || TREE_TYPE (op1) != TREE_TYPE (op2)) + return NULL; + + tree type = TREE_TYPE (op1); + gimple *stmt1 = NULL, *stmt2 = NULL; + unsigned int bits = GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type))); + + if (TREE_CODE (op1) == BIT_FIELD_REF + && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2, 0), 0) + && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2, 1), 0) + && known_eq (bit_field_size (op1), bits)) + { + poly_uint64 offset1 = bit_field_offset (op1); + poly_uint64 offset2 = bit_field_offset (op2); + if (known_eq (offset2 - offset1, bits)) + { + if (pos) + *pos = offset1; + return TREE_OPERAND (op1, 0); + } + else if (commutative_p && known_eq (offset1 - offset2, bits)) + { + if (pos) + *pos = offset2; + return TREE_OPERAND (op1, 0); + } + } + else if (TREE_CODE (op1) == ARRAY_REF + && operand_equal_p (get_base_address (op1), get_base_address (op2))) + { + wide_int size1 = wi::to_wide (array_ref_element_size (op1)); + wide_int size2 = wi::to_wide (array_ref_element_size (op2)); + if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8) + || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1)) + || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1))) + return NULL; + + poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1)); + poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1)); + if (known_eq (offset2 - offset1, 1UL)) + { + if (pos) + *pos = offset1 * bits; + return TREE_OPERAND (op1, 0); + } + else if (commutative_p && known_eq (offset1 - offset2, 1UL)) + { + if (pos) + *pos = offset2 * bits; + return TREE_OPERAND (op1, 0); + } + } + else if (TREE_CODE (op1) == SSA_NAME + && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL + && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL + && is_gimple_assign (stmt1) + && is_gimple_assign (stmt2)) + { + if (gimple_assign_rhs_code (stmt1) != ARRAY_REF + && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF + && gimple_assign_rhs_code (stmt2) != ARRAY_REF + && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF) + return NULL; + + return adjacent_data_access_p (gimple_assign_rhs1 (stmt1), + gimple_assign_rhs1 (stmt2), pos, + commutative_p); + } + + return NULL; +} + /* If VECTOR_CST T has a single nonzero element, return the index of that element, otherwise return -1. */