From patchwork Sat Jul 13 15:46:03 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Feng Xue OS X-Patchwork-Id: 93895 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 12FA73861011 for ; Sat, 13 Jul 2024 15:46:41 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from SJ2PR03CU001.outbound.protection.outlook.com (mail-westusazlp170120002.outbound.protection.outlook.com [IPv6:2a01:111:f403:c001::2]) by sourceware.org (Postfix) with ESMTPS id 50507385DDCC for ; Sat, 13 Jul 2024 15:46:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 50507385DDCC Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 50507385DDCC Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f403:c001::2 ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885571; cv=pass; b=MEJiMW+G7b8LnoAK9U85ohB/ODy4idWAXSIFnOoemwk+tcJEvlaM+roD6E8mfd++tU/DwMJgmbws226CYY0eXOmvonvqmtru73mzEu4/BrXiQbk7SfdrJ4Bap0TOpplY0o/lOAURQp7VKXDnHieFuisaQo6T90xdCdrk7SfXtA8= ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885571; c=relaxed/simple; bh=nfZ/jC5AI0zrDHP/XCmQJz6BgpNBdE07X9OL2SjaSF4=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=GeNLcAxK7GIw2xs5/QCEQAX9CARWrEZXK7KchfSx/lEFM8tLqclfQz2HTHYTSi5lO64FNDJn7gqq1QZzDCXeLw6k3zG2vRPi4cjS4ehuwLdXnRhJ25HOfuHtTuIVKi3gBKsISd7LZDuSb128DR4HvB+GoJb4fzHMbb+5bl8ppOw= ARC-Authentication-Results: i=2; server2.sourceware.org ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=zDiY19mmfpNOblkb6b2oz3AjaOka2R5Z26kp+c2B5akSd3w8xz839OEQcG6LTuRys0QhenQfhMw5QhWnPISou/zM7FMyF7x4MiN+Tyun+iidvQ5eW+M8iU/ZVLwDr9qeZlp1oSRIOcRQEMv8quke49uGpMvjyEd3p1eAgKLbHwBYTCBHIy27bZBGCfoK9dwWKPJU8Y85a+X5ODpedpzKQVxwURSY0kdo6CUMWURG2Pr3nohgaQUalhpUhemRNjrI5VQ6O4SOZfZQ1+I6OByYy3DzK5SLNh7tAA50vLv+lnITwpqyO1VQSUmA4AcjEjDL+JxUqAILT+JTSsvz8Ne5Yw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=DnQS/6XzniSb2bZ5A2VbJXtI87R3nOVfZW3kZtJOtM8=; b=CNF/t8N0wzvMZ+1/ghSLU/nVEylvAu60d/wHN0EFuNr9lQUxZMez105jfV7GjxaPkHftb9FhNzieB+h1SOwTVygjaVe/KoLpNQVLtXIPvZPdoIuRxEbCflC5yuEejpdRnf7nkhkn8VdcPYs7zPJUf/EuYiTVMgRKBUHB5+vaRxW+4ePi/h+dk6ncOo0wivBSJXH0s5t0amxWCS+dxSqQlPoC0ysW53e6G20nPUhwXY6BUNb5luFWywOIQNc7qwOJY8OZmiN1IwGwj/SIPuDW33ROGYg5wTUkaZ1bO9TSGvi02loyg1Qq6JCjt8ubNGhKVQcnE/KhEcg2CtEXKyL8GA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none header.from=os.amperecomputing.com; dkim=pass header.d=os.amperecomputing.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=os.amperecomputing.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=DnQS/6XzniSb2bZ5A2VbJXtI87R3nOVfZW3kZtJOtM8=; b=X4f6s4V3Heu/+o73iJtLx4iCVCU8TqWIPu4Xx2aCMECSlk2a5H4QkllhAiL6M47+bggsSOn9pIrMzJknMsWfx9eFEXbwlFyybUirOkPrAnSU3kYOy3fdUc1W+8CFBZEhfbLj6Yv4rcaiuRzpqF8Wx+VzTOwOZmYjdiVml//jzyU= Received: from LV2PR01MB7839.prod.exchangelabs.com (2603:10b6:408:14f::13) by PH0PR01MB7334.prod.exchangelabs.com (2603:10b6:510:10d::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7762.24; Sat, 13 Jul 2024 15:46:03 +0000 Received: from LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63]) by LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63%4]) with mapi id 15.20.7762.020; Sat, 13 Jul 2024 15:46:03 +0000 From: Feng Xue OS To: Richard Biener , "gcc-patches@gcc.gnu.org" Subject: [PATCH 1/4] vect: Add a unified vect_get_num_copies for slp and non-slp Thread-Topic: [PATCH 1/4] vect: Add a unified vect_get_num_copies for slp and non-slp Thread-Index: AQHa1TuPERMHSrxr8UGPN3NJj/3UAg== Date: Sat, 13 Jul 2024 15:46:03 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2024-07-13T15:46:03.053Z; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=os.amperecomputing.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: LV2PR01MB7839:EE_|PH0PR01MB7334:EE_ x-ms-office365-filtering-correlation-id: 20cb28fb-241f-418d-2ac5-08dca352e6bd x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230040|1800799024|376014|366016|38070700018; x-microsoft-antispam-message-info: =?iso-8859-1?q?1yo7fXWUsY3bzWJHVjLil6N1EZ?= =?iso-8859-1?q?f5lWrJCwLaCBFUs4Mej10+xcCUFfN7qZuHOeTAhID4ufRd6UzDfwxzvY/WyT?= =?iso-8859-1?q?8FZXsZAhmM6XQDRv3paXDIITCUbr379fPhgPF88o2WgmZpbjvpfqP5F3Y0PG?= =?iso-8859-1?q?sNUO9ZCaH5aKURc1emaYMSGTmLUZMGBFvSE4ciwpViAyfA9LaP2Ozoz/oTlj?= =?iso-8859-1?q?MoGScaDN4lj2Ae4C+PHel4zmFJD3bsM8RSxOduurcBAEp7v44w9VLmkWuSaZ?= =?iso-8859-1?q?2Qhn137Ak5bIAb0oAHVS9/DcNT+ynjOHzNqKqHvCh4QeiW19ERf5/KvBX+gD?= =?iso-8859-1?q?ZkDbS3mga7xOObMK2QbVutkbm9OKF6nWDItVdsk+Ox9VLaEJ3C8gka2x4aRe?= =?iso-8859-1?q?WP5FsOq08wgLVREXjzIxPEPod/1CRcNHaXfSfhYTBZB7ImaqfgcgLN+6b4jb?= =?iso-8859-1?q?W+ryL7I6rjBWTimE0PzkZyD7tyu0Vo9HpDIXGryIyR2Oqhk9jBYP8BY6yLop?= =?iso-8859-1?q?VW3srYLbkCKEkOG5hy9WeXmNuD0OAHoTHTIVHCMKf0tF2W6sdWYsbxI4tIZR?= =?iso-8859-1?q?+SzAOSUVjglfLPGTkBAJRbc6FhAyzDAI2pO4NXGxbeE8Hagw2htU5LugA7qf?= =?iso-8859-1?q?o7O9A/CAvbVYh32FAhSdMW2uTpkjYFADi+roBKezmoo0woCHWRQH9Q7q3C6d?= =?iso-8859-1?q?7V7b1YaawgW75eTmNqFHT9z/VfDOWWuS95VUEt4KEFUkTfW2WnQV98QVjbCO?= =?iso-8859-1?q?MnIAPfyh07WgVHl1bMRUdzSZ6a0DAOCTIMOUTxis/s71AX6Hkij2g0kr8Ydi?= =?iso-8859-1?q?l5DI4G2vfu3G7xjPFOQCQAvH9Kzl9Zji/26sdxW9lUDOVfBU5Fy7fwKeJ8hY?= =?iso-8859-1?q?ZcNby4V6Cs/u1wyZ7MqQfSTn27ndIaO0TaUQTQypp+Y/fiMUI/mkArt3jA6b?= =?iso-8859-1?q?jNRlh+8x3QY4XZ4XQEtlXE+XdegrQ6x4xFAHU8tJaC8QpfnEJMAs90+4dNOQ?= =?iso-8859-1?q?yCOYovxblCL5095qT4ka5choUy+oJlEmF66nW3Vcn5TYRYiRNlyZcFyZIBEC?= =?iso-8859-1?q?IVMTFX0EivxFdZTV1j99hYzueR6uTQ2oW1NAw/+fyPQarAcFvo/n4vXhqWAe?= =?iso-8859-1?q?AemnJ3ZU1FGTsjlJm4tjiH79CI0IyEXvWj/Cp2+zHOfT1Ydy96ZgbmQ0W+OG?= =?iso-8859-1?q?ZqYoQCNZLLGtg7QQa/GlqEpZCVnJieNjuRM5dWxxb00AXUZBy7SLQ+bSKw7Y?= =?iso-8859-1?q?mitY3DNgfPI5l8+iYRBhkOgjshehmxwZBK11iHfEoS6n1dGJ1MjIYioU5zRS?= =?iso-8859-1?q?OEL77kYc37lGFwhEmkY4qypD6CGw9/sg0hhHxmgRlmg/+JRwoZeXGpSqvf5r?= =?iso-8859-1?q?EkDDpgxB2QiY7IgI6kM/kMnUMKMKVijBDZjRPHZLUvIPop0AS8rVXXgN2PsT?= =?iso-8859-1?q?IYTHfnj29G5aWDhJzNIPVSvA=3D=3D?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:LV2PR01MB7839.prod.exchangelabs.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(376014)(366016)(38070700018); DIR:OUT; SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?q?XHHOFtsjhxhtjYVUddsj1//?= =?iso-8859-1?q?f10ikyaC4xcjsMMpWGPJJDG8igmlGZhZNgtN2qcksYTfXqnEkcnxY8afxJIq?= =?iso-8859-1?q?6mAmgtWy52YNF5Upi8YbG38xMG3nj9Rs8iBMsPqTI7kpTz0MWsGC0esPRyJk?= =?iso-8859-1?q?wiIYimcSFUUVH7kUxHrpB1EqPst1Obn4QTnT5bdvyWstuyvloRxL5CF01UUi?= =?iso-8859-1?q?lntUL5ICxihDeUnoNfqtU+efgQn65rcjawfxVm5JX3DfvTGqkBRHikJVNnjB?= =?iso-8859-1?q?zIVUNcoqmTtsYs7+j11umlAPqzy5Z+0+54vPkyg93Bu2+7JhyJgnzOM4qJ1y?= =?iso-8859-1?q?Cn35lOKthcXxWCIg6GuyNlAWTcUS357m31K24yrvcaxpGi5DFdiGqJVO3+uW?= =?iso-8859-1?q?/rOFjO7wN1XsEsZ/r1mZI/Jc6JOkrRWUdNEidT9MTkU+oKNW5c2a+bebQJNe?= =?iso-8859-1?q?oQbbLBb2RNUeGd7spzcgfLSw/a56+W5RB0nDBmVmCdN0EVCwR8aktGjLEQWY?= =?iso-8859-1?q?xMGpkgQD1Ya4CnF+PsIPHDbAF0BQRwnoUI+4UoMptS0+WwR1h7hD7PLDbi/v?= =?iso-8859-1?q?23SZZzvpwwrY9P50plKpmYkycyYHojAUjEbA4cMQ5EEBSevMNIRScygWYuPP?= =?iso-8859-1?q?sX+d9byOufh6NV8KZv+pCz9v11I484V8PNa0o2zwVQdO8VJl67lBeqUPaVSP?= =?iso-8859-1?q?/Qo5z/gi9P0DO0wBf6byPOM6kpTRaeXNojCmHuh3Z46j0DV5b4lu5WD9qExI?= =?iso-8859-1?q?3pR60LKMRYaA2B5qikOlsz/UMSCn9ZLpWjeKgGJ4Ti+aY+zf6lJThNbxc/8Y?= =?iso-8859-1?q?FGiRDPColJNCnXPbylVrsS/vf3DXf9lO60x7j2xEcksRqHYHjJDFlIdQj5/M?= =?iso-8859-1?q?5pw9O5sm6mkUD+jTDF/+RprBDR8ZEIDM2uMLTgfmYAcivB8r2zWDcMXIuAyu?= =?iso-8859-1?q?vhrvDm0eb6g3SJYB1K6T0jMYt/c4mDvO84OBEb14S1cRhonGaCx7NgHKUhYG?= =?iso-8859-1?q?dMuYowEU+nG7AXj0K2hRh4luLmJcLuyF0tjw4llxsL7Ad/n5sGh9knyDcLwT?= =?iso-8859-1?q?+1/u/eem9BrxNY2SSDHERKzp3zwijVuWeQ7PIE1ecbKeUqBAGFXYvAHUWQUN?= =?iso-8859-1?q?hlMzfXqtnNd5LlzC5n/neHBaHecfvm9YafVVNppIlUZeEhklspl5GE/iXA2r?= =?iso-8859-1?q?+Axv0qh4oaZY9Lff8dyGq9MDE6GuiJk41m0hrBJU6awBN5sMxcXc5E2iwAjV?= =?iso-8859-1?q?/GxHP21jqttwA1SdNuHbI/y1e/te+V5jHjMnlv1MDHguM3/efb8GrQZqTQDi?= =?iso-8859-1?q?iyAHGn+EdbKHLdd+vIHIXmdEvmbMusHz36GGCkaFjJhHU/v+YxAG4mHDSude?= =?iso-8859-1?q?x3ck3Y5SbTXm0Q8XxoIh0UEejbIWjhB0yhVk25b6LFqmtGFCKsw3fbzfEav6?= =?iso-8859-1?q?pm57LKuhE6Xd1mFNgzO7Rtrr31mHCfS5qZCPIb/GNOHxWZLgYXXreOwwjtPF?= =?iso-8859-1?q?cDJDgBMH8FOrLqOa3og7zNxLd3JbDcYc9Cv/dwSh5pkysHwxPDRKfJDlEhlJ?= =?iso-8859-1?q?lkHEMG+PFnIMMjXSV7Me5zL6vICdUOtP74hvLsnGCFhfZ6NsTqKASnlUi1Gx?= =?iso-8859-1?q?P3TYJT9NI6G3uJv9d?= MIME-Version: 1.0 X-OriginatorOrg: os.amperecomputing.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: LV2PR01MB7839.prod.exchangelabs.com X-MS-Exchange-CrossTenant-Network-Message-Id: 20cb28fb-241f-418d-2ac5-08dca352e6bd X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Jul 2024 15:46:03.3056 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: NNGMzpCs0eAGUzuMt84nDkywTIXq0DOFVgrvo6sDssDIj0btGHjlWyTgk+WfohJRwI9usN5tJVcOiIwY+RRAollz+JryaoWYipkIg61EjwE= X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR01MB7334 X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~patchwork=sourceware.org@gcc.gnu.org Extend original vect_get_num_copies (pure loop-based) to calculate number of vector stmts for slp node regarding a generic vect region. Thanks, Feng --- gcc/ * tree-vectorizer.h (vect_get_num_copies): New overload function. (vect_get_slp_num_vectors): New function. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Calculate number of vector stmts for slp node with vect_get_num_copies. (vect_slp_analyze_node_operations): Calculate number of vector elements for constant/external slp node with vect_get_num_copies. --- gcc/tree-vect-slp.cc | 19 +++---------------- gcc/tree-vectorizer.h | 29 ++++++++++++++++++++++++++++- 2 files changed, 31 insertions(+), 17 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index d0a8531fd3b..4dadbc6854d 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -6573,17 +6573,7 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node, } } else - { - poly_uint64 vf; - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) - vf = loop_vinfo->vectorization_factor; - else - vf = 1; - unsigned int group_size = SLP_TREE_LANES (node); - tree vectype = SLP_TREE_VECTYPE (node); - SLP_TREE_NUMBER_OF_VEC_STMTS (node) - = vect_get_num_vectors (vf * group_size, vectype); - } + SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); /* Handle purely internal nodes. */ if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) @@ -6851,12 +6841,9 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, && j == 1); continue; } - unsigned group_size = SLP_TREE_LANES (child); - poly_uint64 vf = 1; - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) - vf = loop_vinfo->vectorization_factor; + SLP_TREE_NUMBER_OF_VEC_STMTS (child) - = vect_get_num_vectors (vf * group_size, vector_type); + = vect_get_num_copies (vinfo, child); /* And cost them. */ vect_prologue_cost_for_slp (child, cost_vec); } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 8eb3ec4df86..09923b9b440 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2080,6 +2080,33 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype) return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant (); } +/* Return the number of vectors in the context of vectorization region VINFO, + needed for a group of total SIZE statements that are supposed to be + interleaved together with no gap, and all operate on vectors of type + VECTYPE. If NULL, SLP_TREE_VECTYPE of NODE is used. */ + +inline unsigned int +vect_get_num_copies (vec_info *vinfo, slp_tree node, tree vectype = NULL) +{ + poly_uint64 vf; + + if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) + vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + else + vf = 1; + + if (node) + { + vf *= SLP_TREE_LANES (node); + if (!vectype) + vectype = SLP_TREE_VECTYPE (node); + } + else + gcc_checking_assert (vectype); + + return vect_get_num_vectors (vf, vectype); +} + /* Return the number of copies needed for loop vectorization when a statement operates on vectors of type VECTYPE. This is the vectorization factor divided by the number of elements in @@ -2088,7 +2115,7 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype) inline unsigned int vect_get_num_copies (loop_vec_info loop_vinfo, tree vectype) { - return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo), vectype); + return vect_get_num_copies (loop_vinfo, NULL, vectype); } /* Update maximum unit count *MAX_NUNITS so that it accounts for From patchwork Sat Jul 13 15:47:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Feng Xue OS X-Patchwork-Id: 93896 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0F6A13861030 for ; Sat, 13 Jul 2024 15:47:57 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from SN4PR2101CU001.outbound.protection.outlook.com (mail-southcentralusazlp170120000.outbound.protection.outlook.com [IPv6:2a01:111:f403:c10d::]) by sourceware.org (Postfix) with ESMTPS id EF9D2385E027 for ; Sat, 13 Jul 2024 15:47:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EF9D2385E027 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org EF9D2385E027 Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f403:c10d:: ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885642; cv=pass; b=JwVg+5Jp1AgtqONjjJNziCnkqyVZR67yylHt9eRVIngDArUmi9crhsxTGGoAfQYs0VTTASq1j4q0MrQRYm+TBxFWXtzUHBFiurISNmtMcDklQ0heakhMxVkvMI8JKqkqJBWCumbYxlaU8I9Gha1vZdtOA2mT/Xr0ykaaMHBCwW8= ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885642; c=relaxed/simple; bh=ZjFGe3kuMgp7+jdOhY2bpeu6av3lCh/yhQ380O7rKmQ=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=basfd3L3pB5jtXIacfpiUQSxkoRxGTWkVCkNOOnvvE80tx0jzUJIjeoYfsVtvtGGRwgkAcURjUcbYT7xZMdHzhZ+qEJh1T6bh7QPI3DHEnD0KuLa007ON0usUHFrSjoIO84V7k95NrJ3NR85tmdYt3QLrhIpRgE58Wb8E6buJcw= ARC-Authentication-Results: i=2; server2.sourceware.org ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=NbdBfoRebw1DjV9xJeWznUNl0lal5pPp2tILW06H4hBTCMrfuNTC11caJPX7R7x7otWnEizF6a/Ip8UHK7DPeGwIPejtXNmBvCLj5XH+43adtINKKwWQZP/jbfA/jSVjYF9gbJbqHc0fjZRxD8/Bacsxde/roluMCTxKCIPUVrpEwrSb9K22tfnC2tizzy6rHhG/qLM/o4Bk9D57AjORlRtzFq77ShLM4jS36j7jQP4s6imxRV2xjlUqhIbmRb7uuChWbqlc+AzIrR7JZ6yR1UnqPccuWS5mf0dWt0G8tiFFq6M0UoZtmn9bibxorg+eUiBI3TaCVfVXDixdKQCPlA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=O0LeKJHHjMOB/Sc0nFqN43JdJ1bCvi9DGB1k3ZnmK+Q=; b=U7aOc8qdyzb4K+xNXmXdTqYq0lPKCsDrK0X2GPp57mXCRQC+/KrsJpdqe5uMMoyyNCjxp3cLBMysGIW9RxyrIa14faTaLD0RTAdT/QAFi9d3/EThKTuoaPkSsAEKnRUbpve2GrlMsWFDwAXyJfQ+xOlSfvRvXBlW8t1y4v/jVI63/62qm7XPuX1B9/0zaQgT3/d8M7UqCKpQacHIjLZP6UxFZ2RzXzSe4AXUhJjc82BimuuTAowB0KWwCPpZAv3FIC9O1dx3m01caKhrWj/x+RFzBnxm9fl7jF3XL0KvJ0sdiOTjciC/wmGpVVCXLbbXK+OmD4NySkPZacHBXJrGew== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none header.from=os.amperecomputing.com; dkim=pass header.d=os.amperecomputing.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=os.amperecomputing.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=O0LeKJHHjMOB/Sc0nFqN43JdJ1bCvi9DGB1k3ZnmK+Q=; b=JkvlCGEg+Km6iFMss7bn9lylqRb0qiMMV3GU6Zt+FM96pFAJ25MpwW/0A1hDZ5LGow57eDHsQKL0itpXl6o0h8A+MkeI5kyLu7UqvJaxYDH370bPcrYl4HZHnjVYf5ST/JtjimiwGLaxR2gE6BOx2ZbVFNyHjYhARoXb6hQanU0= Received: from LV2PR01MB7839.prod.exchangelabs.com (2603:10b6:408:14f::13) by PH0PR01MB7334.prod.exchangelabs.com (2603:10b6:510:10d::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7762.24; Sat, 13 Jul 2024 15:47:14 +0000 Received: from LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63]) by LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63%4]) with mapi id 15.20.7762.020; Sat, 13 Jul 2024 15:47:14 +0000 From: Feng Xue OS To: Richard Biener , "gcc-patches@gcc.gnu.org" Subject: [PATCH 2/4] vect: Refit lane-reducing to be normal operation Thread-Topic: [PATCH 2/4] vect: Refit lane-reducing to be normal operation Thread-Index: AQHa1TvXLlSGrjJ6Oka5bbDPXp+L8A== Date: Sat, 13 Jul 2024 15:47:14 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2024-07-13T15:47:14.140Z; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=os.amperecomputing.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: LV2PR01MB7839:EE_|PH0PR01MB7334:EE_ x-ms-office365-filtering-correlation-id: 96707789-5010-482d-2f20-08dca3531116 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230040|1800799024|376014|366016|38070700018; x-microsoft-antispam-message-info: =?iso-2022-jp?b?bUYyMW9OZ0ZMelc4ZUFoY3pU?= =?iso-2022-jp?b?WEwrQUhBRW5EVVdNNGt5MHZqeFZBZUkwRm5xWEZ2elE1Y0NTODluTWJx?= =?iso-2022-jp?b?Q3pXM1dxck1xWThucEZSdEE4Ny9PVUV0M0xXZnh5RnhlY1VoMkh4dDgz?= =?iso-2022-jp?b?SFduR0pGTXZGclZwelZLV3RZWlVBRzV3MWo3SDVNWVI1ZzZBSWphdXBQ?= =?iso-2022-jp?b?NWRLSEk5UUVQQmV3ayszNWhoS0lja3FxSGFXZGp6RHd5T2FHZmJBOHE2?= =?iso-2022-jp?b?ZjRRWE1udGtWTEEvYnpTbUtNZjlsdE1ERjMwZkxZTGIvUSs1WnoreDZm?= =?iso-2022-jp?b?NW5mU25SaE1kSnJ0STV5Y1FtWUNJU0oyUEd0TlpKbGtrZlRnVHV6dUdn?= =?iso-2022-jp?b?cmtxeW1rRHRzZVZFVndKdEpDbXI5WUh0eGxCbHI0c3RrRFVNTnlYQjU0?= =?iso-2022-jp?b?UGlGckV5MUtOa3R3ZDNhSUxmT2RVQ3E5cCtHNzEvOVh5KzUwTTFXY25r?= =?iso-2022-jp?b?dERkM01xME9DZkdYT0I1RDYxdXlHc2U2WkFOTEN0NkU3ckp0cE9YaUJi?= =?iso-2022-jp?b?c1lXUVEyVFlhcTREVUFnSzhyZ1RhdXl4em1ESFV6RWQvaUZsQUxSblBC?= =?iso-2022-jp?b?SWVkSFBGRUh3czdPTUQ3YStLcWhhNVJDN29xajhhZDF0SFlzeHF5V1lU?= =?iso-2022-jp?b?ZW5hV1pjKy8vcU1ndmV5cW96WGkrQkNRTWoweVVBMTNNRVQzRHd4cFYw?= =?iso-2022-jp?b?dzRETWtLM0FkV1ByQ0hNSlVtNGQvWmtOZzlVSkc3SWhVL3cwbGkrbXpP?= =?iso-2022-jp?b?TVUzWHNyL01kblEwaFpzVHRoY29LaXIyekR6MGpIK2lyelRyUWlPMFZh?= =?iso-2022-jp?b?aGtod09sblg0MGh1RW1QQjUvMm45QjN2UjFWa1pYWWVYYVhjRFVXU2ow?= =?iso-2022-jp?b?N1NCUVBnY05PcjMwZ2hmTWhzUDliUEJTeThoS2ZzTzJJMER1Yy96bnFw?= =?iso-2022-jp?b?Q2hhU003WjF0MnBaWXcxa2lZME1DMVNtaGpJanN3RmY2Y21ycjZoa3BZ?= =?iso-2022-jp?b?dVlUR1p0VmFLb1FtVjRRUDcza0NsY1RTaSs2VytqSEZVOVNkNzVsYlcy?= =?iso-2022-jp?b?NEhpTWJXZHJheUlVU1p6UkppZlY1dDlYakJYUE9GL1d3R0xnREJhUncz?= =?iso-2022-jp?b?WWw5ZUZJNncvTFRiZGZGa2FuaG5EcTVqeElUaStaTkZlOEwzVlZPa1Az?= =?iso-2022-jp?b?enBGcjl2S3ZxclY4RWtyNnBjaFJDMmlGM25pYW5MdE9xSkNQWUJLcVRr?= =?iso-2022-jp?b?d1NMV3lMNkRielg2QVh5VXJpeDRuVHJJOXJlN3FKQzNYaGdDV1pYVWpq?= =?iso-2022-jp?b?TWx2S1FJbGJEWGhoaWRYdm9scnVDbk1HdGxUSTJIQnpCdnY2MVZKOVd6?= =?iso-2022-jp?b?ckNWSm1HeFVISmdoWE9PT0VrZ0JlWElXTTVqb3NqS1FSRjZNZ1ZheERV?= =?iso-2022-jp?b?blI4b0M4T3JSanNnRTlnWHhMc2E0ZmJoOFRBR25VV1ptdFh1VFQwVkZr?= =?iso-2022-jp?b?NzVxSWVuUnBOakxlbEhVR3JxRmRZSVJhQjdkNkdaNUtwTDd0Uk9xOEhM?= =?iso-2022-jp?b?SmdlZ281bllOS2NzN3IvNk5TUFVSbW42OGl4Z2N3UUthdmUzYzkrRWhR?= =?iso-2022-jp?b?MnZ0MkxESXFvZm5ZWnNwMHArbHhvMXNrdi9zcElQdEI2Q2E0ZWpyaUE5?= =?iso-2022-jp?b?WCtSZURNSlRJbU1GeDNaS3h1RW9OTmxOOTkwRGZrTzV0aEZmbzcxc204?= =?iso-2022-jp?b?YnpHcUNCMjdscHRLY0t0M1ZNb2NDWmhZS2dUMFl2eENwcTNaVG1DT2Qx?= =?iso-2022-jp?b?SXFObUdSZ0hmaVhYb2xoTXBSdHNCZ29jYVM4ZThEUFFvR0dySFhRWmpV?= =?iso-2022-jp?b?MlByaG9FRFUxQVN4dGNFUk5wa083K3NWQ0ZuVXQrb1BzUkxaQmxFSXVZ?= =?iso-2022-jp?b?RUYyaXpNYmxOSHE2SEFjUExCQUFRWW0xNWFHWmo5SCtnemRtQmZRcHRO?= =?iso-2022-jp?b?anJHWnpxaWxIdVl0bHY4WmQvd3JTSGFRamc9PQ==?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:ja; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:LV2PR01MB7839.prod.exchangelabs.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(376014)(366016)(38070700018); DIR:OUT; SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-2022-jp?b?aTZINmRvdmdZL1ZhZXRD?= =?iso-2022-jp?b?MkNOMTVUVTArdHFTOWR5ajFmMEdxSExiU1dQT1o0L3hHN3RoUkFWaklX?= =?iso-2022-jp?b?WHhwaEZXWFhqNXV5WFJXTVpzRzFBV3cvb3c0Z1lUNjE4bitvM1BJU0o2?= =?iso-2022-jp?b?ZFllamNOUDJPejVreFJpaGlkVG9mV3RoUTBvc2w1R1RyZGI0MXB0MzVk?= =?iso-2022-jp?b?MVZKRFhlWXNVazRac0hzS0VnTkF3QkY1TEh5akxIS1U1WjAxcFp1UEFX?= =?iso-2022-jp?b?UysvWWdkUHVtQzBlZzJxSnZuTjV5TW41OVczL0tTZ1Y0RFZIOTRLdVpq?= =?iso-2022-jp?b?VEN6b2w2RjNWMHkrcU9kbitzUUVyQ0VCbE91SzlaSG1qbHl4OGtaRXFB?= =?iso-2022-jp?b?ME9lSzBBa0gzaVg4TGhzblVTaHdabE04clhyeEh6Uit3aXFOQklpOXVu?= =?iso-2022-jp?b?b2RpRTdNMmlheTAxQnRYR2NBZkpSVGNDbElXbmRaYStCT2EyY0F3eXZV?= =?iso-2022-jp?b?YkdHcllETnZZNDQxMDJLZDVCS3d3MW9BbTRpMUFMMlRrVkJ2K3RqZzBX?= =?iso-2022-jp?b?eXgyclRXVmdENHl6YnUxRGdQZWtYaENpVnVyWWdJemp4NUNMWERIdVh2?= =?iso-2022-jp?b?cmthdEZMVnNrQnQvajcwTnRxSCtUaS9UbFBxb1N1S1JlR3c5aGhPSHEz?= =?iso-2022-jp?b?WE9HZzJQYTJjSWYzaVMyVXVLZHNkYjJpQ2RDZWVhb3MwNC8rYmJqU2FS?= =?iso-2022-jp?b?ejVuSXpqU2ZRcEhqR3JySWNPYktpZ2MzVUtYQi8yTFdRaXVEbkV6SE1p?= =?iso-2022-jp?b?ZUQwb2lDR25RMThtODdBNnF3RkxLNWluK2ZjcTFLRWRZbXRnWDN0aDk0?= =?iso-2022-jp?b?bVdnemhGdXlZZTRlMFRSY2NKVWk0dlNaNlJReVZoZ0hBM0REWXJ4OEJM?= =?iso-2022-jp?b?cWVqVXFnVk15K0VDNzBoLzd1em00UGxSeWhYdE13TGRNV203TVJxTlBj?= =?iso-2022-jp?b?ZnhtbmE1aHY2cVVlQTJhTGI2Z1UwbCt2WGQ3SVBOcHV6c1pUMkRwdzZN?= =?iso-2022-jp?b?TGhKV0xhaTVjeVNSU2xob3pWQjdlUU9xMWRzMGk1RS9udVBET3hMVEJ0?= =?iso-2022-jp?b?aUxYV2g0a1pPT1hpSDdHZmo3cm55Yk8zeENhN1pySEhYaWNqakJwVnlO?= =?iso-2022-jp?b?MFBDUWRvcXRuM1l1cVpLU1REbXp6WmFHSVlKNmlaVkJ6bndaUFA3ajJs?= =?iso-2022-jp?b?Z0NlNXhzK3JwTXNyek9Xa0xhMTJQbVN4bVNBY3I2K1lvNDkwa0NKby80?= =?iso-2022-jp?b?SkR3WDA3amU0RXl6QldhZVRHV3VEL2JNQzhpVHZkT05xQTQ3ZkowdGc0?= =?iso-2022-jp?b?OHoyVVNxU3luTnJDbGZoMHZ5b3p3Q0lodk5yZHhBRDhoU0pwRDVpcUVa?= =?iso-2022-jp?b?VTlpZVJ6WXZOd2JkUE5QZkxOU21PME9MVm45RVZwNzQ3NS9qQi9nbDZa?= =?iso-2022-jp?b?VU1ORmlQK0ErM2xRWE81UDVOUEZiakowenBMVmo4VjRTSzdHQTJIRGsx?= =?iso-2022-jp?b?czBlUCtmYnpZOW41MUxPaEt3cG1RS3ZJeWMrRHhOSExZRlVnQmhKanlT?= =?iso-2022-jp?b?YXk0Um4xd1EyZlFubDEyVWRVSElYVTc0a1ZKSGhoZk9veWVnS0w0OWJk?= =?iso-2022-jp?b?NHRXbkNzd1dqRlJMMW9NaTlFaDdjOVdPellSWURnaHRSMkFWeTVPc0hn?= =?iso-2022-jp?b?ckx2eGdyVzdpMnd6R2grSXpDaDdYVVdtU2hHcFVyc284RFhpdkxRL2pw?= =?iso-2022-jp?b?akwzUnRNR1RDK2l4NVNvbHFMM3pXMlNYMDhDcUFBcTFwZDlyb25mMzBC?= =?iso-2022-jp?b?dno3cG0yMXBkZHdzR2FjcUxFU3RONmRaWnNaUndsZmVWMlVSUjBocTZw?= =?iso-2022-jp?b?ZTdrYXArdTMzNFVmbDdnclFNRFJsUXl6NUF4aG9PSERDclUvbDZTaEJW?= =?iso-2022-jp?b?SENpU1NlRHhUVEp4cnJlQ3VhYmNOTFlGMU9qMSs3alBWNURHa1Mxekpm?= =?iso-2022-jp?b?d2tKcWZZd29VUWVvd3B4QVNYSWU4Q2t1ZkFtUUh5R3hXeXdoUWFOTlJm?= =?iso-2022-jp?b?OW82eTN4dDRiSWgxekZBVjI5M3U5NVU0N1hHZVZGVHRBMkcvdmZHUkF3?= =?iso-2022-jp?b?cFRXZXlHMVhRZGlmR2g5QUJiUDFTcFpDd1Z2MS9selNQbDNSMkI4WEJZ?= =?iso-2022-jp?b?OHVKRnhhSjR4UFJXcUNSZUZHVVBaTzl0T1dIdW8zQkJHV3pqS3EwaHB4?= =?iso-2022-jp?b?cXdhZ1h6MHlYRW92dE12OFFNZjZYNFByWEFaOFFwVA==?= MIME-Version: 1.0 X-OriginatorOrg: os.amperecomputing.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: LV2PR01MB7839.prod.exchangelabs.com X-MS-Exchange-CrossTenant-Network-Message-Id: 96707789-5010-482d-2f20-08dca3531116 X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Jul 2024 15:47:14.3806 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: 46eyII7C7JSC79XRF591L6uRu5zyPMiBf8zla0CB+5oH+k1M01nb/iPF5yZBxA7NhT0KOCxhbKLjnCZZFFnWA1taHTIDhYtn9swTZXUYEf4= X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR01MB7334 X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~patchwork=sourceware.org@gcc.gnu.org Vector stmts number of an operation is calculated based on output vectype. This is over-estimated for lane-reducing operation, which would cause vector def/use mismatched when we want to support loop reduction mixed with lane- reducing and normal operations. One solution is to refit lane-reducing to make it behave like a normal one, by adding new pass-through copies to fix possible def/use gap. And resultant superfluous statements could be optimized away after vectorization. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 Thanks, Feng --- gcc/ * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Calculate effective vector stmts number with generic vect_get_num_copies. (vect_transform_reduction): Insert copies for lane-reducing so as to fix over-estimated vector stmts number. (vect_transform_cycle_phi): Calculate vector PHI number only based on output vectype. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove adjustment on vector stmts number specific to slp reduction. --- gcc/tree-vect-loop.cc | 134 +++++++++++++++++++++++++++++++++++------- gcc/tree-vect-slp.cc | 27 +++------ 2 files changed, 121 insertions(+), 40 deletions(-) From 2b9b22f7f1a19816a17086c79e7ec5f7d0298af6 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Tue, 2 Jul 2024 17:12:00 +0800 Subject: [PATCH 2/4] vect: Refit lane-reducing to be normal operation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Vector stmts number of an operation is calculated based on output vectype. This is over-estimated for lane-reducing operation, which would cause vector def/use mismatched when we want to support loop reduction mixed with lane- reducing and normal operations. One solution is to refit lane-reducing to make it behave like a normal one, by adding new pass-through copies to fix possible def/use gap. And resultant superfluous statements could be optimized away after vectorization. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 2024-07-02 Feng Xue gcc/ * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Calculate effective vector stmts number with generic vect_get_num_copies. (vect_transform_reduction): Insert copies for lane-reducing so as to fix over-estimated vector stmts number. (vect_transform_cycle_phi): Calculate vector PHI number only based on output vectype. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove adjustment on vector stmts number specific to slp reduction. --- gcc/tree-vect-loop.cc | 134 +++++++++++++++++++++++++++++++++++------- gcc/tree-vect-slp.cc | 27 +++------ 2 files changed, 121 insertions(+), 40 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a64b5082bd1..5ac83e76975 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7468,12 +7468,8 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, = get_masked_reduction_fn (reduc_fn, vectype_in); vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); - unsigned nvectors; - - if (slp_node) - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); - else - nvectors = vect_get_num_copies (loop_vinfo, vectype_in); + unsigned nvectors = vect_get_num_copies (loop_vinfo, slp_node, + vectype_in); if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1); @@ -8595,12 +8591,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo, stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)); gphi *reduc_def_phi = as_a (phi_info->stmt); int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); - tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); + + if (!vectype_in) + vectype_in = STMT_VINFO_VECTYPE (stmt_info); if (slp_node) { ncopies = 1; - vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + vec_num = vect_get_num_copies (loop_vinfo, slp_node, vectype_in); } else { @@ -8658,13 +8657,40 @@ vect_transform_reduction (loop_vec_info loop_vinfo, bool lane_reducing = lane_reducing_op_p (code); gcc_assert (single_defuse_cycle || lane_reducing); + if (lane_reducing) + { + /* The last operand of lane-reducing op is for reduction. */ + gcc_assert (reduc_index == (int) op.num_ops - 1); + } + /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out); + if (lane_reducing && !slp_node && !single_defuse_cycle) + { + /* Note: there are still vectorizable cases that can not be handled by + single-lane slp. Probably it would take some time to evolve the + feature to a mature state. So we have to keep the below non-slp code + path as failsafe for lane-reducing support. */ + gcc_assert (op.num_ops <= 3); + for (unsigned i = 0; i < op.num_ops; i++) + { + unsigned oprnd_ncopies = ncopies; + + if ((int) i == reduc_index) + { + tree vectype = STMT_VINFO_VECTYPE (stmt_info); + oprnd_ncopies = vect_get_num_copies (loop_vinfo, vectype); + } + + vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, oprnd_ncopies, + op.ops[i], &vec_oprnds[i]); + } + } /* Get NCOPIES vector definitions for all operands except the reduction definition. */ - if (!cond_fn_p) + else if (!cond_fn_p) { gcc_assert (reduc_index >= 0 && reduc_index <= 2); vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, @@ -8702,6 +8728,61 @@ vect_transform_reduction (loop_vec_info loop_vinfo, reduc_index == 2 ? op.ops[2] : NULL_TREE, &vec_oprnds[2]); } + else if (lane_reducing) + { + /* For normal reduction, consistency between vectorized def/use is + naturally ensured when mapping from scalar statement. But if lane- + reducing op is involved in reduction, thing would become somewhat + complicated in that the op's result and operand for accumulation are + limited to less lanes than other operands, which certainly causes + def/use mismatch on adjacent statements around the op if do not have + any kind of specific adjustment. One approach is to refit lane- + reducing op in the way of introducing new trivial pass-through copies + to fix possible def/use gap, so as to make it behave like a normal op. + And vector reduction PHIs are always generated to the full extent, no + matter lane-reducing op exists or not. If some copies or PHIs are + actually superfluous, they would be cleaned up by passes after + vectorization. An example for single-lane slp is given as below. + Similarly, this handling is applicable for multiple-lane slp as well. + + int sum = 1; + for (i) + { + sum += d0[i] * d1[i]; // dot-prod + } + + The vector size is 128-bit,vectorization factor is 16. Reduction + statements would be transformed as: + + vector<4> int sum_v0 = { 0, 0, 0, 1 }; + vector<4> int sum_v1 = { 0, 0, 0, 0 }; + vector<4> int sum_v2 = { 0, 0, 0, 0 }; + vector<4> int sum_v3 = { 0, 0, 0, 0 }; + + for (i / 16) + { + sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); + sum_v1 = sum_v1; // copy + sum_v2 = sum_v2; // copy + sum_v3 = sum_v3; // copy + } + + sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + */ + unsigned effec_ncopies = vec_oprnds[0].length (); + unsigned total_ncopies = vec_oprnds[reduc_index].length (); + + gcc_assert (effec_ncopies <= total_ncopies); + + if (effec_ncopies < total_ncopies) + { + for (unsigned i = 0; i < op.num_ops - 1; i++) + { + gcc_assert (vec_oprnds[i].length () == effec_ncopies); + vec_oprnds[i].safe_grow_cleared (total_ncopies); + } + } + } bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); @@ -8710,7 +8791,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo, { gimple *new_stmt; tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; - if (masked_loop_p && !mask_by_cond_expr) + if (!vop[0] || !vop[1]) + { + tree reduc_vop = vec_oprnds[reduc_index][i]; + + /* If could not generate an effective vector statement for current + portion of reduction operand, insert a trivial copy to simply + handle over the operand to other dependent statements. */ + gcc_assert (reduc_vop); + + if (slp_node && TREE_CODE (reduc_vop) == SSA_NAME + && !SSA_NAME_IS_DEFAULT_DEF (reduc_vop)) + new_stmt = SSA_NAME_DEF_STMT (reduc_vop); + else + { + new_temp = make_ssa_name (vec_dest); + new_stmt = gimple_build_assign (new_temp, reduc_vop); + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, + gsi); + } + } + else if (masked_loop_p && !mask_by_cond_expr) { /* No conditional ifns have been defined for lane-reducing op yet. */ @@ -8810,23 +8911,16 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo, /* Leave the scalar phi in place. */ return true; - tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); - /* For a nested cycle we do not fill the above. */ - if (!vectype_in) - vectype_in = STMT_VINFO_VECTYPE (stmt_info); - gcc_assert (vectype_in); - if (slp_node) { - /* The size vect_schedule_slp_instance computes is off for us. */ - vec_num = vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo) - * SLP_TREE_LANES (slp_node), vectype_in); + vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); ncopies = 1; } else { vec_num = 1; - ncopies = vect_get_num_copies (loop_vinfo, vectype_in); + ncopies = vect_get_num_copies (loop_vinfo, + STMT_VINFO_VECTYPE (stmt_info)); } /* Check whether we should use a single PHI node and accumulate diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 4dadbc6854d..55ae496cbb2 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -6554,26 +6554,13 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node, { stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node); - /* Calculate the number of vector statements to be created for the - scalar stmts in this node. For SLP reductions it is equal to the - number of vector statements in the children (which has already been - calculated by the recursive call). Otherwise it is the number of - scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by - VF divided by the number of elements in a vector. */ - if (SLP_TREE_CODE (node) != VEC_PERM_EXPR - && !STMT_VINFO_DATA_REF (stmt_info) - && REDUC_GROUP_FIRST_ELEMENT (stmt_info)) - { - for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i) - if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == vect_internal_def) - { - SLP_TREE_NUMBER_OF_VEC_STMTS (node) - = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]); - break; - } - } - else - SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); + /* Calculate the number of vector statements to be created for the scalar + stmts in this node. It is the number of scalar elements in one scalar + iteration (DR_GROUP_SIZE) multiplied by VF divided by the number of + elements in a vector. For single-defuse-cycle, lane-reducing op, and + PHI statement that starts reduction comprised of only lane-reducing ops, + the number is more than effective vector statements actually required. */ + SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); /* Handle purely internal nodes. */ if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) -- 2.17.1 From patchwork Sat Jul 13 15:48:42 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Feng Xue OS X-Patchwork-Id: 93897 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id B3A82386102B for ; Sat, 13 Jul 2024 15:49:23 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from DM1PR04CU001.outbound.protection.outlook.com (mail-centralusazlp170100005.outbound.protection.outlook.com [IPv6:2a01:111:f403:c111::5]) by sourceware.org (Postfix) with ESMTPS id 9FF2E385DDCC for ; Sat, 13 Jul 2024 15:48:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9FF2E385DDCC Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 9FF2E385DDCC Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f403:c111::5 ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885734; cv=pass; b=WL7bYtmySRqBT9O29z2FbnCzrBbhjzbkz6/H7bbYhoEe3VmArVCsgjAoG9CdlVRas9we81kW/pnJoLGvFSZXLZZaTlCccHrBuz9ilg7yDcJpj6PjHU/Qpfdu0aoeFW6yS++fKpa2FtnAJPMvu6g46i3Xz5bhE/0lpc62UIQ2cKg= ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885734; c=relaxed/simple; bh=zgurLu2vmmp/orVkY4ZDxBLV3L6AvOct4dk+9EoaL+I=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=ObVZKekK7OZ77h9bfsXKgoTct6+tDXmEGpbvzv/6E4PoMxriR772Lu5aEyewE61PTgsqqXYnDKna8AavfwtnplWVA6Ndc01WV+t+ulH4rCt77ep0OI4aBJJLNGk3YElORah44Jq3uR3iWObs+pCE+JDHi2zEwdywsUVSCLF0cDg= ARC-Authentication-Results: i=2; server2.sourceware.org ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Df8Lu89g/F6aiWoNDYRdd2eVWlEq/5QBtGg7pT4xeJ0qZjXKb+V2YyjKnNXBd5bmHydKvVIcYXnRxZuK7LTLvfmLefo0i2PqR1Gn143PEkU8jIgGr62ha+QZYYsAPj7iCDnspGOmSZmnebTkr+rg3IuoAgpcnoVWtwddHQUkKS73SNyef41q6cdyskXb8QtZDb5Sn3ntQ180PRsvEp5GkRDIn9wuaAk6Tpz4/1zPI6rHeoT9dfbqhxI+5cYMD0hKjtKyJE2AZT2O8rU3bSFJotKuj/OoE4YGR2V5R436s/Md/2kiYdf9MuKQo1szdkXDU9Yrs5HB2nICzJQuexgiHQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/4YqrttfWskNEXvgWWpQSJbjPh7mcWW1LawOVvVyjoo=; b=OYNlWryUKEV24OeBUcFlvCVKmjBxShW3uYMLENgkO9Zz2DHtFTRgGKTnr32cwrO2KGuK1yBXlMSQkhyP26PlepHS8X6ompDHNDIn7JmI1SD4qofkv9aaXorxhh7fEJZ3oARI+bg1fT/57c5rty+OuJvAAx5tkLs8YBDOZUHRa0LaN1b1oPwWiLcbgioehd5u2LCgiXdkXc+ILrgPJsngU1JlCaBeGuU6wuaBPI1Kq8WghW5QjAw/FF85Hura1v1DPUH8uXsVWwy5BSV8WBIRtmiYNGBY9XbCrgt+41gw806RpjLjpuOWfp/U5HKi9kNAtktYYyP3XCpVIEi8LQMGGA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none header.from=os.amperecomputing.com; dkim=pass header.d=os.amperecomputing.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=os.amperecomputing.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=/4YqrttfWskNEXvgWWpQSJbjPh7mcWW1LawOVvVyjoo=; b=jkzhkdDc3PrZRsh4rFBiPkPt09c0HDoHTM9EuCwWSxy2ijPJ+em8iUDI3UIR8wobEt3hRTNaBpq6gxVLXyxCIiJGJ3OodPSnuI/lk5LyYFm/wcniCnYxZ4Ti8rftcOohvOweOqddl1Oavrk97/kaGDzBpmIaPdoQ7TteY/XfZ/c= Received: from LV2PR01MB7839.prod.exchangelabs.com (2603:10b6:408:14f::13) by PH0PR01MB7334.prod.exchangelabs.com (2603:10b6:510:10d::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7762.24; Sat, 13 Jul 2024 15:48:42 +0000 Received: from LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63]) by LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63%4]) with mapi id 15.20.7762.020; Sat, 13 Jul 2024 15:48:42 +0000 From: Feng Xue OS To: Richard Biener , "gcc-patches@gcc.gnu.org" Subject: [PATCH 3/4] vect: Support multiple lane-reducing operations for loop reduction [PR114440] Thread-Topic: [PATCH 3/4] vect: Support multiple lane-reducing operations for loop reduction [PR114440] Thread-Index: AQHa1TwUuukY1odfEEyRRSuRlwLpMQ== Date: Sat, 13 Jul 2024 15:48:42 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2024-07-13T15:48:42.517Z; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=os.amperecomputing.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: LV2PR01MB7839:EE_|PH0PR01MB7334:EE_ x-ms-office365-filtering-correlation-id: f7a715fd-080e-4ccb-4e11-08dca35345d6 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230040|1800799024|376014|366016|38070700018; x-microsoft-antispam-message-info: =?iso-8859-1?q?se2/ft0NLNWddiZF3SPVBQ1hkq?= =?iso-8859-1?q?IGyl0ukKZy0rXQnwEThC2ZM8WFWVJfUl1tALd7HAJwEmDck6CguDqY6QaCaJ?= =?iso-8859-1?q?/X7mvuBgZWAVzIg7Z/QREtODxJcC5Jlz1uLd2Vto38o75ENRf7ZZjumpX37i?= =?iso-8859-1?q?Wpvgumr5oIhuTpV0BHHVkv9POlvdJy5wJVJwG0n55P7DjuxFVxwP7hDwl98C?= =?iso-8859-1?q?14iiJK4I38mWNAxvduOe86MWEUdo2EvCZhMymjAxRBaphHQKOKxNYwctoO/E?= =?iso-8859-1?q?9MBH5VDyJzPOMwOkKyg3Q2ED92rf+C3+Ueu9C12BuWwbrQpZ+uplthVKnEb1?= =?iso-8859-1?q?URTpVJpzyz94Mn6CD+SzB4LupTS6i60twVwBwZ+ePeR6+j46Uu6xN09PjI+R?= =?iso-8859-1?q?+wvbprLxPNoNh+Vu4u5fossqr70ACZSNCVf5NNNRmpDBwUJbRtqQ9gM71NlM?= =?iso-8859-1?q?Fga+193mo6Vh2d3x5C5r9Ssw20HNly7fwEaH1ElCSgVY9HHiz/47ZBG10zjf?= =?iso-8859-1?q?PBHKh18XBjI3n070dWMhozaeqOlcRvpecrHpozIWqX3N2rEkzzKypettj3Ch?= =?iso-8859-1?q?4cUXuYaWAEQS4wI43N+UVdYLaXVMIyqec8Dc01iZ/D47vAFSPqTOH5CDHMd9?= =?iso-8859-1?q?zNDe/oOWgYesBwMHIfmky+ldpcdg/dSB/YrBmLL1jWTSSCesa/aTx8ZNvYSZ?= =?iso-8859-1?q?EEsF3u5bY+V/7xDwtmfldSoYetVBFgSwEi9rddOwjQo/H4lONTJFADJFJOQX?= =?iso-8859-1?q?TJllaxdpdLXmtIkcWZpz14MfqiWwpQCQhMe4tw/0I9UW0rrMqkRTQZzNzySR?= =?iso-8859-1?q?+gFLwpPMjrKH4789MZ8dOc16bnsfms7O+c6jTEOSIlMNZLRDwrnMBok8KvQ7?= =?iso-8859-1?q?WSlcDC8pfj9heTgu9SXQf+5iZ+LXPB5m29IS4y4XAtwu4/0SNJ1OsX/XqtJc?= =?iso-8859-1?q?YBgHOah1FgtnQTdeHYtlSMtCIsjbhjKMEWEiSZFePyCt931HLDzz9F+LAIEY?= =?iso-8859-1?q?XBF5iEpzM10Swb8sSdOEJwzArXYjfH8A3onNGcn69OLqeVdtE4eWvNLFWbH2?= =?iso-8859-1?q?XPTZ1p7ZYtC/2bH3u41b0eyplmyiodSzWzLnEoMqi8Z3R98C7YgrFtfAXCqH?= =?iso-8859-1?q?0CCDYLxgd4b0XRsOCBVf4XplPr/ClwOZTj/UyO24M4eR5MY20FexJBjgUAjH?= =?iso-8859-1?q?uu055vZ0nZMqS5J3kWhL6x+qxL7T+k4eocMbgV9ZT862WLw9IUXVlJ/DOT30?= =?iso-8859-1?q?Hu0jcd0+joLLaXFdKukOSvs0opXzzgTbIgWLM9XSeoX1F5JdXVIROuvpMdu7?= =?iso-8859-1?q?6Cy5VRQ8m10z48kCHa2j13B/4vg8tb+EUcjcnOAor0+ahYFluhoM0h5uT+Yw?= =?iso-8859-1?q?e9oG/w7pbCEO9wGy8UG5xlHUSp1TulaMzQaoOQXhY=3D?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:LV2PR01MB7839.prod.exchangelabs.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(376014)(366016)(38070700018); DIR:OUT; SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?q?hFjtK769CtBeU1jeGCitodS?= =?iso-8859-1?q?C7Q6DfbAjdaTyNjDoyacoKiPYIY3kqC0uzBk0YfE487CtmEemGw1el76EWuE?= =?iso-8859-1?q?SYaJNcv59eOS6Je1qe6d+0sVRY7rCwbSG+/KOitO52jqu0q3wQv5H1xDVHk9?= =?iso-8859-1?q?FtWZjQ8qOPvJrh1toAusFxr+A3N4rsb6Gmgb0SwTRRZkYRq4O788GExuqsCB?= =?iso-8859-1?q?2Xj8Y37dzlEJm9t/Y5ofHfDNlUHLD/Nf57k72vhPujXOH181EC6IKba8Diyu?= =?iso-8859-1?q?+XoGmN/kWwICRIyNKYl1ME4LTZuXMiivOpxG+33wBQ+fkoNuH9XbNzAWxWqY?= =?iso-8859-1?q?sPGBiQzVW1W8WGw/WZOi6f0McVmy2ik6OZ1s2/7fUSYrv6RhQOH6vVc6zyPl?= =?iso-8859-1?q?aYzxcfxD8nDWXPzOa5FqI/THp4AdzAPL1IXGUqy53vfri+B2lw4zxnnUYdn/?= =?iso-8859-1?q?JtBGMlg/cpCz29zFGhArG3BTCVgU8e8HlJXDfbQqoneiz0lSN5KCfjvqp4Ky?= =?iso-8859-1?q?Q9dSssPlDMq/qpCF+upSXjTb0mXVi4NjJR8Le3el4e8H9BZjlwG2NqU3gxn6?= =?iso-8859-1?q?FwgGcrteEZa+BHc2l6Cl3jiSjd0cKyJA/QFYExuHhVcDZ5SboksXfD5u61bM?= =?iso-8859-1?q?9VUgb6dbauGcjYlGRlV+YsGwC1zFOUVkcT1MnYaumXTlTsinkI66q7FNmQvP?= =?iso-8859-1?q?14BMVmOgD6BKQuvC6IZHvJRiUYqqspyKu6wJcekhD4OOD/ZcROrWSuOHcLqY?= =?iso-8859-1?q?sIK7uPjYrBZbNqMMQjodiDAnvhYRRPvuPykY4HhqhvZQ8U80S8RseLF6Zk5G?= =?iso-8859-1?q?QcW3SuDZc7jdpkSDBx/notvm5DAlcEgQWU720v7ZvTVN7oURrIwnV4/TLBgl?= =?iso-8859-1?q?X+Vb/6IurKxFkNxPRSHXH9nnkwAIEHxrMGszbv+aOMdaFoon+XswcKsIL95D?= =?iso-8859-1?q?1Jj2Wec+PawHThVxL14tQlqr2dlhzAPbwX53OalG/+Y5Ts6gumjMlNvG0yCa?= =?iso-8859-1?q?sIJN0nKGUVQ6/b9J5k4kejK3NDogYeOeSIDpu9oWWNh5sqSweRiVV7VIE1bp?= =?iso-8859-1?q?5+w6bC7SO+P+dXF0MBleS+dze+a7yTRZ0C1iku6IihNVwk/TmxNXEVlJz+Z4?= =?iso-8859-1?q?RNQMCV8Vdp+NojLxzFiqqwGnlE86VTU4rfXS8p0PraUl27oOIZwq6SW9cchk?= =?iso-8859-1?q?MQjjZXHHVVraHdtpu6MBW6BKHpkouf5f8UkX0SySnuqcFnq5oijCtti8s+1Y?= =?iso-8859-1?q?RIYMoWqqJOXQl/DEhuYeu7cCcrqxK3nVhag1nykZQaJ0brEks+ehM8V6lugv?= =?iso-8859-1?q?emjvUB2C/KEmMgCeXXOu3f9PmzZg3bS1WOAoIxY0zmH3NHimseZ5F476Oa9A?= =?iso-8859-1?q?BvPDCEYwOb/y5rMojCHvlYnFgneJBio2dvcTexQ8wvz6Dsfa2i6K0F6waEU8?= =?iso-8859-1?q?pAvgx8t3e1hu9enKLbdbdblK+84utC9Wz16fWYceoylakxeJL5oVb8tONKpg?= =?iso-8859-1?q?asM29Bm/5dWG5xFhMVpdgSc8gcqxrr6wgIrsl2uZwYeGX6vCanTOMoTBhzkH?= =?iso-8859-1?q?gqCwlYKxZX3o47dYjOnis/GXGMD2HZ1ZNUIxos6ATKUEWdchtAeup4i33YEg?= =?iso-8859-1?q?/lXoH5jH7y9QWC/8X?= MIME-Version: 1.0 X-OriginatorOrg: os.amperecomputing.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: LV2PR01MB7839.prod.exchangelabs.com X-MS-Exchange-CrossTenant-Network-Message-Id: f7a715fd-080e-4ccb-4e11-08dca35345d6 X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Jul 2024 15:48:42.8865 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: zcCdBcuo6RAnfiqFquMxAp7ry9gbGrtTvsH/z9YXTLiBMNqIsWMj+94VKm6zXkwkdCPSL/NUfXdDMzpUqFul4Xk7tFOB89S/k2GKIu2CBDE= X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR01MB7334 X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~patchwork=sourceware.org@gcc.gnu.org For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. This patches removes some constraints in reduction analysis to allow multiple arbitrary lane-reducing operations with mixed input vectypes in a loop reduction chain. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } The vector size is 128-bit vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing. (vect_transform_reduction): Adjust comments with updated example. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c * gcc.dg/vect/vect-reduc-dot-slp-1.c --- .../gcc.dg/vect/vect-reduc-chain-1.c | 64 +++++ .../gcc.dg/vect/vect-reduc-chain-2.c | 79 ++++++ .../gcc.dg/vect/vect-reduc-chain-3.c | 68 +++++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 95 +++++++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 67 +++++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c | 79 ++++++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c | 63 +++++ .../gcc.dg/vect/vect-reduc-dot-slp-1.c | 60 +++++ gcc/tree-vect-loop.cc | 240 +++++++++++++----- gcc/tree-vect-stmts.cc | 2 + gcc/tree-vectorizer.h | 2 + 11 files changed, 750 insertions(+), 69 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c From 0889639114ecb8ab2e46dc4effe8f114f5ab8ad6 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Wed, 29 May 2024 17:22:36 +0800 Subject: [PATCH 3/4] vect: Support multiple lane-reducing operations for loop reduction [PR114440] For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. This patches removes some constraints in reduction analysis to allow multiple arbitrary lane-reducing operations with mixed input vectypes in a loop reduction chain. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } The vector size is 128-bit vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 2024-03-22 Feng Xue gcc/ PR tree-optimization/114440 * tree-vectorizer.h (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing. (vect_transform_reduction): Adjust comments with updated example. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c * gcc.dg/vect/vect-reduc-dot-slp-1.c --- .../gcc.dg/vect/vect-reduc-chain-1.c | 64 +++++ .../gcc.dg/vect/vect-reduc-chain-2.c | 79 ++++++ .../gcc.dg/vect/vect-reduc-chain-3.c | 68 +++++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 95 +++++++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 67 +++++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c | 79 ++++++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c | 63 +++++ .../gcc.dg/vect/vect-reduc-dot-slp-1.c | 60 +++++ gcc/tree-vect-loop.cc | 240 +++++++++++++----- gcc/tree-vect-stmts.cc | 2 + gcc/tree-vectorizer.h | 2 + 11 files changed, 750 insertions(+), 69 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c new file mode 100644 index 00000000000..80b0089ea0f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c @@ -0,0 +1,64 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *restrict a, + SIGNEDNESS_2 char *restrict b, + SIGNEDNESS_2 char *restrict c, + SIGNEDNESS_2 char *restrict d, + SIGNEDNESS_1 int *restrict e) +{ + for (int i = 0; i < N; ++i) + { + res += a[i] * b[i]; + res += c[i] * d[i]; + res += e[i]; + } + return res; +} + +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) +#define OFFSET 20 + +int +main (void) +{ + check_vect (); + + SIGNEDNESS_2 char a[N], b[N]; + SIGNEDNESS_2 char c[N], d[N]; + SIGNEDNESS_1 int e[N]; + int expected = 0x12345; + + #pragma GCC novector + for (int i = 0; i < N; ++i) + { + a[i] = BASE + i * 5; + b[i] = BASE + OFFSET + i * 4; + c[i] = BASE + i * 2; + d[i] = BASE + OFFSET + i * 3; + e[i] = i; + expected += a[i] * b[i]; + expected += c[i] * d[i]; + expected += e[i]; + } + + if (f (0x12345, a, b, c, d, e) != expected) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c new file mode 100644 index 00000000000..5bc2686fc9d --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c @@ -0,0 +1,79 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 unsigned +#define SIGNEDNESS_3 signed +#define SIGNEDNESS_4 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +fn (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *restrict a, + SIGNEDNESS_2 char *restrict b, + SIGNEDNESS_3 char *restrict c, + SIGNEDNESS_3 char *restrict d, + SIGNEDNESS_4 short *restrict e, + SIGNEDNESS_4 short *restrict f, + SIGNEDNESS_1 int *restrict g) +{ + for (int i = 0; i < N; ++i) + { + res += a[i] * b[i]; + res += i + 1; + res += c[i] * d[i]; + res += e[i] * f[i]; + res += g[i]; + } + return res; +} + +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4) +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373) +#define OFFSET 20 + +int +main (void) +{ + check_vect (); + + SIGNEDNESS_2 char a[N], b[N]; + SIGNEDNESS_3 char c[N], d[N]; + SIGNEDNESS_4 short e[N], f[N]; + SIGNEDNESS_1 int g[N]; + int expected = 0x12345; + +#pragma GCC novector + for (int i = 0; i < N; ++i) + { + a[i] = BASE2 + i * 5; + b[i] = BASE2 + OFFSET + i * 4; + c[i] = BASE3 + i * 2; + d[i] = BASE3 + OFFSET + i * 3; + e[i] = BASE4 + i * 6; + f[i] = BASE4 + OFFSET + i * 5; + g[i] = i; + expected += a[i] * b[i]; + expected += i + 1; + expected += c[i] * d[i]; + expected += e[i] * f[i]; + expected += g[i]; + } + + if (fn (0x12345, a, b, c, d, e, f, g) != expected) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */ +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */ +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c new file mode 100644 index 00000000000..6a733fbac53 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c @@ -0,0 +1,68 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ + +#include "tree-vect.h" + +#define N 50 + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 unsigned +#define SIGNEDNESS_3 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *restrict a, + SIGNEDNESS_2 char *restrict b, + SIGNEDNESS_3 short *restrict c, + SIGNEDNESS_3 short *restrict d, + SIGNEDNESS_1 int *restrict e) +{ + for (int i = 0; i < N; ++i) + { + short diff = a[i] - b[i]; + SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff; + res += abs; + res += c[i] * d[i]; + res += e[i]; + } + return res; +} + +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373) +#define OFFSET 20 + +int +main (void) +{ + check_vect (); + + SIGNEDNESS_2 char a[N], b[N]; + SIGNEDNESS_3 short c[N], d[N]; + SIGNEDNESS_1 int e[N]; + int expected = 0x12345; + +#pragma GCC novector + for (int i = 0; i < N; ++i) + { + a[i] = BASE2 + i * 5; + b[i] = BASE2 - i * 4; + c[i] = BASE3 + i * 2; + d[i] = BASE3 + OFFSET + i * 3; + e[i] = i; + short diff = a[i] - b[i]; + SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff; + expected += abs; + expected += c[i] * d[i]; + expected += e[i]; + } + + if (f (0x12345, a, b, c, d, e) != expected) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */ +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c new file mode 100644 index 00000000000..72a370ab3c0 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c @@ -0,0 +1,95 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *a, + SIGNEDNESS_2 char *b, + int step, int n) +{ + for (int i = 0; i < n; i++) + { + res += a[0] * b[0]; + res += a[1] * b[1]; + res += a[2] * b[2]; + res += a[3] * b[3]; + res += a[4] * b[4]; + res += a[5] * b[5]; + res += a[6] * b[6]; + res += a[7] * b[7]; + res += a[8] * b[8]; + res += a[9] * b[9]; + res += a[10] * b[10]; + res += a[11] * b[11]; + res += a[12] * b[12]; + res += a[13] * b[13]; + res += a[14] * b[14]; + res += a[15] * b[15]; + + a += step; + b += step; + } + + return res; +} + +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) +#define OFFSET 20 + +int +main (void) +{ + check_vect (); + + SIGNEDNESS_2 char a[100], b[100]; + int expected = 0x12345; + int step = 16; + int n = 2; + int t = 0; + +#pragma GCC novector + for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i) + { + a[i] = BASE + i * 5; + b[i] = BASE + OFFSET + i * 4; + } + +#pragma GCC novector + for (int i = 0; i < n; i++) + { + expected += a[t + 0] * b[t + 0]; + expected += a[t + 1] * b[t + 1]; + expected += a[t + 2] * b[t + 2]; + expected += a[t + 3] * b[t + 3]; + expected += a[t + 4] * b[t + 4]; + expected += a[t + 5] * b[t + 5]; + expected += a[t + 6] * b[t + 6]; + expected += a[t + 7] * b[t + 7]; + expected += a[t + 8] * b[t + 8]; + expected += a[t + 9] * b[t + 9]; + expected += a[t + 10] * b[t + 10]; + expected += a[t + 11] * b[t + 11]; + expected += a[t + 12] * b[t + 12]; + expected += a[t + 13] * b[t + 13]; + expected += a[t + 14] * b[t + 14]; + expected += a[t + 15] * b[t + 15]; + t += step; + } + + if (f (0x12345, a, b, step, n) != expected) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c new file mode 100644 index 00000000000..aab86ee2f1c --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c @@ -0,0 +1,67 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *a, + SIGNEDNESS_2 char *b, + int n) +{ + for (int i = 0; i < n; i++) + { + res += a[5 * i + 0] * b[5 * i + 0]; + res += a[5 * i + 1] * b[5 * i + 1]; + res += a[5 * i + 2] * b[5 * i + 2]; + res += a[5 * i + 3] * b[5 * i + 3]; + res += a[5 * i + 4] * b[5 * i + 4]; + } + + return res; +} + +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) +#define OFFSET 20 + +int +main (void) +{ + check_vect (); + + SIGNEDNESS_2 char a[100], b[100]; + int expected = 0x12345; + int n = 18; + +#pragma GCC novector + for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i) + { + a[i] = BASE + i * 5; + b[i] = BASE + OFFSET + i * 4; + } + +#pragma GCC novector + for (int i = 0; i < n; i++) + { + expected += a[5 * i + 0] * b[5 * i + 0]; + expected += a[5 * i + 1] * b[5 * i + 1]; + expected += a[5 * i + 2] * b[5 * i + 2]; + expected += a[5 * i + 3] * b[5 * i + 3]; + expected += a[5 * i + 4] * b[5 * i + 4]; + } + + if (f (0x12345, a, b, n) != expected) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c new file mode 100644 index 00000000000..9f1d2136ab6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c @@ -0,0 +1,79 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 short *a, + SIGNEDNESS_2 short *b, + int step, int n) +{ + for (int i = 0; i < n; i++) + { + res += a[0] * b[0]; + res += a[1] * b[1]; + res += a[2] * b[2]; + res += a[3] * b[3]; + res += a[4] * b[4]; + res += a[5] * b[5]; + res += a[6] * b[6]; + res += a[7] * b[7]; + + a += step; + b += step; + } + + return res; +} + +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373) +#define OFFSET 20 + +int +main (void) +{ + check_vect (); + + SIGNEDNESS_2 short a[100], b[100]; + int expected = 0x12345; + int step = 8; + int n = 2; + int t = 0; + +#pragma GCC novector + for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i) + { + a[i] = BASE + i * 5; + b[i] = BASE + OFFSET + i * 4; + } + +#pragma GCC novector + for (int i = 0; i < n; i++) + { + expected += a[t + 0] * b[t + 0]; + expected += a[t + 1] * b[t + 1]; + expected += a[t + 2] * b[t + 2]; + expected += a[t + 3] * b[t + 3]; + expected += a[t + 4] * b[t + 4]; + expected += a[t + 5] * b[t + 5]; + expected += a[t + 6] * b[t + 6]; + expected += a[t + 7] * b[t + 7]; + t += step; + } + + if (f (0x12345, a, b, step, n) != expected) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect" { target vect_sdot_hi } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c new file mode 100644 index 00000000000..f4dcebdfa10 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c @@ -0,0 +1,63 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 short *a, + SIGNEDNESS_2 short *b, + int n) +{ + for (int i = 0; i < n; i++) + { + res += a[3 * i + 0] * b[3 * i + 0]; + res += a[3 * i + 1] * b[3 * i + 1]; + res += a[3 * i + 2] * b[3 * i + 2]; + } + + return res; +} + +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373) +#define OFFSET 20 + +int +main (void) +{ + check_vect (); + + SIGNEDNESS_2 short a[100], b[100]; + int expected = 0x12345; + int n = 18; + +#pragma GCC novector + for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i) + { + a[i] = BASE + i * 5; + b[i] = BASE + OFFSET + i * 4; + } + +#pragma GCC novector + for (int i = 0; i < n; i++) + { + expected += a[3 * i + 0] * b[3 * i + 0]; + expected += a[3 * i + 1] * b[3 * i + 1]; + expected += a[3 * i + 2] * b[3 * i + 2]; + } + + if (f (0x12345, a, b, n) != expected) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect" { target vect_sdot_hi } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c new file mode 100644 index 00000000000..84c82b023d4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c @@ -0,0 +1,60 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-do compile } */ +/* { dg-additional-options "--param vect-epilogues-nomask=0 -fdump-tree-optimized" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res0, + SIGNEDNESS_1 int res1, + SIGNEDNESS_1 int res2, + SIGNEDNESS_1 int res3, + SIGNEDNESS_1 int res4, + SIGNEDNESS_1 int res5, + SIGNEDNESS_1 int res6, + SIGNEDNESS_1 int res7, + SIGNEDNESS_1 int res8, + SIGNEDNESS_1 int res9, + SIGNEDNESS_1 int resA, + SIGNEDNESS_1 int resB, + SIGNEDNESS_1 int resC, + SIGNEDNESS_1 int resD, + SIGNEDNESS_1 int resE, + SIGNEDNESS_1 int resF, + SIGNEDNESS_2 char *a, + SIGNEDNESS_2 char *b) +{ + for (int i = 0; i < 64; i += 16) + { + res0 += a[i + 0x00] * b[i + 0x00]; + res1 += a[i + 0x01] * b[i + 0x01]; + res2 += a[i + 0x02] * b[i + 0x02]; + res3 += a[i + 0x03] * b[i + 0x03]; + res4 += a[i + 0x04] * b[i + 0x04]; + res5 += a[i + 0x05] * b[i + 0x05]; + res6 += a[i + 0x06] * b[i + 0x06]; + res7 += a[i + 0x07] * b[i + 0x07]; + res8 += a[i + 0x08] * b[i + 0x08]; + res9 += a[i + 0x09] * b[i + 0x09]; + resA += a[i + 0x0A] * b[i + 0x0A]; + resB += a[i + 0x0B] * b[i + 0x0B]; + resC += a[i + 0x0C] * b[i + 0x0C]; + resD += a[i + 0x0D] * b[i + 0x0D]; + resE += a[i + 0x0E] * b[i + 0x0E]; + resF += a[i + 0x0F] * b[i + 0x0F]; + } + + return res0 ^ res1 ^ res2 ^ res3 ^ res4 ^ res5 ^ res6 ^ res7 ^ + res8 ^ res9 ^ resA ^ resB ^ resC ^ resD ^ resE ^ resF; +} + +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ +/* { dg-final { scan-tree-dump-not "DOT_PROD_EXPR" "optimized" } } */ diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 5ac83e76975..e72d692ffa3 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -5328,8 +5328,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, if (!gimple_extract_op (orig_stmt_info->stmt, &op)) gcc_unreachable (); - bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); - if (reduction_type == EXTRACT_LAST_REDUCTION) /* No extra instructions are needed in the prologue. The loop body operations are costed in vectorizable_condition. */ @@ -5364,12 +5362,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, initial result of the data reduction, initial value of the index reduction. */ prologue_stmts = 4; - else if (emulated_mixed_dot_prod) - /* We need the initial reduction value and two invariants: - one that contains the minimum signed value and one that - contains half of its negative. */ - prologue_stmts = 3; else + /* We need the initial reduction value. */ prologue_stmts = 1; prologue_cost += record_stmt_cost (cost_vec, prologue_stmts, scalar_to_vec, stmt_info, 0, @@ -7478,6 +7472,143 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, } } +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in + the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC. + Now there are three such kinds of operations: dot-prod/widen-sum/sad + (sum-of-absolute-differences). + + For a lane-reducing operation, the loop reduction path that it lies in, + may contain normal operation, or other lane-reducing operation of different + input type size, an example as: + + int sum = 0; + for (i) + { + ... + sum += d0[i] * d1[i]; // dot-prod + sum += w[i]; // widen-sum + sum += abs(s0[i] - s1[i]); // sad + sum += n[i]; // normal + ... + } + + Vectorization factor is essentially determined by operation whose input + vectype has the most lanes ("vector(16) char" in the example), while we + need to choose input vectype with the least lanes ("vector(4) int" in the + example) for the reduction PHI statement. */ + +bool +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, + slp_tree slp_node, stmt_vector_for_cost *cost_vec) +{ + gimple *stmt = stmt_info->stmt; + + if (!lane_reducing_stmt_p (stmt)) + return false; + + tree type = TREE_TYPE (gimple_assign_lhs (stmt)); + + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type)) + return false; + + /* Do not try to vectorize bit-precision reductions. */ + if (!type_has_mode_precision_p (type)) + return false; + + for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++) + { + stmt_vec_info def_stmt_info; + slp_tree slp_op; + tree op; + tree vectype; + enum vect_def_type dt; + + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op, + &slp_op, &dt, &vectype, &def_stmt_info)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "use not simple.\n"); + return false; + } + + if (!vectype) + { + vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op), + slp_op); + if (!vectype) + return false; + } + + if (slp_node && !vect_maybe_update_slp_op_vectype (slp_op, vectype)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "incompatible vector types for invariants\n"); + return false; + } + + if (i == STMT_VINFO_REDUC_IDX (stmt_info)) + continue; + + /* There should be at most one cycle def in the stmt. */ + if (VECTORIZABLE_CYCLE_DEF (dt)) + return false; + } + + stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)); + + /* TODO: Support lane-reducing operation that does not directly participate + in loop reduction. */ + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0) + return false; + + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not + recoginized. */ + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def); + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION); + + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); + + gcc_assert (vectype_in); + + /* Compute number of effective vector statements for costing. */ + unsigned int ncopies_for_cost = vect_get_num_copies (loop_vinfo, slp_node, + vectype_in); + gcc_assert (ncopies_for_cost >= 1); + + if (vect_is_emulated_mixed_dot_prod (stmt_info)) + { + /* We need extra two invariants: one that contains the minimum signed + value and one that contains half of its negative. */ + int prologue_stmts = 2; + unsigned cost = record_stmt_cost (cost_vec, prologue_stmts, + scalar_to_vec, stmt_info, 0, + vect_prologue); + if (dump_enabled_p ()) + dump_printf (MSG_NOTE, "vectorizable_lane_reducing: " + "extra prologue_cost = %d .\n", cost); + + /* Three dot-products and a subtraction. */ + ncopies_for_cost *= 4; + } + + record_stmt_cost (cost_vec, (int) ncopies_for_cost, vector_stmt, stmt_info, + 0, vect_body); + + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) + { + enum tree_code code = gimple_assign_rhs_code (stmt); + vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info, + slp_node, code, type, + vectype_in); + } + + /* Transform via vect_transform_reduction. */ + STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type; + return true; +} + /* Function vectorizable_reduction. Check if STMT_INFO performs a reduction operation that can be vectorized. @@ -7811,18 +7942,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (!type_has_mode_precision_p (op.type)) return false; - /* For lane-reducing ops we're reducing the number of reduction PHIs - which means the only use of that may be in the lane-reducing operation. */ - if (lane_reducing - && reduc_chain_length != 1 - && !only_slp_reduc_chain) - { - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "lane-reducing reduction with extra stmts.\n"); - return false; - } - /* Lane-reducing ops also never can be used in a SLP reduction group since we'll mix lanes belonging to different reductions. But it's OK to use them in a reduction chain or when the reduction group @@ -8362,14 +8481,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo, && loop_vinfo->suggested_unroll_factor == 1) single_defuse_cycle = true; - if (single_defuse_cycle || lane_reducing) + if (single_defuse_cycle && !lane_reducing) { gcc_assert (op.code != COND_EXPR); - /* 4. Supportable by target? */ - bool ok = true; - - /* 4.1. check support for the operation in the loop + /* 4. check support for the operation in the loop This isn't necessary for the lane reduction codes, since they can only be produced by pattern matching, and it's up to the @@ -8378,14 +8494,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo, mixed-sign dot-products can be implemented using signed dot-products. */ machine_mode vec_mode = TYPE_MODE (vectype_in); - if (!lane_reducing - && !directly_supported_p (op.code, vectype_in, optab_vector)) + if (!directly_supported_p (op.code, vectype_in, optab_vector)) { if (dump_enabled_p ()) dump_printf (MSG_NOTE, "op not supported by target.\n"); if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD) || !vect_can_vectorize_without_simd_p (op.code)) - ok = false; + single_defuse_cycle = false; else if (dump_enabled_p ()) dump_printf (MSG_NOTE, "proceeding using word mode.\n"); @@ -8398,16 +8513,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, dump_printf (MSG_NOTE, "using word mode not possible.\n"); return false; } - - /* lane-reducing operations have to go through vect_transform_reduction. - For the other cases try without the single cycle optimization. */ - if (!ok) - { - if (lane_reducing) - return false; - else - single_defuse_cycle = false; - } } if (dump_enabled_p () && single_defuse_cycle) dump_printf_loc (MSG_NOTE, vect_location, @@ -8415,22 +8520,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo, "multiple vectors to one in the loop body\n"); STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle; - /* If the reduction stmt is one of the patterns that have lane - reduction embedded we cannot handle the case of ! single_defuse_cycle. */ - if ((ncopies > 1 && ! single_defuse_cycle) - && lane_reducing) - { - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "multi def-use cycle not possible for lane-reducing " - "reduction operation\n"); - return false; - } + /* For lane-reducing operation, the below processing related to single + defuse-cycle will be done in its own vectorizable function. One more + thing to note is that the operation must not be involved in fold-left + reduction. */ + single_defuse_cycle &= !lane_reducing; if (slp_node - && !(!single_defuse_cycle - && !lane_reducing - && reduction_type != FOLD_LEFT_REDUCTION)) + && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION)) for (i = 0; i < (int) op.num_ops; i++) if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i])) { @@ -8443,28 +8540,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo, vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn, reduction_type, ncopies, cost_vec); /* Cost the reduction op inside the loop if transformed via - vect_transform_reduction. Otherwise this is costed by the - separate vectorizable_* routines. */ - if (single_defuse_cycle || lane_reducing) - { - int factor = 1; - if (vect_is_emulated_mixed_dot_prod (stmt_info)) - /* Three dot-products and a subtraction. */ - factor = 4; - record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, - stmt_info, 0, vect_body); - } + vect_transform_reduction for non-lane-reducing operation. Otherwise + this is costed by the separate vectorizable_* routines. */ + if (single_defuse_cycle) + record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body); if (dump_enabled_p () && reduction_type == FOLD_LEFT_REDUCTION) dump_printf_loc (MSG_NOTE, vect_location, "using an in-order (fold-left) reduction.\n"); STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type; - /* All but single defuse-cycle optimized, lane-reducing and fold-left - reductions go through their own vectorizable_* routines. */ - if (!single_defuse_cycle - && !lane_reducing - && reduction_type != FOLD_LEFT_REDUCTION) + + /* All but single defuse-cycle optimized and fold-left reductions go + through their own vectorizable_* routines. */ + if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION) { stmt_vec_info tem = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info)); @@ -8742,13 +8831,16 @@ vect_transform_reduction (loop_vec_info loop_vinfo, And vector reduction PHIs are always generated to the full extent, no matter lane-reducing op exists or not. If some copies or PHIs are actually superfluous, they would be cleaned up by passes after - vectorization. An example for single-lane slp is given as below. + vectorization. An example for single-lane slp, lane-reducing ops + with mixed input vectypes in a reduction chain, is given as below. Similarly, this handling is applicable for multiple-lane slp as well. int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod + sum += w[i]; // widen-sum + sum += abs(s0[i] - s1[i]); // sad } The vector size is 128-bit,vectorization factor is 16. Reduction @@ -8765,9 +8857,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy + + sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); + sum_v1 = sum_v1; // copy + sum_v2 = sum_v2; // copy + sum_v3 = sum_v3; // copy + + sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); + sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); + sum_v2 = sum_v2; // copy + sum_v3 = sum_v3; // copy } - sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 */ unsigned effec_ncopies = vec_oprnds[0].length (); unsigned total_ncopies = vec_oprnds[reduc_index].length (); diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index fdcda0d2aba..135580d25d7 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -13286,6 +13286,8 @@ vect_analyze_stmt (vec_info *vinfo, NULL, NULL, node, cost_vec) || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec) || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec) + || vectorizable_lane_reducing (as_a (vinfo), + stmt_info, node, cost_vec) || vectorizable_reduction (as_a (vinfo), stmt_info, node, node_instance, cost_vec) || vectorizable_induction (as_a (vinfo), stmt_info, diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 09923b9b440..62121f63f18 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2486,6 +2486,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *, extern bool vectorizable_live_operation (vec_info *, stmt_vec_info, slp_tree, slp_instance, int, bool, stmt_vector_for_cost *); +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info, + slp_tree, stmt_vector_for_cost *); extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info, slp_tree, slp_instance, stmt_vector_for_cost *); -- 2.17.1 From patchwork Sat Jul 13 15:49:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Feng Xue OS X-Patchwork-Id: 93898 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C74873861033 for ; Sat, 13 Jul 2024 15:50:19 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from BN1PR04CU002.outbound.protection.outlook.com (mail-eastus2azlp170100000.outbound.protection.outlook.com [IPv6:2a01:111:f403:c110::]) by sourceware.org (Postfix) with ESMTPS id 5F79E386103C for ; Sat, 13 Jul 2024 15:49:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5F79E386103C Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 5F79E386103C Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f403:c110:: ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885793; cv=pass; b=rImO06vvD4YF60pE/jjdsnRPaGHDUMiVFD+Cz0gL2uVmqhqd5bupRKlXceUrye5HMutK5jUVQsVvvTzmWWTRZJ1L/7BjGQa8pd4DFJqN2muwJx/jArFEQ+bkrYObY8m5oUFKzlMVvw1SNMqE9bJONDHb0XcKaOpcIbLRSA15j54= ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885793; c=relaxed/simple; bh=caCkq9iObxWedpMo+JLzuQh7AtU9WWIufcd4UdLoEyM=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=anO0Yt3FLMtA0a6PrxMQlnIXrsQjvN8SbDTH1F6yyW4Zu2G4m87A5UkOm1uHMPs+Z4algxPUF90i0HH6DCtVb31tHYHJs4J0cctyGP3ll4/Df9X28bG4t9OuRM7CnLbqjgjmvKmdsMVLM+8ER/M148tzyDanq8AvZQWuduhObAk= ARC-Authentication-Results: i=2; server2.sourceware.org ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=BRqgECRShJU1EfEpO5FQ6iD/eut4Hc6RZZhGWAon25Y1ZkkJW2cF0grUdpIkiFJAX6xCY2IzlX2zGyNPeZOeWAHnaC6lnXmceup38ZJW3oULOAdZEllYy19ojRARvAkFs2B9YevoX+n8uVDfoalomxnOtlczh+CUDBBkwlvlrQRwUXRMsUqaLHuCyYyKXi8qsjE7DYDsH4TqMJPD7XsSvkuPfeOfV4ATCFR6i+EzfIaAK7Qs6I8q/D05QH6g9CS6fhCpRhp44Mx8GZaIddcEJ6TlxdzXbhrpr0tNN1YdjaFa/7r4Ysm33ROd0zDG2WqJHzjLZze2G3oFyF3BCSBr7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=jIlqLQ2q6DSJW8t9X3bwlAyqlzKJR0+xQiO49SEhxOc=; b=d11UAofBDH+Bgib7z7j/lWk8NVPYrZ4w8NzQ1+rvbX4xawjqZJ88DXGV+viSGHU+GW53vsAYArb5Jc2ugR3l3himiqIhR9T2ExzbUQk2+jaeAk+3Nt5UHcXh+OaVUspxGy3o1V+UAdIp8aii1NxaWPXzXpBYlDL5mGSW8TiqOmk9TiIVj2T7pFytiF3CXqdMB5y5G1Qsx8TNWVlp2axE2cHj0jG31Sb+bJreTm75qzw5oYecXKvuekc0AQa4oHAb7vF+do8egOBWInX1juLSVnT9K90jVxmqYIIyb28wCGp6KlRRw89stBIowzh8mCRezxrWMszZmoLhsbX3VFMbAg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none header.from=os.amperecomputing.com; dkim=pass header.d=os.amperecomputing.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=os.amperecomputing.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=jIlqLQ2q6DSJW8t9X3bwlAyqlzKJR0+xQiO49SEhxOc=; b=BCvZlCczX0LE96xyewXtyjF/ycWeSqd+mS7QDJ0aTgVkz9iJWc90M93AopfNeVgFKbL2nxOtCseFYh0cu92851FcWlGk7mlZ7wpky7X8uU4qCgrNDuyfV5PSUGQWAYY9Lc2bPvn40UJAjqFkAPT3XGvlLqIeQGaBMfWNPzjRc20= Received: from LV2PR01MB7839.prod.exchangelabs.com (2603:10b6:408:14f::13) by PH0PR01MB7334.prod.exchangelabs.com (2603:10b6:510:10d::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7762.24; Sat, 13 Jul 2024 15:49:49 +0000 Received: from LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63]) by LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63%4]) with mapi id 15.20.7762.020; Sat, 13 Jul 2024 15:49:49 +0000 From: Feng Xue OS To: Richard Biener , "gcc-patches@gcc.gnu.org" Subject: [PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles Thread-Topic: [PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles Thread-Index: AQHa1Tw2XxBk97P7e0W53AZRrQ8nig== Date: Sat, 13 Jul 2024 15:49:49 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2024-07-13T15:49:48.768Z; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=os.amperecomputing.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: LV2PR01MB7839:EE_|PH0PR01MB7334:EE_ x-ms-office365-filtering-correlation-id: 324fe78c-1aed-4000-4d0a-08dca3536d40 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230040|1800799024|376014|366016|38070700018; x-microsoft-antispam-message-info: =?iso-2022-jp?b?bjRqdGF0VGpJb1BCMnpTWlU5?= =?iso-2022-jp?b?TkpkTnY1QkV2Rm5jbGY3cENkUTJWTTFYOHNXbjBUNUc1ZjhZRmRiL1Ar?= =?iso-2022-jp?b?Ym9sTnE0NEt1VGRmQWdNajNHYVNiemlXdXVJOE92T0d5MUw4N3d1UjBL?= =?iso-2022-jp?b?ejE4SzFjSEtPUC9lV0xKa3NwYiswUU8xOFQ2M3dtMjZPQThURWdaekRT?= =?iso-2022-jp?b?WlQvQkwzRmFnQTJkSzJwR0dtZ3E5cXpIVmJ1eU1ESkpINVhKV24vQjFD?= =?iso-2022-jp?b?MU1UQUlRYWVHMEVwTUN2TnRJQXlCRzRvcjFMMnZJY29tcEw4SHYyMXdH?= =?iso-2022-jp?b?MXlUVUV4WHZBNUxDYlRFaVJwUGtiOVhmTWlFdCtmc2hDWkQ4M2J0MjJi?= =?iso-2022-jp?b?bHNEb2h4VnpCaDYzYXB5UnJOWHNLdmdxWDVJZ0FpQVlQNndNMGUvZm5E?= =?iso-2022-jp?b?UVdKNDVvUnlBR1gybDBRbGRmRlZNbU8yVnNMUTBvS3BvZnJJenE3dGZm?= =?iso-2022-jp?b?Rkl2dDN3V3lrY1AwQUpkT3NiVEQvZlNXK1krK1Q5TGZvcG0rbE5oTUlB?= =?iso-2022-jp?b?T3h1ZU5wZ0JYWkV2YUx1RzY3SjU2YTFLWllXWXBGc0tQVVNUTVNnZy9n?= =?iso-2022-jp?b?OWVpcWY5WEhEUUFXSUV3RjNHdGRrdWxwZDlBMjF3YVVBNkVBZnpraC9U?= =?iso-2022-jp?b?WWhUOGMwbkpHaitLWkFoVjlYVGUrb1Noc2k0eHZkNExscFNYelRpZVli?= =?iso-2022-jp?b?ZmEvbmR6M0o0aXVtYmdQUXF2WjVMUXBLdFJGdk5hNGZ4akM2LzYrbzlr?= =?iso-2022-jp?b?RTdPWlpKNVRtYlFiSGw0c0pCdE1TVUswTE5MUnFVQVpHbkF6TXVZdnpK?= =?iso-2022-jp?b?cjFueFo2Zm55MVh1QWt2U1dXRlNlcDlqcUpmK2U1U0VEbUxhTlAwbWk4?= =?iso-2022-jp?b?eitKSXJMMmgyRXJHTmprNEtFWmEvZnRKREdna1BaTGZ6aGNQTUJibzU3?= =?iso-2022-jp?b?aW9DalBUNFpkR3RHR1RaSzFvQ0FXV2Izalc2aWdQS212K2N4VGw2VG5G?= =?iso-2022-jp?b?V0xqQXE5bFBDZmNLa2phVHphcTVHSlF0TlpZUDVxSUZySVRzanVOT1RC?= =?iso-2022-jp?b?aEplSE1ybFNROTE0TkxhRm52cjlodXowemR1WCsyNEdtbkhlOUJDTHJP?= =?iso-2022-jp?b?YVdjWEFnUENtU0MySGNzc244Y3pyR0pxcWlNa0dYUXA1cTdwcnVNSjky?= =?iso-2022-jp?b?TEhPVXdOZis3WUtQN2hxcWlDN0N2T1dTYTQ2ZlFjeVByNmkxaSt1ckNh?= =?iso-2022-jp?b?VzhZaTVYbmRKVjJVSk9JZVo0eHJraHZEeWV5ZkZHVGhNMVlCZnZTbExK?= =?iso-2022-jp?b?YmprcmxZdlAza3B4ZWRUaWNXMHdvOEJXSDlvbTM0cHpwUlZwdVgyYUJW?= =?iso-2022-jp?b?L1F2d0xsd0ZsejVKZWVoUGIveWMwUitqNmFLbVM3YnE3R2FLZ0l1ODVI?= =?iso-2022-jp?b?dlIwQ2hEVlVCOTNNTEhGcWlTRkgvTytyWWR2MWljTkRJamlxNjJYQlFu?= =?iso-2022-jp?b?VHRibVdzdWoxU0Z4NzMxczF0WENzR2lUNEtOV2cxcFl3b0NBVG9vb1Q2?= =?iso-2022-jp?b?cU5hRjU3MmZnQnBQbWU5RTh5Z0xMT01jK1YyRGhtN09BcEl0S0JWVGhT?= =?iso-2022-jp?b?Y2JXM2g1VGQ1TE5UMEVaa215RmF5R3IwL3BsMXBQTU1SUTErMTB2WTJP?= =?iso-2022-jp?b?WGtmNHozYkdjeVMrVDY1bTIrdFE5eHdseTNqSFdxL2hUcWxLeFBlNnJS?= =?iso-2022-jp?b?dzN2SmQvTzV3c0N0UGljN2dtUVd3ZDRDQ1dSQysrRy9qRTd5STRISkd3?= =?iso-2022-jp?b?MFhaWGNYTFpSMmd1NVhhK0F4ckNmRTVweGN1NVNsTFlhZ003ci9OdTBX?= =?iso-2022-jp?b?RW5HRkJGSlg0MHpkbzJlbUh6enB2emlaaVNqUnN4cHZKY0R6cmZjYXZK?= =?iso-2022-jp?b?N281UmUvVmk3N3oxdloza2lSZHpVZTZJdmN0cTRlQzlsT2Y4QVg3Uzl2?= =?iso-2022-jp?b?SEgyWVgwVi9vR1Q2ekYvalY5WG9UNS9lTGc9PQ==?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:ja; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:LV2PR01MB7839.prod.exchangelabs.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(376014)(366016)(38070700018); DIR:OUT; SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-2022-jp?b?Rk5TL2tqblpIdVE3TjRY?= =?iso-2022-jp?b?T3JSeDJFbmpmckNJSkd1eHFUNUFGMUZPSmdBaHJGTU82STlVM2JRWDdC?= =?iso-2022-jp?b?a25EQ3ZjZ3VQelhJK3pTUTVJK2JSZUJ3K2ZrVnBYckM2NnBud3JGNXJF?= =?iso-2022-jp?b?dExMdXhxMktlaUM1Q3ljVWRoaWNLcWNjL0lVUFlsUDd6WlV4WGVqanVF?= =?iso-2022-jp?b?OU9NemZKTmJmSlVkTlh1ZExxUHJVRnJqU3hWb2xtTGtDK1NKa2k5Znlj?= =?iso-2022-jp?b?cFpabFI0VDl1OTFyWWc1ZlJFcHVDNS9XSkRkNFpkM01rTmowYURZM3pr?= =?iso-2022-jp?b?bk12OUwwQ0t4RitBeTQxZ25GK0ZUQkFUcmxmeE0wMzFhdHJtUThOc0xP?= =?iso-2022-jp?b?Z01EZmIzVWdBdnJWNVVETDJtaEZDNlpyano2aDNOS1BRamtyWnFRdjly?= =?iso-2022-jp?b?QTVoc2pzYWJQcHliOGt1QTJLWHo0ZFhPOTNHcjZwaGMwM0ZCWWpNVWFB?= =?iso-2022-jp?b?eHh3RHdoT3VzbHlOdEpwdFBzNGRGZTh2c1lScDVTWWNkY0RDZXlsVnNC?= =?iso-2022-jp?b?NTlyN1F4czgzTktYWU5XQ3VMazRxbXREcVNqT0hVWGUxQU1tK1ZlQ1VB?= =?iso-2022-jp?b?M3FGOUhBRnNFWWVuRFhqT012dlFVOE5KL1hGVU52NmlvNDM1a0pwSXpW?= =?iso-2022-jp?b?cjRTM05YTXM5MmtEN0QyZU1HM0xXYjR4WDNabEpBdGtGOStzYVBGTFZu?= =?iso-2022-jp?b?OHQ5eHF4Y2luMll1M0l2VGdONFY0cE93TkhpUWVnZXU1ZUJxTGxRaWNv?= =?iso-2022-jp?b?TGlJcG5WVStybmRUZmFBQ1pUVmtHSWJxTG5JNUYwaXFQK3ZQaE1Nb1Zl?= =?iso-2022-jp?b?V1J1bUl1ajgzaWlyem45VC9saEtwb3ZUN1ozNjJnOWN5SHBJMmNhdUZh?= =?iso-2022-jp?b?MkdUYUFYSHk4cGpqckdONHhBWTI1bk9BdWd3TDRkNlNsTy9XYk5iN0Vw?= =?iso-2022-jp?b?ak1VMzdCQ2ttY29NNHNUUEVETk5SQU9Ja3JOQ1FjRk5UdjM0QlFQaE1P?= =?iso-2022-jp?b?VEVjOTJEVlp4YmRVa1ZNQ3lQa2ZxVVRkVmtzSlN2c21Ma2J1NVhPR29L?= =?iso-2022-jp?b?bFFQRE5MOWFnYUllTGp2NUJreGU1K2g4THhncjRxQXZVZjBrSVptelNz?= =?iso-2022-jp?b?K1JMSGZTeml6ZHhmSGlpTUpVSmNBcHF3T1BkUTNSZzFzWllNakQyTCsv?= =?iso-2022-jp?b?Qk8wZng1ZHRyVXNIbmNZUmVUeU1ySjhqd1FocTZITXp1N25PajlFeHBI?= =?iso-2022-jp?b?Nk5NSTBmaytTVXNuU0h6NWxOVDNFTXhITkEzYWV5NThZSFZ2MmhmQlAx?= =?iso-2022-jp?b?RFV2ZWFZa0ZOSnhGZ0V2bEREVzBWM3I4MlB5bCs0WnVETkxIOW9hSjFJ?= =?iso-2022-jp?b?K3d0K3NnU1dlRFpUOTdNVHJCYUY2SGVKek15cGFkcHBpT0RDUzVsSzcr?= =?iso-2022-jp?b?TDhpUjRkTkVENmRackdSU0VhcE9WYnBleXB0Z2lxUGNCTmMyUjNxUzM2?= =?iso-2022-jp?b?amZJYUIzbmRCakQ1OEM5M2JoM2kyODhucEdBL2oxVXVJWWw5dlNyVnpP?= =?iso-2022-jp?b?WmxnQzhuUGtHVStYUmlDUlpxU1BkYStZcGNyU1AwRTFnR1E2VFpQWHUz?= =?iso-2022-jp?b?UnczSGc2eWhJTHR2Y2F3MTVaeHZMemlOUmthNUVLNjNabUZsZzEyOGUv?= =?iso-2022-jp?b?K21lK3hmcWE0NzhJN0RtOUNDeWpuTVBlK05zT0x3S2lRWW5NVmNRTzk0?= =?iso-2022-jp?b?ZjNwQnA4SHFpMWo0LzhldXM3SlhGUjRiZjQ1RzFGcEM1bks3N1MxbHM5?= =?iso-2022-jp?b?bHJyVW5JQmFVRU9yN2NTcHdhSkFRZ2pTNVh1YVo0OTMzaHd6bmJ1bkJZ?= =?iso-2022-jp?b?TjBCS3BFd3JzekJsTE9haGdXa0JGMnVsODVjSGV5VGJ2WnJiWFJIajlx?= =?iso-2022-jp?b?Y1R4YmpOSkI1aU5qcHYzMFVKWDhoK0VzY1hpbDVZSFp0eG51TzY2OGVS?= =?iso-2022-jp?b?YVpLZHdkNWNNTFVBWHA2ZjhXYW80T3hoanVrNTdmMFV5NzhvcWh2aUJy?= =?iso-2022-jp?b?OEFWT0p0Z3Y3OW5TTFphR0ZaSVBFWGV5RFh6WmpYcmpxbStBV3FNVGll?= =?iso-2022-jp?b?RzNMTG9EZ2NLZGlYWFhYRCt5cEU5aW1wcmJHK3lFVlJnLzVOK2Rrbk11?= =?iso-2022-jp?b?QmVhdEI3NHhpd3FWRnJDL3VmWk5FUUsxbUZBUld6RUNVQUFWRS94ME1X?= =?iso-2022-jp?b?NzNwdHM2QlkrYjFUOWo4R1hsbFlVYXdibGZ3emc2ZA==?= MIME-Version: 1.0 X-OriginatorOrg: os.amperecomputing.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: LV2PR01MB7839.prod.exchangelabs.com X-MS-Exchange-CrossTenant-Network-Message-Id: 324fe78c-1aed-4000-4d0a-08dca3536d40 X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Jul 2024 15:49:49.0076 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: nMN2G5WR7OuCKIAfy5TVEkVCcSymfSLmO05IOOdiXzSirLsVgrwwXyHt3BXGwsMSRN6WjrGrLiSSnw0kiS9wYbi1zOgpQHKUY2mG6/BKBMM= X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR01MB7334 X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~patchwork=sourceware.org@gcc.gnu.org When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy ... } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vector lane-reducing ops be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency among them could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); ... } Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 64 ++++++++++++++++++++++++++++++++++++++----- gcc/tree-vectorizer.h | 6 ++++ 2 files changed, 63 insertions(+), 7 deletions(-) From f3d2bff96f8e29f775e2cb12ef43ad464b819fcf Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Wed, 29 May 2024 17:28:14 +0800 Subject: [PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy ... } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vector lane-reducing ops be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency among them could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); ... } 2024-03-22 Feng Xue gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 64 ++++++++++++++++++++++++++++++++++++++----- gcc/tree-vectorizer.h | 6 ++++ 2 files changed, 63 insertions(+), 7 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index e72d692ffa3..5bc6e526d43 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8841,6 +8841,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad + sum += n[i]; // normal } The vector size is 128-bit,vectorization factor is 16. Reduction @@ -8858,19 +8859,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); - sum_v1 = sum_v1; // copy + sum_v0 = sum_v0; // copy + sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy + + sum_v0 += n_v0[i: 0 ~ 3 ]; + sum_v1 += n_v1[i: 4 ~ 7 ]; + sum_v2 += n_v2[i: 8 ~ 11]; + sum_v3 += n_v3[i: 12 ~ 15]; } - sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 - */ + Moreover, for a higher instruction parallelism in final vectorized + loop, it is considered to make those effective vector lane-reducing + ops be distributed evenly among all def-use cycles. In the above + example, DOT_PROD, WIDEN_SUM and SADs are generated into disparate + cycles, instruction dependency among them could be eliminated. */ unsigned effec_ncopies = vec_oprnds[0].length (); unsigned total_ncopies = vec_oprnds[reduc_index].length (); @@ -8884,6 +8893,47 @@ vect_transform_reduction (loop_vec_info loop_vinfo, vec_oprnds[i].safe_grow_cleared (total_ncopies); } } + + tree reduc_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + gcc_assert (reduc_vectype_in); + + unsigned effec_reduc_ncopies + = vect_get_num_copies (loop_vinfo, slp_node, reduc_vectype_in); + + gcc_assert (effec_ncopies <= effec_reduc_ncopies); + + if (effec_ncopies < effec_reduc_ncopies) + { + /* Find suitable def-use cycles to generate vectorized statements + into, and reorder operands based on the selection. */ + unsigned curr_pos = reduc_info->reduc_result_pos; + unsigned next_pos = (curr_pos + effec_ncopies) % effec_reduc_ncopies; + + gcc_assert (curr_pos < effec_reduc_ncopies); + reduc_info->reduc_result_pos = next_pos; + + if (curr_pos) + { + unsigned count = effec_reduc_ncopies - effec_ncopies; + unsigned start = curr_pos - count; + + if ((int) start < 0) + { + count = curr_pos; + start = 0; + } + + for (unsigned i = 0; i < op.num_ops - 1; i++) + { + for (unsigned j = effec_ncopies; j > start; j--) + { + unsigned k = j - 1; + std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]); + gcc_assert (!vec_oprnds[i][k]); + } + } + } + } } bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 62121f63f18..b6fdbc651d6 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1402,6 +1402,12 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* For loop reduction with multiple vectorized results (ncopies > 1), a + lane-reducing operation participating in it may not use all of those + results, this field specifies result index starting from which any + following land-reducing operation would be assigned to. */ + unsigned int reduc_result_pos; + /* If IS_REDUC_INFO is true and if the vector code is performing N scalar reductions in parallel, this variable gives the initial scalar values of those N reductions. */ -- 2.17.1