From patchwork Sat Jul 13 15:48:42 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Feng Xue OS <fxue@os.amperecomputing.com>
X-Patchwork-Id: 93897
Return-Path: <gcc-patches-bounces~patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id B3A82386102B
	for <patchwork@sourceware.org>; Sat, 13 Jul 2024 15:49:23 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from DM1PR04CU001.outbound.protection.outlook.com
 (mail-centralusazlp170100005.outbound.protection.outlook.com
 [IPv6:2a01:111:f403:c111::5])
 by sourceware.org (Postfix) with ESMTPS id 9FF2E385DDCC
 for <gcc-patches@gcc.gnu.org>; Sat, 13 Jul 2024 15:48:46 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9FF2E385DDCC
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=os.amperecomputing.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=os.amperecomputing.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 9FF2E385DDCC
Authentication-Results: server2.sourceware.org;
 arc=pass smtp.remote-ip=2a01:111:f403:c111::5
ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885734; cv=pass;
 b=WL7bYtmySRqBT9O29z2FbnCzrBbhjzbkz6/H7bbYhoEe3VmArVCsgjAoG9CdlVRas9we81kW/pnJoLGvFSZXLZZaTlCccHrBuz9ilg7yDcJpj6PjHU/Qpfdu0aoeFW6yS++fKpa2FtnAJPMvu6g46i3Xz5bhE/0lpc62UIQ2cKg=
ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720885734; c=relaxed/simple;
 bh=zgurLu2vmmp/orVkY4ZDxBLV3L6AvOct4dk+9EoaL+I=;
 h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version;
 b=ObVZKekK7OZ77h9bfsXKgoTct6+tDXmEGpbvzv/6E4PoMxriR772Lu5aEyewE61PTgsqqXYnDKna8AavfwtnplWVA6Ndc01WV+t+ulH4rCt77ep0OI4aBJJLNGk3YElORah44Jq3uR3iWObs+pCE+JDHi2zEwdywsUVSCLF0cDg=
ARC-Authentication-Results: i=2; server2.sourceware.org
ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none;
 b=Df8Lu89g/F6aiWoNDYRdd2eVWlEq/5QBtGg7pT4xeJ0qZjXKb+V2YyjKnNXBd5bmHydKvVIcYXnRxZuK7LTLvfmLefo0i2PqR1Gn143PEkU8jIgGr62ha+QZYYsAPj7iCDnspGOmSZmnebTkr+rg3IuoAgpcnoVWtwddHQUkKS73SNyef41q6cdyskXb8QtZDb5Sn3ntQ180PRsvEp5GkRDIn9wuaAk6Tpz4/1zPI6rHeoT9dfbqhxI+5cYMD0hKjtKyJE2AZT2O8rU3bSFJotKuj/OoE4YGR2V5R436s/Md/2kiYdf9MuKQo1szdkXDU9Yrs5HB2nICzJQuexgiHQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector10001;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=/4YqrttfWskNEXvgWWpQSJbjPh7mcWW1LawOVvVyjoo=;
 b=OYNlWryUKEV24OeBUcFlvCVKmjBxShW3uYMLENgkO9Zz2DHtFTRgGKTnr32cwrO2KGuK1yBXlMSQkhyP26PlepHS8X6ompDHNDIn7JmI1SD4qofkv9aaXorxhh7fEJZ3oARI+bg1fT/57c5rty+OuJvAAx5tkLs8YBDOZUHRa0LaN1b1oPwWiLcbgioehd5u2LCgiXdkXc+ILrgPJsngU1JlCaBeGuU6wuaBPI1Kq8WghW5QjAw/FF85Hura1v1DPUH8uXsVWwy5BSV8WBIRtmiYNGBY9XbCrgt+41gw806RpjLjpuOWfp/U5HKi9kNAtktYYyP3XCpVIEi8LQMGGA==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none
 header.from=os.amperecomputing.com; dkim=pass
 header.d=os.amperecomputing.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=os.amperecomputing.com; s=selector2;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=/4YqrttfWskNEXvgWWpQSJbjPh7mcWW1LawOVvVyjoo=;
 b=jkzhkdDc3PrZRsh4rFBiPkPt09c0HDoHTM9EuCwWSxy2ijPJ+em8iUDI3UIR8wobEt3hRTNaBpq6gxVLXyxCIiJGJ3OodPSnuI/lk5LyYFm/wcniCnYxZ4Ti8rftcOohvOweOqddl1Oavrk97/kaGDzBpmIaPdoQ7TteY/XfZ/c=
Received: from LV2PR01MB7839.prod.exchangelabs.com (2603:10b6:408:14f::13) by
 PH0PR01MB7334.prod.exchangelabs.com (2603:10b6:510:10d::22) with
 Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.7762.24; Sat, 13 Jul 2024 15:48:42 +0000
Received: from LV2PR01MB7839.prod.exchangelabs.com
 ([fe80::2ac3:5a77:36fd:9c63]) by LV2PR01MB7839.prod.exchangelabs.com
 ([fe80::2ac3:5a77:36fd:9c63%4]) with mapi id 15.20.7762.020; Sat, 13 Jul 2024
 15:48:42 +0000
From: Feng Xue OS <fxue@os.amperecomputing.com>
To: Richard Biener <richard.guenther@gmail.com>, "gcc-patches@gcc.gnu.org"
 <gcc-patches@gcc.gnu.org>
Subject: [PATCH 3/4] vect: Support multiple lane-reducing operations for loop
 reduction [PR114440]
Thread-Topic: [PATCH 3/4] vect: Support multiple lane-reducing operations for
 loop  reduction [PR114440]
Thread-Index: AQHa1TwUuukY1odfEEyRRSuRlwLpMQ==
Date: Sat, 13 Jul 2024 15:48:42 +0000
Message-ID: 
 <LV2PR01MB7839A4422945D5C7DA87909DF7A72@LV2PR01MB7839.prod.exchangelabs.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator: 
msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True;
 MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7;
 MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2024-07-13T15:48:42.517Z;
 MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential;
 MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0;
 MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard;
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=os.amperecomputing.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: LV2PR01MB7839:EE_|PH0PR01MB7334:EE_
x-ms-office365-filtering-correlation-id: f7a715fd-080e-4ccb-4e11-08dca35345d6
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
 ARA:13230040|1800799024|376014|366016|38070700018;
x-microsoft-antispam-message-info: =?iso-8859-1?q?se2/ft0NLNWddiZF3SPVBQ1hkq?=
	=?iso-8859-1?q?IGyl0ukKZy0rXQnwEThC2ZM8WFWVJfUl1tALd7HAJwEmDck6CguDqY6QaCaJ?=
	=?iso-8859-1?q?/X7mvuBgZWAVzIg7Z/QREtODxJcC5Jlz1uLd2Vto38o75ENRf7ZZjumpX37i?=
	=?iso-8859-1?q?Wpvgumr5oIhuTpV0BHHVkv9POlvdJy5wJVJwG0n55P7DjuxFVxwP7hDwl98C?=
	=?iso-8859-1?q?14iiJK4I38mWNAxvduOe86MWEUdo2EvCZhMymjAxRBaphHQKOKxNYwctoO/E?=
	=?iso-8859-1?q?9MBH5VDyJzPOMwOkKyg3Q2ED92rf+C3+Ueu9C12BuWwbrQpZ+uplthVKnEb1?=
	=?iso-8859-1?q?URTpVJpzyz94Mn6CD+SzB4LupTS6i60twVwBwZ+ePeR6+j46Uu6xN09PjI+R?=
	=?iso-8859-1?q?+wvbprLxPNoNh+Vu4u5fossqr70ACZSNCVf5NNNRmpDBwUJbRtqQ9gM71NlM?=
	=?iso-8859-1?q?Fga+193mo6Vh2d3x5C5r9Ssw20HNly7fwEaH1ElCSgVY9HHiz/47ZBG10zjf?=
	=?iso-8859-1?q?PBHKh18XBjI3n070dWMhozaeqOlcRvpecrHpozIWqX3N2rEkzzKypettj3Ch?=
	=?iso-8859-1?q?4cUXuYaWAEQS4wI43N+UVdYLaXVMIyqec8Dc01iZ/D47vAFSPqTOH5CDHMd9?=
	=?iso-8859-1?q?zNDe/oOWgYesBwMHIfmky+ldpcdg/dSB/YrBmLL1jWTSSCesa/aTx8ZNvYSZ?=
	=?iso-8859-1?q?EEsF3u5bY+V/7xDwtmfldSoYetVBFgSwEi9rddOwjQo/H4lONTJFADJFJOQX?=
	=?iso-8859-1?q?TJllaxdpdLXmtIkcWZpz14MfqiWwpQCQhMe4tw/0I9UW0rrMqkRTQZzNzySR?=
	=?iso-8859-1?q?+gFLwpPMjrKH4789MZ8dOc16bnsfms7O+c6jTEOSIlMNZLRDwrnMBok8KvQ7?=
	=?iso-8859-1?q?WSlcDC8pfj9heTgu9SXQf+5iZ+LXPB5m29IS4y4XAtwu4/0SNJ1OsX/XqtJc?=
	=?iso-8859-1?q?YBgHOah1FgtnQTdeHYtlSMtCIsjbhjKMEWEiSZFePyCt931HLDzz9F+LAIEY?=
	=?iso-8859-1?q?XBF5iEpzM10Swb8sSdOEJwzArXYjfH8A3onNGcn69OLqeVdtE4eWvNLFWbH2?=
	=?iso-8859-1?q?XPTZ1p7ZYtC/2bH3u41b0eyplmyiodSzWzLnEoMqi8Z3R98C7YgrFtfAXCqH?=
	=?iso-8859-1?q?0CCDYLxgd4b0XRsOCBVf4XplPr/ClwOZTj/UyO24M4eR5MY20FexJBjgUAjH?=
	=?iso-8859-1?q?uu055vZ0nZMqS5J3kWhL6x+qxL7T+k4eocMbgV9ZT862WLw9IUXVlJ/DOT30?=
	=?iso-8859-1?q?Hu0jcd0+joLLaXFdKukOSvs0opXzzgTbIgWLM9XSeoX1F5JdXVIROuvpMdu7?=
	=?iso-8859-1?q?6Cy5VRQ8m10z48kCHa2j13B/4vg8tb+EUcjcnOAor0+ahYFluhoM0h5uT+Yw?=
	=?iso-8859-1?q?e9oG/w7pbCEO9wGy8UG5xlHUSp1TulaMzQaoOQXhY=3D?=
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:;
 IPV:NLI; SFV:NSPM; H:LV2PR01MB7839.prod.exchangelabs.com; PTR:; CAT:NONE;
 SFS:(13230040)(1800799024)(376014)(366016)(38070700018); DIR:OUT; SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?q?hFjtK769CtBeU1jeGCitodS?=
	=?iso-8859-1?q?C7Q6DfbAjdaTyNjDoyacoKiPYIY3kqC0uzBk0YfE487CtmEemGw1el76EWuE?=
	=?iso-8859-1?q?SYaJNcv59eOS6Je1qe6d+0sVRY7rCwbSG+/KOitO52jqu0q3wQv5H1xDVHk9?=
	=?iso-8859-1?q?FtWZjQ8qOPvJrh1toAusFxr+A3N4rsb6Gmgb0SwTRRZkYRq4O788GExuqsCB?=
	=?iso-8859-1?q?2Xj8Y37dzlEJm9t/Y5ofHfDNlUHLD/Nf57k72vhPujXOH181EC6IKba8Diyu?=
	=?iso-8859-1?q?+XoGmN/kWwICRIyNKYl1ME4LTZuXMiivOpxG+33wBQ+fkoNuH9XbNzAWxWqY?=
	=?iso-8859-1?q?sPGBiQzVW1W8WGw/WZOi6f0McVmy2ik6OZ1s2/7fUSYrv6RhQOH6vVc6zyPl?=
	=?iso-8859-1?q?aYzxcfxD8nDWXPzOa5FqI/THp4AdzAPL1IXGUqy53vfri+B2lw4zxnnUYdn/?=
	=?iso-8859-1?q?JtBGMlg/cpCz29zFGhArG3BTCVgU8e8HlJXDfbQqoneiz0lSN5KCfjvqp4Ky?=
	=?iso-8859-1?q?Q9dSssPlDMq/qpCF+upSXjTb0mXVi4NjJR8Le3el4e8H9BZjlwG2NqU3gxn6?=
	=?iso-8859-1?q?FwgGcrteEZa+BHc2l6Cl3jiSjd0cKyJA/QFYExuHhVcDZ5SboksXfD5u61bM?=
	=?iso-8859-1?q?9VUgb6dbauGcjYlGRlV+YsGwC1zFOUVkcT1MnYaumXTlTsinkI66q7FNmQvP?=
	=?iso-8859-1?q?14BMVmOgD6BKQuvC6IZHvJRiUYqqspyKu6wJcekhD4OOD/ZcROrWSuOHcLqY?=
	=?iso-8859-1?q?sIK7uPjYrBZbNqMMQjodiDAnvhYRRPvuPykY4HhqhvZQ8U80S8RseLF6Zk5G?=
	=?iso-8859-1?q?QcW3SuDZc7jdpkSDBx/notvm5DAlcEgQWU720v7ZvTVN7oURrIwnV4/TLBgl?=
	=?iso-8859-1?q?X+Vb/6IurKxFkNxPRSHXH9nnkwAIEHxrMGszbv+aOMdaFoon+XswcKsIL95D?=
	=?iso-8859-1?q?1Jj2Wec+PawHThVxL14tQlqr2dlhzAPbwX53OalG/+Y5Ts6gumjMlNvG0yCa?=
	=?iso-8859-1?q?sIJN0nKGUVQ6/b9J5k4kejK3NDogYeOeSIDpu9oWWNh5sqSweRiVV7VIE1bp?=
	=?iso-8859-1?q?5+w6bC7SO+P+dXF0MBleS+dze+a7yTRZ0C1iku6IihNVwk/TmxNXEVlJz+Z4?=
	=?iso-8859-1?q?RNQMCV8Vdp+NojLxzFiqqwGnlE86VTU4rfXS8p0PraUl27oOIZwq6SW9cchk?=
	=?iso-8859-1?q?MQjjZXHHVVraHdtpu6MBW6BKHpkouf5f8UkX0SySnuqcFnq5oijCtti8s+1Y?=
	=?iso-8859-1?q?RIYMoWqqJOXQl/DEhuYeu7cCcrqxK3nVhag1nykZQaJ0brEks+ehM8V6lugv?=
	=?iso-8859-1?q?emjvUB2C/KEmMgCeXXOu3f9PmzZg3bS1WOAoIxY0zmH3NHimseZ5F476Oa9A?=
	=?iso-8859-1?q?BvPDCEYwOb/y5rMojCHvlYnFgneJBio2dvcTexQ8wvz6Dsfa2i6K0F6waEU8?=
	=?iso-8859-1?q?pAvgx8t3e1hu9enKLbdbdblK+84utC9Wz16fWYceoylakxeJL5oVb8tONKpg?=
	=?iso-8859-1?q?asM29Bm/5dWG5xFhMVpdgSc8gcqxrr6wgIrsl2uZwYeGX6vCanTOMoTBhzkH?=
	=?iso-8859-1?q?gqCwlYKxZX3o47dYjOnis/GXGMD2HZ1ZNUIxos6ATKUEWdchtAeup4i33YEg?=
	=?iso-8859-1?q?/lXoH5jH7y9QWC/8X?=
MIME-Version: 1.0
X-OriginatorOrg: os.amperecomputing.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: LV2PR01MB7839.prod.exchangelabs.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 f7a715fd-080e-4ccb-4e11-08dca35345d6
X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Jul 2024 15:48:42.8865 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 
 zcCdBcuo6RAnfiqFquMxAp7ry9gbGrtTvsH/z9YXTLiBMNqIsWMj+94VKm6zXkwkdCPSL/NUfXdDMzpUqFul4Xk7tFOB89S/k2GKIu2CBDE=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR01MB7334
X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, SPF_HELO_PASS, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~patchwork=sourceware.org@gcc.gnu.org

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

This patches removes some constraints in reduction analysis to allow multiple
arbitrary lane-reducing operations with mixed input vectypes in a loop
reduction chain. For example:

   int sum = 1;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 1 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

    sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1

Thanks,
Feng
---
gcc/
	PR tree-optimization/114440
	* tree-vectorizer.h (vectorizable_lane_reducing): New function
	declaration.
	* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
	vectorizable_lane_reducing to analyze lane-reducing operation.
	* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
	code related to	emulated_mixed_dot_prod.
	(vectorizable_lane_reducing): New function.
	(vectorizable_reduction): Allow multiple lane-reducing operations in
	loop reduction. Move some original lane-reducing related code to
	vectorizable_lane_reducing.
	(vect_transform_reduction): Adjust comments with updated example.

gcc/testsuite/
	PR tree-optimization/114440
	* gcc.dg/vect/vect-reduc-chain-1.c
	* gcc.dg/vect/vect-reduc-chain-2.c
	* gcc.dg/vect/vect-reduc-chain-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
	* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c          |  64 +++++
 .../gcc.dg/vect/vect-reduc-chain-2.c          |  79 ++++++
 .../gcc.dg/vect/vect-reduc-chain-3.c          |  68 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 ++++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 +++++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  60 +++++
 gcc/tree-vect-loop.cc                         | 240 +++++++++++++-----
 gcc/tree-vect-stmts.cc                        |   2 +
 gcc/tree-vectorizer.h                         |   2 +
 11 files changed, 750 insertions(+), 69 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

From 0889639114ecb8ab2e46dc4effe8f114f5ab8ad6 Mon Sep 17 00:00:00 2001
From: Feng Xue <fxue@os.amperecomputing.com>
Date: Wed, 29 May 2024 17:22:36 +0800
Subject: [PATCH 3/4] vect: Support multiple lane-reducing operations for loop
 reduction [PR114440]

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

This patches removes some constraints in reduction analysis to allow multiple
arbitrary lane-reducing operations with mixed input vectypes in a loop
reduction chain. For example:

   int sum = 1;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 1 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

    sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
	PR tree-optimization/114440
	* tree-vectorizer.h (vectorizable_lane_reducing): New function
	declaration.
	* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
	vectorizable_lane_reducing to analyze lane-reducing operation.
	* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
	code related to	emulated_mixed_dot_prod.
	(vectorizable_lane_reducing): New function.
	(vectorizable_reduction): Allow multiple lane-reducing operations in
	loop reduction. Move some original lane-reducing related code to
	vectorizable_lane_reducing.
	(vect_transform_reduction): Adjust comments with updated example.

gcc/testsuite/
	PR tree-optimization/114440
	* gcc.dg/vect/vect-reduc-chain-1.c
	* gcc.dg/vect/vect-reduc-chain-2.c
	* gcc.dg/vect/vect-reduc-chain-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
	* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c          |  64 +++++
 .../gcc.dg/vect/vect-reduc-chain-2.c          |  79 ++++++
 .../gcc.dg/vect/vect-reduc-chain-3.c          |  68 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 ++++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 +++++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  60 +++++
 gcc/tree-vect-loop.cc                         | 240 +++++++++++++-----
 gcc/tree-vect-stmts.cc                        |   2 +
 gcc/tree-vectorizer.h                         |   2 +
 11 files changed, 750 insertions(+), 69 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 00000000000..80b0089ea0f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,64 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_2 char c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+
+  #pragma GCC novector
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      c[i] = BASE + i * 2;
+      d[i] = BASE + OFFSET + i * 3;
+      e[i] = i;
+      expected += a[i] * b[i];
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
new file mode 100644
index 00000000000..5bc2686fc9d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
@@ -0,0 +1,79 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#define SIGNEDNESS_4 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+fn (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 char *restrict c,
+   SIGNEDNESS_3 char *restrict d,
+   SIGNEDNESS_4 short *restrict e,
+   SIGNEDNESS_4 short *restrict f,
+   SIGNEDNESS_1 int *restrict g)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += i + 1;
+      res += c[i] * d[i];
+      res += e[i] * f[i];
+      res += g[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
+#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 char c[N], d[N];
+  SIGNEDNESS_4 short e[N], f[N];
+  SIGNEDNESS_1 int g[N];
+  int expected = 0x12345;
+
+#pragma GCC novector
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 + OFFSET + i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = BASE4 + i * 6;
+      f[i] = BASE4 + OFFSET + i * 5;
+      g[i] = i;
+      expected += a[i] * b[i];
+      expected += i + 1;
+      expected += c[i] * d[i];
+      expected += e[i] * f[i];
+      expected += g[i];
+    }
+
+  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
new file mode 100644
index 00000000000..6a733fbac53
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
@@ -0,0 +1,68 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 short *restrict c,
+   SIGNEDNESS_3 short *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      res += abs;
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 short c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+
+#pragma GCC novector
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 - i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = i;
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      expected += abs;
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
new file mode 100644
index 00000000000..72a370ab3c0
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
@@ -0,0 +1,95 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+      res += a[8] * b[8];
+      res += a[9] * b[9];
+      res += a[10] * b[10];
+      res += a[11] * b[11];
+      res += a[12] * b[12];
+      res += a[13] * b[13];
+      res += a[14] * b[14];
+      res += a[15] * b[15];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int step = 16;
+  int n = 2;
+  int t = 0;
+
+#pragma GCC novector
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+    }
+
+#pragma GCC novector
+  for (int i = 0; i < n; i++)
+    {
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      expected += a[t + 8] * b[t + 8];
+      expected += a[t + 9] * b[t + 9];
+      expected += a[t + 10] * b[t + 10];
+      expected += a[t + 11] * b[t + 11];
+      expected += a[t + 12] * b[t + 12];
+      expected += a[t + 13] * b[t + 13];
+      expected += a[t + 14] * b[t + 14];
+      expected += a[t + 15] * b[t + 15];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
new file mode 100644
index 00000000000..aab86ee2f1c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
@@ -0,0 +1,67 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[5 * i + 0] * b[5 * i + 0];
+      res += a[5 * i + 1] * b[5 * i + 1];
+      res += a[5 * i + 2] * b[5 * i + 2];
+      res += a[5 * i + 3] * b[5 * i + 3];
+      res += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+#pragma GCC novector
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+    }
+
+#pragma GCC novector
+  for (int i = 0; i < n; i++)
+    {
+      expected += a[5 * i + 0] * b[5 * i + 0];
+      expected += a[5 * i + 1] * b[5 * i + 1];
+      expected += a[5 * i + 2] * b[5 * i + 2];
+      expected += a[5 * i + 3] * b[5 * i + 3];
+      expected += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
new file mode 100644
index 00000000000..9f1d2136ab6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
@@ -0,0 +1,79 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int step = 8;
+  int n = 2;
+  int t = 0;
+
+#pragma GCC novector
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+    }
+
+#pragma GCC novector
+  for (int i = 0; i < n; i++)
+    {
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
new file mode 100644
index 00000000000..f4dcebdfa10
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
@@ -0,0 +1,63 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[3 * i + 0] * b[3 * i + 0];
+      res += a[3 * i + 1] * b[3 * i + 1];
+      res += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+#pragma GCC novector
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+    }
+
+#pragma GCC novector
+  for (int i = 0; i < n; i++)
+    {
+      expected += a[3 * i + 0] * b[3 * i + 0];
+      expected += a[3 * i + 1] * b[3 * i + 1];
+      expected += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
new file mode 100644
index 00000000000..84c82b023d4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
@@ -0,0 +1,60 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-do compile } */
+/* { dg-additional-options "--param vect-epilogues-nomask=0 -fdump-tree-optimized" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res0,
+   SIGNEDNESS_1 int res1,
+   SIGNEDNESS_1 int res2,
+   SIGNEDNESS_1 int res3,
+   SIGNEDNESS_1 int res4,
+   SIGNEDNESS_1 int res5,
+   SIGNEDNESS_1 int res6,
+   SIGNEDNESS_1 int res7,
+   SIGNEDNESS_1 int res8,
+   SIGNEDNESS_1 int res9,
+   SIGNEDNESS_1 int resA,
+   SIGNEDNESS_1 int resB,
+   SIGNEDNESS_1 int resC,
+   SIGNEDNESS_1 int resD,
+   SIGNEDNESS_1 int resE,
+   SIGNEDNESS_1 int resF,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b)
+{
+  for (int i = 0; i < 64; i += 16)
+    {
+      res0 += a[i + 0x00] * b[i + 0x00];
+      res1 += a[i + 0x01] * b[i + 0x01];
+      res2 += a[i + 0x02] * b[i + 0x02];
+      res3 += a[i + 0x03] * b[i + 0x03];
+      res4 += a[i + 0x04] * b[i + 0x04];
+      res5 += a[i + 0x05] * b[i + 0x05];
+      res6 += a[i + 0x06] * b[i + 0x06];
+      res7 += a[i + 0x07] * b[i + 0x07];
+      res8 += a[i + 0x08] * b[i + 0x08];
+      res9 += a[i + 0x09] * b[i + 0x09];
+      resA += a[i + 0x0A] * b[i + 0x0A];
+      resB += a[i + 0x0B] * b[i + 0x0B];
+      resC += a[i + 0x0C] * b[i + 0x0C];
+      resD += a[i + 0x0D] * b[i + 0x0D];
+      resE += a[i + 0x0E] * b[i + 0x0E];
+      resF += a[i + 0x0F] * b[i + 0x0F];
+    }
+
+  return res0 ^ res1 ^ res2 ^ res3 ^ res4 ^ res5 ^ res6 ^ res7 ^
+         res8 ^ res9 ^ resA ^ resB ^ resC ^ resD ^ resE ^ resF;
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-not "DOT_PROD_EXPR" "optimized" } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 5ac83e76975..e72d692ffa3 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5328,8 +5328,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
   if (!gimple_extract_op (orig_stmt_info->stmt, &op))
     gcc_unreachable ();
 
-  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
-
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     /* No extra instructions are needed in the prologue.  The loop body
        operations are costed in vectorizable_condition.  */
@@ -5364,12 +5362,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
 	   initial result of the data reduction, initial value of the index
 	   reduction.  */
 	prologue_stmts = 4;
-      else if (emulated_mixed_dot_prod)
-	/* We need the initial reduction value and two invariants:
-	   one that contains the minimum signed value and one that
-	   contains half of its negative.  */
-	prologue_stmts = 3;
       else
+	/* We need the initial reduction value.  */
 	prologue_stmts = 1;
       prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
 					 scalar_to_vec, stmt_info, 0,
@@ -7478,6 +7472,143 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
     }
 }
 
+/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
+   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
+   Now there are three such kinds of operations: dot-prod/widen-sum/sad
+   (sum-of-absolute-differences).
+
+   For a lane-reducing operation, the loop reduction path that it lies in,
+   may contain normal operation, or other lane-reducing operation of different
+   input type size, an example as:
+
+     int sum = 0;
+     for (i)
+       {
+         ...
+         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
+         sum += w[i];                // widen-sum <vector(16) char>
+         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
+         sum += n[i];                // normal <vector(4) int>
+         ...
+       }
+
+   Vectorization factor is essentially determined by operation whose input
+   vectype has the most lanes ("vector(16) char" in the example), while we
+   need to choose input vectype with the least lanes ("vector(4) int" in the
+   example) for the reduction PHI statement.  */
+
+bool
+vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
+			    slp_tree slp_node, stmt_vector_for_cost *cost_vec)
+{
+  gimple *stmt = stmt_info->stmt;
+
+  if (!lane_reducing_stmt_p (stmt))
+    return false;
+
+  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
+
+  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
+    return false;
+
+  /* Do not try to vectorize bit-precision reductions.  */
+  if (!type_has_mode_precision_p (type))
+    return false;
+
+  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
+    {
+      stmt_vec_info def_stmt_info;
+      slp_tree slp_op;
+      tree op;
+      tree vectype;
+      enum vect_def_type dt;
+
+      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
+			       &slp_op, &dt, &vectype, &def_stmt_info))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "use not simple.\n");
+	  return false;
+	}
+
+      if (!vectype)
+	{
+	  vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
+						 slp_op);
+	  if (!vectype)
+	    return false;
+	}
+
+      if (slp_node && !vect_maybe_update_slp_op_vectype (slp_op, vectype))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "incompatible vector types for invariants\n");
+	  return false;
+	}
+
+      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
+	continue;
+
+      /* There should be at most one cycle def in the stmt.  */
+      if (VECTORIZABLE_CYCLE_DEF (dt))
+	return false;
+    }
+
+  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
+
+  /* TODO: Support lane-reducing operation that does not directly participate
+     in loop reduction.  */
+  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
+    return false;
+
+  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
+     recoginized.  */
+  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
+  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
+
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+
+  gcc_assert (vectype_in);
+
+  /* Compute number of effective vector statements for costing.  */
+  unsigned int ncopies_for_cost = vect_get_num_copies (loop_vinfo, slp_node,
+						       vectype_in);
+  gcc_assert (ncopies_for_cost >= 1);
+
+  if (vect_is_emulated_mixed_dot_prod (stmt_info))
+    {
+      /* We need extra two invariants: one that contains the minimum signed
+	 value and one that contains half of its negative.  */
+      int prologue_stmts = 2;
+      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
+					scalar_to_vec, stmt_info, 0,
+					vect_prologue);
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
+		     "extra prologue_cost = %d .\n", cost);
+
+      /* Three dot-products and a subtraction.  */
+      ncopies_for_cost *= 4;
+    }
+
+  record_stmt_cost (cost_vec, (int) ncopies_for_cost, vector_stmt, stmt_info,
+		    0, vect_body);
+
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      enum tree_code code = gimple_assign_rhs_code (stmt);
+      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
+						  slp_node, code, type,
+						  vectype_in);
+    }
+
+  /* Transform via vect_transform_reduction.  */
+  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+  return true;
+}
+
 /* Function vectorizable_reduction.
 
    Check if STMT_INFO performs a reduction operation that can be vectorized.
@@ -7811,18 +7942,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (!type_has_mode_precision_p (op.type))
     return false;
 
-  /* For lane-reducing ops we're reducing the number of reduction PHIs
-     which means the only use of that may be in the lane-reducing operation.  */
-  if (lane_reducing
-      && reduc_chain_length != 1
-      && !only_slp_reduc_chain)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "lane-reducing reduction with extra stmts.\n");
-      return false;
-    }
-
   /* Lane-reducing ops also never can be used in a SLP reduction group
      since we'll mix lanes belonging to different reductions.  But it's
      OK to use them in a reduction chain or when the reduction group
@@ -8362,14 +8481,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
-  if (single_defuse_cycle || lane_reducing)
+  if (single_defuse_cycle && !lane_reducing)
     {
       gcc_assert (op.code != COND_EXPR);
 
-      /* 4. Supportable by target?  */
-      bool ok = true;
-
-      /* 4.1. check support for the operation in the loop
+      /* 4. check support for the operation in the loop
 
 	 This isn't necessary for the lane reduction codes, since they
 	 can only be produced by pattern matching, and it's up to the
@@ -8378,14 +8494,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	 mixed-sign dot-products can be implemented using signed
 	 dot-products.  */
       machine_mode vec_mode = TYPE_MODE (vectype_in);
-      if (!lane_reducing
-	  && !directly_supported_p (op.code, vectype_in, optab_vector))
+      if (!directly_supported_p (op.code, vectype_in, optab_vector))
         {
           if (dump_enabled_p ())
             dump_printf (MSG_NOTE, "op not supported by target.\n");
 	  if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
 	      || !vect_can_vectorize_without_simd_p (op.code))
-	    ok = false;
+	    single_defuse_cycle = false;
 	  else
 	    if (dump_enabled_p ())
 	      dump_printf (MSG_NOTE, "proceeding using word mode.\n");
@@ -8398,16 +8513,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	    dump_printf (MSG_NOTE, "using word mode not possible.\n");
 	  return false;
 	}
-
-      /* lane-reducing operations have to go through vect_transform_reduction.
-         For the other cases try without the single cycle optimization.  */
-      if (!ok)
-	{
-	  if (lane_reducing)
-	    return false;
-	  else
-	    single_defuse_cycle = false;
-	}
     }
   if (dump_enabled_p () && single_defuse_cycle)
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -8415,22 +8520,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 		     "multiple vectors to one in the loop body\n");
   STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
 
-  /* If the reduction stmt is one of the patterns that have lane
-     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
-  if ((ncopies > 1 && ! single_defuse_cycle)
-      && lane_reducing)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "multi def-use cycle not possible for lane-reducing "
-			 "reduction operation\n");
-      return false;
-    }
+  /* For lane-reducing operation, the below processing related to single
+     defuse-cycle will be done in its own vectorizable function.  One more
+     thing to note is that the operation must not be involved in fold-left
+     reduction.  */
+  single_defuse_cycle &= !lane_reducing;
 
   if (slp_node
-      && !(!single_defuse_cycle
-	   && !lane_reducing
-	   && reduction_type != FOLD_LEFT_REDUCTION))
+      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
     for (i = 0; i < (int) op.num_ops; i++)
       if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
 	{
@@ -8443,28 +8540,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
 			     reduction_type, ncopies, cost_vec);
   /* Cost the reduction op inside the loop if transformed via
-     vect_transform_reduction.  Otherwise this is costed by the
-     separate vectorizable_* routines.  */
-  if (single_defuse_cycle || lane_reducing)
-    {
-      int factor = 1;
-      if (vect_is_emulated_mixed_dot_prod (stmt_info))
-	/* Three dot-products and a subtraction.  */
-	factor = 4;
-      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
-			stmt_info, 0, vect_body);
-    }
+     vect_transform_reduction for non-lane-reducing operation.  Otherwise
+     this is costed by the separate vectorizable_* routines.  */
+  if (single_defuse_cycle)
+    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
 
   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
-  /* All but single defuse-cycle optimized, lane-reducing and fold-left
-     reductions go through their own vectorizable_* routines.  */
-  if (!single_defuse_cycle
-      && !lane_reducing
-      && reduction_type != FOLD_LEFT_REDUCTION)
+
+  /* All but single defuse-cycle optimized and fold-left reductions go
+     through their own vectorizable_* routines.  */
+  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
     {
       stmt_vec_info tem
 	= vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
@@ -8742,13 +8831,16 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	 And vector reduction PHIs are always generated to the full extent, no
 	 matter lane-reducing op exists or not.  If some copies or PHIs are
 	 actually superfluous, they would be cleaned up by passes after
-	 vectorization.  An example for single-lane slp is given as below.
+	 vectorization.  An example for single-lane slp, lane-reducing ops
+	 with mixed input vectypes in a reduction chain, is given as below.
 	 Similarly, this handling is applicable for multiple-lane slp as well.
 
 	   int sum = 1;
 	   for (i)
 	     {
 	       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
+	       sum += w[i];               // widen-sum <vector(16) char>
+	       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
 	     }
 
 	 The vector size is 128-bit，vectorization factor is 16.  Reduction
@@ -8765,9 +8857,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	       sum_v1 = sum_v1;  // copy
 	       sum_v2 = sum_v2;  // copy
 	       sum_v3 = sum_v3;  // copy
+
+	       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
+	       sum_v1 = sum_v1;  // copy
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
+
+	       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
+	       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
 	     }
 
-	   sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
+	   sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1
 	*/
       unsigned effec_ncopies = vec_oprnds[0].length ();
       unsigned total_ncopies = vec_oprnds[reduc_index].length ();
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index fdcda0d2aba..135580d25d7 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -13286,6 +13286,8 @@ vect_analyze_stmt (vec_info *vinfo,
 				      NULL, NULL, node, cost_vec)
 	  || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
 	  || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
+	  || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
+					 stmt_info, node, cost_vec)
 	  || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
 				     node, node_instance, cost_vec)
 	  || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 09923b9b440..62121f63f18 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2486,6 +2486,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
 extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
 					 slp_tree, slp_instance, int,
 					 bool, stmt_vector_for_cost *);
+extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
+					slp_tree, stmt_vector_for_cost *);
 extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
 				    slp_tree, slp_instance,
 				    stmt_vector_for_cost *);
-- 
2.17.1