From patchwork Tue Feb 27 13:56:42 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Andre Vieira (lists)" X-Patchwork-Id: 56734 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E334838582B2 for ; Tue, 27 Feb 2024 13:57:42 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 8ED6C3858C78 for ; Tue, 27 Feb 2024 13:57:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8ED6C3858C78 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8ED6C3858C78 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1709042226; cv=none; b=UQWzR5tA04RDm+huWWdGJis8v1FXGVY3lOak4pDIzfRWE0+SW6B/95lVKokdcSwqERRrH5bxvLSnHWhUaa6rtUjdFNAbb733ihfpjExtszr1vGyvvq8IKC0x+eHZzJ0k9wfTzJ+iOvOKt0KwpEt19Rfatzahc4/KlxdyyYLrOHE= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1709042226; c=relaxed/simple; bh=G0M+lr5/j+JONE6Ks+S0sLspzvlgCLTZ4/RsdPJwUpY=; h=From:To:Subject:Date:Message-Id:MIME-Version; b=RK+TfV9LqydWoHDRn6QpD10M7O1Zex0nm0LgyltBiChJ3EHWWq5K0CGcH7OD1XURUvxmztT6rEHNWyulJNGD+dNBxsiKm2pqbZ5DNLa4DyFC04pbu/9VogexXVE/r8185Jnimt4sxz13fB7OQJWxJWo1Lvl+DhtvyckK8u6O8z0= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7EBE6DA7; Tue, 27 Feb 2024 05:57:42 -0800 (PST) Received: from e107157-lin.cambridge.arm.com (e107157-lin.cambridge.arm.com [10.2.78.70]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6D65D3F762; Tue, 27 Feb 2024 05:57:03 -0800 (PST) From: Andre Vieira To: gcc-patches@gcc.gnu.org Cc: stam.markianos-wright@arm.com, richard.earnshaw@arm.com, Andre Vieira Subject: [PATCH v5 0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops Date: Tue, 27 Feb 2024 13:56:42 +0000 Message-Id: <20240227135647.30404-1-andre.simoesdiasvieira@arm.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Hi, Re-ordered patches, our latest plan is to only commit patches 1-3, and leave 4-5 for GCC 15, as we believe it is too late in Stage 4 to be making changes to target agnostic parts, especially since these affect so many ports that we can not easily test. [1/5] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns [2/5] arm: Annotate instructions with mve_safe_imp_xlane_pred [3/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators [4/5] doloop: Add support for predicated vectorized loops [5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops Original cover letter: This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop feature. The M-class Arm-ARM: https://developer.arm.com/documentation/ddi0553/bu/?lang=en Section B5.5.1 "Loop tail predication" describes the feature we are adding support for with this patch (although we only add codegen for DLSTP/LETP instruction loops). Previously with commit d2ed233cb94 we'd added support for non-MVE DLS/LE loops through the loop-doloop pass, which, given a standard MVE loop like: ``` void __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n) { while (n > 0) { mve_pred16_t p = vctp16q (n); int16x8_t va = vldrhq_z_s16 (a, p); int16x8_t vb = vldrhq_z_s16 (b, p); int16x8_t vc = vaddq_x_s16 (va, vb, p); vstrhq_p_s16 (c, vc, p); c+=8; a+=8; b+=8; n-=8; } } ``` .. would output: ``` dls lr, lr .L3: vctp.16 r3 vmrs ip, P0 @ movhi sxth ip, ip vmsr P0, ip @ movhi mov r4, r0 vpst vldrht.16 q2, [r4] mov r4, r1 vmov q3, q0 vpst vldrht.16 q1, [r4] mov r4, r2 vpst vaddt.i16 q3, q2, q1 subs r3, r3, #8 vpst vstrht.16 q3, [r4] adds r0, r0, #16 adds r1, r1, #16 adds r2, r2, #16 le lr, .L3 ``` where the LE instruction will decrement LR by 1, compare and branch if needed. (there are also other inefficiencies with the above code, like the pointless vmrs/sxth/vmsr on the VPR and the adds not being merged into the vldrht/vstrht as a #16 offsets and some random movs! But that's different problems...) The MVE version is similar, except that: * Instead of DLS/LE the instructions are DLSTP/LETP. * Instead of pre-calculating the number of iterations of the loop, we place the number of elements to be processed by the loop into LR. * Instead of decrementing the LR by one, LETP will decrement it by FPSCR.LTPSIZE, which is the number of elements being processed in each iteration: 16 for 8-bit elements, 5 for 16-bit elements, etc. * On the final iteration, automatic Loop Tail Predication is performed, as if the instructions within the loop had been VPT predicated with a VCTP generating the VPR predicate in every loop iteration. The dlstp/letp loop now looks like: ``` dlstp.16 lr, r3 .L14: mov r3, r0 vldrh.16 q3, [r3] mov r3, r1 vldrh.16 q2, [r3] mov r3, r2 vadd.i16 q3, q3, q2 adds r0, r0, #16 vstrh.16 q3, [r3] adds r1, r1, #16 adds r2, r2, #16 letp lr, .L14 ``` Since the loop tail predication is automatic, we have eliminated the VCTP that had been specified by the user in the intrinsic and converted the VPT-predicated instructions into their unpredicated equivalents (which also saves us from VPST insns). The LE instruction here decrements LR by 8 in each iteration. Stam Markianos-Wright (1): arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns Andre Vieira (4): arm: Annotate instructions with mve_safe_imp_xlane_pred arm: Fix a wrong attribute use and remove unused unspecs and iterators doloop: Add support for predicated vectorized loops arm: Add support for MVE Tail-Predicated Low Overhead Loops