From patchwork Wed Jun 22 11:39:26 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roger Sayle X-Patchwork-Id: 55271 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 42CED38344D3 for ; Wed, 22 Jun 2022 11:39:47 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id 8D42E3858002 for ; Wed, 22 Jun 2022 11:39:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8D42E3858002 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID: Date:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=QFM71AYBSxOHrGGOlCghJ6nc1FMo5o+2+XRzcnESTNQ=; b=cxiT41ZIRy4KgCYU6z7SLkoHiG NEtyEGmQNQJCNstWeon75dl9CNf0yOWakoCf/2nV+HgQuZkEuvk8Ij96mf6MBq/MbuSm5+9scONJP 1OWeIOsGHfxsG9ahAv4qAxHUZXuYNuq5i8u8QO8CAQLhe+xuXz6PR6cjaJIIc9J4UOC2JVKFZQ818 F1lwn1xfuGonnIFWo94vwo066pJLl11v+n7mX0v8jEgux8gtN31WuU6UTu5haqxml2xmX6WeTTtCH ntrhijGhjLWULRcY7qmOvv8CeRWgtxmDlrgGomRwPfAHNFCePkflTtmvusd9q9o/afkn/fIEdZcrl 2xKT4tjg==; Received: from host86-130-134-60.range86-130.btcentralplus.com ([86.130.134.60]:56716 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1o3yhl-00014a-Jg; Wed, 22 Jun 2022 07:39:29 -0400 From: "Roger Sayle" To: Subject: [x86 PATCH] PR target/105930: Split *xordi3_doubleword after reload. Date: Wed, 22 Jun 2022 12:39:26 +0100 Message-ID: <00e701d8862c$bb2d2290$318767b0$@nextmovesoftware.com> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 16.0 Thread-Index: AdiGKz9GOoSFJoIUQ8Saw5CTcNFCnw== Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-12.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 'Jakub Jelinek' Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" This patch addresses PR target/105930 which is an ia32 stack frame size regression in high-register pressure XOR-rich cryptography functions reported by Linus Torvalds. The underlying problem is once the limited number of registers on the x86 are exhausted, the register allocator has to decide which to spill, where some eviction choices lead to much poorer code, but these consequences are difficult to predict in advance. The patch below, which splits xordi3_doubleword and iordi3_doubleword after reload (instead of before), significantly reduces the amount of spill code and stack frame size, in what might appear to be an arbitrary choice. My explanation of this behaviour is that the mixing of pre-reload split SImode instructions and post-reload split DImode instructions is confusing some of the heuristics used by reload. One might think that splitting early gives the register allocator more freedom to use available registers, but in practice the constraint that double word values occupy consecutive registers (when ultimately used as a DImode value) is the greater constraint. Instead, I believe in this case, the pseudo registers used in memory addressing, appear to be double counted for split SImode instructions compared to unsplit DImode instructions. For the reduced test case in comment #13, this leads to %eax being used to hold the long-lived argument pointer "v", blocking the use of the ax:dx pair for processing double word values. The important lines are at the very top of the assembly output: GCC 11 [use %ecx to address memory, require a 24-byte stack frame] sub esp, 24 mov ecx, DWORD PTR [esp+40] GCC 12 [use %eax to address memory, require a 44-byte stack frame] sub esp, 44 mov eax, DWORD PTR [esp+64] Jakub's alternative proposed patch in comment #17 is to improve consistency by splitting more instructions (rotates) before reload, which shows a small improvement, but not to GCC v11 levels. I have some follow-up patches (based on other approaches I've tried), including splitting rotations by 32 after reload, and supporting TImode operations via , notice this same pattern is mentioned in https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html but this patch below is the core minimal fix that's hopefully suitable for benchmarking and possibly backporting to the 12 branch. I believe that changes to the register allocator itself, to tweak how stack slots are assigned and which values can be cheaply materialized are out-of-scope for the release branches. I'm curious what folks (especially Uros and Jakub) think? This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. CSiBE also shows a (minor) code size reduction. It would be great if someone could benchmark this patch, or alternatively it can be baked on mainline to let the automatic benchmarking evaluate it, then revert the patch if there are any observed performance issues. Thoughts? 2022-06-22 Roger Sayle gcc/ChangeLog PR target/105930 * config/i386/i386.md (*di3_doubleword): Split after reload. Use rtx_equal_p to avoid creating memory-to-memory moves, and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0). (define_insn _1): Renamed from *mode_1 to provide gen_si_1 functions. Roger diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 3093cb5..537fba21 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -10539,48 +10539,61 @@ "ix86_expand_binary_operator (, mode, operands); DONE;") (define_insn_and_split "*di3_doubleword" - [(set (match_operand:DI 0 "nonimmediate_operand") + [(set (match_operand:DI 0 "nonimmediate_operand" "=ro,r") (any_or:DI - (match_operand:DI 1 "nonimmediate_operand") - (match_operand:DI 2 "x86_64_szext_general_operand"))) + (match_operand:DI 1 "nonimmediate_operand" "0,0") + (match_operand:DI 2 "x86_64_szext_general_operand" "re,o"))) (clobber (reg:CC FLAGS_REG))] "!TARGET_64BIT - && ix86_binary_operator_ok (, DImode, operands) - && ix86_pre_reload_split ()" + && ix86_binary_operator_ok (, DImode, operands)" "#" - "&& 1" + "&& reload_completed" [(const_int 0)] { + /* This insn may disappear completely when operands[2] == const0_rtx + and operands[0] == operands[1], which requires a NOTE_INSN_DELETED. */ + bool emit_insn_deleted_note_p = false; + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); if (operands[2] == const0_rtx) - emit_move_insn (operands[0], operands[1]); + { + if (!rtx_equal_p (operands[0], operands[1])) + emit_move_insn (operands[0], operands[1]); + else + emit_insn_deleted_note_p = true; + } else if (operands[2] == constm1_rtx) { if ( == IOR) emit_move_insn (operands[0], constm1_rtx); else - ix86_expand_unary_operator (NOT, SImode, &operands[0]); + emit_insn (gen_one_cmplsi2 (operands[0], operands[1])); } else - ix86_expand_binary_operator (, SImode, &operands[0]); + emit_insn (gen_si_1 (operands[0], operands[1], operands[2])); if (operands[5] == const0_rtx) - emit_move_insn (operands[3], operands[4]); + { + if (!rtx_equal_p (operands[3], operands[4])) + emit_move_insn (operands[3], operands[4]); + else if (emit_insn_deleted_note_p) + emit_note (NOTE_INSN_DELETED); + } else if (operands[5] == constm1_rtx) { if ( == IOR) emit_move_insn (operands[3], constm1_rtx); else - ix86_expand_unary_operator (NOT, SImode, &operands[3]); + emit_insn (gen_one_cmplsi2 (operands[3], operands[4])); } else - ix86_expand_binary_operator (, SImode, &operands[3]); + emit_insn (gen_si_1 (operands[3], operands[4], operands[5])); DONE; }) -(define_insn "*_1" +(define_insn "_1" [(set (match_operand:SWI248 0 "nonimmediate_operand" "=rm,r,?k") (any_or:SWI248 (match_operand:SWI248 1 "nonimmediate_operand" "%0,0,k")