Message ID | 032501d87651$43cf0960$cb6d1c20$@nextmovesoftware.com |
---|---|
State | New |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 96C733955CA0 for <patchwork@sourceware.org>; Thu, 2 Jun 2022 07:20:57 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id B0BA73858434 for <gcc-patches@gcc.gnu.org>; Thu, 2 Jun 2022 07:20:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B0BA73858434 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID: Date:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=EttB6zJo4rvWeSzUxvHIslIQpduUlxe+aJJT5rTXQGk=; b=EbSlaDDUMr26L23YEoFxTFEjt3 b/xXoaWDWojihiEzeB5GepXlKQPocnYh9Pzh450P7VhlvIKcuqXjc6XL8n6xV815Lcf3b6d82IMrs Rf7yjWPRpnZmTDHPTR4hSa3Rlz/3NNiiIZu8LogJOL2n8v1ZmrEnZrRbFtTnd1iVyvkjyxxK2KWU2 sNuO9xw1GFk+pidxiNlUH+QfFOOQNkdI3e/1XWYsVmzBpHRhFZnvzkzJUv+tPbSAG5TyImSgHA6nD ELTs0U4iDih1A+ubEFh+cFjXXceDMtWfDBoKhaIsTKdn7zxn2TPvyVLMUa2sKqm/BeqGb8Mt5Ej9e 0w+OTplw==; Received: from host109-154-46-241.range109-154.btcentralplus.com ([109.154.46.241]:54842 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <roger@nextmovesoftware.com>) id 1nwf8L-0003sL-4V; Thu, 02 Jun 2022 03:20:41 -0400 From: "Roger Sayle" <roger@nextmovesoftware.com> To: "'GCC Patches'" <gcc-patches@gcc.gnu.org> Subject: [x86 PATCH] Add peephole2 to reduce double word register shuffling. Date: Thu, 2 Jun 2022 08:20:39 +0100 Message-ID: <032501d87651$43cf0960$cb6d1c20$@nextmovesoftware.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0326_01D87659.A595E260" X-Mailer: Microsoft Outlook 16.0 Thread-Index: Adh2UIoGktFLlF3xQmmXyBwxff3paw== Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
[x86] Add peephole2 to reduce double word register shuffling.
|
|
Commit Message
Roger Sayle
June 2, 2022, 7:20 a.m. UTC
The simple test case below demonstrates an interesting register allocation challenge facing x86_64, imposed by ABI requirements on int128. __int128 foo(__int128 x, __int128 y) { return x+y; } For which GCC currently generates the unusual sequence: movq %rsi, %rax movq %rdi, %r8 movq %rax, %rdi movq %rdx, %rax movq %rcx, %rdx addq %r8, %rax adcq %rdi, %rdx ret The challenge is that the x86_64 ABI requires passing the first __int128, x, in %rsi:%rdi (highpart in %rsi, lowpart in %rdi), where internally GCC prefers TI mode (double word) integers to be register allocated as %rdi:%rsi (highpart in %rdi, lowpart in %rsi). So after reload, we have four mov instructions, two to move the double word to temporary registers and then two to move them back. This patch adds a peephole2 to spot this register shuffling, and with -Os generates a xchg instruction, to produce: xchgq %rsi, %rdi movq %rdx, %rax movq %rcx, %rdx addq %rsi, %rax adcq %rdi, %rdx ret or when optimizing for speed, a three mov sequence, using just one of the temporary registers, which ultimately results in the improved: movq %rdi, %r8 movq %rdx, %rax movq %rcx, %rdx addq %r8, %rax adcq %rsi, %rdx ret I've a follow-up patch which improves things further, and with the output in flux, I'd like to add the new testcase with part 2, once we're back down to requiring only two movq instructions. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-06-02 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * config/i386/i386.md (define_peephole2): Recognize double word swap sequences, and replace them with more efficient idioms, including using xchg when optimizing for size. Thanks in advance, Roger --
Comments
On Thu, Jun 2, 2022 at 9:21 AM Roger Sayle <roger@nextmovesoftware.com> wrote: > > The simple test case below demonstrates an interesting register > allocation challenge facing x86_64, imposed by ABI requirements > on int128. > > __int128 foo(__int128 x, __int128 y) > { > return x+y; > } > > For which GCC currently generates the unusual sequence: > > movq %rsi, %rax > movq %rdi, %r8 > movq %rax, %rdi > movq %rdx, %rax > movq %rcx, %rdx > addq %r8, %rax > adcq %rdi, %rdx > ret > > The challenge is that the x86_64 ABI requires passing the first __int128, > x, in %rsi:%rdi (highpart in %rsi, lowpart in %rdi), where internally > GCC prefers TI mode (double word) integers to be register allocated as > %rdi:%rsi (highpart in %rdi, lowpart in %rsi). Do you know if this is a hard limitation? I guess reg:TI 2 will cover hardreg 2 and 3 and the overlap is always implicit adjacent hardregs? I suspect that in other places we prefer the current hardreg ordering so altering it to make it match the __int128 register passing convention is not an option. Alternatively TImode ops should be split before RA and for register passing (concat:TI ...) could be allowed? Fixing up after the fact is of course possible but it looks awkward that there's no good way for the RA and the backend to communicate better here? > So after reload, we have > four mov instructions, two to move the double word to temporary registers > and then two to move them back. > > This patch adds a peephole2 to spot this register shuffling, and with > -Os generates a xchg instruction, to produce: > > xchgq %rsi, %rdi > movq %rdx, %rax > movq %rcx, %rdx > addq %rsi, %rax > adcq %rdi, %rdx > ret > > or when optimizing for speed, a three mov sequence, using just one of > the temporary registers, which ultimately results in the improved: > > movq %rdi, %r8 > movq %rdx, %rax > movq %rcx, %rdx > addq %r8, %rax > adcq %rsi, %rdx > ret > > I've a follow-up patch which improves things further, and with the > output in flux, I'd like to add the new testcase with part 2, once > we're back down to requiring only two movq instructions. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} with > no new failures. Ok for mainline? > > > 2022-06-02 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > * config/i386/i386.md (define_peephole2): Recognize double word > swap sequences, and replace them with more efficient idioms, > including using xchg when optimizing for size. > > > Thanks in advance, > Roger > -- >
On Thu, Jun 2, 2022 at 9:20 AM Roger Sayle <roger@nextmovesoftware.com> wrote: > > The simple test case below demonstrates an interesting register > allocation challenge facing x86_64, imposed by ABI requirements > on int128. > > __int128 foo(__int128 x, __int128 y) > { > return x+y; > } > > For which GCC currently generates the unusual sequence: > > movq %rsi, %rax > movq %rdi, %r8 > movq %rax, %rdi > movq %rdx, %rax > movq %rcx, %rdx > addq %r8, %rax > adcq %rdi, %rdx > ret > > The challenge is that the x86_64 ABI requires passing the first __int128, > x, in %rsi:%rdi (highpart in %rsi, lowpart in %rdi), where internally > GCC prefers TI mode (double word) integers to be register allocated as > %rdi:%rsi (highpart in %rdi, lowpart in %rsi). So after reload, we have > four mov instructions, two to move the double word to temporary registers > and then two to move them back. > > This patch adds a peephole2 to spot this register shuffling, and with > -Os generates a xchg instruction, to produce: > > xchgq %rsi, %rdi > movq %rdx, %rax > movq %rcx, %rdx > addq %rsi, %rax > adcq %rdi, %rdx > ret > > or when optimizing for speed, a three mov sequence, using just one of > the temporary registers, which ultimately results in the improved: > > movq %rdi, %r8 > movq %rdx, %rax > movq %rcx, %rdx > addq %r8, %rax > adcq %rsi, %rdx > ret > > I've a follow-up patch which improves things further, and with the > output in flux, I'd like to add the new testcase with part 2, once > we're back down to requiring only two movq instructions. Shouldn't we rather do something about: (insn 2 9 3 2 (set (reg:DI 85) (reg:DI 5 di [ x ])) "dword-2.c":2:1 82 {*movdi_internal} (nil)) (insn 3 2 4 2 (set (reg:DI 86) (reg:DI 4 si [ x+8 ])) "dword-2.c":2:1 82 {*movdi_internal} (nil)) (insn 4 3 5 2 (set (reg:TI 84) (subreg:TI (reg:DI 85) 0)) "dword-2.c":2:1 81 {*movti_internal} (nil)) (insn 5 4 6 2 (set (subreg:DI (reg:TI 84) 8) (reg:DI 86)) "dword-2.c":2:1 82 {*movdi_internal} (nil)) (insn 6 5 7 2 (set (reg/v:TI 83 [ x ]) (reg:TI 84)) "dword-2.c":2:1 81 {*movti_internal} (nil)) The above is how the functionTImode argument is constructed. The other problem is that double-word addition gets split only after reload, mostly due to RA reasons. In the past it was determined that RA creates better code when registers are split late (this reason probably does not hold anymore), but nowadays the limitation remains only for arithmetic and shifts. Attached to this message, please find the patch that performs dual word mode arithmetic splitting before reload. It improves generated code somehow, but due to the above argument construction sequence, the bulk of moves remain. Unfortunately, when under register pressure (e.g. 32-bit targets), the peephole approach gets ineffective due to register spilling, so IMO the root of the problem should be fixed. Uros. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} with > no new failures. Ok for mainline? > > > 2022-06-02 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > * config/i386/i386.md (define_peephole2): Recognize double word > swap sequences, and replace them with more efficient idioms, > including using xchg when optimizing for size. > > > Thanks in advance, > Roger > -- > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 050dee7d43a..876220d57f9 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -5730,9 +5730,10 @@ (match_operand:<DWI> 1 "nonimmediate_operand" "%0,0") (match_operand:<DWI> 2 "x86_64_hilo_general_operand" "r<di>,o"))) (clobber (reg:CC FLAGS_REG))] - "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)" + "ix86_binary_operator_ok (PLUS, <DWI>mode, operands) + && ix86_pre_reload_split ()" "#" - "&& reload_completed" + "&& 1" [(parallel [(set (reg:CCC FLAGS_REG) (compare:CCC (plus:DWIH (match_dup 1) (match_dup 2)) @@ -5750,6 +5751,7 @@ split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]); if (operands[2] == const0_rtx) { + emit_move_insn (operands[0], operands[1]); ix86_expand_binary_operator (PLUS, <MODE>mode, &operands[3]); DONE; } @@ -6539,9 +6541,10 @@ (plus:<DWI> (match_dup 1) (match_dup 2))))) (set (match_operand:<DWI> 0 "nonimmediate_operand" "=ro,r") (plus:<DWI> (match_dup 1) (match_dup 2)))] - "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)" + "ix86_binary_operator_ok (PLUS, <DWI>mode, operands) + && ix86_pre_reload_split ()" "#" - "&& reload_completed" + "&& 1" [(parallel [(set (reg:CCC FLAGS_REG) (compare:CCC (plus:DWIH (match_dup 1) (match_dup 2)) @@ -6567,9 +6570,7 @@ (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0)) (match_dup 4)) (match_dup 5)))])] -{ - split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]); -}) + "split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);") (define_insn_and_split "*addv<dwi>4_doubleword_1" [(set (reg:CCO FLAGS_REG) @@ -6586,9 +6587,10 @@ (plus:<DWI> (match_dup 1) (match_dup 2)))] "ix86_binary_operator_ok (PLUS, <DWI>mode, operands) && CONST_SCALAR_INT_P (operands[2]) - && rtx_equal_p (operands[2], operands[3])" + && rtx_equal_p (operands[2], operands[3]) + && ix86_pre_reload_split ()" "#" - "&& reload_completed" + "&& 1" [(parallel [(set (reg:CCC FLAGS_REG) (compare:CCC (plus:DWIH (match_dup 1) (match_dup 2)) @@ -6618,6 +6620,7 @@ split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]); if (operands[2] == const0_rtx) { + emit_move_insn (operands[0], operands[1]); emit_insn (gen_addv<mode>4_1 (operands[3], operands[4], operands[5], operands[5])); DONE; @@ -6880,9 +6883,10 @@ (match_operand:<DWI> 1 "nonimmediate_operand" "0,0") (match_operand:<DWI> 2 "x86_64_hilo_general_operand" "r<di>,o"))) (clobber (reg:CC FLAGS_REG))] - "ix86_binary_operator_ok (MINUS, <MODE>mode, operands)" + "ix86_binary_operator_ok (MINUS, <MODE>mode, operands) + && ix86_pre_reload_split ()" "#" - "&& reload_completed" + "&& 1" [(parallel [(set (reg:CC FLAGS_REG) (compare:CC (match_dup 1) (match_dup 2))) (set (match_dup 0) @@ -6898,6 +6902,7 @@ split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]); if (operands[2] == const0_rtx) { + emit_move_insn (operands[0], operands[1]); ix86_expand_binary_operator (MINUS, <MODE>mode, &operands[3]); DONE; } @@ -7080,9 +7085,10 @@ (minus:<DWI> (match_dup 1) (match_dup 2))))) (set (match_operand:<DWI> 0 "nonimmediate_operand" "=ro,r") (minus:<DWI> (match_dup 1) (match_dup 2)))] - "ix86_binary_operator_ok (MINUS, <MODE>mode, operands)" + "ix86_binary_operator_ok (MINUS, <MODE>mode, operands) + && ix86_pre_reload_split ()" "#" - "&& reload_completed" + "&& 1" [(parallel [(set (reg:CC FLAGS_REG) (compare:CC (match_dup 1) (match_dup 2))) (set (match_dup 0) @@ -7106,9 +7112,7 @@ (match_dup 4) (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0))) (match_dup 5)))])] -{ - split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]); -}) + "split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);") (define_insn_and_split "*subv<dwi>4_doubleword_1" [(set (reg:CCO FLAGS_REG) @@ -7125,9 +7129,10 @@ (minus:<DWI> (match_dup 1) (match_dup 2)))] "ix86_binary_operator_ok (MINUS, <MODE>mode, operands) && CONST_SCALAR_INT_P (operands[2]) - && rtx_equal_p (operands[2], operands[3])" + && rtx_equal_p (operands[2], operands[3]) + && ix86_pre_reload_split ()" "#" - "&& reload_completed" + "&& 1" [(parallel [(set (reg:CC FLAGS_REG) (compare:CC (match_dup 1) (match_dup 2))) (set (match_dup 0) @@ -7155,6 +7160,7 @@ split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]); if (operands[2] == const0_rtx) { + emit_move_insn (operands[0], operands[1]); emit_insn (gen_subv<mode>4_1 (operands[3], operands[4], operands[5], operands[5])); DONE; @@ -7805,9 +7811,10 @@ (match_dup 1))) (set (match_operand:<DWI> 0 "nonimmediate_operand" "=ro,r") (plus:<DWI> (match_dup 1) (match_dup 2)))] - "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)" + "ix86_binary_operator_ok (PLUS, <DWI>mode, operands) + && ix86_pre_reload_split ()" "#" - "&& reload_completed" + "&& 1" [(parallel [(set (reg:CCC FLAGS_REG) (compare:CCC (plus:DWIH (match_dup 1) (match_dup 2)) @@ -7834,6 +7841,7 @@ split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]); if (operands[2] == const0_rtx) { + emit_move_insn (operands[0], operands[1]); emit_insn (gen_addcarry<mode>_0 (operands[3], operands[4], operands[5])); DONE; } @@ -10977,9 +10985,10 @@ [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=ro") (neg:<DWI> (match_operand:<DWI> 1 "nonimmediate_operand" "0"))) (clobber (reg:CC FLAGS_REG))] - "ix86_unary_operator_ok (NEG, <DWI>mode, operands)" + "ix86_unary_operator_ok (NEG, <DWI>mode, operands) + && ix86_pre_reload_split ()" "#" - "&& reload_completed" + "&& 1" [(parallel [(set (reg:CCC FLAGS_REG) (ne:CCC (match_dup 1) (const_int 0)))
On Thu, Jun 2, 2022 at 11:32 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Thu, Jun 2, 2022 at 9:20 AM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > > The simple test case below demonstrates an interesting register > > allocation challenge facing x86_64, imposed by ABI requirements > > on int128. > > > > __int128 foo(__int128 x, __int128 y) > > { > > return x+y; > > } > > > > For which GCC currently generates the unusual sequence: > > > > movq %rsi, %rax > > movq %rdi, %r8 > > movq %rax, %rdi > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %r8, %rax > > adcq %rdi, %rdx > > ret > > > > The challenge is that the x86_64 ABI requires passing the first __int128, > > x, in %rsi:%rdi (highpart in %rsi, lowpart in %rdi), where internally > > GCC prefers TI mode (double word) integers to be register allocated as > > %rdi:%rsi (highpart in %rdi, lowpart in %rsi). So after reload, we have > > four mov instructions, two to move the double word to temporary registers > > and then two to move them back. > > > > This patch adds a peephole2 to spot this register shuffling, and with > > -Os generates a xchg instruction, to produce: > > > > xchgq %rsi, %rdi > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %rsi, %rax > > adcq %rdi, %rdx > > ret > > > > or when optimizing for speed, a three mov sequence, using just one of > > the temporary registers, which ultimately results in the improved: > > > > movq %rdi, %r8 > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %r8, %rax > > adcq %rsi, %rdx > > ret > > > > I've a follow-up patch which improves things further, and with the > > output in flux, I'd like to add the new testcase with part 2, once > > we're back down to requiring only two movq instructions. > > Shouldn't we rather do something about: > > (insn 2 9 3 2 (set (reg:DI 85) > (reg:DI 5 di [ x ])) "dword-2.c":2:1 82 {*movdi_internal} > (nil)) > (insn 3 2 4 2 (set (reg:DI 86) > (reg:DI 4 si [ x+8 ])) "dword-2.c":2:1 82 {*movdi_internal} > (nil)) > (insn 4 3 5 2 (set (reg:TI 84) > (subreg:TI (reg:DI 85) 0)) "dword-2.c":2:1 81 {*movti_internal} > (nil)) > (insn 5 4 6 2 (set (subreg:DI (reg:TI 84) 8) > (reg:DI 86)) "dword-2.c":2:1 82 {*movdi_internal} > (nil)) > (insn 6 5 7 2 (set (reg/v:TI 83 [ x ]) > (reg:TI 84)) "dword-2.c":2:1 81 {*movti_internal} > (nil)) > > The above is how the functionTImode argument is constructed. > > The other problem is that double-word addition gets split only after > reload, mostly due to RA reasons. In the past it was determined that > RA creates better code when registers are split late (this reason > probably does not hold anymore), but nowadays the limitation remains > only for arithmetic and shifts. FYI, the effect of the patch can be seen with the following testcase: --cut here-- #include <stdint.h> void test (int64_t n) { while (1) { n++; asm volatile ("#" :: "b" ((int32_t)n), "c" ((int32_t)(n >> 32))); } } --cut here-- Please compile this with -O2 -m32 with patched and unpatched compiler. Uros.
On Thu, Jun 2, 2022 at 11:48 AM Uros Bizjak via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > On Thu, Jun 2, 2022 at 9:20 AM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > > The simple test case below demonstrates an interesting register > > allocation challenge facing x86_64, imposed by ABI requirements > > on int128. > > > > __int128 foo(__int128 x, __int128 y) > > { > > return x+y; > > } > > > > For which GCC currently generates the unusual sequence: > > > > movq %rsi, %rax > > movq %rdi, %r8 > > movq %rax, %rdi > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %r8, %rax > > adcq %rdi, %rdx > > ret > > > > The challenge is that the x86_64 ABI requires passing the first __int128, > > x, in %rsi:%rdi (highpart in %rsi, lowpart in %rdi), where internally > > GCC prefers TI mode (double word) integers to be register allocated as > > %rdi:%rsi (highpart in %rdi, lowpart in %rsi). So after reload, we have > > four mov instructions, two to move the double word to temporary registers > > and then two to move them back. > > > > This patch adds a peephole2 to spot this register shuffling, and with > > -Os generates a xchg instruction, to produce: > > > > xchgq %rsi, %rdi > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %rsi, %rax > > adcq %rdi, %rdx > > ret > > > > or when optimizing for speed, a three mov sequence, using just one of > > the temporary registers, which ultimately results in the improved: > > > > movq %rdi, %r8 > > movq %rdx, %rax > > movq %rcx, %rdx > > addq %r8, %rax > > adcq %rsi, %rdx > > ret > > > > I've a follow-up patch which improves things further, and with the > > output in flux, I'd like to add the new testcase with part 2, once > > we're back down to requiring only two movq instructions. > > Shouldn't we rather do something about: > > (insn 2 9 3 2 (set (reg:DI 85) > (reg:DI 5 di [ x ])) "dword-2.c":2:1 82 {*movdi_internal} > (nil)) > (insn 3 2 4 2 (set (reg:DI 86) > (reg:DI 4 si [ x+8 ])) "dword-2.c":2:1 82 {*movdi_internal} > (nil)) > (insn 4 3 5 2 (set (reg:TI 84) > (subreg:TI (reg:DI 85) 0)) "dword-2.c":2:1 81 {*movti_internal} > (nil)) > (insn 5 4 6 2 (set (subreg:DI (reg:TI 84) 8) > (reg:DI 86)) "dword-2.c":2:1 82 {*movdi_internal} > (nil)) > (insn 6 5 7 2 (set (reg/v:TI 83 [ x ]) > (reg:TI 84)) "dword-2.c":2:1 81 {*movti_internal} > (nil)) > > The above is how the functionTImode argument is constructed. > > The other problem is that double-word addition gets split only after > reload, mostly due to RA reasons. In the past it was determined that > RA creates better code when registers are split late (this reason > probably does not hold anymore), but nowadays the limitation remains > only for arithmetic and shifts. Hmm. Presumably the lower-subreg pass doesn't split the above after the double-word adds are split? Or maybe we simply do it too late. > Attached to this message, please find the patch that performs dual > word mode arithmetic splitting before reload. It improves generated > code somehow, but due to the above argument construction sequence, the > bulk of moves remain. Unfortunately, when under register pressure > (e.g. 32-bit targets), the peephole approach gets ineffective due to > register spilling, so IMO the root of the problem should be fixed. > > Uros. > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} with > > no new failures. Ok for mainline? > > > > > > 2022-06-02 Roger Sayle <roger@nextmovesoftware.com> > > > > gcc/ChangeLog > > * config/i386/i386.md (define_peephole2): Recognize double word > > swap sequences, and replace them with more efficient idioms, > > including using xchg when optimizing for size. > > > > > > Thanks in advance, > > Roger > > -- > >
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 2b1d65b..f3cf6e2 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3016,6 +3016,36 @@ [(parallel [(set (match_dup 1) (match_dup 2)) (set (match_dup 2) (match_dup 1))])]) +;; Replace a double word swap that requires 4 mov insns with a +;; 3 mov insn implementation (or an xchg when optimizing for size). +(define_peephole2 + [(set (match_operand:DWIH 0 "general_reg_operand") + (match_operand:DWIH 1 "general_reg_operand")) + (set (match_operand:DWIH 2 "general_reg_operand") + (match_operand:DWIH 3 "general_reg_operand")) + (clobber (match_operand:<DWI> 4 "general_reg_operand")) + (set (match_dup 3) (match_dup 0)) + (set (match_dup 1) (match_dup 2))] + "REGNO (operands[0]) != REGNO (operands[3]) + && REGNO (operands[1]) != REGNO (operands[2]) + && REGNO (operands[1]) != REGNO (operands[3]) + && REGNO (operands[3]) == REGNO (operands[4]) + && peep2_reg_dead_p (4, operands[0]) + && peep2_reg_dead_p (5, operands[2])" + [(parallel [(set (match_dup 1) (match_dup 3)) + (set (match_dup 3) (match_dup 1))])] +{ + if (!optimize_insn_for_size_p ()) + { + rtx tmp = REGNO (operands[0]) > REGNO (operands[2]) ? operands[0] + : operands[2]; + emit_move_insn (tmp, operands[1]); + emit_move_insn (operands[1], operands[3]); + emit_move_insn (operands[3], tmp); + DONE; + } +}) + (define_expand "movstrict<mode>" [(set (strict_low_part (match_operand:SWI12 0 "register_operand")) (match_operand:SWI12 1 "general_operand"))]