From patchwork Sun Sep 11 23:40:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roger Sayle X-Patchwork-Id: 57542 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3C4C2385AC30 for ; Sun, 11 Sep 2022 23:40:54 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id C2DFA3858C83; Sun, 11 Sep 2022 23:40:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C2DFA3858C83 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID: Date:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=R2MEKFvAEnMV4knjkRfeywtkEEZzrhKys/HllLM6iZQ=; b=gszDDtyTtOw+N+sNfinmab2kqB ZmqD5nBDzSwDZnVTxiHwcVjgLVtDns/vodGHeYdgl/OVR1/H0AvpcEUCgnoo1zhCBvyWG4PukuQ1d xunVBkLKwVVi1IUkQ+62LEPao9DiM+RV667xh1aqg38VIkEP3t+H9Y6oNJXh3PVPCsT+dU5r8du7V Si0SiUpZQp9wH8S58DnJyLFi40e2GRKOXLWRP1YdUkmISSjXn3RPEGms0tYl/BmX4W0rDVPIvHGwc mitSARt3laPojsK+8FgyDcQQGJoFtMy1P+xUusHNIkzw0ng4aNSxhJ5Us3nhS0cdzLFXa9/628IxT kdJEo0rg==; Received: from host109-154-46-66.range109-154.btcentralplus.com ([109.154.46.66]:61623 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1oXWZ0-00028o-LR; Sun, 11 Sep 2022 19:40:34 -0400 From: "Roger Sayle" To: "'GCC Patches'" Subject: [PATCH] PR rtl-optimization/106594: Preserve zero_extend when cheap. Date: Mon, 12 Sep 2022 00:40:31 +0100 Message-ID: <007201d8c637$e2c3bfd0$a84b3f70$@nextmovesoftware.com> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 16.0 Thread-Index: AdjGNRgHyC8w8tMBT1OQCFptvSJETA== Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-12.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 'Tamar Christina' Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" This patch addresses PR rtl-optimization/106594, a significant performance regression affecting aarch64 recently introduced (exposed) by one of my recent RTL simplification improvements. Firstly many thanks to Tamar Christina for confirming that the core of this patch provides ~5% performance improvement on some on his benchmarks. GCC's combine pass uses the function expand_compound_operation to conceptually simplify/canonicalize SIGN_EXTEND and ZERO_EXTEND as a pair of shift operations, as not all targets support extension instructions [technically ZERO_EXTEND may potentially be simplified/ canonicalized to an AND operation, but the theory remains the same]. In that function, around line 7226 of combine.cc, there's an optimization that's remarkably similar to part of my recent simplify-rtx patch posted at https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599196.html The comment above this code reads: /* Convert sign extension to zero extension, if we know that the high bit is not set, as this is easier to optimize. It will be converted back to cheaper alternative in make_extraction. */ which is exactly the SIGN_EXTEND to ZERO_EXTEND canonicalization that we now perform in simplify-rtx. The significant difference is that this code checks the backend's RTL costs, via set_src_cost, and selects either the SIGN_EXTEND, the ZERO_EXTEND or the pair of SHIFTs depending on which is cheaper. The problem is that now that SIGN_EXTEND is converted to ZERO_EXTEND earlier, this transform/check is no longer triggered, and as a result the incoming ZERO_EXTEND is always converted to a pair of shifts, irrespective of the backend's costs. The latent bug, revealed by benchmarking, is that we should avoid converting SIGN_EXTEND or ZERO_EXTEND into (slower) shifts on targets where extensions are cheap (i.e. a single instruction, that's cheaper than two shift instructions, or as cheap as an AND). This core fix (and performance improvement) is the first chunk of the attached patch. Unfortunately (as is often the case), as soon as you tweak the RTL stream/canonical forms of instructions, you expose a number of missed optimization issues, in both the middle-end and backends, that were expecting one pattern but now see a (cheaper) equivalent pattern... The remaining chunks affecting expand_compound_operation, prevent combine from generating SUBREGs of RTX other than REG or MEM (considered invalid RTL) where previously gen_lowpart would have generated a CLOBBER. In simplify_unary_operation_1, the middle-end can create rtx for FFS, PARITY and POPCOUNT where the operand mode didn't match the result mode [which is no longer supported according to the RTL documentation]. In i386.md, I needed to add variations of the define_insn_and_split patterns for *clzsi2_lzcnt_zext, *clzsi2_lzcnt_zext_falsedep, *bmi2_bzhi_zero_extendsidi_4 , *popcountsi2_zext, *popcountsi2_zext_falsedep, *popcounthi2 to recognize ZERO_EXTEND in addition to the previous AND forms, and I added a variation of a popcount-related peephole2 now that we generate one less instruction in the input sequence that it's expecting to match. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. I'm happy to help fix any turbulence encountered by targets with cheap sign/zero extension operations, but most should see a performance improvement, hopefully to better than before the identified performance regression. Ok for mainline? Fingers-crossed, Uros can review the x86 backend changes, which are potentially independent (fixing regressions caused by the middle-end changes), but included in this post to provide better context. TIA. 2022-09-12 Roger Sayle gcc/ChangeLog PR rtl-optimization/106594 * gcc/combine.cc (expand_compound_operation): Don't expand/transform ZERO_EXTEND or SIGN_EXTEND on targets where rtx_cost claims they are cheap. If gen_lowpart returns a SUBREG of something other than a REG or a MEM, i.e. invalid RTL, return the original expression. * gcc/simplify-rtx.cc (simplify_unary_operation_1) : Avoid generating FFS with mismatched operand and result modes, by using an explicit SIGN_EXTEND/ZERO_EXTEND instead. : Likewise, for POPCOUNT of ZERO_EXTEND. : Likewise, for PARITY of {ZERO,SIGN}_EXTEND. * config/i386/i386.md (*clzsi2_lzcnt_zext_2): define_insn_and_split to match ZERO_EXTEND form of *clzsi2_lzcnt_zext. (*clzsi2_lzcnt_zext_2_falsedep): Likewise, new define_insn to match ZERO_EXTEND form of *clzsi2_lzcnt_zext_falsedep. (*bmi2_bzhi_zero_extendsidi_5): Likewise, new define_insn to match ZERO_EXTEND form of *bmi2_bzhi_zero_extendsidi. (*popcountsi2_zext_2): Likewise, new define_insn_and_split to match ZERO_EXTEND form of *popcountsi2_zext. (*popcountsi2_zext_2_falsedep): Likewise, new define_insn to match ZERO_EXTEND form of *popcountsi2_zext_falsedep. (*popcounthi2_2): Likewise, new define_insn_and_split to match ZERO_EXTEND form of *popcounthi2. (define_peephole2): ZERO_EXTEND variant of HImode popcount&1 using parity flag peephole2. Thanks in advance, Roger diff --git a/gcc/combine.cc b/gcc/combine.cc index a5fabf3..9412a9c 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -7288,7 +7288,17 @@ expand_compound_operation (rtx x) && (STORE_FLAG_VALUE & ~GET_MODE_MASK (inner_mode)) == 0) return SUBREG_REG (XEXP (x, 0)); + /* If ZERO_EXTEND is cheap on this target, do nothing, + i.e. don't attempt to convert it to a pair of shifts. */ + if (set_src_cost (x, mode, optimize_this_for_speed_p) + <= COSTS_N_INSNS (1)) + return x; } + /* Likewise, if SIGN_EXTEND is cheap, do nothing. */ + else if (GET_CODE (x) == SIGN_EXTEND + && set_src_cost (x, mode, optimize_this_for_speed_p) + <= COSTS_N_INSNS (1)) + return x; /* If we reach here, we want to return a pair of shifts. The inner shift is a left shift of BITSIZE - POS - LEN bits. The outer @@ -7309,7 +7319,11 @@ expand_compound_operation (rtx x) if (modewidth >= pos + len) { tem = gen_lowpart (mode, XEXP (x, 0)); - if (!tem || GET_CODE (tem) == CLOBBER) + if (!tem + || GET_CODE (tem) == CLOBBER + || (GET_CODE (tem) == SUBREG + && !REG_P (SUBREG_REG (tem)) + && !MEM_P (SUBREG_REG (tem)))) return x; tem = simplify_shift_const (NULL_RTX, ASHIFT, mode, tem, modewidth - pos - len); @@ -7321,7 +7335,11 @@ expand_compound_operation (rtx x) tem = simplify_shift_const (NULL_RTX, LSHIFTRT, inner_mode, XEXP (x, 0), pos); tem = gen_lowpart (mode, tem); - if (!tem || GET_CODE (tem) == CLOBBER) + if (!tem + || GET_CODE (tem) == CLOBBER + || (GET_CODE (tem) == SUBREG + && !REG_P (SUBREG_REG (tem)) + && !MEM_P (SUBREG_REG (tem)))) return x; tem = simplify_and_const_int (NULL_RTX, mode, tem, (HOST_WIDE_INT_1U << len) - 1); diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index fc0d6c3..698ca6e 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -1404,22 +1404,32 @@ simplify_context::simplify_unary_operation_1 (rtx_code code, machine_mode mode, break; case FFS: - /* (ffs (*_extend )) = (ffs ) */ + /* (ffs (*_extend )) = (*_extend (ffs )). */ if (GET_CODE (op) == SIGN_EXTEND || GET_CODE (op) == ZERO_EXTEND) - return simplify_gen_unary (FFS, mode, XEXP (op, 0), - GET_MODE (XEXP (op, 0))); + { + temp = simplify_gen_unary (FFS, GET_MODE (XEXP (op, 0)), + XEXP (op, 0), GET_MODE (XEXP (op, 0))); + return simplify_gen_unary (GET_CODE (op), mode, temp, + GET_MODE (temp)); + } break; case POPCOUNT: switch (GET_CODE (op)) { case BSWAP: - case ZERO_EXTEND: - /* (popcount (zero_extend )) = (popcount ) */ + /* (popcount (bswap )) = (popcount ). */ return simplify_gen_unary (POPCOUNT, mode, XEXP (op, 0), GET_MODE (XEXP (op, 0))); + case ZERO_EXTEND: + /* (popcount (zero_extend )) = (zero_extend (popcount )). */ + temp = simplify_gen_unary (POPCOUNT, GET_MODE (XEXP (op, 0)), + XEXP (op, 0), GET_MODE (XEXP (op, 0))); + return simplify_gen_unary (ZERO_EXTEND, mode, temp, + GET_MODE (temp)); + case ROTATE: case ROTATERT: /* Rotations don't affect popcount. */ @@ -1438,11 +1448,16 @@ simplify_context::simplify_unary_operation_1 (rtx_code code, machine_mode mode, { case NOT: case BSWAP: - case ZERO_EXTEND: - case SIGN_EXTEND: return simplify_gen_unary (PARITY, mode, XEXP (op, 0), GET_MODE (XEXP (op, 0))); + case ZERO_EXTEND: + case SIGN_EXTEND: + temp = simplify_gen_unary (PARITY, GET_MODE (XEXP (op, 0)), + XEXP (op, 0), GET_MODE (XEXP (op, 0))); + return simplify_gen_unary (GET_CODE (op), mode, temp, + GET_MODE (temp)); + case ROTATE: case ROTATERT: /* Rotations don't affect parity. */ diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 1be9b66..dcf50cb 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -17010,6 +17010,42 @@ (set_attr "type" "bitmanip") (set_attr "mode" "SI")]) +(define_insn_and_split "*clzsi2_lzcnt_zext_2" + [(set (match_operand:DI 0 "register_operand" "=r") + (zero_extend:DI + (clz:SI (match_operand:SI 1 "nonimmediate_operand" "rm")))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_LZCNT && TARGET_64BIT" + "lzcnt{l}\t{%1, %k0|%k0, %1}" + "&& TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed + && optimize_function_for_speed_p (cfun) + && !reg_mentioned_p (operands[0], operands[1])" + [(parallel + [(set (match_dup 0) + (zero_extend:DI (clz:SI (match_dup 1)))) + (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP) + (clobber (reg:CC FLAGS_REG))])] + "ix86_expand_clear (operands[0]);" + [(set_attr "prefix_rep" "1") + (set_attr "type" "bitmanip") + (set_attr "mode" "SI")]) + +; False dependency happens when destination is only updated by tzcnt, +; lzcnt or popcnt. There is no false dependency when destination is +; also used in source. +(define_insn "*clzsi2_lzcnt_zext_2_falsedep" + [(set (match_operand:DI 0 "register_operand" "=r") + (zero_extend:DI + (clz:SI (match_operand:SWI48 1 "nonimmediate_operand" "rm")))) + (unspec [(match_operand:DI 2 "register_operand" "0")] + UNSPEC_INSN_FALSE_DEP) + (clobber (reg:CC FLAGS_REG))] + "TARGET_LZCNT" + "lzcnt{l}\t{%1, %k0|%k0, %1}" + [(set_attr "prefix_rep" "1") + (set_attr "type" "bitmanip") + (set_attr "mode" "SI")]) + (define_int_iterator LT_ZCNT [(UNSPEC_TZCNT "TARGET_BMI") (UNSPEC_LZCNT "TARGET_LZCNT")]) @@ -17328,6 +17364,22 @@ (set_attr "prefix" "vex") (set_attr "mode" "DI")]) +(define_insn "*bmi2_bzhi_zero_extendsidi_5" + [(set (match_operand:DI 0 "register_operand" "=r") + (and:DI + (zero_extend:DI + (plus:SI + (ashift:SI (const_int 1) + (match_operand:QI 2 "register_operand" "r")) + (const_int -1))) + (match_operand:DI 1 "nonimmediate_operand" "rm"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_64BIT && TARGET_BMI2" + "bzhi\t{%q2, %q1, %q0|%q0, %q1, %q2}" + [(set_attr "type" "bitmanip") + (set_attr "prefix" "vex") + (set_attr "mode" "DI")]) + (define_insn "bmi2_pdep_3" [(set (match_operand:SWI48 0 "register_operand" "=r") (unspec:SWI48 [(match_operand:SWI48 1 "register_operand" "r") @@ -17590,6 +17642,54 @@ (set_attr "type" "bitmanip") (set_attr "mode" "SI")]) +(define_insn_and_split "*popcountsi2_zext_2" + [(set (match_operand:DI 0 "register_operand" "=r") + (zero_extend:DI + (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm")))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_POPCNT && TARGET_64BIT" +{ +#if TARGET_MACHO + return "popcnt\t{%1, %k0|%k0, %1}"; +#else + return "popcnt{l}\t{%1, %k0|%k0, %1}"; +#endif +} + "&& TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed + && optimize_function_for_speed_p (cfun) + && !reg_mentioned_p (operands[0], operands[1])" + [(parallel + [(set (match_dup 0) + (zero_extend:DI (popcount:SI (match_dup 1)))) + (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP) + (clobber (reg:CC FLAGS_REG))])] + "ix86_expand_clear (operands[0]);" + [(set_attr "prefix_rep" "1") + (set_attr "type" "bitmanip") + (set_attr "mode" "SI")]) + +; False dependency happens when destination is only updated by tzcnt, +; lzcnt or popcnt. There is no false dependency when destination is +; also used in source. +(define_insn "*popcountsi2_zext_2_falsedep" + [(set (match_operand:DI 0 "register_operand" "=r") + (zero_extend:DI + (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm")))) + (unspec [(match_operand:DI 2 "register_operand" "0")] + UNSPEC_INSN_FALSE_DEP) + (clobber (reg:CC FLAGS_REG))] + "TARGET_POPCNT && TARGET_64BIT" +{ +#if TARGET_MACHO + return "popcnt\t{%1, %k0|%k0, %1}"; +#else + return "popcnt{l}\t{%1, %k0|%k0, %1}"; +#endif +} + [(set_attr "prefix_rep" "1") + (set_attr "type" "bitmanip") + (set_attr "mode" "SI")]) + (define_insn_and_split "*popcounthi2_1" [(set (match_operand:SI 0 "register_operand") (popcount:SI @@ -17608,6 +17708,24 @@ DONE; }) +(define_insn_and_split "*popcounthi2_2" + [(set (match_operand:SI 0 "register_operand") + (zero_extend:SI + (popcount:HI (match_operand:HI 1 "nonimmediate_operand")))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_POPCNT + && ix86_pre_reload_split ()" + "#" + "&& 1" + [(const_int 0)] +{ + rtx tmp = gen_reg_rtx (HImode); + + emit_insn (gen_popcounthi2 (tmp, operands[1])); + emit_insn (gen_zero_extendhisi2 (operands[0], tmp)); + DONE; +}) + (define_insn "popcounthi2" [(set (match_operand:HI 0 "register_operand" "=r") (popcount:HI @@ -17927,6 +18045,39 @@ PUT_CODE (operands[5], GET_CODE (operands[5]) == EQ ? UNORDERED : ORDERED); }) +;; Eliminate HImode popcount&1 using parity flag (variant 2) +(define_peephole2 + [(match_scratch:HI 0 "Q") + (parallel [(set (match_operand:HI 1 "register_operand") + (popcount:HI + (match_operand:HI 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))]) + (set (reg:CCZ FLAGS_REG) + (compare:CCZ (and:QI (match_operand:QI 3 "register_operand") + (const_int 1)) + (const_int 0))) + (set (pc) (if_then_else (match_operator 4 "bt_comparison_operator" + [(reg:CCZ FLAGS_REG) + (const_int 0)]) + (label_ref (match_operand 5)) + (pc)))] + "REGNO (operands[1]) == REGNO (operands[3]) + && peep2_reg_dead_p (2, operands[1]) + && peep2_reg_dead_p (2, operands[3]) + && peep2_regno_dead_p (3, FLAGS_REG)" + [(set (match_dup 0) (match_dup 2)) + (parallel [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_dup 0)] UNSPEC_PARITY)) + (clobber (match_dup 0))]) + (set (pc) (if_then_else (match_op_dup 4 [(reg:CC FLAGS_REG) + (const_int 0)]) + (label_ref (match_dup 5)) + (pc)))] +{ + operands[4] = shallow_copy_rtx (operands[4]); + PUT_CODE (operands[4], GET_CODE (operands[4]) == EQ ? UNORDERED : ORDERED); +}) + ;; Thread-local storage patterns for ELF. ;;