From patchwork Wed Sep 15 08:09:48 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" X-Patchwork-Id: 45011 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 010873858422 for ; Wed, 15 Sep 2021 08:11:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 010873858422 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1631693486; bh=7AL1mwUqtatWeyzGcxRxRaxbfecSKjEIpxDD/57RYK0=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=IyFWVXgmFtfkBsT00RvpCIF14g/RzYT9Yrur5gAT76n0I3AGfwslFnGnU3jOKvaYF OrXTd7g8S1tRJ904zlP4/zZLY9Vs97IfnAMEG0UQW3d6g/bCp+vdMmTm5ewt1ARVQm gnwNR4CTf2biz2lN//+Eb0db1j5ioV3uUrORpN7Q= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by sourceware.org (Postfix) with ESMTPS id 3EFD13858407 for ; Wed, 15 Sep 2021 08:09:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3EFD13858407 X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="221917964" X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="221917964" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2021 01:09:55 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="433325049" Received: from scymds02.sc.intel.com ([10.82.73.244]) by orsmga006.jf.intel.com with ESMTP; 15 Sep 2021 01:09:55 -0700 Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125]) by scymds02.sc.intel.com with ESMTP id 18F89p6j023358; Wed, 15 Sep 2021 01:09:53 -0700 To: ubizjak@gmail.com Subject: [PATCH 1/4] [PATCH 1/4] x86: Update -mtune=tremont Date: Wed, 15 Sep 2021 16:09:48 +0800 Message-Id: <20210915080951.10362-2-lili.cui@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20210915080951.10362-1-lili.cui@intel.com> References: <20210915080951.10362-1-lili.cui@intel.com> X-Spam-Status: No, score=-24.4 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "lili.cui--- via Gcc-patches" From: "Li, Pan2 via Gcc-patches" Reply-To: lili.cui@intel.com Cc: hongtao.liu@intel.com, gcc-patches@gcc.gnu.org Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" From: "H.J. Lu" Initial -mtune=tremont update 1. Use Haswell scheduling model. 2. Assume that stack engine allows to execute push&pop instructions in parall. 3. Prepare for scheduling pass as -mtune=generic. 4. Use the same issue rate as -mtune=generic. 5. Enable partial_reg_dependency. 6. Disable accumulate_outgoing_args 7. Enable use_leave 8. Enable push_memory 9. Disable four_jump_limit 10. Disable opt_agu 11. Disable avoid_lea_for_addr 12. Disable avoid_mem_opnd_for_cmove 13. Enable misaligned_move_string_pro_epilogues 14. Enable use_cltd 16. Enable avoid_false_dep_for_bmi 17. Enable avoid_mfence 18. Disable expand_abs 19. Enable sse_typeless_stores 20. Enable sse_load0_by_pxor 21. Disable split_mem_opnd_for_fp_converts 22. Disable slow_pshufb 23. Enable partial_reg_dependency This is the first patch to tune for Tremont. With all patches applied, performance impacts on SPEC CPU 2017 are: 500.perlbench_r 1.81% 502.gcc_r 0.57% 505.mcf_r 1.16% 520.omnetpp_r 0.00% 523.xalancbmk_r 0.00% 525.x264_r 4.55% 531.deepsjeng_r 0.00% 541.leela_r 0.39% 548.exchange2_r 1.13% 557.xz_r 0.00% geomean for intrate 0.95% 503.bwaves_r 0.00% 507.cactuBSSN_r 6.94% 508.namd_r 12.37% 510.parest_r 1.01% 511.povray_r 3.70% 519.lbm_r 36.61% 521.wrf_r 8.79% 526.blender_r 2.91% 527.cam4_r 6.23% 538.imagick_r 0.28% 544.nab_r 21.99% 549.fotonik3d_r 3.63% 554.roms_r -1.20% geomean for fprate 7.50% gcc/ChangeLog * common/config/i386/i386-common.c: Use Haswell scheduling model for Tremont. * config/i386/i386.c (ix86_sched_init_global): Prepare for Tremont scheduling pass. * config/i386/x86-tune-sched.c (ix86_issue_rate): Change Tremont issue rate to 4. (ix86_adjust_cost): Handle Tremont. * config/i386/x86-tune.def (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY): Enable for Tremont. (X86_TUNE_USE_LEAVE): Likewise. (X86_TUNE_PUSH_MEMORY): Likewise. (X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES): Likewise. (X86_TUNE_USE_CLTD): Likewise. (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): Likewise. (X86_TUNE_AVOID_MFENCE): Likewise. (X86_TUNE_SSE_TYPELESS_STORES): Likewise. (X86_TUNE_SSE_LOAD0_BY_PXOR): Likewise. (X86_TUNE_ACCUMULATE_OUTGOING_ARGS): Disable for Tremont. (X86_TUNE_FOUR_JUMP_LIMIT): Likewise. (X86_TUNE_OPT_AGU): Likewise. (X86_TUNE_AVOID_LEA_FOR_ADDR): Likewise. (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE): Likewise. (X86_TUNE_EXPAND_ABS): Likewise. (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS): Likewise. (X86_TUNE_SLOW_PSHUFB): Likewise. --- gcc/common/config/i386/i386-common.c | 2 +- gcc/config/i386/i386.c | 1 + gcc/config/i386/x86-tune-sched.c | 2 ++ gcc/config/i386/x86-tune.def | 37 ++++++++++++++-------------- 4 files changed, 23 insertions(+), 19 deletions(-) diff --git a/gcc/common/config/i386/i386-common.c b/gcc/common/config/i386/i386-common.c index 00c65ba15ab..2c9e1ccbc6e 100644 --- a/gcc/common/config/i386/i386-common.c +++ b/gcc/common/config/i386/i386-common.c @@ -1935,7 +1935,7 @@ const pta processor_alias_table[] = M_CPU_TYPE (INTEL_GOLDMONT), P_PROC_SSE4_2}, {"goldmont-plus", PROCESSOR_GOLDMONT_PLUS, CPU_GLM, PTA_GOLDMONT_PLUS, M_CPU_TYPE (INTEL_GOLDMONT_PLUS), P_PROC_SSE4_2}, - {"tremont", PROCESSOR_TREMONT, CPU_GLM, PTA_TREMONT, + {"tremont", PROCESSOR_TREMONT, CPU_HASWELL, PTA_TREMONT, M_CPU_TYPE (INTEL_TREMONT), P_PROC_SSE4_2}, {"knl", PROCESSOR_KNL, CPU_SLM, PTA_KNL, M_CPU_TYPE (INTEL_KNL), P_PROC_AVX512F}, diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 7b173bc0beb..2927e2884c9 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -16976,6 +16976,7 @@ ix86_sched_init_global (FILE *, int, int) case PROCESSOR_NEHALEM: case PROCESSOR_SANDYBRIDGE: case PROCESSOR_HASWELL: + case PROCESSOR_TREMONT: case PROCESSOR_GENERIC: /* Do not perform multipass scheduling for pre-reload schedule to save compile time. */ diff --git a/gcc/config/i386/x86-tune-sched.c b/gcc/config/i386/x86-tune-sched.c index 2e5ee4e4444..56ada99a450 100644 --- a/gcc/config/i386/x86-tune-sched.c +++ b/gcc/config/i386/x86-tune-sched.c @@ -71,6 +71,7 @@ ix86_issue_rate (void) case PROCESSOR_NEHALEM: case PROCESSOR_SANDYBRIDGE: case PROCESSOR_HASWELL: + case PROCESSOR_TREMONT: case PROCESSOR_GENERIC: return 4; @@ -429,6 +430,7 @@ ix86_adjust_cost (rtx_insn *insn, int dep_type, rtx_insn *dep_insn, int cost, case PROCESSOR_NEHALEM: case PROCESSOR_SANDYBRIDGE: case PROCESSOR_HASWELL: + case PROCESSOR_TREMONT: case PROCESSOR_GENERIC: /* Stack engine allows to execute push&pop instructions in parall. */ if ((insn_type == TYPE_PUSH || insn_type == TYPE_POP) diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index 2f221b1f8c9..385e275bbd9 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -62,7 +62,7 @@ DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency", that can be partly masked by careful scheduling of moves. */ DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency", m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10 - | m_BDVER | m_ZNVER | m_GENERIC) + | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC) /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies are resolved on SSE register parts instead of whole registers, so we may @@ -136,7 +136,7 @@ DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch", DEF_TUNE (X86_TUNE_ACCUMULATE_OUTGOING_ARGS, "accumulate_outgoing_args", m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL - | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_ATHLON_K8) + | m_GOLDMONT | m_GOLDMONT_PLUS | m_ATHLON_K8) /* X86_TUNE_PROLOGUE_USING_MOVE: Do not use push/pop in prologues that are considered on critical path. */ @@ -150,14 +150,15 @@ DEF_TUNE (X86_TUNE_EPILOGUE_USING_MOVE, "epilogue_using_move", /* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where it fits. */ DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave", - m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC) + m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_TREMONT + | m_GENERIC) /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem" instructions. Some chips, like 486 and Pentium works faster with separate load and push instructions. */ DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory", m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE - | m_GENERIC) + | m_TREMONT | m_GENERIC) /* X86_TUNE_SINGLE_PUSH: Enable if single push insn is preferred over esp subtraction. */ @@ -198,8 +199,7 @@ DEF_TUNE (X86_TUNE_PAD_RETURNS, "pad_returns", than 4 branch instructions in the 16 byte window. */ DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit", m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM - | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL | m_ATHLON_K8 - | m_AMDFAM10) + | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL | m_ATHLON_K8 | m_AMDFAM10) /*****************************************************************************/ /* Integer instruction selection tuning */ @@ -240,11 +240,11 @@ DEF_TUNE (X86_TUNE_INTEGER_DFMODE_MOVES, "integer_dfmode_moves", /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag will impact LEA instruction selection. */ DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_BONNELL | m_SILVERMONT | m_KNL - | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL) + | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL) /* X86_TUNE_AVOID_LEA_FOR_ADDR: Avoid lea for address computation. */ DEF_TUNE (X86_TUNE_AVOID_LEA_FOR_ADDR, "avoid_lea_for_addr", - m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT + m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_KNL | m_KNM) /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is @@ -263,7 +263,7 @@ DEF_TUNE (X86_TUNE_SLOW_IMUL_IMM8, "slow_imul_imm8", a conditional move. */ DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE, "avoid_mem_opnd_for_cmove", m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_KNL - | m_KNM | m_TREMONT | m_INTEL) + | m_KNM | m_INTEL) /* X86_TUNE_SINGLE_STRINGOP: Enable use of single string operations, such as MOVS and STOS (without a REP prefix) to move/set sequences of bytes. */ @@ -282,7 +282,8 @@ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, FIXME: This may actualy be a win on more targets than listed here. */ DEF_TUNE (X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES, "misaligned_move_string_pro_epilogues", - m_386 | m_486 | m_CORE_ALL | m_AMD_MULTIPLE | m_GENERIC) + m_386 | m_486 | m_CORE_ALL | m_AMD_MULTIPLE | m_TREMONT + | m_GENERIC) /* X86_TUNE_USE_SAHF: Controls use of SAHF. */ DEF_TUNE (X86_TUNE_USE_SAHF, "use_sahf", @@ -294,7 +295,7 @@ DEF_TUNE (X86_TUNE_USE_SAHF, "use_sahf", /* X86_TUNE_USE_CLTD: Controls use of CLTD and CTQO instructions. */ DEF_TUNE (X86_TUNE_USE_CLTD, "use_cltd", ~(m_PENT | m_LAKEMONT | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL - | m_K6 | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT)) + | m_K6 | m_GOLDMONT | m_GOLDMONT_PLUS)) /* X86_TUNE_USE_BT: Enable use of BT (bit test) instructions. */ DEF_TUNE (X86_TUNE_USE_BT, "use_bt", @@ -305,7 +306,7 @@ DEF_TUNE (X86_TUNE_USE_BT, "use_bt", /* X86_TUNE_AVOID_FALSE_DEP_FOR_BMI: Avoid false dependency for bit-manipulation instructions. */ DEF_TUNE (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI, "avoid_false_dep_for_bmi", - m_SANDYBRIDGE | m_CORE_AVX2 | m_GENERIC) + m_SANDYBRIDGE | m_CORE_AVX2 | m_TREMONT | m_GENERIC) /* X86_TUNE_ADJUST_UNROLL: This enables adjusting the unroll factor based on hardware capabilities. Bdver3 hardware has a loop buffer which makes @@ -321,14 +322,14 @@ DEF_TUNE (X86_TUNE_ONE_IF_CONV_INSN, "one_if_conv_insn", /* X86_TUNE_AVOID_MFENCE: Use lock prefixed instructions instead of mfence. */ DEF_TUNE (X86_TUNE_AVOID_MFENCE, "avoid_mfence", - m_CORE_ALL | m_BDVER | m_ZNVER | m_GENERIC) + m_CORE_ALL | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC) /* X86_TUNE_EXPAND_ABS: This enables a new abs pattern by generating instructions for abs (x) = (((signed) x >> (W-1) ^ x) - (signed) x >> (W-1)) instead of cmove or SSE max/abs instructions. */ DEF_TUNE (X86_TUNE_EXPAND_ABS, "expand_abs", m_CORE_ALL | m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT - | m_GOLDMONT_PLUS | m_TREMONT ) + | m_GOLDMONT_PLUS) /*****************************************************************************/ /* 387 instruction selection tuning */ @@ -386,13 +387,13 @@ DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, "sse_packed_single_insn_optim /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores. */ DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", - m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC) + m_AMD_MULTIPLE | m_CORE_ALL | m_TREMONT | m_GENERIC) /* X86_TUNE_SSE_LOAD0_BY_PXOR: Always use pxor to load0 as opposed to xorps/xorpd and other variants. */ DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor", m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BDVER | m_BTVER | m_ZNVER - | m_GENERIC) + | m_TREMONT | m_GENERIC) /* X86_TUNE_INTER_UNIT_MOVES_TO_VEC: Enable moves in from integer to SSE registers. If disabled, the moves will be done by storing @@ -419,7 +420,7 @@ DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSIONS, "inter_unit_conversions", fp converts to destination register. */ DEF_TUNE (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS, "split_mem_opnd_for_fp_converts", m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS - | m_TREMONT | m_INTEL) + | m_INTEL) /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion from FP to FP. This form of instructions avoids partial write to the @@ -434,7 +435,7 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10) /* X86_TUNE_SLOW_SHUFB: Indicates tunings with slow pshufb instruction. */ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb", m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT - | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL) + | m_GOLDMONT_PLUS | m_INTEL) /* X86_TUNE_AVOID_4BYTE_PREFIXES: Avoid instructions requiring 4+ bytes of prefixes. */ DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", From patchwork Wed Sep 15 08:09:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" X-Patchwork-Id: 45013 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A735B3857C74 for ; Wed, 15 Sep 2021 08:13:27 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A735B3857C74 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1631693607; bh=BwaDg1VXnrH6zQQVM+7kjypP4PUYXEVduKMRactw5+U=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=enRnx4mPxnFhJ2pJkc+6fBqeZtLkulv+G547LlcJoqJuI4wJzKQEeLBq3wFAmmQyI pdZ/4H+vbbNE9ZrWeBC/Cv2cLLg8R2hYnjQXz/2E01zlSugU7Zqm4cYFM70zbSUDo5 qLzb0QdkFV3zjQJpnM7WmHuqeMDCifV3N/yLf0Ck= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by sourceware.org (Postfix) with ESMTPS id 5867A3858039 for ; Wed, 15 Sep 2021 08:10:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5867A3858039 X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="209482561" X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="209482561" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2021 01:09:58 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="472316190" Received: from scymds02.sc.intel.com ([10.82.73.244]) by orsmga007.jf.intel.com with ESMTP; 15 Sep 2021 01:09:56 -0700 Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125]) by scymds02.sc.intel.com with ESMTP id 18F89p6k023358; Wed, 15 Sep 2021 01:09:55 -0700 To: ubizjak@gmail.com Subject: [PATCH 2/4] [PATCH 2/4] x86: Update memcpy/memset inline strategies for -mtune=tremont Date: Wed, 15 Sep 2021 16:09:49 +0800 Message-Id: <20210915080951.10362-3-lili.cui@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20210915080951.10362-1-lili.cui@intel.com> References: <20210915080951.10362-1-lili.cui@intel.com> X-Spam-Status: No, score=-15.8 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "lili.cui--- via Gcc-patches" From: "Li, Pan2 via Gcc-patches" Reply-To: lili.cui@intel.com Cc: hongtao.liu@intel.com, gcc-patches@gcc.gnu.org Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" From: "H.J. Lu" Simply memcpy and memset inline strategies to avoid branches for -mtune=tremont: 1. Create Tremont cost model from generic cost model. 2. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 3. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. 4. Use memcpy/memset libray function if data size is unknown or > 256. * config/i386/i386-options.c (processor_cost_table): Use tremont_cost for Tremont. * config/i386/x86-tune-costs.h (tremont_memcpy): New. (tremont_memset): Likewise. (tremont_cost): Likewise. * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): Enable for Tremont. --- gcc/config/i386/i386-options.c | 2 +- gcc/config/i386/x86-tune-costs.h | 124 +++++++++++++++++++++++++++++++ gcc/config/i386/x86-tune.def | 2 +- 3 files changed, 126 insertions(+), 2 deletions(-) diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c index c0006b3674b..e7a3bd4aaea 100644 --- a/gcc/config/i386/i386-options.c +++ b/gcc/config/i386/i386-options.c @@ -724,7 +724,7 @@ static const struct processor_costs *processor_cost_table[] = &slm_cost, &slm_cost, &slm_cost, - &slm_cost, + &tremont_cost, &slm_cost, &slm_cost, &skylake_cost, diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index ffe810f2bcb..93644be9cb3 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -2734,6 +2734,130 @@ struct processor_costs slm_cost = { "16", /* Func alignment. */ }; +static stringop_algs tremont_memcpy[2] = { + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; +static stringop_algs tremont_memset[2] = { + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; +static const +struct processor_costs tremont_cost = { + { + /* Start of register allocator costs. integer->integer move cost is 2. */ + 6, /* cost for loading QImode using movzbl */ + {6, 6, 6}, /* cost of loading integer registers + in QImode, HImode and SImode. + Relative to reg-reg move (2). */ + {6, 6, 6}, /* cost of storing integer registers */ + 4, /* cost of reg,reg fld/fst */ + {6, 6, 12}, /* cost of loading fp registers + in SFmode, DFmode and XFmode */ + {6, 6, 12}, /* cost of storing fp registers + in SFmode, DFmode and XFmode */ + 2, /* cost of moving MMX register */ + {6, 6}, /* cost of loading MMX registers + in SImode and DImode */ + {6, 6}, /* cost of storing MMX registers + in SImode and DImode */ + 2, 3, 4, /* cost of moving XMM,YMM,ZMM register */ + {6, 6, 6, 10, 15}, /* cost of loading SSE registers + in 32,64,128,256 and 512-bit */ + {6, 6, 6, 10, 15}, /* cost of storing SSE registers + in 32,64,128,256 and 512-bit */ + 6, 6, /* SSE->integer and integer->SSE moves */ + 6, 6, /* mask->integer and integer->mask moves */ + {6, 6, 6}, /* cost of loading mask register + in QImode, HImode, SImode. */ + {6, 6, 6}, /* cost if storing mask register + in QImode, HImode, SImode. */ + 2, /* cost of moving mask register. */ + /* End of register allocator costs. */ + }, + + COSTS_N_INSNS (1), /* cost of an add instruction */ + /* Setting cost to 2 makes our current implementation of synth_mult result in + use of unnecessary temporary registers causing regression on several + SPECfp benchmarks. */ + COSTS_N_INSNS (1) + 1, /* cost of a lea instruction */ + COSTS_N_INSNS (1), /* variable shift costs */ + COSTS_N_INSNS (1), /* constant shift costs */ + {COSTS_N_INSNS (3), /* cost of starting multiply for QI */ + COSTS_N_INSNS (4), /* HI */ + COSTS_N_INSNS (3), /* SI */ + COSTS_N_INSNS (4), /* DI */ + COSTS_N_INSNS (4)}, /* other */ + 0, /* cost of multiply per each bit set */ + {COSTS_N_INSNS (16), /* cost of a divide/mod for QI */ + COSTS_N_INSNS (22), /* HI */ + COSTS_N_INSNS (30), /* SI */ + COSTS_N_INSNS (74), /* DI */ + COSTS_N_INSNS (74)}, /* other */ + COSTS_N_INSNS (1), /* cost of movsx */ + COSTS_N_INSNS (1), /* cost of movzx */ + 8, /* "large" insn */ + 17, /* MOVE_RATIO */ + 17, /* CLEAR_RATIO */ + {6, 6, 6}, /* cost of loading integer registers + in QImode, HImode and SImode. + Relative to reg-reg move (2). */ + {6, 6, 6}, /* cost of storing integer registers */ + {6, 6, 6, 10, 15}, /* cost of loading SSE register + in 32bit, 64bit, 128bit, 256bit and 512bit */ + {6, 6, 6, 10, 15}, /* cost of storing SSE register + in 32bit, 64bit, 128bit, 256bit and 512bit */ + {6, 6, 6, 10, 15}, /* cost of unaligned loads. */ + {6, 6, 6, 10, 15}, /* cost of unaligned storess. */ + 2, 3, 4, /* cost of moving XMM,YMM,ZMM register */ + 6, /* cost of moving SSE register to integer. */ + 18, 6, /* Gather load static, per_elt. */ + 18, 6, /* Gather store static, per_elt. */ + 32, /* size of l1 cache. */ + 512, /* size of l2 cache. */ + 64, /* size of prefetch block */ + 6, /* number of parallel prefetches */ + /* Benchmarks shows large regressions on K8 sixtrack benchmark when this + value is increased to perhaps more appropriate value of 5. */ + 3, /* Branch cost */ + COSTS_N_INSNS (3), /* cost of FADD and FSUB insns. */ + COSTS_N_INSNS (5), /* cost of FMUL instruction. */ + COSTS_N_INSNS (17), /* cost of FDIV instruction. */ + COSTS_N_INSNS (1), /* cost of FABS instruction. */ + COSTS_N_INSNS (1), /* cost of FCHS instruction. */ + COSTS_N_INSNS (14), /* cost of FSQRT instruction. */ + + COSTS_N_INSNS (1), /* cost of cheap SSE instruction. */ + COSTS_N_INSNS (3), /* cost of ADDSS/SD SUBSS/SD insns. */ + COSTS_N_INSNS (4), /* cost of MULSS instruction. */ + COSTS_N_INSNS (5), /* cost of MULSD instruction. */ + COSTS_N_INSNS (5), /* cost of FMA SS instruction. */ + COSTS_N_INSNS (5), /* cost of FMA SD instruction. */ + COSTS_N_INSNS (13), /* cost of DIVSS instruction. */ + COSTS_N_INSNS (17), /* cost of DIVSD instruction. */ + COSTS_N_INSNS (14), /* cost of SQRTSS instruction. */ + COSTS_N_INSNS (18), /* cost of SQRTSD instruction. */ + 1, 4, 3, 3, /* reassoc int, fp, vec_int, vec_fp. */ + tremont_memcpy, + tremont_memset, + COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ + COSTS_N_INSNS (2), /* cond_not_taken_branch_cost. */ + "16:11:8", /* Loop alignment. */ + "16:11:8", /* Jump alignment. */ + "0:0:8", /* Label alignment. */ + "16", /* Func alignment. */ +}; + static stringop_algs intel_memcpy[2] = { {libcall, {{11, loop, false}, {-1, rep_prefix_4_byte, false}}}, {libcall, {{32, loop, false}, {64, rep_prefix_4_byte, false}, diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index 385e275bbd9..088edb6c4ca 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) move/set sequences of bytes with known size. */ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, "prefer_known_rep_movsb_stosb", - m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512) + m_SKYLAKE | m_ALDERLAKE | m_TREMONT | m_CORE_AVX512) /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of compact prologues and epilogues by issuing a misaligned moves. This From patchwork Wed Sep 15 08:09:50 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" X-Patchwork-Id: 45014 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DB3193858432 for ; Wed, 15 Sep 2021 08:14:24 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DB3193858432 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1631693664; bh=1BNpHm9sIZn7RLDTLg8/xtv0HX+CnnpJX+YtHalEEbI=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=fR1A89CuQyW1Bp5p7DWI1Zo7hXxS6gpqGtq99/5wmlH/gdv8LuO9mf680PBKqudyK 9cuZGQosCifOQLkQDKIkDaB1FQq3vKlaEznqH9lHXYJKHhvALcuXV+EYHroekQnPCa sNjAQGckUN28W9Qh4teP3Q/Tfa8dfeS6Sp7B/FJE= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by sourceware.org (Postfix) with ESMTPS id 519253858039 for ; Wed, 15 Sep 2021 08:10:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 519253858039 X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="209482563" X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="209482563" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2021 01:09:58 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="470658648" Received: from scymds02.sc.intel.com ([10.82.73.244]) by fmsmga007.fm.intel.com with ESMTP; 15 Sep 2021 01:09:58 -0700 Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125]) by scymds02.sc.intel.com with ESMTP id 18F89p6l023358; Wed, 15 Sep 2021 01:09:57 -0700 To: ubizjak@gmail.com Subject: [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS Date: Wed, 15 Sep 2021 16:09:50 +0800 Message-Id: <20210915080951.10362-4-lili.cui@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20210915080951.10362-1-lili.cui@intel.com> References: <20210915080951.10362-1-lili.cui@intel.com> X-Spam-Status: No, score=-15.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "lili.cui--- via Gcc-patches" From: "Li, Pan2 via Gcc-patches" Reply-To: lili.cui@intel.com Cc: hongtao.liu@intel.com, gcc-patches@gcc.gnu.org Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" From: "H.J. Lu" Check TARGET_USE_VECTOR_FP_CONVERTS or TARGET_USE_VECTOR_CONVERTS when handling avx_partial_xmm_update attribute. Don't convert AVX partial XMM register update if vector packed SSE conversion should be used. gcc/ PR target/101900 * config/i386/i386-features.c (remove_partial_avx_dependency): Check TARGET_USE_VECTOR_FP_CONVERTS and TARGET_USE_VECTOR_CONVERTS before generating vxorps. gcc/ PR target/101900 * testsuite/gcc.target/i386/pr101900-1.c: New test. * testsuite/gcc.target/i386/pr101900-2.c: Likewise. * testsuite/gcc.target/i386/pr101900-3.c: Likewise. --- gcc/config/i386/i386-features.c | 21 ++++++++++++++++++--- gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 ++++++++++++++++++ gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 ++++++++++++++++++ gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 +++++++++++++++++++ 4 files changed, 73 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index 5a99ea7c046..ae5ea02a002 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -2210,15 +2210,30 @@ remove_partial_avx_dependency (void) != AVX_PARTIAL_XMM_UPDATE_TRUE) continue; - if (!v4sf_const0) - v4sf_const0 = gen_reg_rtx (V4SFmode); - /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF, SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and vec_merge with subreg. */ rtx src = SET_SRC (set); rtx dest = SET_DEST (set); machine_mode dest_mode = GET_MODE (dest); + machine_mode src_mode; + + if (TARGET_USE_VECTOR_FP_CONVERTS) + { + src_mode = GET_MODE (XEXP (src, 0)); + if (src_mode == E_SFmode || src_mode == E_DFmode) + continue; + } + + if (TARGET_USE_VECTOR_CONVERTS) + { + src_mode = GET_MODE (XEXP (src, 0)); + if (src_mode == E_SImode || src_mode == E_DImode) + continue; + } + + if (!v4sf_const0) + v4sf_const0 = gen_reg_rtx (V4SFmode); rtx zero; machine_mode dest_vecmode; diff --git a/gcc/testsuite/gcc.target/i386/pr101900-1.c b/gcc/testsuite/gcc.target/i386/pr101900-1.c new file mode 100644 index 00000000000..0a45f8e340a --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101900-1.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts" } */ + +extern float f; +extern double d; +extern int i; + +void +foo (void) +{ + d = f; + f = i; +} + +/* { dg-final { scan-assembler "vcvtps2pd" } } */ +/* { dg-final { scan-assembler "vcvtsi2ssl" } } */ +/* { dg-final { scan-assembler-not "vcvtss2sd" } } */ +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr101900-2.c b/gcc/testsuite/gcc.target/i386/pr101900-2.c new file mode 100644 index 00000000000..c8b2d1da5ae --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101900-2.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_converts" } */ + +extern float f; +extern double d; +extern int i; + +void +foo (void) +{ + d = f; + f = i; +} + +/* { dg-final { scan-assembler "vcvtss2sd" } } */ +/* { dg-final { scan-assembler "vcvtdq2ps" } } */ +/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */ +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr101900-3.c b/gcc/testsuite/gcc.target/i386/pr101900-3.c new file mode 100644 index 00000000000..6ee565b5bd4 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101900-3.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts,use_vector_converts" } */ + +extern float f; +extern double d; +extern int i; + +void +foo (void) +{ + d = f; + f = i; +} + +/* { dg-final { scan-assembler "vcvtps2pd" } } */ +/* { dg-final { scan-assembler "vcvtdq2ps" } } */ +/* { dg-final { scan-assembler-not "vcvtss2sd" } } */ +/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */ +/* { dg-final { scan-assembler-not "vxorps" } } */ From patchwork Wed Sep 15 08:09:51 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" X-Patchwork-Id: 45012 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3F5093858018 for ; Wed, 15 Sep 2021 08:12:31 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3F5093858018 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1631693551; bh=LkCcVF39jOTKSFcGcvr4JJDvR7Y+oJcqDf1iFp9CVHg=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=gF43nglw/zE3EBAM4fUcS0s0q2uOqFPTxJUYPN9KvpYy80dKXg/ZpEt3wofESxXl0 anruwF6IdXAmfLF8PT8pc7l/6GLIaJYeuceZt6Scm9rb6bCGH5vvvaNcpkIngGy07m E+xjAwS6NVN2/0w8gtrYfDn0QVGnLNFaFib1UUNg= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by sourceware.org (Postfix) with ESMTPS id 883373857C6B for ; Wed, 15 Sep 2021 08:10:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 883373857C6B X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="209345048" X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="209345048" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2021 01:10:00 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="452445322" Received: from scymds02.sc.intel.com ([10.82.73.244]) by orsmga002.jf.intel.com with ESMTP; 15 Sep 2021 01:10:00 -0700 Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125]) by scymds02.sc.intel.com with ESMTP id 18F89p6m023358; Wed, 15 Sep 2021 01:09:58 -0700 To: ubizjak@gmail.com Subject: [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY Date: Wed, 15 Sep 2021 16:09:51 +0800 Message-Id: <20210915080951.10362-5-lili.cui@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20210915080951.10362-1-lili.cui@intel.com> References: <20210915080951.10362-1-lili.cui@intel.com> X-Spam-Status: No, score=-15.7 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "lili.cui--- via Gcc-patches" From: "Li, Pan2 via Gcc-patches" Reply-To: lili.cui@intel.com Cc: hongtao.liu@intel.com, gcc-patches@gcc.gnu.org Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" From: "H.J. Lu" 1. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY in SSE FP to FP splitters. 2. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY in SSE INT to FP splitters. 3. Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and TARGET_SSE_PARTIAL_REG_DEPENDENCY when handling avx_partial_xmm_update attribute. Don't convert AVX partial XMM register update if there is no partial SSE register dependency for SSE conversion. gcc/ * config/i386/i386-features.c (remove_partial_avx_dependency): Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and and TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY before generating vxorps. * config/i386/i386.h (TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New. (TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise. * config/i386/i386.md (SSE FP to FP splitters): Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY. (SSE INT to FP splitter): Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY. * config/i386/x86-tune.def (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New. (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise. gcc/testsuite/ * gcc.target/i386/avx-covert-1.c: New file. * gcc.target/i386/avx-fp-covert-1.c: Likewise. * gcc.target/i386/avx-int-covert-1.c: Likewise. * gcc.target/i386/sse-covert-1.c: Likewise. * gcc.target/i386/sse-fp-covert-1.c: Likewise. * gcc.target/i386/sse-int-covert-1.c: Likewise. --- gcc/config/i386/i386-features.c | 6 ++++-- gcc/config/i386/i386.h | 4 ++++ gcc/config/i386/i386.md | 9 ++++++--- gcc/config/i386/x86-tune.def | 15 +++++++++++++++ gcc/testsuite/gcc.target/i386/avx-covert-1.c | 19 +++++++++++++++++++ .../gcc.target/i386/avx-fp-covert-1.c | 15 +++++++++++++++ .../gcc.target/i386/avx-int-covert-1.c | 14 ++++++++++++++ gcc/testsuite/gcc.target/i386/sse-covert-1.c | 19 +++++++++++++++++++ .../gcc.target/i386/sse-fp-covert-1.c | 15 +++++++++++++++ .../gcc.target/i386/sse-int-covert-1.c | 14 ++++++++++++++ 10 files changed, 125 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index ae5ea02a002..91bfa06d4bf 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -2218,14 +2218,16 @@ remove_partial_avx_dependency (void) machine_mode dest_mode = GET_MODE (dest); machine_mode src_mode; - if (TARGET_USE_VECTOR_FP_CONVERTS) + if (TARGET_USE_VECTOR_FP_CONVERTS + || !TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY) { src_mode = GET_MODE (XEXP (src, 0)); if (src_mode == E_SFmode || src_mode == E_DFmode) continue; } - if (TARGET_USE_VECTOR_CONVERTS) + if (TARGET_USE_VECTOR_CONVERTS + || !TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY) { src_mode = GET_MODE (XEXP (src, 0)); if (src_mode == E_SImode || src_mode == E_DImode) diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index e76bb55c080..ec60b89753e 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -334,6 +334,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; ix86_tune_features[X86_TUNE_PARTIAL_REG_DEPENDENCY] #define TARGET_SSE_PARTIAL_REG_DEPENDENCY \ ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY] +#define TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY \ + ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY] +#define TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY \ + ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY] #define TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \ ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL] #define TARGET_SSE_UNALIGNED_STORE_OPTIMAL \ diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 13f6f57cdcc..c82a9dc1f67 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -4535,7 +4535,8 @@ (float_extend:DF (match_operand:SF 1 "nonimmediate_operand")))] "!TARGET_AVX - && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed + && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY + && epilogue_completed && optimize_function_for_speed_p (cfun) && (!REG_P (operands[1]) || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1]))) @@ -4708,7 +4709,8 @@ (float_truncate:SF (match_operand:DF 1 "nonimmediate_operand")))] "!TARGET_AVX - && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed + && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY + && epilogue_completed && optimize_function_for_speed_p (cfun) && (!REG_P (operands[1]) || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1]))) @@ -5243,7 +5245,8 @@ [(set (match_operand:MODEF 0 "sse_reg_operand") (float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))] "!TARGET_AVX - && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed + && TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY + && epilogue_completed && optimize_function_for_speed_p (cfun) && (!EXT_REX_SSE_REG_P (operands[0]) || TARGET_AVX512VL)" diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index 088edb6c4ca..58e8ead56b4 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -64,6 +64,21 @@ DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency", m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10 | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC) +/* X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY: This knob avoids + partial write to the destination in scalar SSE conversion from FP + to FP. */ +DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY, + "sse_partial_reg_fp_converts_dependency", + m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10 + | m_BDVER | m_ZNVER | m_GENERIC) + +/* X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY: This knob avoids partial + write to the destination in scalar SSE conversion from integer to FP. */ +DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY, + "sse_partial_reg_converts_dependency", + m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10 + | m_BDVER | m_ZNVER | m_GENERIC) + /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies are resolved on SSE register parts instead of whole registers, so we may maintain just lower part of scalar values in proper format leaving the diff --git a/gcc/testsuite/gcc.target/i386/avx-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-covert-1.c new file mode 100644 index 00000000000..b6c794ecbb8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx-covert-1.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */ + +extern float f; +extern double d; +extern int i; + +void +foo (void) +{ + d = f; + f = i; +} + +/* { dg-final { scan-assembler "vcvtss2sd" } } */ +/* { dg-final { scan-assembler "vcvtsi2ssl" } } */ +/* { dg-final { scan-assembler-not "vcvtps2pd" } } */ +/* { dg-final { scan-assembler-not "vcvtdq2ps" } } */ +/* { dg-final { scan-assembler-not "vxorps" } } */ diff --git a/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c new file mode 100644 index 00000000000..c40c48b1b2d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */ + +extern float f; +extern double d; + +void +foo (void) +{ + d = f; +} + +/* { dg-final { scan-assembler "vcvtss2sd" } } */ +/* { dg-final { scan-assembler-not "vcvtps2pd" } } */ +/* { dg-final { scan-assembler-not "vxorps" } } */ diff --git a/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c new file mode 100644 index 00000000000..01bb64e66cc --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */ + +extern float f; +extern int i; + +void +foo (void) +{ + f = i; +} + +/* { dg-final { scan-assembler "vcvtsi2ssl" } } */ +/* { dg-final { scan-assembler-not "vxorps" } } */ diff --git a/gcc/testsuite/gcc.target/i386/sse-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-covert-1.c new file mode 100644 index 00000000000..c30af694505 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/sse-covert-1.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */ + +extern float f; +extern double d; +extern int i; + +void +foo (void) +{ + d = f; + f = i; +} + +/* { dg-final { scan-assembler "cvtss2sd" } } */ +/* { dg-final { scan-assembler "cvtsi2ssl" } } */ +/* { dg-final { scan-assembler-not "cvtps2pd" } } */ +/* { dg-final { scan-assembler-not "cvtdq2ps" } } */ +/* { dg-final { scan-assembler-not "pxor" } } */ diff --git a/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c new file mode 100644 index 00000000000..b6567e60e3e --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */ + +extern float f; +extern double d; + +void +foo (void) +{ + d = f; +} + +/* { dg-final { scan-assembler "cvtss2sd" } } */ +/* { dg-final { scan-assembler-not "cvtps2pd" } } */ +/* { dg-final { scan-assembler-not "pxor" } } */ diff --git a/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c new file mode 100644 index 00000000000..107f7241def --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */ + +extern float f; +extern int i; + +void +foo (void) +{ + f = i; +} + +/* { dg-final { scan-assembler "cvtsi2ssl" } } */ +/* { dg-final { scan-assembler-not "pxor" } } */