From patchwork Wed Sep 15 08:09:48 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
X-Patchwork-Id: 45011
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 010873858422
	for <patchwork@sourceware.org>; Wed, 15 Sep 2021 08:11:26 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 010873858422
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1631693486;
	bh=7AL1mwUqtatWeyzGcxRxRaxbfecSKjEIpxDD/57RYK0=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=IyFWVXgmFtfkBsT00RvpCIF14g/RzYT9Yrur5gAT76n0I3AGfwslFnGnU3jOKvaYF
	 OrXTd7g8S1tRJ904zlP4/zZLY9Vs97IfnAMEG0UQW3d6g/bCp+vdMmTm5ewt1ARVQm
	 gnwNR4CTf2biz2lN//+Eb0db1j5ioV3uUrORpN7Q=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by sourceware.org (Postfix) with ESMTPS id 3EFD13858407
 for <gcc-patches@gcc.gnu.org>; Wed, 15 Sep 2021 08:09:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3EFD13858407
X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="221917964"
X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="221917964"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 15 Sep 2021 01:09:55 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="433325049"
Received: from scymds02.sc.intel.com ([10.82.73.244])
 by orsmga006.jf.intel.com with ESMTP; 15 Sep 2021 01:09:55 -0700
Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125])
 by scymds02.sc.intel.com with ESMTP id 18F89p6j023358;
 Wed, 15 Sep 2021 01:09:53 -0700
To: ubizjak@gmail.com
Subject: [PATCH 1/4] [PATCH 1/4] x86: Update -mtune=tremont
Date: Wed, 15 Sep 2021 16:09:48 +0800
Message-Id: <20210915080951.10362-2-lili.cui@intel.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20210915080951.10362-1-lili.cui@intel.com>
References: <20210915080951.10362-1-lili.cui@intel.com>
X-Spam-Status: No, score=-24.4 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE,
 SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "lili.cui--- via Gcc-patches"
 <gcc-patches@gcc.gnu.org>
From: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
Reply-To: lili.cui@intel.com
Cc: hongtao.liu@intel.com, gcc-patches@gcc.gnu.org
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

From: "H.J. Lu" <hjl.tools@gmail.com>

Initial -mtune=tremont update

1. Use Haswell scheduling model.
2. Assume that stack engine allows to execute push&pop instructions in
parall.
3. Prepare for scheduling pass as -mtune=generic.
4. Use the same issue rate as -mtune=generic.
5. Enable partial_reg_dependency.
6. Disable accumulate_outgoing_args
7. Enable use_leave
8. Enable push_memory
9. Disable four_jump_limit
10. Disable opt_agu
11. Disable avoid_lea_for_addr
12. Disable avoid_mem_opnd_for_cmove
13. Enable misaligned_move_string_pro_epilogues
14. Enable use_cltd
16. Enable avoid_false_dep_for_bmi
17. Enable avoid_mfence
18. Disable expand_abs
19. Enable sse_typeless_stores
20. Enable sse_load0_by_pxor
21. Disable split_mem_opnd_for_fp_converts
22. Disable slow_pshufb
23. Enable partial_reg_dependency

This is the first patch to tune for Tremont.  With all patches applied,
performance impacts on SPEC CPU 2017 are:

500.perlbench_r         1.81%
502.gcc_r               0.57%
505.mcf_r               1.16%
520.omnetpp_r           0.00%
523.xalancbmk_r         0.00%
525.x264_r              4.55%
531.deepsjeng_r         0.00%
541.leela_r             0.39%
548.exchange2_r         1.13%
557.xz_r                0.00%
geomean for intrate     0.95%
503.bwaves_r            0.00%
507.cactuBSSN_r         6.94%
508.namd_r              12.37%
510.parest_r            1.01%
511.povray_r            3.70%
519.lbm_r               36.61%
521.wrf_r               8.79%
526.blender_r           2.91%
527.cam4_r              6.23%
538.imagick_r           0.28%
544.nab_r               21.99%
549.fotonik3d_r         3.63%
554.roms_r              -1.20%
geomean for fprate      7.50%

gcc/ChangeLog

	* common/config/i386/i386-common.c: Use Haswell scheduling model
	for Tremont.
	* config/i386/i386.c (ix86_sched_init_global): Prepare for Tremont
	scheduling pass.
	* config/i386/x86-tune-sched.c (ix86_issue_rate): Change Tremont
	issue rate to 4.
	(ix86_adjust_cost): Handle Tremont.
	* config/i386/x86-tune.def (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY):
	Enable for Tremont.
	(X86_TUNE_USE_LEAVE): Likewise.
	(X86_TUNE_PUSH_MEMORY): Likewise.
	(X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES): Likewise.
	(X86_TUNE_USE_CLTD): Likewise.
	(X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): Likewise.
	(X86_TUNE_AVOID_MFENCE): Likewise.
	(X86_TUNE_SSE_TYPELESS_STORES): Likewise.
	(X86_TUNE_SSE_LOAD0_BY_PXOR): Likewise.
	(X86_TUNE_ACCUMULATE_OUTGOING_ARGS): Disable for Tremont.
	(X86_TUNE_FOUR_JUMP_LIMIT): Likewise.
	(X86_TUNE_OPT_AGU): Likewise.
	(X86_TUNE_AVOID_LEA_FOR_ADDR): Likewise.
	(X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE): Likewise.
	(X86_TUNE_EXPAND_ABS): Likewise.
	(X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS): Likewise.
	(X86_TUNE_SLOW_PSHUFB): Likewise.
---
 gcc/common/config/i386/i386-common.c |  2 +-
 gcc/config/i386/i386.c               |  1 +
 gcc/config/i386/x86-tune-sched.c     |  2 ++
 gcc/config/i386/x86-tune.def         | 37 ++++++++++++++--------------
 4 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/gcc/common/config/i386/i386-common.c b/gcc/common/config/i386/i386-common.c
index 00c65ba15ab..2c9e1ccbc6e 100644
--- a/gcc/common/config/i386/i386-common.c
+++ b/gcc/common/config/i386/i386-common.c
@@ -1935,7 +1935,7 @@ const pta processor_alias_table[] =
     M_CPU_TYPE (INTEL_GOLDMONT), P_PROC_SSE4_2},
   {"goldmont-plus", PROCESSOR_GOLDMONT_PLUS, CPU_GLM, PTA_GOLDMONT_PLUS,
     M_CPU_TYPE (INTEL_GOLDMONT_PLUS), P_PROC_SSE4_2},
-  {"tremont", PROCESSOR_TREMONT, CPU_GLM, PTA_TREMONT,
+  {"tremont", PROCESSOR_TREMONT, CPU_HASWELL, PTA_TREMONT,
     M_CPU_TYPE (INTEL_TREMONT), P_PROC_SSE4_2},
   {"knl", PROCESSOR_KNL, CPU_SLM, PTA_KNL,
     M_CPU_TYPE (INTEL_KNL), P_PROC_AVX512F},
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7b173bc0beb..2927e2884c9 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -16976,6 +16976,7 @@ ix86_sched_init_global (FILE *, int, int)
     case PROCESSOR_NEHALEM:
     case PROCESSOR_SANDYBRIDGE:
     case PROCESSOR_HASWELL:
+    case PROCESSOR_TREMONT:
     case PROCESSOR_GENERIC:
       /* Do not perform multipass scheduling for pre-reload schedule
          to save compile time.  */
diff --git a/gcc/config/i386/x86-tune-sched.c b/gcc/config/i386/x86-tune-sched.c
index 2e5ee4e4444..56ada99a450 100644
--- a/gcc/config/i386/x86-tune-sched.c
+++ b/gcc/config/i386/x86-tune-sched.c
@@ -71,6 +71,7 @@ ix86_issue_rate (void)
     case PROCESSOR_NEHALEM:
     case PROCESSOR_SANDYBRIDGE:
     case PROCESSOR_HASWELL:
+    case PROCESSOR_TREMONT:
     case PROCESSOR_GENERIC:
       return 4;
 
@@ -429,6 +430,7 @@ ix86_adjust_cost (rtx_insn *insn, int dep_type, rtx_insn *dep_insn, int cost,
     case PROCESSOR_NEHALEM:
     case PROCESSOR_SANDYBRIDGE:
     case PROCESSOR_HASWELL:
+    case PROCESSOR_TREMONT:
     case PROCESSOR_GENERIC:
       /* Stack engine allows to execute push&pop instructions in parall.  */
       if ((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 2f221b1f8c9..385e275bbd9 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -62,7 +62,7 @@ DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency",
    that can be partly masked by careful scheduling of moves.  */
 DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
           m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
-	  | m_BDVER | m_ZNVER | m_GENERIC)
+	  | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
    are resolved on SSE register parts instead of whole registers, so we may
@@ -136,7 +136,7 @@ DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch",
 
 DEF_TUNE (X86_TUNE_ACCUMULATE_OUTGOING_ARGS, "accumulate_outgoing_args",
 	  m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL
-	  | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_ATHLON_K8)
+	  | m_GOLDMONT | m_GOLDMONT_PLUS | m_ATHLON_K8)
 
 /* X86_TUNE_PROLOGUE_USING_MOVE: Do not use push/pop in prologues that are
    considered on critical path.  */
@@ -150,14 +150,15 @@ DEF_TUNE (X86_TUNE_EPILOGUE_USING_MOVE, "epilogue_using_move",
 
 /* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where it fits.  */
 DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
-	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC)
+	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_TREMONT
+	  | m_GENERIC)
 
 /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem" instructions.
    Some chips, like 486 and Pentium works faster with separate load
    and push instructions.  */
 DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory",
           m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE
-          | m_GENERIC)
+          | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_SINGLE_PUSH: Enable if single push insn is preferred
    over esp subtraction.  */
@@ -198,8 +199,7 @@ DEF_TUNE (X86_TUNE_PAD_RETURNS, "pad_returns",
    than 4 branch instructions in the 16 byte window.  */
 DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit",
           m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM
-	  | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL | m_ATHLON_K8
-	  | m_AMDFAM10)
+	  | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL | m_ATHLON_K8 | m_AMDFAM10)
 
 /*****************************************************************************/
 /* Integer instruction selection tuning                                      */
@@ -240,11 +240,11 @@ DEF_TUNE (X86_TUNE_INTEGER_DFMODE_MOVES, "integer_dfmode_moves",
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_BONNELL | m_SILVERMONT | m_KNL
-	 | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL)
+	 | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL)
 
 /* X86_TUNE_AVOID_LEA_FOR_ADDR: Avoid lea for address computation.  */
 DEF_TUNE (X86_TUNE_AVOID_LEA_FOR_ADDR, "avoid_lea_for_addr",
-	  m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT
+	  m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS
 	  | m_KNL | m_KNM)
 
 /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
@@ -263,7 +263,7 @@ DEF_TUNE (X86_TUNE_SLOW_IMUL_IMM8, "slow_imul_imm8",
    a conditional move.  */
 DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE, "avoid_mem_opnd_for_cmove",
 	  m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_KNL
-	  | m_KNM | m_TREMONT | m_INTEL)
+	  | m_KNM | m_INTEL)
 
 /* X86_TUNE_SINGLE_STRINGOP: Enable use of single string operations, such
    as MOVS and STOS (without a REP prefix) to move/set sequences of bytes.  */
@@ -282,7 +282,8 @@ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
    FIXME: This may actualy be a win on more targets than listed here.  */
 DEF_TUNE (X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES,
 	  "misaligned_move_string_pro_epilogues",
-	  m_386 | m_486 | m_CORE_ALL | m_AMD_MULTIPLE | m_GENERIC)
+	  m_386 | m_486 | m_CORE_ALL | m_AMD_MULTIPLE | m_TREMONT
+	  | m_GENERIC)
 
 /* X86_TUNE_USE_SAHF: Controls use of SAHF.  */
 DEF_TUNE (X86_TUNE_USE_SAHF, "use_sahf",
@@ -294,7 +295,7 @@ DEF_TUNE (X86_TUNE_USE_SAHF, "use_sahf",
 /* X86_TUNE_USE_CLTD: Controls use of CLTD and CTQO instructions.  */
 DEF_TUNE (X86_TUNE_USE_CLTD, "use_cltd",
 	  ~(m_PENT | m_LAKEMONT | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL
-	    | m_K6 | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT))
+	    | m_K6 | m_GOLDMONT | m_GOLDMONT_PLUS))
 
 /* X86_TUNE_USE_BT: Enable use of BT (bit test) instructions.  */
 DEF_TUNE (X86_TUNE_USE_BT, "use_bt",
@@ -305,7 +306,7 @@ DEF_TUNE (X86_TUNE_USE_BT, "use_bt",
 /* X86_TUNE_AVOID_FALSE_DEP_FOR_BMI: Avoid false dependency
    for bit-manipulation instructions.  */
 DEF_TUNE (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI, "avoid_false_dep_for_bmi",
-	  m_SANDYBRIDGE | m_CORE_AVX2 | m_GENERIC)
+	  m_SANDYBRIDGE | m_CORE_AVX2 | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_ADJUST_UNROLL: This enables adjusting the unroll factor based
    on hardware capabilities. Bdver3 hardware has a loop buffer which makes
@@ -321,14 +322,14 @@ DEF_TUNE (X86_TUNE_ONE_IF_CONV_INSN, "one_if_conv_insn",
 
 /* X86_TUNE_AVOID_MFENCE: Use lock prefixed instructions instead of mfence.  */
 DEF_TUNE (X86_TUNE_AVOID_MFENCE, "avoid_mfence",
-	 m_CORE_ALL | m_BDVER | m_ZNVER | m_GENERIC)
+	 m_CORE_ALL | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_EXPAND_ABS: This enables a new abs pattern by
    generating instructions for abs (x) = (((signed) x >> (W-1) ^ x) -
    (signed) x >> (W-1)) instead of cmove or SSE max/abs instructions.  */
 DEF_TUNE (X86_TUNE_EXPAND_ABS, "expand_abs",
 	  m_CORE_ALL | m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT
-	  | m_GOLDMONT_PLUS | m_TREMONT )
+	  | m_GOLDMONT_PLUS)
 
 /*****************************************************************************/
 /* 387 instruction selection tuning                                          */
@@ -386,13 +387,13 @@ DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, "sse_packed_single_insn_optim
 
 /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores.   */
 DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores",
-	  m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC)
+	  m_AMD_MULTIPLE | m_CORE_ALL | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_SSE_LOAD0_BY_PXOR: Always use pxor to load0 as opposed to
    xorps/xorpd and other variants.  */
 DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor",
 	  m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BDVER | m_BTVER | m_ZNVER
-	  | m_GENERIC)
+	  | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_INTER_UNIT_MOVES_TO_VEC: Enable moves in from integer
    to SSE registers.  If disabled, the moves will be done by storing
@@ -419,7 +420,7 @@ DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSIONS, "inter_unit_conversions",
    fp converts to destination register.  */
 DEF_TUNE (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS, "split_mem_opnd_for_fp_converts",
 	  m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS
-	  | m_TREMONT | m_INTEL)
+	  | m_INTEL)
 
 /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
    from FP to FP.  This form of instructions avoids partial write to the
@@ -434,7 +435,7 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
 /* X86_TUNE_SLOW_SHUFB: Indicates tunings with slow pshufb instruction.  */
 DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",
 	  m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT
-	  | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL)
+	  | m_GOLDMONT_PLUS | m_INTEL)
 
 /* X86_TUNE_AVOID_4BYTE_PREFIXES: Avoid instructions requiring 4+ bytes of prefixes.  */
 DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",

From patchwork Wed Sep 15 08:09:49 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
X-Patchwork-Id: 45013
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id A735B3857C74
	for <patchwork@sourceware.org>; Wed, 15 Sep 2021 08:13:27 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A735B3857C74
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1631693607;
	bh=BwaDg1VXnrH6zQQVM+7kjypP4PUYXEVduKMRactw5+U=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=enRnx4mPxnFhJ2pJkc+6fBqeZtLkulv+G547LlcJoqJuI4wJzKQEeLBq3wFAmmQyI
	 pdZ/4H+vbbNE9ZrWeBC/Cv2cLLg8R2hYnjQXz/2E01zlSugU7Zqm4cYFM70zbSUDo5
	 qLzb0QdkFV3zjQJpnM7WmHuqeMDCifV3N/yLf0Ck=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
 by sourceware.org (Postfix) with ESMTPS id 5867A3858039
 for <gcc-patches@gcc.gnu.org>; Wed, 15 Sep 2021 08:10:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5867A3858039
X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="209482561"
X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="209482561"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
 by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 15 Sep 2021 01:09:58 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="472316190"
Received: from scymds02.sc.intel.com ([10.82.73.244])
 by orsmga007.jf.intel.com with ESMTP; 15 Sep 2021 01:09:56 -0700
Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125])
 by scymds02.sc.intel.com with ESMTP id 18F89p6k023358;
 Wed, 15 Sep 2021 01:09:55 -0700
To: ubizjak@gmail.com
Subject: [PATCH 2/4] [PATCH 2/4] x86: Update memcpy/memset inline strategies
 for -mtune=tremont
Date: Wed, 15 Sep 2021 16:09:49 +0800
Message-Id: <20210915080951.10362-3-lili.cui@intel.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20210915080951.10362-1-lili.cui@intel.com>
References: <20210915080951.10362-1-lili.cui@intel.com>
X-Spam-Status: No, score=-15.8 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY,
 RCVD_IN_MSPIKE_H3,
 RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "lili.cui--- via Gcc-patches"
 <gcc-patches@gcc.gnu.org>
From: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
Reply-To: lili.cui@intel.com
Cc: hongtao.liu@intel.com, gcc-patches@gcc.gnu.org
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

From: "H.J. Lu" <hjl.tools@gmail.com>

Simply memcpy and memset inline strategies to avoid branches for
-mtune=tremont:

1. Create Tremont cost model from generic cost model.
2. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
   load and store for up to 16 * 16 (256) bytes when the data size is
   fixed and known.
3. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with simple code sequence if the data size
      is a constant.
   b. Use loop if data size is not a constant.
4. Use memcpy/memset libray function if data size is unknown or > 256.

	* config/i386/i386-options.c (processor_cost_table): Use
	tremont_cost for Tremont.
	* config/i386/x86-tune-costs.h (tremont_memcpy): New.
	(tremont_memset): Likewise.
	(tremont_cost): Likewise.
	* config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
	Enable for Tremont.
---
 gcc/config/i386/i386-options.c   |   2 +-
 gcc/config/i386/x86-tune-costs.h | 124 +++++++++++++++++++++++++++++++
 gcc/config/i386/x86-tune.def     |   2 +-
 3 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index c0006b3674b..e7a3bd4aaea 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -724,7 +724,7 @@ static const struct processor_costs *processor_cost_table[] =
   &slm_cost,
   &slm_cost,
   &slm_cost,
-  &slm_cost,
+  &tremont_cost,
   &slm_cost,
   &slm_cost,
   &skylake_cost,
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index ffe810f2bcb..93644be9cb3 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -2734,6 +2734,130 @@ struct processor_costs slm_cost = {
   "16",					/* Func alignment.  */
 };
 
+static stringop_algs tremont_memcpy[2] = {
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
+static stringop_algs tremont_memset[2] = {
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
+static const
+struct processor_costs tremont_cost = {
+  {
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
+  6,				     /* cost for loading QImode using movzbl */
+  {6, 6, 6},				/* cost of loading integer registers
+					   in QImode, HImode and SImode.
+					   Relative to reg-reg move (2).  */
+  {6, 6, 6},				/* cost of storing integer registers */
+  4,					/* cost of reg,reg fld/fst */
+  {6, 6, 12},				/* cost of loading fp registers
+					   in SFmode, DFmode and XFmode */
+  {6, 6, 12},				/* cost of storing fp registers
+					   in SFmode, DFmode and XFmode */
+  2,					/* cost of moving MMX register */
+  {6, 6},				/* cost of loading MMX registers
+					   in SImode and DImode */
+  {6, 6},				/* cost of storing MMX registers
+					   in SImode and DImode */
+  2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
+  {6, 6, 6, 10, 15},			/* cost of loading SSE registers
+					   in 32,64,128,256 and 512-bit */
+  {6, 6, 6, 10, 15},			/* cost of storing SSE registers
+					   in 32,64,128,256 and 512-bit */
+  6, 6,				/* SSE->integer and integer->SSE moves */
+  6, 6,				/* mask->integer and integer->mask moves */
+  {6, 6, 6},				/* cost of loading mask register
+					   in QImode, HImode, SImode.  */
+  {6, 6, 6},			/* cost if storing mask register
+					   in QImode, HImode, SImode.  */
+  2,					/* cost of moving mask register.  */
+  /* End of register allocator costs.  */
+  },
+
+  COSTS_N_INSNS (1),			/* cost of an add instruction */
+  /* Setting cost to 2 makes our current implementation of synth_mult result in
+     use of unnecessary temporary registers causing regression on several
+     SPECfp benchmarks.  */
+  COSTS_N_INSNS (1) + 1,		/* cost of a lea instruction */
+  COSTS_N_INSNS (1),			/* variable shift costs */
+  COSTS_N_INSNS (1),			/* constant shift costs */
+  {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */
+   COSTS_N_INSNS (4),			/*				 HI */
+   COSTS_N_INSNS (3),			/*				 SI */
+   COSTS_N_INSNS (4),			/*				 DI */
+   COSTS_N_INSNS (4)},			/*			      other */
+  0,					/* cost of multiply per each bit set */
+  {COSTS_N_INSNS (16),			/* cost of a divide/mod for QI */
+   COSTS_N_INSNS (22),			/*			    HI */
+   COSTS_N_INSNS (30),			/*			    SI */
+   COSTS_N_INSNS (74),			/*			    DI */
+   COSTS_N_INSNS (74)},			/*			    other */
+  COSTS_N_INSNS (1),			/* cost of movsx */
+  COSTS_N_INSNS (1),			/* cost of movzx */
+  8,					/* "large" insn */
+  17,					/* MOVE_RATIO */
+  17,					/* CLEAR_RATIO */
+  {6, 6, 6},				/* cost of loading integer registers
+					   in QImode, HImode and SImode.
+					   Relative to reg-reg move (2).  */
+  {6, 6, 6},				/* cost of storing integer registers */
+  {6, 6, 6, 10, 15},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 15},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
+  {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
+  6,					/* cost of moving SSE register to integer.  */
+  18, 6,				/* Gather load static, per_elt.  */
+  18, 6,				/* Gather store static, per_elt.  */
+  32,					/* size of l1 cache.  */
+  512,					/* size of l2 cache.  */
+  64,					/* size of prefetch block */
+  6,					/* number of parallel prefetches */
+  /* Benchmarks shows large regressions on K8 sixtrack benchmark when this
+     value is increased to perhaps more appropriate value of 5.  */
+  3,					/* Branch cost */
+  COSTS_N_INSNS (3),			/* cost of FADD and FSUB insns.  */
+  COSTS_N_INSNS (5),			/* cost of FMUL instruction.  */
+  COSTS_N_INSNS (17),			/* cost of FDIV instruction.  */
+  COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
+  COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
+  COSTS_N_INSNS (14),			/* cost of FSQRT instruction.  */
+
+  COSTS_N_INSNS (1),			/* cost of cheap SSE instruction.  */
+  COSTS_N_INSNS (3),			/* cost of ADDSS/SD SUBSS/SD insns.  */
+  COSTS_N_INSNS (4),			/* cost of MULSS instruction.  */
+  COSTS_N_INSNS (5),			/* cost of MULSD instruction.  */
+  COSTS_N_INSNS (5),			/* cost of FMA SS instruction.  */
+  COSTS_N_INSNS (5),			/* cost of FMA SD instruction.  */
+  COSTS_N_INSNS (13),			/* cost of DIVSS instruction.  */
+  COSTS_N_INSNS (17),			/* cost of DIVSD instruction.  */
+  COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
+  COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
+  1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  tremont_memcpy,
+  tremont_memset,
+  COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
+  COSTS_N_INSNS (2),			/* cond_not_taken_branch_cost.  */
+  "16:11:8",				/* Loop alignment.  */
+  "16:11:8",				/* Jump alignment.  */
+  "0:0:8",				/* Label alignment.  */
+  "16",					/* Func alignment.  */
+};
+
 static stringop_algs intel_memcpy[2] = {
   {libcall, {{11, loop, false}, {-1, rep_prefix_4_byte, false}}},
   {libcall, {{32, loop, false}, {64, rep_prefix_4_byte, false},
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 385e275bbd9..088edb6c4ca 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA)
    move/set sequences of bytes with known size.  */
 DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
 	  "prefer_known_rep_movsb_stosb",
-	  m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
+	  m_SKYLAKE | m_ALDERLAKE | m_TREMONT | m_CORE_AVX512)
 
 /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of
    compact prologues and epilogues by issuing a misaligned moves.  This

From patchwork Wed Sep 15 08:09:50 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
X-Patchwork-Id: 45014
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id DB3193858432
	for <patchwork@sourceware.org>; Wed, 15 Sep 2021 08:14:24 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DB3193858432
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1631693664;
	bh=1BNpHm9sIZn7RLDTLg8/xtv0HX+CnnpJX+YtHalEEbI=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=fR1A89CuQyW1Bp5p7DWI1Zo7hXxS6gpqGtq99/5wmlH/gdv8LuO9mf680PBKqudyK
	 9cuZGQosCifOQLkQDKIkDaB1FQq3vKlaEznqH9lHXYJKHhvALcuXV+EYHroekQnPCa
	 sNjAQGckUN28W9Qh4teP3Q/Tfa8dfeS6Sp7B/FJE=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
 by sourceware.org (Postfix) with ESMTPS id 519253858039
 for <gcc-patches@gcc.gnu.org>; Wed, 15 Sep 2021 08:10:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 519253858039
X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="209482563"
X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="209482563"
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
 by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 15 Sep 2021 01:09:58 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="470658648"
Received: from scymds02.sc.intel.com ([10.82.73.244])
 by fmsmga007.fm.intel.com with ESMTP; 15 Sep 2021 01:09:58 -0700
Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125])
 by scymds02.sc.intel.com with ESMTP id 18F89p6l023358;
 Wed, 15 Sep 2021 01:09:57 -0700
To: ubizjak@gmail.com
Subject: [PATCH 3/4] [PATCH 3/4] x86: Properly handle
 USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
Date: Wed, 15 Sep 2021 16:09:50 +0800
Message-Id: <20210915080951.10362-4-lili.cui@intel.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20210915080951.10362-1-lili.cui@intel.com>
References: <20210915080951.10362-1-lili.cui@intel.com>
X-Spam-Status: No, score=-15.9 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "lili.cui--- via Gcc-patches"
 <gcc-patches@gcc.gnu.org>
From: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
Reply-To: lili.cui@intel.com
Cc: hongtao.liu@intel.com, gcc-patches@gcc.gnu.org
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

From: "H.J. Lu" <hjl.tools@gmail.com>

Check TARGET_USE_VECTOR_FP_CONVERTS or TARGET_USE_VECTOR_CONVERTS when
handling avx_partial_xmm_update attribute.  Don't convert AVX partial
XMM register update if vector packed SSE conversion should be used.

gcc/

	PR target/101900
	* config/i386/i386-features.c (remove_partial_avx_dependency):
	Check TARGET_USE_VECTOR_FP_CONVERTS and TARGET_USE_VECTOR_CONVERTS
	before generating vxorps.

gcc/

	PR target/101900
	* testsuite/gcc.target/i386/pr101900-1.c: New test.
	* testsuite/gcc.target/i386/pr101900-2.c: Likewise.
	* testsuite/gcc.target/i386/pr101900-3.c: Likewise.
---
 gcc/config/i386/i386-features.c            | 21 ++++++++++++++++++---
 gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 ++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 ++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 +++++++++++++++++++
 4 files changed, 73 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 5a99ea7c046..ae5ea02a002 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2210,15 +2210,30 @@ remove_partial_avx_dependency (void)
 	      != AVX_PARTIAL_XMM_UPDATE_TRUE)
 	    continue;
 
-	  if (!v4sf_const0)
-	    v4sf_const0 = gen_reg_rtx (V4SFmode);
-
 	  /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
 	     SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
 	     vec_merge with subreg.  */
 	  rtx src = SET_SRC (set);
 	  rtx dest = SET_DEST (set);
 	  machine_mode dest_mode = GET_MODE (dest);
+	  machine_mode src_mode;
+
+	  if (TARGET_USE_VECTOR_FP_CONVERTS)
+	    {
+	      src_mode = GET_MODE (XEXP (src, 0));
+	      if (src_mode == E_SFmode || src_mode == E_DFmode)
+		continue;
+	    }
+
+	  if (TARGET_USE_VECTOR_CONVERTS)
+	    {
+	      src_mode = GET_MODE (XEXP (src, 0));
+	      if (src_mode == E_SImode || src_mode == E_DImode)
+		continue;
+	    }
+
+	  if (!v4sf_const0)
+	    v4sf_const0 = gen_reg_rtx (V4SFmode);
 
 	  rtx zero;
 	  machine_mode dest_vecmode;
diff --git a/gcc/testsuite/gcc.target/i386/pr101900-1.c b/gcc/testsuite/gcc.target/i386/pr101900-1.c
new file mode 100644
index 00000000000..0a45f8e340a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101900-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtps2pd" } } */
+/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr101900-2.c b/gcc/testsuite/gcc.target/i386/pr101900-2.c
new file mode 100644
index 00000000000..c8b2d1da5ae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101900-2.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_converts" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtss2sd" } } */
+/* { dg-final { scan-assembler "vcvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr101900-3.c b/gcc/testsuite/gcc.target/i386/pr101900-3.c
new file mode 100644
index 00000000000..6ee565b5bd4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101900-3.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts,use_vector_converts" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtps2pd" } } */
+/* { dg-final { scan-assembler "vcvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
+/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */

From patchwork Wed Sep 15 08:09:51 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
X-Patchwork-Id: 45012
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 3F5093858018
	for <patchwork@sourceware.org>; Wed, 15 Sep 2021 08:12:31 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3F5093858018
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1631693551;
	bh=LkCcVF39jOTKSFcGcvr4JJDvR7Y+oJcqDf1iFp9CVHg=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=gF43nglw/zE3EBAM4fUcS0s0q2uOqFPTxJUYPN9KvpYy80dKXg/ZpEt3wofESxXl0
	 anruwF6IdXAmfLF8PT8pc7l/6GLIaJYeuceZt6Scm9rb6bCGH5vvvaNcpkIngGy07m
	 E+xjAwS6NVN2/0w8gtrYfDn0QVGnLNFaFib1UUNg=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
 by sourceware.org (Postfix) with ESMTPS id 883373857C6B
 for <gcc-patches@gcc.gnu.org>; Wed, 15 Sep 2021 08:10:01 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 883373857C6B
X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="209345048"
X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="209345048"
Received: from orsmga002.jf.intel.com ([10.7.209.21])
 by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 15 Sep 2021 01:10:00 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.85,294,1624345200"; d="scan'208";a="452445322"
Received: from scymds02.sc.intel.com ([10.82.73.244])
 by orsmga002.jf.intel.com with ESMTP; 15 Sep 2021 01:10:00 -0700
Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125])
 by scymds02.sc.intel.com with ESMTP id 18F89p6m023358;
 Wed, 15 Sep 2021 01:09:58 -0700
To: ubizjak@gmail.com
Subject: [PATCH 4/4] [PATCH 4/4] x86: Add
 TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY
Date: Wed, 15 Sep 2021 16:09:51 +0800
Message-Id: <20210915080951.10362-5-lili.cui@intel.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20210915080951.10362-1-lili.cui@intel.com>
References: <20210915080951.10362-1-lili.cui@intel.com>
X-Spam-Status: No, score=-15.7 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT,
 SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "lili.cui--- via Gcc-patches"
 <gcc-patches@gcc.gnu.org>
From: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
Reply-To: lili.cui@intel.com
Cc: hongtao.liu@intel.com, gcc-patches@gcc.gnu.org
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

From: "H.J. Lu" <hjl.tools@gmail.com>

1. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY in SSE FP to FP splitters.
2. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY in SSE INT to FP splitters.
3.  Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
TARGET_SSE_PARTIAL_REG_DEPENDENCY when handling avx_partial_xmm_update
attribute.  Don't convert AVX partial XMM register update if there is no
partial SSE register dependency for SSE conversion.

gcc/

	* config/i386/i386-features.c (remove_partial_avx_dependency):
	Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
	and TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY before generating
	vxorps.
	* config/i386/i386.h (TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY):
	New.
	(TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
	* config/i386/i386.md (SSE FP to FP splitters): Replace
	TARGET_SSE_PARTIAL_REG_DEPENDENCY with
	TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY.
	(SSE INT to FP splitter): Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY
	with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY.
	* config/i386/x86-tune.def
	(X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New.
	(X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.

gcc/testsuite/

	* gcc.target/i386/avx-covert-1.c: New file.
	* gcc.target/i386/avx-fp-covert-1.c: Likewise.
	* gcc.target/i386/avx-int-covert-1.c: Likewise.
	* gcc.target/i386/sse-covert-1.c: Likewise.
	* gcc.target/i386/sse-fp-covert-1.c: Likewise.
	* gcc.target/i386/sse-int-covert-1.c: Likewise.
---
 gcc/config/i386/i386-features.c               |  6 ++++--
 gcc/config/i386/i386.h                        |  4 ++++
 gcc/config/i386/i386.md                       |  9 ++++++---
 gcc/config/i386/x86-tune.def                  | 15 +++++++++++++++
 gcc/testsuite/gcc.target/i386/avx-covert-1.c  | 19 +++++++++++++++++++
 .../gcc.target/i386/avx-fp-covert-1.c         | 15 +++++++++++++++
 .../gcc.target/i386/avx-int-covert-1.c        | 14 ++++++++++++++
 gcc/testsuite/gcc.target/i386/sse-covert-1.c  | 19 +++++++++++++++++++
 .../gcc.target/i386/sse-fp-covert-1.c         | 15 +++++++++++++++
 .../gcc.target/i386/sse-int-covert-1.c        | 14 ++++++++++++++
 10 files changed, 125 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index ae5ea02a002..91bfa06d4bf 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2218,14 +2218,16 @@ remove_partial_avx_dependency (void)
 	  machine_mode dest_mode = GET_MODE (dest);
 	  machine_mode src_mode;
 
-	  if (TARGET_USE_VECTOR_FP_CONVERTS)
+	  if (TARGET_USE_VECTOR_FP_CONVERTS
+	      || !TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY)
 	    {
 	      src_mode = GET_MODE (XEXP (src, 0));
 	      if (src_mode == E_SFmode || src_mode == E_DFmode)
 		continue;
 	    }
 
-	  if (TARGET_USE_VECTOR_CONVERTS)
+	  if (TARGET_USE_VECTOR_CONVERTS
+	      || !TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY)
 	    {
 	      src_mode = GET_MODE (XEXP (src, 0));
 	      if (src_mode == E_SImode || src_mode == E_DImode)
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index e76bb55c080..ec60b89753e 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -334,6 +334,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_PARTIAL_REG_DEPENDENCY]
 #define TARGET_SSE_PARTIAL_REG_DEPENDENCY \
 	ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY]
+#define TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY \
+	ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY]
+#define TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY \
+	ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY]
 #define TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
 	ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
 #define TARGET_SSE_UNALIGNED_STORE_OPTIMAL \
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 13f6f57cdcc..c82a9dc1f67 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4535,7 +4535,8 @@
         (float_extend:DF
           (match_operand:SF 1 "nonimmediate_operand")))]
   "!TARGET_AVX
-   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
+   && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -4708,7 +4709,8 @@
         (float_truncate:SF
 	  (match_operand:DF 1 "nonimmediate_operand")))]
   "!TARGET_AVX
-   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
+   && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -5243,7 +5245,8 @@
   [(set (match_operand:MODEF 0 "sse_reg_operand")
 	(float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
   "!TARGET_AVX
-   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+   && TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY
+   && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!EXT_REX_SSE_REG_P (operands[0])
        || TARGET_AVX512VL)"
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 088edb6c4ca..58e8ead56b4 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -64,6 +64,21 @@ DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
           m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
 	  | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
 
+/* X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY: This knob avoids
+   partial write to the destination in scalar SSE conversion from FP
+   to FP.  */
+DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY,
+	  "sse_partial_reg_fp_converts_dependency",
+	  m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
+	  | m_BDVER | m_ZNVER | m_GENERIC)
+
+/* X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY: This knob avoids partial
+   write to the destination in scalar SSE conversion from integer to FP.  */
+DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY,
+	  "sse_partial_reg_converts_dependency",
+	  m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
+	  | m_BDVER | m_ZNVER | m_GENERIC)
+
 /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
    are resolved on SSE register parts instead of whole registers, so we may
    maintain just lower part of scalar values in proper format leaving the
diff --git a/gcc/testsuite/gcc.target/i386/avx-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-covert-1.c
new file mode 100644
index 00000000000..b6c794ecbb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-covert-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtss2sd" } } */
+/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vcvtps2pd" } } */
+/* { dg-final { scan-assembler-not "vcvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
new file mode 100644
index 00000000000..c40c48b1b2d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */
+
+extern float f;
+extern double d;
+
+void
+foo (void)
+{
+  d = f;
+}
+
+/* { dg-final { scan-assembler "vcvtss2sd" } } */
+/* { dg-final { scan-assembler-not "vcvtps2pd" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
new file mode 100644
index 00000000000..01bb64e66cc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern int i;
+
+void
+foo (void)
+{
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-covert-1.c
new file mode 100644
index 00000000000..c30af694505
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse-covert-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "cvtss2sd" } } */
+/* { dg-final { scan-assembler "cvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "cvtps2pd" } } */
+/* { dg-final { scan-assembler-not "cvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "pxor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
new file mode 100644
index 00000000000..b6567e60e3e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */
+
+extern float f;
+extern double d;
+
+void
+foo (void)
+{
+  d = f;
+}
+
+/* { dg-final { scan-assembler "cvtss2sd" } } */
+/* { dg-final { scan-assembler-not "cvtps2pd" } } */
+/* { dg-final { scan-assembler-not "pxor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
new file mode 100644
index 00000000000..107f7241def
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern int i;
+
+void
+foo (void)
+{
+  f = i;
+}
+
+/* { dg-final { scan-assembler "cvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "pxor" } } */