From patchwork Mon Dec 4 05:32:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: liuhongt X-Patchwork-Id: 81247 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 4A9763857C58 for ; Mon, 4 Dec 2023 05:32:33 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) by sourceware.org (Postfix) with ESMTPS id 926B03858C78 for ; Mon, 4 Dec 2023 05:32:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 926B03858C78 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 926B03858C78 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=198.175.65.12 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701667937; cv=none; b=WYWHlWFJwprt0MX0ufGPUMAW3dseUM/TGkhqKXQRcjoZDo2rPGBND0Q3Kyky0GFjFUpz1yNSpo2MKJyGRhyrPxmVvKbKB51JHc50Is8WJAJ/ofFNv2N5f/Gm/VuGPi1ULc1NNm22BygBXj501pT2q1vU7pQ8necdflLP/zvRxVs= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701667937; c=relaxed/simple; bh=g5paXve4KjIfqlwjfzZQ86XmkcsDeNU2UxxOL6bX+nU=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=bpPWuFjSM0CNZYn+6Av3GGQO9mapdaZjoRPv4YHE+TKfsVCzPOgp3+PN3/fWtwObd9tEoSPQrgNslJ1TrM7mtCRn9JuBbzcsfK4TbVb5hK0EnVzjNHFL2w3GUxEVCAZgL+DhFIUD4zmN8RpJrPXaw9KbBaWeAhS6CHkaTPJfau4= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701667934; x=1733203934; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=g5paXve4KjIfqlwjfzZQ86XmkcsDeNU2UxxOL6bX+nU=; b=BXmlhRUR+E29dcEaipbbq6nHJ8XU51rLxgM12PDeIb6mLE4I4I8lbniY psbi32SBcttEtcV5uJvPC2UIYlco0xZrOx6NIe7kLAlTH5C06rL6pXDXx nlY4c8BqQZgKYCHvr9Dm6idqCy/rOuhfBCaDkFwcq0AyDHHodaIuhAYFp e8iVhbbpJDpxvvrCskZ7SlClw+6x2OZovYowUUUC1cPL9ufpHvjGqph29 c2xEr8g4mMReIBtNAtmgexnWtcxycDtHzBJSg2pdqc1s7NM0a4/5JUr2Q JgtcySx+h5JTjMVtIcqWld+3S5PSMkZfOES3zFI03PUE4akMTUUEvh7HX g==; X-IronPort-AV: E=McAfee;i="6600,9927,10913"; a="736407" X-IronPort-AV: E=Sophos;i="6.04,248,1695711600"; d="scan'208";a="736407" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2023 21:32:13 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10913"; a="861261515" X-IronPort-AV: E=Sophos;i="6.04,248,1695711600"; d="scan'208";a="861261515" Received: from shvmail03.sh.intel.com ([10.239.245.20]) by FMSMGA003.fm.intel.com with ESMTP; 03 Dec 2023 21:32:09 -0800 Received: from shliclel4217.sh.intel.com (shliclel4217.sh.intel.com [10.239.240.127]) by shvmail03.sh.intel.com (Postfix) with ESMTP id 9D3871005689; Mon, 4 Dec 2023 13:32:08 +0800 (CST) From: liuhongt To: gcc-patches@gcc.gnu.org Cc: crazylht@gmail.com, hjl.tools@gmail.com Subject: [PATCH] Don't vectorize when vector stmts are only vec_contruct and stores Date: Mon, 4 Dec 2023 13:32:08 +0800 Message-Id: <20231204053208.908533-1-hongtao.liu@intel.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org .i.e. for below cases. a[0] = b1; a[1] = b2; .. a[n] = bn; There're extra dependences when contructing the vector, but not for scalar store. According to experiments, it's generally worse. The patch adds an cut-off heuristic when vec_stmt is just vec_construct and vector store. It improves SPEC2017 a little bit. BenchMarks Ratio 500.perlbench_r 2.60% 502.gcc_r 0.30% 505.mcf_r 0.40% 520.omnetpp_r -1.00% 523.xalancbmk_r 0.90% 525.x264_r 0.00% 531.deepsjeng_r 0.30% 541.leela_r 0.90% 548.exchange2_r 3.20% 557.xz_r 1.40% 503.bwaves_r 0.00% 507.cactuBSSN_r 0.00% 508.namd_r 0.30% 510.parest_r 0.00% 511.povray_r 0.20% 519.lbm_r SAME BIN 521.wrf_r -0.30% 526.blender_r -1.20% 527.cam4_r -0.20% 538.imagick_r 4.00% 544.nab_r 0.40% 549.fotonik3d_r 0.00% 554.roms_r 0.00% Geomean-int 0.90% Geomean-fp 0.30% Geomean-all 0.50% And Regressed testcases: gcc.target/i386/part-vect-absneghf.c gcc.target/i386/part-vect-copysignhf.c gcc.target/i386/part-vect-xorsignhf.c Regressed under -m32 since it generates 2 vector .ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The original testcases are used to test vectorization capability for .ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT. gcc.target/i386/pr111023-2.c gcc.target/i386/pr111023.c Regressed under -m32 testcase as below void v8hi_v8qi (v8hi *dst, v16qi src) { short tem[8]; tem[0] = src[0]; tem[1] = src[1]; tem[2] = src[2]; tem[3] = src[3]; tem[4] = src[4]; tem[5] = src[5]; tem[6] = src[6]; tem[7] = src[7]; dst[0] = *(v8hi *) tem; } under 64-bit target, vectorizer realize it's just permutation of original src vector, but under -m32, vectorizer relies on vec_construct for vectorization. I think optimziation for this case under 32-bit target maynot impact much, so just add -fno-vect-cost-model. gcc.target/i386/pr91446.c: This testcase is guard for cost model of vector store, not vectorization capability, so just adjust testcase. gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to optimize for bswap, like other optimziation vectorizer can't realize optimization after it. So the current solution is add -fno-vect-cost-model to the testcase. costmodel-pr104582-1.c costmodel-pr104582-2.c costmodel-pr104582-4.c Failed since it's now not vectorized, looked at the PR, it's exactly what's wanted, so adjust testcase to scan-tree-dump-not. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/99881 PR target/104582 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Check if kind is vec_construct or vector store. (ix86_vector_costs::finish_cost): Don't do vectorization when vector stmts are only vec_construct and stores. (ix86_vector_costs::ix86_vect_construct_store_only_p): New function. (ix86_vector_costs::ix86_vect_cut_off): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/part-vect-absneghf.c: Restrict testcase to TARGET_64BIT. * gcc.target/i386/part-vect-copysignhf.c: Ditto. * gcc.target/i386/part-vect-xorsignhf.c: Ditto. * gcc.target/i386/pr91446.c: Adjust testcase. * gcc.target/i386/pr108938-3.c: Add -fno-vect-cost-model. * gcc.target/i386/pr111023-2.c: Ditto. * gcc.target/i386/pr111023.c: Ditto. * gcc.target/i386/pr99881.c: Remove xfail. * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: Changed to Scan-tree-dump-not. * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Ditto. * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Ditto. --- gcc/config/i386/i386.cc | 81 ++++++++++++++++++- .../costmodel/x86_64/costmodel-pr104582-1.c | 2 +- .../costmodel/x86_64/costmodel-pr104582-3.c | 2 +- .../costmodel/x86_64/costmodel-pr104582-4.c | 2 +- .../gcc.target/i386/part-vect-absneghf.c | 4 +- .../gcc.target/i386/part-vect-copysignhf.c | 4 +- .../gcc.target/i386/part-vect-xorsignhf.c | 4 +- gcc/testsuite/gcc.target/i386/pr108938-3.c | 2 +- gcc/testsuite/gcc.target/i386/pr111023-2.c | 2 +- gcc/testsuite/gcc.target/i386/pr111023.c | 2 +- gcc/testsuite/gcc.target/i386/pr91446.c | 14 ++-- gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- 12 files changed, 99 insertions(+), 22 deletions(-) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index dcaea6c2096..a4b23e29eba 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -24573,6 +24573,10 @@ public: private: + /* Don't do vectorization for certain patterns. */ + void ix86_vect_cut_off (); + + bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info); /* Estimate register pressure of the vectorized code. */ void ix86_vect_estimate_reg_pressure (); /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for @@ -24581,12 +24585,15 @@ private: where we know it's not loaded from memory. */ unsigned m_num_gpr_needed[3]; unsigned m_num_sse_needed[3]; + + bool m_vec_construct_store_only; }; ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar) : vector_costs (vinfo, costing_for_scalar), m_num_gpr_needed (), - m_num_sse_needed () + m_num_sse_needed (), + m_vec_construct_store_only (true) { } @@ -24609,6 +24616,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, = (kind == scalar_stmt || kind == scalar_load || kind == scalar_store); int stmt_cost = - 1; + if (m_vec_construct_store_only + && !ix86_vect_construct_store_only_p (kind, stmt_info)) + m_vec_construct_store_only = false; + bool fp = false; machine_mode mode = scalar_p ? SImode : TImode; @@ -24865,8 +24876,45 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure () } } +/* Return true if KIND and STMT_INFO is either vec_construct or vector + store, stmt_info could be promote/demote between vec_construct and + vector store, also count that. */ +bool +ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind, + stmt_vec_info stmt_info) +{ + switch (kind) + { + case vec_construct: + case vector_store: + case unaligned_store: + case vector_scatter_store: + case vec_promote_demote: + return true; + + /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes + char* a; + short b1, b2, b3, b4; + a[0] = b1; + a[1] = b2; + a[2] = b3; + a[3] = b4; + Also don't vectorized it. */ + case vector_stmt: + if (stmt_info && stmt_info->stmt + && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN + && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR) + return true; + + default: + break; + } + + return false; +} + void -ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) +ix86_vector_costs::ix86_vect_cut_off () { loop_vec_info loop_vinfo = dyn_cast (m_vinfo); if (loop_vinfo && !m_costing_for_scalar) @@ -24885,10 +24933,39 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ()) > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) m_costs[vect_body] = INT_MAX; + return; + } + + /* Don't do vectorization for things like + + a[0] = b1; + a[1] = b2; + .. + a[n] = bn; + + There're extra dependences when contructing the vector, but not for + scalar store. According to experiments, it's generally worse. */ + if (m_vec_construct_store_only) + { + m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Skip vectorization for stmts which only contains" + " vec_construct and vector store.\n"); + return; } + return; +} + +void +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) +{ + ix86_vect_estimate_reg_pressure (); + ix86_vect_cut_off (); + vector_costs::finish_cost (scalar_costs); } diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c index 992a845ad7a..f940af70b72 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c @@ -12,4 +12,4 @@ foo (unsigned long *a, unsigned long *b) s.b = b_; } -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c index 999c4905708..eff60b2a82a 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c @@ -10,4 +10,4 @@ foo (double a, double b) s.b = b; } -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c index cc471e1ed73..2f354338e96 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c @@ -12,4 +12,4 @@ foo (signed long *a, unsigned long *b) s.b = b_; } -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c index 48aed14d604..4052210ec39 100644 --- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c +++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c @@ -85,7 +85,7 @@ do_test (void) abort (); } -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" } } */ -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ /* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR