Message ID | 20220304072714.55713-1-hongtao.liu@intel.com |
---|---|
State | New |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 16E613857C5A for <patchwork@sourceware.org>; Fri, 4 Mar 2022 07:28:29 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 16E613857C5A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1646378909; bh=ZQmfK3cyxwiWUbii05odrlreNChFn1cH3NooXOXM2PU=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=L0WpUNZB96n1MSZUGaaO5s9QLp/ENV+8vUiQwx+ju4o/xl0OTJtISCu0O2ww/70+u Ffqj4Ng3sIUyHc3lSUhPUb3K1vVrXP8kVdzdjTYIgsCVpsy27v8ZqruDBgDijgXkzB YHLfaYilr0cD5ZaVoma0D1/Jc2NrkK+INJiB130c= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by sourceware.org (Postfix) with ESMTPS id BD0163857C49 for <gcc-patches@gcc.gnu.org>; Fri, 4 Mar 2022 07:27:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BD0163857C49 X-IronPort-AV: E=McAfee;i="6200,9189,10275"; a="253646909" X-IronPort-AV: E=Sophos;i="5.90,154,1643702400"; d="scan'208";a="253646909" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Mar 2022 23:27:22 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,154,1643702400"; d="scan'208";a="609867843" Received: from scymds01.sc.intel.com ([10.148.94.138]) by fmsmga004.fm.intel.com with ESMTP; 03 Mar 2022 23:27:16 -0800 Received: from shliclel051.sh.intel.com (shliclel051.sh.intel.com [10.239.236.51]) by scymds01.sc.intel.com with ESMTP id 2247RF7W013034; Thu, 3 Mar 2022 23:27:15 -0800 To: gcc-patches@gcc.gnu.org Subject: [PATCH] [i386] Prevent vectorization for load from parm_decl at O2 to avoid STF issue. Date: Fri, 4 Mar 2022 15:27:14 +0800 Message-Id: <20220304072714.55713-1-hongtao.liu@intel.com> X-Mailer: git-send-email 2.18.1 X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: liuhongt via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: liuhongt <hongtao.liu@intel.com> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
[i386] Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
|
|
Commit Message
Liu, Hongtao
March 4, 2022, 7:27 a.m. UTC
For parameter passing through stack, vectorized load from parm_decl in callee may trigger serious STF issue. This is why GCC12 regresses 50% for cray at -O2 compared to GCC11. The patch add an extremely large number to stmt_cost to prevent vectorization for loads from parm_decl under very-cheap cost model, this can at least prevent O2 regression due to STF issue, but may lose some perf where there's no such issue(1 vector_load vs n scalar_load + CTOR). No impact for SPEC2017 for both plain O2 and native O2 on ICX. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/101908 * config/i386/i386.cc (ix86_load_maybe_stfs_p): New. (ix86_vector_costs::add_stmt_cost): Add extra cost for vector_load/unsigned_load which may have stall forward issue. gcc/testsuite/ChangeLog: * gcc.target/i386/pr101908-1.c: New test. * gcc.target/i386/pr101908-2.c: New test. --- gcc/config/i386/i386.cc | 31 ++++++++++++++++++++++ gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++++++ gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++++++ 3 files changed, 55 insertions(+) create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
Comments
On Fri, Mar 4, 2022 at 3:28 PM liuhongt via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > For parameter passing through stack, vectorized load from parm_decl > in callee may trigger serious STF issue. This is why GCC12 regresses > 50% for cray at -O2 compared to GCC11. > > The patch add an extremely large number to stmt_cost to prevent > vectorization for loads from parm_decl under very-cheap cost model, > this can at least prevent O2 regression due to STF issue, but may lose > some perf where there's no such issue(1 vector_load vs n scalar_load + > CTOR). > > No impact for SPEC2017 for both plain O2 and native O2 on ICX. > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. > Ok for trunk? > > gcc/ChangeLog: > > PR target/101908 > * config/i386/i386.cc (ix86_load_maybe_stfs_p): New. > (ix86_vector_costs::add_stmt_cost): Add extra cost for > vector_load/unsigned_load which may have stall forward issue. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr101908-1.c: New test. > * gcc.target/i386/pr101908-2.c: New test. > --- > gcc/config/i386/i386.cc | 31 ++++++++++++++++++++++ > gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++++++ > gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++++++ > 3 files changed, 55 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index b2bf90576d5..3bbaaf65ea8 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -22976,6 +22976,19 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info) > return default_noce_conversion_profitable_p (seq, if_info); > } > > +/* Return true if REF may have STF issue, otherwise false. */ > +static bool > +ix86_load_maybe_stfs_p (tree ref) > +{ > + tree addr = get_base_address (ref); > + > + if (TREE_CODE (addr) != PARM_DECL > + || !tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (addr))) > + || tree_to_uhwi (TYPE_SIZE (TREE_TYPE (addr))) <= MAX_BITS_PER_WORD) > + return false; > + return true; > +} > + > /* x86-specific vector costs. */ > class ix86_vector_costs : public vector_costs > { > @@ -23203,6 +23216,24 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > if (TREE_CODE (op) == SSA_NAME) > TREE_VISITED (op) = 0; > } > + > + /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue. > + Performance may lose when there's no STF issue(1 vector_load vs n > + scalar_load + CTOR). > + TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine > + tuned. */ > + if ((kind == vector_load || kind == unaligned_load) > + && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP > + && stmt_info > + && stmt_info->slp_type == pure_slp > + && stmt_info->stmt > + && gimple_assign_load_p (stmt_info->stmt) > + && ix86_load_maybe_stfs_p (gimple_assign_rhs1 (stmt_info->stmt))) > + { > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > + stmt_cost += 2000; > + } > + > if (stmt_cost == -1) > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c > new file mode 100644 > index 00000000000..f8e0f2e26bb > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */ > + > +struct X { double x[2]; }; > +typedef double v2df __attribute__((vector_size(16))); > + > +v2df __attribute__((noipa)) > +foo (struct X* x, struct X* y) > +{ > + return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] }; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c > new file mode 100644 > index 00000000000..7f2f00cebab > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */ > + > +struct X { double x[2]; }; > +typedef double v2df __attribute__((vector_size(16))); > + > +v2df __attribute__((noipa)) > +foo (struct X x, struct X y) > +{ > + return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] }; > +} > -- > 2.18.1 >
On Fri, Mar 4, 2022 at 8:27 AM liuhongt <hongtao.liu@intel.com> wrote: > > For parameter passing through stack, vectorized load from parm_decl > in callee may trigger serious STF issue. This is why GCC12 regresses > 50% for cray at -O2 compared to GCC11. > > The patch add an extremely large number to stmt_cost to prevent > vectorization for loads from parm_decl under very-cheap cost model, > this can at least prevent O2 regression due to STF issue, but may lose > some perf where there's no such issue(1 vector_load vs n scalar_load + > CTOR). Note this is just heuristics in that by-value passed parameters are usually stored to the stack close before the function call. It does not catch the similar case from foo (const X &bar) { ... } where a foo ({ 1., 2. }) will have the object passed by reference constructed right before the call. In the end a full solution will need to perform some IPA analysis that computes the initialization distance from the call and uses should factor in the use distance from function entry. > No impact for SPEC2017 for both plain O2 and native O2 on ICX. > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. > Ok for trunk? > > gcc/ChangeLog: > > PR target/101908 > * config/i386/i386.cc (ix86_load_maybe_stfs_p): New. > (ix86_vector_costs::add_stmt_cost): Add extra cost for > vector_load/unsigned_load which may have stall forward issue. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr101908-1.c: New test. > * gcc.target/i386/pr101908-2.c: New test. > --- > gcc/config/i386/i386.cc | 31 ++++++++++++++++++++++ > gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++++++ > gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++++++ > 3 files changed, 55 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index b2bf90576d5..3bbaaf65ea8 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -22976,6 +22976,19 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info) > return default_noce_conversion_profitable_p (seq, if_info); > } > > +/* Return true if REF may have STF issue, otherwise false. */ > +static bool > +ix86_load_maybe_stfs_p (tree ref) > +{ > + tree addr = get_base_address (ref); > + > + if (TREE_CODE (addr) != PARM_DECL > + || !tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (addr))) > + || tree_to_uhwi (TYPE_SIZE (TREE_TYPE (addr))) <= MAX_BITS_PER_WORD) > + return false; > + return true; > +} > + > /* x86-specific vector costs. */ > class ix86_vector_costs : public vector_costs > { > @@ -23203,6 +23216,24 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > if (TREE_CODE (op) == SSA_NAME) > TREE_VISITED (op) = 0; > } > + > + /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue. > + Performance may lose when there's no STF issue(1 vector_load vs n > + scalar_load + CTOR). > + TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine > + tuned. */ > + if ((kind == vector_load || kind == unaligned_load) > + && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP This check doesn't make much sense, I'd rather remove it. > + && stmt_info > + && stmt_info->slp_type == pure_slp > + && stmt_info->stmt > + && gimple_assign_load_p (stmt_info->stmt) > + && ix86_load_maybe_stfs_p (gimple_assign_rhs1 (stmt_info->stmt))) I'd pass down STMT_VINFO_DATA_REF instead and have ix86_load_maybe_stfs_p and use tree addr = DR_BASE_ADDRESS (dr); if (TREE_CODE (addr) != ADDR_EXPR) return false; addr = get_base_address (TREE_OPERAND (addr, 0)); ... since that gets you a more reliable way to look at the actual object referenced. > + { > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > + stmt_cost += 2000; > + } > + Maybe handle this like the Bonell case and thus put it after the stmt_cost == -1 handling, just bumping the cost (also noting the actual number is arbitrary). It would be nice to have a better estimate on the penalty than "2000", maybe formulate it in terms of the target costs simple-sse op at least. That said, it might be interesting to micro-benchmark v2df __attribute__((noipa)) foo (struct X* x, struct X* y) { double temx0, temx1, temy0, temy1; temx0 = x->x[0]; temx0 += temx0; ... temx1 = x->x[1]; temx1 += temx1; ... return (v2df) {temx1, temx0 } + (v2df) { temy1, temy0 }; } (without -ffast-math) to see how many vector adds we'd need to compensate the STLF penalty (just to have an idea whether the magic number is closer to 200, 2000 or 20000). Maybe also put that respective kernel into the i386 testsuite with a specific -mtune (and make the thing a target tunable?). > if (stmt_cost == -1) > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c > new file mode 100644 > index 00000000000..f8e0f2e26bb > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */ > + > +struct X { double x[2]; }; > +typedef double v2df __attribute__((vector_size(16))); > + > +v2df __attribute__((noipa)) > +foo (struct X* x, struct X* y) > +{ > + return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] }; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c > new file mode 100644 > index 00000000000..7f2f00cebab > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */ > + > +struct X { double x[2]; }; > +typedef double v2df __attribute__((vector_size(16))); > + > +v2df __attribute__((noipa)) > +foo (struct X x, struct X y) > +{ > + return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] }; > +} > -- > 2.18.1 >
On Mon, Mar 7, 2022 at 5:37 PM Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > On Fri, Mar 4, 2022 at 8:27 AM liuhongt <hongtao.liu@intel.com> wrote: > > > > For parameter passing through stack, vectorized load from parm_decl > > in callee may trigger serious STF issue. This is why GCC12 regresses > > 50% for cray at -O2 compared to GCC11. > > > > The patch add an extremely large number to stmt_cost to prevent > > vectorization for loads from parm_decl under very-cheap cost model, > > this can at least prevent O2 regression due to STF issue, but may lose > > some perf where there's no such issue(1 vector_load vs n scalar_load + > > CTOR). > > Note this is just heuristics in that by-value passed parameters are usually > stored to the stack close before the function call. It does not catch the > similar case from > > foo (const X &bar) { ... } > > where a > > foo ({ 1., 2. }) > > will have the object passed by reference constructed right before the > call. In the end a full solution will need to perform some IPA analysis > that computes the initialization distance from the call and uses should > factor in the use distance from function entry. Yes, this patch only deals with by-value passed objects which has nothing to do with ipa, but only depends on psABI. For the case of passing references, IPA is necessary to help analyze some obvious STFS scenarios, but static analysis still has limitations, for example, for the store across the cache line (or page) boundary case, IPA can not give a judgment. > > > No impact for SPEC2017 for both plain O2 and native O2 on ICX. > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. > > Ok for trunk? > > > > gcc/ChangeLog: > > > > PR target/101908 > > * config/i386/i386.cc (ix86_load_maybe_stfs_p): New. > > (ix86_vector_costs::add_stmt_cost): Add extra cost for > > vector_load/unsigned_load which may have stall forward issue. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/pr101908-1.c: New test. > > * gcc.target/i386/pr101908-2.c: New test. > > --- > > gcc/config/i386/i386.cc | 31 ++++++++++++++++++++++ > > gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++++++ > > gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++++++ > > 3 files changed, 55 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index b2bf90576d5..3bbaaf65ea8 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -22976,6 +22976,19 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info) > > return default_noce_conversion_profitable_p (seq, if_info); > > } > > > > +/* Return true if REF may have STF issue, otherwise false. */ > > +static bool > > +ix86_load_maybe_stfs_p (tree ref) > > +{ > > + tree addr = get_base_address (ref); > > + > > + if (TREE_CODE (addr) != PARM_DECL > > + || !tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (addr))) > > + || tree_to_uhwi (TYPE_SIZE (TREE_TYPE (addr))) <= MAX_BITS_PER_WORD) > > + return false; > > + return true; > > +} > > + > > /* x86-specific vector costs. */ > > class ix86_vector_costs : public vector_costs > > { > > @@ -23203,6 +23216,24 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > if (TREE_CODE (op) == SSA_NAME) > > TREE_VISITED (op) = 0; > > } > > + > > + /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue. > > + Performance may lose when there's no STF issue(1 vector_load vs n > > + scalar_load + CTOR). > > + TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine > > + tuned. */ > > + if ((kind == vector_load || kind == unaligned_load) > > + && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP > > This check doesn't make much sense, I'd rather remove it. Will change. > > > + && stmt_info > > + && stmt_info->slp_type == pure_slp > > + && stmt_info->stmt > > + && gimple_assign_load_p (stmt_info->stmt) > > + && ix86_load_maybe_stfs_p (gimple_assign_rhs1 (stmt_info->stmt))) > > I'd pass down STMT_VINFO_DATA_REF instead and have ix86_load_maybe_stfs_p > and use > > tree addr = DR_BASE_ADDRESS (dr); > if (TREE_CODE (addr) != ADDR_EXPR) > return false; > addr = get_base_address (TREE_OPERAND (addr, 0)); > ... > > since that gets you a more reliable way to look at the actual object referenced. Yes. > > > + { > > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > + stmt_cost += 2000; > > + } > > + > > Maybe handle this like the Bonell case and thus put it after the > stmt_cost == -1 handling, just bumping the cost (also noting the actual number > is arbitrary). It would be nice to have a better estimate on the penalty than > "2000", maybe formulate it in terms of the target costs simple-sse op at least. > I'll add a member to ix86_cost for STFS to make it target specific, the initial value will come from [1] and real experiments. [1] https://www.agner.org/optimize/microarchitecture.pdf > That said, it might be interesting to micro-benchmark > > v2df __attribute__((noipa)) > foo (struct X* x, struct X* y) > { > double temx0, temx1, temy0, temy1; > temx0 = x->x[0]; > temx0 += temx0; > ... > temx1 = x->x[1]; > temx1 += temx1; > ... > return (v2df) {temx1, temx0 } + (v2df) { temy1, temy0 }; > } > > (without -ffast-math) to see how many vector adds we'd need to compensate the > STLF penalty (just to have an idea whether the magic number is closer to 200, > 2000 or 20000). Maybe also put that respective kernel into the i386 testsuite > with a specific -mtune (and make the thing a target tunable?). > > > if (stmt_cost == -1) > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c > > new file mode 100644 > > index 00000000000..f8e0f2e26bb > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -fdump-tree-slp-details" } */ > > +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */ > > + > > +struct X { double x[2]; }; > > +typedef double v2df __attribute__((vector_size(16))); > > + > > +v2df __attribute__((noipa)) > > +foo (struct X* x, struct X* y) > > +{ > > + return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] }; > > +} > > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c > > new file mode 100644 > > index 00000000000..7f2f00cebab > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -fdump-tree-slp-details" } */ > > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */ > > + > > +struct X { double x[2]; }; > > +typedef double v2df __attribute__((vector_size(16))); > > + > > +v2df __attribute__((noipa)) > > +foo (struct X x, struct X y) > > +{ > > + return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] }; > > +} > > -- > > 2.18.1 > >
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index b2bf90576d5..3bbaaf65ea8 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -22976,6 +22976,19 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info) return default_noce_conversion_profitable_p (seq, if_info); } +/* Return true if REF may have STF issue, otherwise false. */ +static bool +ix86_load_maybe_stfs_p (tree ref) +{ + tree addr = get_base_address (ref); + + if (TREE_CODE (addr) != PARM_DECL + || !tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (addr))) + || tree_to_uhwi (TYPE_SIZE (TREE_TYPE (addr))) <= MAX_BITS_PER_WORD) + return false; + return true; +} + /* x86-specific vector costs. */ class ix86_vector_costs : public vector_costs { @@ -23203,6 +23216,24 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, if (TREE_CODE (op) == SSA_NAME) TREE_VISITED (op) = 0; } + + /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue. + Performance may lose when there's no STF issue(1 vector_load vs n + scalar_load + CTOR). + TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine + tuned. */ + if ((kind == vector_load || kind == unaligned_load) + && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP + && stmt_info + && stmt_info->slp_type == pure_slp + && stmt_info->stmt + && gimple_assign_load_p (stmt_info->stmt) + && ix86_load_maybe_stfs_p (gimple_assign_rhs1 (stmt_info->stmt))) + { + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); + stmt_cost += 2000; + } + if (stmt_cost == -1) stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c new file mode 100644 index 00000000000..f8e0f2e26bb --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-slp-details" } */ +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */ + +struct X { double x[2]; }; +typedef double v2df __attribute__((vector_size(16))); + +v2df __attribute__((noipa)) +foo (struct X* x, struct X* y) +{ + return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] }; +} diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c new file mode 100644 index 00000000000..7f2f00cebab --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-slp-details" } */ +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */ + +struct X { double x[2]; }; +typedef double v2df __attribute__((vector_size(16))); + +v2df __attribute__((noipa)) +foo (struct X x, struct X y) +{ + return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] }; +}