Message ID | 20220304022839.33024-1-hongtao.liu@intel.com |
---|---|
State | New |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 6C29D3858415 for <patchwork@sourceware.org>; Fri, 4 Mar 2022 02:29:14 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6C29D3858415 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1646360954; bh=2pHy6AgIX4PzCYVDkqQRFKlLck/LB6roMFAYTjZU0Ik=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=qgje3L1TzdbQj8gJMumxPFcVORrvmLNEIPZjDiL8rSjEVswBYQD94Ek2ES7PdC6qI rnoJQVMcMkolOobC6QZOtdXS+GWzei4mbMM6LUHvbUepRu4tuC+BD5LHu+iCAvCQnF x2cgkYXYQVKBNKh/U+IlRZkSYxprrYuqld8Q1KwE= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by sourceware.org (Postfix) with ESMTPS id 10CE03858D37 for <gcc-patches@gcc.gnu.org>; Fri, 4 Mar 2022 02:28:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 10CE03858D37 X-IronPort-AV: E=McAfee;i="6200,9189,10275"; a="233835201" X-IronPort-AV: E=Sophos;i="5.90,153,1643702400"; d="scan'208";a="233835201" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Mar 2022 18:28:42 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,153,1643702400"; d="scan'208";a="508833705" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga002.jf.intel.com with ESMTP; 03 Mar 2022 18:28:41 -0800 Received: from shliclel320.sh.intel.com (shliclel320.sh.intel.com [10.239.236.50]) by scymds01.sc.intel.com with ESMTP id 2242SddW016299; Thu, 3 Mar 2022 18:28:40 -0800 To: gcc-patches@gcc.gnu.org Subject: [PATCH] [i386] Optimize v4si broadcast for noavx512vl. Date: Fri, 4 Mar 2022 10:28:39 +0800 Message-Id: <20220304022839.33024-1-hongtao.liu@intel.com> X-Mailer: git-send-email 2.18.1 X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_PASS, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: liuhongt via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: liuhongt <hongtao.liu@intel.com> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
[i386] Optimize v4si broadcast for noavx512vl.
|
|
Commit Message
Liu, Hongtao
March 4, 2022, 2:28 a.m. UTC
This is incremental patch based on [1], it enables optimization as below - vbroadcastss .LC1(%rip), %xmm0 + movl $-45, %edx + vmovd %edx, %xmm0 + vpshufd $0, %xmm0, %xmm0 According to microbenchmark, it's faster than broadcast from memory. [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591162.html. Bootstrapped and regtest on x86_64-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/104704 * config/i386/sse.md (*vec_dupv4si): Add alternative $r and corresponding post_reload splitter. gcc/testsuite/ChangeLog: * gcc.target/i386/pr100865-8a.c: Adjust testcase. * gcc.target/i386/pr100865-8c.c: Ditto. * gcc.target/i386/pr100865-9c.c: Ditto. --- gcc/config/i386/sse.md | 41 ++++++++++++++++----- gcc/testsuite/gcc.target/i386/pr100865-8a.c | 2 +- gcc/testsuite/gcc.target/i386/pr100865-8c.c | 2 +- gcc/testsuite/gcc.target/i386/pr100865-9c.c | 2 +- 4 files changed, 35 insertions(+), 12 deletions(-)
Comments
On Fri, Mar 4, 2022 at 10:29 AM liuhongt via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > This is incremental patch based on [1], it enables optimization as below > > - vbroadcastss .LC1(%rip), %xmm0 > + movl $-45, %edx > + vmovd %edx, %xmm0 > + vpshufd $0, %xmm0, %xmm0 > > According to microbenchmark, it's faster than broadcast from memory. > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591162.html. > > Bootstrapped and regtest on x86_64-linux-gnu{-m32,}. > Ok for trunk? > > gcc/ChangeLog: > > PR target/104704 > * config/i386/sse.md (*vec_dupv4si): Add alternative $r and > corresponding post_reload splitter. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr100865-8a.c: Adjust testcase. > * gcc.target/i386/pr100865-8c.c: Ditto. > * gcc.target/i386/pr100865-9c.c: Ditto. > --- > gcc/config/i386/sse.md | 41 ++++++++++++++++----- > gcc/testsuite/gcc.target/i386/pr100865-8a.c | 2 +- > gcc/testsuite/gcc.target/i386/pr100865-8c.c | 2 +- > gcc/testsuite/gcc.target/i386/pr100865-9c.c | 2 +- > 4 files changed, 35 insertions(+), 12 deletions(-) > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index 3066ea3734a..d124545aa5d 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -25121,20 +25121,43 @@ (define_insn "vec_dupv4sf" > (set_attr "mode" "V4SF")]) > > (define_insn "*vec_dupv4si" > - [(set (match_operand:V4SI 0 "register_operand" "=v,v,x") > + [(set (match_operand:V4SI 0 "register_operand" "=v,v,x,v") > (vec_duplicate:V4SI > - (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0")))] > + (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0,$r")))] > "TARGET_SSE" > "@ > %vpshufd\t{$0, %1, %0|%0, %1, 0} > vbroadcastss\t{%1, %0|%0, %1} > - shufps\t{$0, %0, %0|%0, %0, 0}" > - [(set_attr "isa" "sse2,avx,noavx") > - (set_attr "type" "sselog1,ssemov,sselog1") > - (set_attr "length_immediate" "1,0,1") > - (set_attr "prefix_extra" "0,1,*") > - (set_attr "prefix" "maybe_vex,maybe_evex,orig") > - (set_attr "mode" "TI,V4SF,V4SF")]) > + shufps\t{$0, %0, %0|%0, %0, 0} > + #" > + [(set_attr "isa" "sse2,avx,noavx,noavx512vl") > + (set_attr "type" "sselog1,ssemov,sselog1,sselog1") > + (set_attr "length_immediate" "1,0,1,1") > + (set_attr "prefix_extra" "0,1,*,0") > + (set_attr "prefix" "maybe_vex,maybe_evex,orig,maybe_vex") > + (set_attr "mode" "TI,V4SF,V4SF,TI") > + (set (attr "preferred_for_speed") > + (cond [(eq_attr "alternative" "3") > + (symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC") > + ] > + (symbol_ref "true")))]) > + > +(define_split > + [(set (match_operand:V4SI 0 "sse_reg_operand") > + (vec_duplicate:V4SI > + (match_operand:SI 1 "general_reg_operand")))] > + "TARGET_SSE && reload_completed > + /* Disable this splitter if avx512vl_vec_dup_gprv4si insn is > + available, because then we can broadcast from GPRs directly. */ > + && !TARGET_AVX512VL" > + [(const_int 0)] > +{ > + emit_insn (gen_vec_setv4si_0 (gen_lowpart (V4SImode, operands[0]), > + CONST0_RTX (V4SImode), > + gen_lowpart (SImode, operands[1]))); > + emit_insn (gen_vec_duplicatev4si (operands[0], operands[0])); > + DONE; > +}) > > (define_insn "*vec_dupv2di" > [(set (match_operand:V2DI 0 "register_operand" "=x,v,v,x") > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8a.c b/gcc/testsuite/gcc.target/i386/pr100865-8a.c > index 911b14d4a25..544a14db6f7 100644 > --- a/gcc/testsuite/gcc.target/i386/pr100865-8a.c > +++ b/gcc/testsuite/gcc.target/i386/pr100865-8a.c > @@ -20,5 +20,5 @@ foo (void) > array[i] = MK_CONST128_BROADCAST_SIGNED (-45); > } > > -/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ > /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8c.c b/gcc/testsuite/gcc.target/i386/pr100865-8c.c > index 00682edb8c9..efee0488614 100644 > --- a/gcc/testsuite/gcc.target/i386/pr100865-8c.c > +++ b/gcc/testsuite/gcc.target/i386/pr100865-8c.c > @@ -3,5 +3,5 @@ > > #include "pr100865-8a.c" > > -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ > /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9c.c b/gcc/testsuite/gcc.target/i386/pr100865-9c.c > index 8ffcdc1629d..e6f25902c1d 100644 > --- a/gcc/testsuite/gcc.target/i386/pr100865-9c.c > +++ b/gcc/testsuite/gcc.target/i386/pr100865-9c.c > @@ -3,5 +3,5 @@ > > #include "pr100865-9a.c" > > -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ > /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ > -- > 2.18.1 >
On Fri, Mar 4, 2022 at 3:28 AM liuhongt <hongtao.liu@intel.com> wrote: > > This is incremental patch based on [1], it enables optimization as below > > - vbroadcastss .LC1(%rip), %xmm0 > + movl $-45, %edx > + vmovd %edx, %xmm0 > + vpshufd $0, %xmm0, %xmm0 > > According to microbenchmark, it's faster than broadcast from memory. > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591162.html. > > Bootstrapped and regtest on x86_64-linux-gnu{-m32,}. > Ok for trunk? > > gcc/ChangeLog: > > PR target/104704 > * config/i386/sse.md (*vec_dupv4si): Add alternative $r and > corresponding post_reload splitter. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr100865-8a.c: Adjust testcase. > * gcc.target/i386/pr100865-8c.c: Ditto. > * gcc.target/i386/pr100865-9c.c: Ditto. > --- > gcc/config/i386/sse.md | 41 ++++++++++++++++----- > gcc/testsuite/gcc.target/i386/pr100865-8a.c | 2 +- > gcc/testsuite/gcc.target/i386/pr100865-8c.c | 2 +- > gcc/testsuite/gcc.target/i386/pr100865-9c.c | 2 +- > 4 files changed, 35 insertions(+), 12 deletions(-) > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index 3066ea3734a..d124545aa5d 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -25121,20 +25121,43 @@ (define_insn "vec_dupv4sf" > (set_attr "mode" "V4SF")]) > > (define_insn "*vec_dupv4si" > - [(set (match_operand:V4SI 0 "register_operand" "=v,v,x") > + [(set (match_operand:V4SI 0 "register_operand" "=v,v,x,v") > (vec_duplicate:V4SI > - (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0")))] > + (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0,$r")))] > "TARGET_SSE" > "@ > %vpshufd\t{$0, %1, %0|%0, %1, 0} > vbroadcastss\t{%1, %0|%0, %1} > - shufps\t{$0, %0, %0|%0, %0, 0}" > - [(set_attr "isa" "sse2,avx,noavx") > - (set_attr "type" "sselog1,ssemov,sselog1") > - (set_attr "length_immediate" "1,0,1") > - (set_attr "prefix_extra" "0,1,*") > - (set_attr "prefix" "maybe_vex,maybe_evex,orig") > - (set_attr "mode" "TI,V4SF,V4SF")]) > + shufps\t{$0, %0, %0|%0, %0, 0} > + #" > + [(set_attr "isa" "sse2,avx,noavx,noavx512vl") > + (set_attr "type" "sselog1,ssemov,sselog1,sselog1") > + (set_attr "length_immediate" "1,0,1,1") > + (set_attr "prefix_extra" "0,1,*,0") > + (set_attr "prefix" "maybe_vex,maybe_evex,orig,maybe_vex") > + (set_attr "mode" "TI,V4SF,V4SF,TI") > + (set (attr "preferred_for_speed") > + (cond [(eq_attr "alternative" "3") > + (symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC") > + ] > + (symbol_ref "true")))]) What happens if you set preferred_for_speed to false for alternative 1? > +(define_split > + [(set (match_operand:V4SI 0 "sse_reg_operand") > + (vec_duplicate:V4SI > + (match_operand:SI 1 "general_reg_operand")))] > + "TARGET_SSE && reload_completed > + /* Disable this splitter if avx512vl_vec_dup_gprv4si insn is > + available, because then we can broadcast from GPRs directly. */ I think avx512vl_vec_dup_gprv4si should be merged with the above pattern instead. Uros. > + && !TARGET_AVX512VL" > + [(const_int 0)] > +{ > + emit_insn (gen_vec_setv4si_0 (gen_lowpart (V4SImode, operands[0]), > + CONST0_RTX (V4SImode), > + gen_lowpart (SImode, operands[1]))); > + emit_insn (gen_vec_duplicatev4si (operands[0], operands[0])); > + DONE; > +}) > > (define_insn "*vec_dupv2di" > [(set (match_operand:V2DI 0 "register_operand" "=x,v,v,x") > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8a.c b/gcc/testsuite/gcc.target/i386/pr100865-8a.c > index 911b14d4a25..544a14db6f7 100644 > --- a/gcc/testsuite/gcc.target/i386/pr100865-8a.c > +++ b/gcc/testsuite/gcc.target/i386/pr100865-8a.c > @@ -20,5 +20,5 @@ foo (void) > array[i] = MK_CONST128_BROADCAST_SIGNED (-45); > } > > -/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ > /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8c.c b/gcc/testsuite/gcc.target/i386/pr100865-8c.c > index 00682edb8c9..efee0488614 100644 > --- a/gcc/testsuite/gcc.target/i386/pr100865-8c.c > +++ b/gcc/testsuite/gcc.target/i386/pr100865-8c.c > @@ -3,5 +3,5 @@ > > #include "pr100865-8a.c" > > -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ > /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9c.c b/gcc/testsuite/gcc.target/i386/pr100865-9c.c > index 8ffcdc1629d..e6f25902c1d 100644 > --- a/gcc/testsuite/gcc.target/i386/pr100865-9c.c > +++ b/gcc/testsuite/gcc.target/i386/pr100865-9c.c > @@ -3,5 +3,5 @@ > > #include "pr100865-9a.c" > > -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ > /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ > -- > 2.18.1 >
> Am 04.03.2022 um 03:30 schrieb Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org>: > > On Fri, Mar 4, 2022 at 10:29 AM liuhongt via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: >> >> This is incremental patch based on [1], it enables optimization as below >> >> - vbroadcastss .LC1(%rip), %xmm0 >> + movl $-45, %edx >> + vmovd %edx, %xmm0 >> + vpshufd $0, %xmm0, %xmm0 >> >> According to microbenchmark, it's faster than broadcast from memory Is that true even on AMD uarchs? >> [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591162.html. >> >> Bootstrapped and regtest on x86_64-linux-gnu{-m32,}. >> Ok for trunk? >> >> gcc/ChangeLog: >> >> PR target/104704 >> * config/i386/sse.md (*vec_dupv4si): Add alternative $r and >> corresponding post_reload splitter. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/i386/pr100865-8a.c: Adjust testcase. >> * gcc.target/i386/pr100865-8c.c: Ditto. >> * gcc.target/i386/pr100865-9c.c: Ditto. >> --- >> gcc/config/i386/sse.md | 41 ++++++++++++++++----- >> gcc/testsuite/gcc.target/i386/pr100865-8a.c | 2 +- >> gcc/testsuite/gcc.target/i386/pr100865-8c.c | 2 +- >> gcc/testsuite/gcc.target/i386/pr100865-9c.c | 2 +- >> 4 files changed, 35 insertions(+), 12 deletions(-) >> >> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md >> index 3066ea3734a..d124545aa5d 100644 >> --- a/gcc/config/i386/sse.md >> +++ b/gcc/config/i386/sse.md >> @@ -25121,20 +25121,43 @@ (define_insn "vec_dupv4sf" >> (set_attr "mode" "V4SF")]) >> >> (define_insn "*vec_dupv4si" >> - [(set (match_operand:V4SI 0 "register_operand" "=v,v,x") >> + [(set (match_operand:V4SI 0 "register_operand" "=v,v,x,v") >> (vec_duplicate:V4SI >> - (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0")))] >> + (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0,$r")))] >> "TARGET_SSE" >> "@ >> %vpshufd\t{$0, %1, %0|%0, %1, 0} >> vbroadcastss\t{%1, %0|%0, %1} >> - shufps\t{$0, %0, %0|%0, %0, 0}" >> - [(set_attr "isa" "sse2,avx,noavx") >> - (set_attr "type" "sselog1,ssemov,sselog1") >> - (set_attr "length_immediate" "1,0,1") >> - (set_attr "prefix_extra" "0,1,*") >> - (set_attr "prefix" "maybe_vex,maybe_evex,orig") >> - (set_attr "mode" "TI,V4SF,V4SF")]) >> + shufps\t{$0, %0, %0|%0, %0, 0} >> + #" >> + [(set_attr "isa" "sse2,avx,noavx,noavx512vl") >> + (set_attr "type" "sselog1,ssemov,sselog1,sselog1") >> + (set_attr "length_immediate" "1,0,1,1") >> + (set_attr "prefix_extra" "0,1,*,0") >> + (set_attr "prefix" "maybe_vex,maybe_evex,orig,maybe_vex") >> + (set_attr "mode" "TI,V4SF,V4SF,TI") >> + (set (attr "preferred_for_speed") >> + (cond [(eq_attr "alternative" "3") >> + (symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC") >> + ] >> + (symbol_ref "true")))]) >> + >> +(define_split >> + [(set (match_operand:V4SI 0 "sse_reg_operand") >> + (vec_duplicate:V4SI >> + (match_operand:SI 1 "general_reg_operand")))] >> + "TARGET_SSE && reload_completed >> + /* Disable this splitter if avx512vl_vec_dup_gprv4si insn is >> + available, because then we can broadcast from GPRs directly. */ >> + && !TARGET_AVX512VL" >> + [(const_int 0)] >> +{ >> + emit_insn (gen_vec_setv4si_0 (gen_lowpart (V4SImode, operands[0]), >> + CONST0_RTX (V4SImode), >> + gen_lowpart (SImode, operands[1]))); >> + emit_insn (gen_vec_duplicatev4si (operands[0], operands[0])); >> + DONE; >> +}) >> >> (define_insn "*vec_dupv2di" >> [(set (match_operand:V2DI 0 "register_operand" "=x,v,v,x") >> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8a.c b/gcc/testsuite/gcc.target/i386/pr100865-8a.c >> index 911b14d4a25..544a14db6f7 100644 >> --- a/gcc/testsuite/gcc.target/i386/pr100865-8a.c >> +++ b/gcc/testsuite/gcc.target/i386/pr100865-8a.c >> @@ -20,5 +20,5 @@ foo (void) >> array[i] = MK_CONST128_BROADCAST_SIGNED (-45); >> } >> >> -/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ >> +/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ >> /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ >> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8c.c b/gcc/testsuite/gcc.target/i386/pr100865-8c.c >> index 00682edb8c9..efee0488614 100644 >> --- a/gcc/testsuite/gcc.target/i386/pr100865-8c.c >> +++ b/gcc/testsuite/gcc.target/i386/pr100865-8c.c >> @@ -3,5 +3,5 @@ >> >> #include "pr100865-8a.c" >> >> -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ >> +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ >> /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ >> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9c.c b/gcc/testsuite/gcc.target/i386/pr100865-9c.c >> index 8ffcdc1629d..e6f25902c1d 100644 >> --- a/gcc/testsuite/gcc.target/i386/pr100865-9c.c >> +++ b/gcc/testsuite/gcc.target/i386/pr100865-9c.c >> @@ -3,5 +3,5 @@ >> >> #include "pr100865-9a.c" >> >> -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ >> +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ >> /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ >> -- >> 2.18.1 >> > > > -- > BR, > Hongtao
On Fri, Mar 4, 2022 at 8:40 AM Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > > > > Am 04.03.2022 um 03:30 schrieb Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org>: > > > > On Fri, Mar 4, 2022 at 10:29 AM liuhongt via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > >> > >> This is incremental patch based on [1], it enables optimization as below > >> > >> - vbroadcastss .LC1(%rip), %xmm0 > >> + movl $-45, %edx > >> + vmovd %edx, %xmm0 > >> + vpshufd $0, %xmm0, %xmm0 > >> > >> According to microbenchmark, it's faster than broadcast from memory > > Is that true even on AMD uarchs? Please check TARGET_INTER_UNIT_MOVES_TO_VEC. > >> [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591162.html. > >> > >> Bootstrapped and regtest on x86_64-linux-gnu{-m32,}. > >> Ok for trunk? > >> > >> gcc/ChangeLog: > >> > >> PR target/104704 > >> * config/i386/sse.md (*vec_dupv4si): Add alternative $r and > >> corresponding post_reload splitter. > >> > >> gcc/testsuite/ChangeLog: > >> > >> * gcc.target/i386/pr100865-8a.c: Adjust testcase. > >> * gcc.target/i386/pr100865-8c.c: Ditto. > >> * gcc.target/i386/pr100865-9c.c: Ditto. > >> --- > >> gcc/config/i386/sse.md | 41 ++++++++++++++++----- > >> gcc/testsuite/gcc.target/i386/pr100865-8a.c | 2 +- > >> gcc/testsuite/gcc.target/i386/pr100865-8c.c | 2 +- > >> gcc/testsuite/gcc.target/i386/pr100865-9c.c | 2 +- > >> 4 files changed, 35 insertions(+), 12 deletions(-) > >> > >> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > >> index 3066ea3734a..d124545aa5d 100644 > >> --- a/gcc/config/i386/sse.md > >> +++ b/gcc/config/i386/sse.md > >> @@ -25121,20 +25121,43 @@ (define_insn "vec_dupv4sf" > >> (set_attr "mode" "V4SF")]) > >> > >> (define_insn "*vec_dupv4si" > >> - [(set (match_operand:V4SI 0 "register_operand" "=v,v,x") > >> + [(set (match_operand:V4SI 0 "register_operand" "=v,v,x,v") > >> (vec_duplicate:V4SI > >> - (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0")))] > >> + (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0,$r")))] > >> "TARGET_SSE" > >> "@ > >> %vpshufd\t{$0, %1, %0|%0, %1, 0} > >> vbroadcastss\t{%1, %0|%0, %1} > >> - shufps\t{$0, %0, %0|%0, %0, 0}" > >> - [(set_attr "isa" "sse2,avx,noavx") > >> - (set_attr "type" "sselog1,ssemov,sselog1") > >> - (set_attr "length_immediate" "1,0,1") > >> - (set_attr "prefix_extra" "0,1,*") > >> - (set_attr "prefix" "maybe_vex,maybe_evex,orig") > >> - (set_attr "mode" "TI,V4SF,V4SF")]) > >> + shufps\t{$0, %0, %0|%0, %0, 0} > >> + #" > >> + [(set_attr "isa" "sse2,avx,noavx,noavx512vl") > >> + (set_attr "type" "sselog1,ssemov,sselog1,sselog1") > >> + (set_attr "length_immediate" "1,0,1,1") > >> + (set_attr "prefix_extra" "0,1,*,0") > >> + (set_attr "prefix" "maybe_vex,maybe_evex,orig,maybe_vex") > >> + (set_attr "mode" "TI,V4SF,V4SF,TI") > >> + (set (attr "preferred_for_speed") > >> + (cond [(eq_attr "alternative" "3") > >> + (symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC") > >> + ] > >> + (symbol_ref "true")))]) > >> + > >> +(define_split > >> + [(set (match_operand:V4SI 0 "sse_reg_operand") > >> + (vec_duplicate:V4SI > >> + (match_operand:SI 1 "general_reg_operand")))] > >> + "TARGET_SSE && reload_completed > >> + /* Disable this splitter if avx512vl_vec_dup_gprv4si insn is > >> + available, because then we can broadcast from GPRs directly. */ > >> + && !TARGET_AVX512VL" > >> + [(const_int 0)] > >> +{ > >> + emit_insn (gen_vec_setv4si_0 (gen_lowpart (V4SImode, operands[0]), > >> + CONST0_RTX (V4SImode), > >> + gen_lowpart (SImode, operands[1]))); > >> + emit_insn (gen_vec_duplicatev4si (operands[0], operands[0])); > >> + DONE; > >> +}) > >> > >> (define_insn "*vec_dupv2di" > >> [(set (match_operand:V2DI 0 "register_operand" "=x,v,v,x") > >> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8a.c b/gcc/testsuite/gcc.target/i386/pr100865-8a.c > >> index 911b14d4a25..544a14db6f7 100644 > >> --- a/gcc/testsuite/gcc.target/i386/pr100865-8a.c > >> +++ b/gcc/testsuite/gcc.target/i386/pr100865-8a.c > >> @@ -20,5 +20,5 @@ foo (void) > >> array[i] = MK_CONST128_BROADCAST_SIGNED (-45); > >> } > >> > >> -/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ > >> +/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ > >> /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ > >> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8c.c b/gcc/testsuite/gcc.target/i386/pr100865-8c.c > >> index 00682edb8c9..efee0488614 100644 > >> --- a/gcc/testsuite/gcc.target/i386/pr100865-8c.c > >> +++ b/gcc/testsuite/gcc.target/i386/pr100865-8c.c > >> @@ -3,5 +3,5 @@ > >> > >> #include "pr100865-8a.c" > >> > >> -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ > >> +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ > >> /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ > >> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9c.c b/gcc/testsuite/gcc.target/i386/pr100865-9c.c > >> index 8ffcdc1629d..e6f25902c1d 100644 > >> --- a/gcc/testsuite/gcc.target/i386/pr100865-9c.c > >> +++ b/gcc/testsuite/gcc.target/i386/pr100865-9c.c > >> @@ -3,5 +3,5 @@ > >> > >> #include "pr100865-9a.c" > >> > >> -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ > >> +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ > >> /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ > >> -- > >> 2.18.1 > >> > > > > > > -- > > BR, > > Hongtao
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 3066ea3734a..d124545aa5d 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -25121,20 +25121,43 @@ (define_insn "vec_dupv4sf" (set_attr "mode" "V4SF")]) (define_insn "*vec_dupv4si" - [(set (match_operand:V4SI 0 "register_operand" "=v,v,x") + [(set (match_operand:V4SI 0 "register_operand" "=v,v,x,v") (vec_duplicate:V4SI - (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0")))] + (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0,$r")))] "TARGET_SSE" "@ %vpshufd\t{$0, %1, %0|%0, %1, 0} vbroadcastss\t{%1, %0|%0, %1} - shufps\t{$0, %0, %0|%0, %0, 0}" - [(set_attr "isa" "sse2,avx,noavx") - (set_attr "type" "sselog1,ssemov,sselog1") - (set_attr "length_immediate" "1,0,1") - (set_attr "prefix_extra" "0,1,*") - (set_attr "prefix" "maybe_vex,maybe_evex,orig") - (set_attr "mode" "TI,V4SF,V4SF")]) + shufps\t{$0, %0, %0|%0, %0, 0} + #" + [(set_attr "isa" "sse2,avx,noavx,noavx512vl") + (set_attr "type" "sselog1,ssemov,sselog1,sselog1") + (set_attr "length_immediate" "1,0,1,1") + (set_attr "prefix_extra" "0,1,*,0") + (set_attr "prefix" "maybe_vex,maybe_evex,orig,maybe_vex") + (set_attr "mode" "TI,V4SF,V4SF,TI") + (set (attr "preferred_for_speed") + (cond [(eq_attr "alternative" "3") + (symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC") + ] + (symbol_ref "true")))]) + +(define_split + [(set (match_operand:V4SI 0 "sse_reg_operand") + (vec_duplicate:V4SI + (match_operand:SI 1 "general_reg_operand")))] + "TARGET_SSE && reload_completed + /* Disable this splitter if avx512vl_vec_dup_gprv4si insn is + available, because then we can broadcast from GPRs directly. */ + && !TARGET_AVX512VL" + [(const_int 0)] +{ + emit_insn (gen_vec_setv4si_0 (gen_lowpart (V4SImode, operands[0]), + CONST0_RTX (V4SImode), + gen_lowpart (SImode, operands[1]))); + emit_insn (gen_vec_duplicatev4si (operands[0], operands[0])); + DONE; +}) (define_insn "*vec_dupv2di" [(set (match_operand:V2DI 0 "register_operand" "=x,v,v,x") diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8a.c b/gcc/testsuite/gcc.target/i386/pr100865-8a.c index 911b14d4a25..544a14db6f7 100644 --- a/gcc/testsuite/gcc.target/i386/pr100865-8a.c +++ b/gcc/testsuite/gcc.target/i386/pr100865-8a.c @@ -20,5 +20,5 @@ foo (void) array[i] = MK_CONST128_BROADCAST_SIGNED (-45); } -/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8c.c b/gcc/testsuite/gcc.target/i386/pr100865-8c.c index 00682edb8c9..efee0488614 100644 --- a/gcc/testsuite/gcc.target/i386/pr100865-8c.c +++ b/gcc/testsuite/gcc.target/i386/pr100865-8c.c @@ -3,5 +3,5 @@ #include "pr100865-8a.c" -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9c.c b/gcc/testsuite/gcc.target/i386/pr100865-9c.c index 8ffcdc1629d..e6f25902c1d 100644 --- a/gcc/testsuite/gcc.target/i386/pr100865-9c.c +++ b/gcc/testsuite/gcc.target/i386/pr100865-9c.c @@ -3,5 +3,5 @@ #include "pr100865-9a.c" -/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times "vpshufd\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */ /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */