Message ID | 20220319000857.75054-1-hongyu.wang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E5CAA3888C6B for <patchwork@sourceware.org>; Sat, 19 Mar 2022 00:09:29 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E5CAA3888C6B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1647648569; bh=BZcF1DGBwLvlLjcpavjRbUMggTMYnXgp4QLCaRpRoqs=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=nlraOHCnAL76O7CuhVdGZ8Fq0gbKOq69Gl3TWmFoVn8xNJjtT49DWs5HFZjqALsL+ m12Gny/gk2QS9EZqPNC9AHB6+atcsU2zMG5vHt7kmFp3cRDBiJBr/edTT/vpGYPO44 lUs4tH4mIXOXeP6FDqXMf/BCiPXWrGhYpEejj8lg= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by sourceware.org (Postfix) with ESMTPS id 52A71385840C for <gcc-patches@gcc.gnu.org>; Sat, 19 Mar 2022 00:09:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 52A71385840C X-IronPort-AV: E=McAfee;i="6200,9189,10290"; a="237864559" X-IronPort-AV: E=Sophos;i="5.90,192,1643702400"; d="scan'208";a="237864559" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Mar 2022 17:08:59 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,192,1643702400"; d="scan'208";a="645750505" Received: from scymds02.sc.intel.com ([10.82.73.244]) by fmsmga002.fm.intel.com with ESMTP; 18 Mar 2022 17:08:59 -0700 Received: from shliclel320.sh.intel.com (shliclel320.sh.intel.com [10.239.236.50]) by scymds02.sc.intel.com with ESMTP id 22J08vYi020601; Fri, 18 Mar 2022 17:08:58 -0700 To: hongtao.liu@intel.com Subject: [PATCH] AVX512FP16: Fix wrong code for _mm_mask_f[c]madd.*sch [PR 104978] Date: Sat, 19 Mar 2022 08:08:57 +0800 Message-Id: <20220319000857.75054-1-hongyu.wang@intel.com> X-Mailer: git-send-email 2.18.1 X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_SHORT, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Hongyu Wang via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Hongyu Wang <hongyu.wang@intel.com> Cc: gcc-patches@gcc.gnu.org Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
AVX512FP16: Fix wrong code for _mm_mask_f[c]madd.*sch [PR 104978]
|
|
Commit Message
Hongyu Wang
March 19, 2022, 12:08 a.m. UTC
Hi, For complex scalar intrinsic like _mm_mask_fcmadd_sch, the mask should be and by 1 to ensure the mask is bind to lowest byte. Bootstraped/regtested on x86_64-pc-linux-gnu{-m32,} and sde. Ok for master? gcc/ChangeLog: PR target/104978 * config/i386/sse.md (avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name): Generate mask & 1 before move to dest under TARGET_AVX512VL. (avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name): Likewise. gcc/testsuite/ChangeLog: PR target/104978 * gcc.target/i386/pr104978.c: New test. --- gcc/config/i386/sse.md | 16 ++++++++++------ gcc/testsuite/gcc.target/i386/pr104978.c | 18 ++++++++++++++++++ 2 files changed, 28 insertions(+), 6 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr104978.c
Comments
On Sat, Mar 19, 2022 at 8:09 AM Hongyu Wang via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Hi, > > For complex scalar intrinsic like _mm_mask_fcmadd_sch, the > mask should be and by 1 to ensure the mask is bind to lowest byte. > > Bootstraped/regtested on x86_64-pc-linux-gnu{-m32,} and sde. > > Ok for master? > > gcc/ChangeLog: > > PR target/104978 > * config/i386/sse.md > (avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name): > Generate mask & 1 before move to dest under TARGET_AVX512VL. > (avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name): Likewise. > > gcc/testsuite/ChangeLog: > > PR target/104978 > * gcc.target/i386/pr104978.c: New test. > --- > gcc/config/i386/sse.md | 16 ++++++++++------ > gcc/testsuite/gcc.target/i386/pr104978.c | 18 ++++++++++++++++++ > 2 files changed, 28 insertions(+), 6 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr104978.c > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index ed98120be59..cc4c5542ee6 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -6576,7 +6576,7 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" > (match_operand:QI 4 "register_operand")] > "TARGET_AVX512FP16 && <round_mode512bit_condition>" > { > - rtx op0, op1; > + rtx op0, op1, mask; > > if (<round_embedded_complex>) > emit_insn (gen_avx512fp16_fmaddcsh_v8hf_mask<round_expand_name> ( > @@ -6590,11 +6590,13 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" > { > op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); > op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); > - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); > + mask = gen_reg_rtx (QImode); > + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); > + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); > } > else > { > - rtx mask, tmp, vec_mask; > + rtx tmp, vec_mask; > mask = lowpart_subreg (SImode, operands[4], QImode), > tmp = gen_reg_rtx (SImode); > emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); > @@ -6631,7 +6633,7 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" > (match_operand:QI 4 "register_operand")] > "TARGET_AVX512FP16 && <round_mode512bit_condition>" > { > - rtx op0, op1; > + rtx op0, op1, mask; > > if (<round_embedded_complex>) > emit_insn (gen_avx512fp16_fcmaddcsh_v8hf_mask<round_expand_name> ( > @@ -6645,11 +6647,13 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" > { > op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); > op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); > - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); > + mask = gen_reg_rtx (QImode); > + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); > + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); Would it be better to use vmovss under avx512vl without & 1 for mask. > } > else > { > - rtx mask, tmp, vec_mask; > + rtx tmp, vec_mask; > mask = lowpart_subreg (SImode, operands[4], QImode), > tmp = gen_reg_rtx (SImode); > emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); > diff --git a/gcc/testsuite/gcc.target/i386/pr104978.c b/gcc/testsuite/gcc.target/i386/pr104978.c > new file mode 100644 > index 00000000000..fd22a6c3f43 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr104978.c > @@ -0,0 +1,18 @@ > +/* PR target/104978 */ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl" } */ > +/* { dg-final { scan-assembler-times "and\[^\\n\\r\]*\\\$1" 2 } } */ > + > +#include<immintrin.h> > + > +__m128h > +foo (__m128h a, __m128h b, __m128h c, __mmask8 m) > +{ > + return _mm_mask_fmadd_round_sch (a, m, b, c, 8); > +} > + > +__m128h > +foo2 (__m128h a, __m128h b, __m128h c, __mmask8 m) > +{ > + return _mm_mask_fcmadd_round_sch (a, m, b, c, 8); > +} > -- > 2.18.1 >
> Would it be better to use vmovss under avx512vl without & 1 for mask. vmovss clears the upper bits, but the intrinsic requires src1. We still need either a mask move or blend for the high part. LLVM generates mask & 1 for these intrinsics. Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年3月21日周一 09:08写道: > > On Sat, Mar 19, 2022 at 8:09 AM Hongyu Wang via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: > > > > Hi, > > > > For complex scalar intrinsic like _mm_mask_fcmadd_sch, the > > mask should be and by 1 to ensure the mask is bind to lowest byte. > > > > Bootstraped/regtested on x86_64-pc-linux-gnu{-m32,} and sde. > > > > Ok for master? > > > > gcc/ChangeLog: > > > > PR target/104978 > > * config/i386/sse.md > > (avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name): > > Generate mask & 1 before move to dest under TARGET_AVX512VL. > > (avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name): Likewise. > > > > gcc/testsuite/ChangeLog: > > > > PR target/104978 > > * gcc.target/i386/pr104978.c: New test. > > --- > > gcc/config/i386/sse.md | 16 ++++++++++------ > > gcc/testsuite/gcc.target/i386/pr104978.c | 18 ++++++++++++++++++ > > 2 files changed, 28 insertions(+), 6 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr104978.c > > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > > index ed98120be59..cc4c5542ee6 100644 > > --- a/gcc/config/i386/sse.md > > +++ b/gcc/config/i386/sse.md > > @@ -6576,7 +6576,7 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" > > (match_operand:QI 4 "register_operand")] > > "TARGET_AVX512FP16 && <round_mode512bit_condition>" > > { > > - rtx op0, op1; > > + rtx op0, op1, mask; > > > > if (<round_embedded_complex>) > > emit_insn (gen_avx512fp16_fmaddcsh_v8hf_mask<round_expand_name> ( > > @@ -6590,11 +6590,13 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" > > { > > op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); > > op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); > > - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); > > + mask = gen_reg_rtx (QImode); > > + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); > > + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); > > } > > else > > { > > - rtx mask, tmp, vec_mask; > > + rtx tmp, vec_mask; > > mask = lowpart_subreg (SImode, operands[4], QImode), > > tmp = gen_reg_rtx (SImode); > > emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); > > @@ -6631,7 +6633,7 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" > > (match_operand:QI 4 "register_operand")] > > "TARGET_AVX512FP16 && <round_mode512bit_condition>" > > { > > - rtx op0, op1; > > + rtx op0, op1, mask; > > > > if (<round_embedded_complex>) > > emit_insn (gen_avx512fp16_fcmaddcsh_v8hf_mask<round_expand_name> ( > > @@ -6645,11 +6647,13 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" > > { > > op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); > > op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); > > - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); > > + mask = gen_reg_rtx (QImode); > > + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); > > + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); > Would it be better to use vmovss under avx512vl without & 1 for mask. > > } > > else > > { > > - rtx mask, tmp, vec_mask; > > + rtx tmp, vec_mask; > > mask = lowpart_subreg (SImode, operands[4], QImode), > > tmp = gen_reg_rtx (SImode); > > emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); > > diff --git a/gcc/testsuite/gcc.target/i386/pr104978.c b/gcc/testsuite/gcc.target/i386/pr104978.c > > new file mode 100644 > > index 00000000000..fd22a6c3f43 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr104978.c > > @@ -0,0 +1,18 @@ > > +/* PR target/104978 */ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl" } */ > > +/* { dg-final { scan-assembler-times "and\[^\\n\\r\]*\\\$1" 2 } } */ > > + > > +#include<immintrin.h> > > + > > +__m128h > > +foo (__m128h a, __m128h b, __m128h c, __mmask8 m) > > +{ > > + return _mm_mask_fmadd_round_sch (a, m, b, c, 8); > > +} > > + > > +__m128h > > +foo2 (__m128h a, __m128h b, __m128h c, __mmask8 m) > > +{ > > + return _mm_mask_fcmadd_round_sch (a, m, b, c, 8); > > +} > > -- > > 2.18.1 > > > > > -- > BR, > Hongtao
On Mon, Mar 21, 2022 at 9:22 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > Would it be better to use vmovss under avx512vl without & 1 for mask. > > vmovss clears the upper bits, but the intrinsic requires src1. We > still need either a mask move or blend for the high part. not for __m128 _mm_mask_move_ss (__m128 src, __mmask8 k, __m128 a, __m128 b) https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vmovss&ig_expand=3807,3081,3082,3084,3083,4837,4838 > > LLVM generates mask & 1 for these intrinsics. > > Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年3月21日周一 09:08写道: > > > > On Sat, Mar 19, 2022 at 8:09 AM Hongyu Wang via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > Hi, > > > > > > For complex scalar intrinsic like _mm_mask_fcmadd_sch, the > > > mask should be and by 1 to ensure the mask is bind to lowest byte. > > > > > > Bootstraped/regtested on x86_64-pc-linux-gnu{-m32,} and sde. > > > > > > Ok for master? > > > > > > gcc/ChangeLog: > > > > > > PR target/104978 > > > * config/i386/sse.md > > > (avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name): > > > Generate mask & 1 before move to dest under TARGET_AVX512VL. > > > (avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name): Likewise. > > > > > > gcc/testsuite/ChangeLog: > > > > > > PR target/104978 > > > * gcc.target/i386/pr104978.c: New test. > > > --- > > > gcc/config/i386/sse.md | 16 ++++++++++------ > > > gcc/testsuite/gcc.target/i386/pr104978.c | 18 ++++++++++++++++++ > > > 2 files changed, 28 insertions(+), 6 deletions(-) > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr104978.c > > > > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > > > index ed98120be59..cc4c5542ee6 100644 > > > --- a/gcc/config/i386/sse.md > > > +++ b/gcc/config/i386/sse.md > > > @@ -6576,7 +6576,7 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" > > > (match_operand:QI 4 "register_operand")] > > > "TARGET_AVX512FP16 && <round_mode512bit_condition>" > > > { > > > - rtx op0, op1; > > > + rtx op0, op1, mask; > > > > > > if (<round_embedded_complex>) > > > emit_insn (gen_avx512fp16_fmaddcsh_v8hf_mask<round_expand_name> ( > > > @@ -6590,11 +6590,13 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" > > > { > > > op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); > > > op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); > > > - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); > > > + mask = gen_reg_rtx (QImode); > > > + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); > > > + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); > > > } > > > else > > > { > > > - rtx mask, tmp, vec_mask; > > > + rtx tmp, vec_mask; > > > mask = lowpart_subreg (SImode, operands[4], QImode), > > > tmp = gen_reg_rtx (SImode); > > > emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); > > > @@ -6631,7 +6633,7 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" > > > (match_operand:QI 4 "register_operand")] > > > "TARGET_AVX512FP16 && <round_mode512bit_condition>" > > > { > > > - rtx op0, op1; > > > + rtx op0, op1, mask; > > > > > > if (<round_embedded_complex>) > > > emit_insn (gen_avx512fp16_fcmaddcsh_v8hf_mask<round_expand_name> ( > > > @@ -6645,11 +6647,13 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" > > > { > > > op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); > > > op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); > > > - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); > > > + mask = gen_reg_rtx (QImode); > > > + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); > > > + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); > > Would it be better to use vmovss under avx512vl without & 1 for mask. > > > } > > > else > > > { > > > - rtx mask, tmp, vec_mask; > > > + rtx tmp, vec_mask; > > > mask = lowpart_subreg (SImode, operands[4], QImode), > > > tmp = gen_reg_rtx (SImode); > > > emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); > > > diff --git a/gcc/testsuite/gcc.target/i386/pr104978.c b/gcc/testsuite/gcc.target/i386/pr104978.c > > > new file mode 100644 > > > index 00000000000..fd22a6c3f43 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/i386/pr104978.c > > > @@ -0,0 +1,18 @@ > > > +/* PR target/104978 */ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl" } */ > > > +/* { dg-final { scan-assembler-times "and\[^\\n\\r\]*\\\$1" 2 } } */ > > > + > > > +#include<immintrin.h> > > > + > > > +__m128h > > > +foo (__m128h a, __m128h b, __m128h c, __mmask8 m) > > > +{ > > > + return _mm_mask_fmadd_round_sch (a, m, b, c, 8); > > > +} > > > + > > > +__m128h > > > +foo2 (__m128h a, __m128h b, __m128h c, __mmask8 m) > > > +{ > > > + return _mm_mask_fcmadd_round_sch (a, m, b, c, 8); > > > +} > > > -- > > > 2.18.1 > > > > > > > > > -- > > BR, > > Hongtao
> > > Would it be better to use vmovss under avx512vl without & 1 for mask. > > > > vmovss clears the upper bits, but the intrinsic requires src1. We > > still need either a mask move or blend for the high part. > not for __m128 _mm_mask_move_ss (__m128 src, __mmask8 k, __m128 a, __m128 b) > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vmovss&ig_expand=3807,3081,3082,3084,3083,4837,4838 Oh, if this works, the non-avx512vl part could also be adjusted. Will try this, thanks. Hongtao Liu <crazylht@gmail.com> 于2022年3月21日周一 09:48写道: > > On Mon, Mar 21, 2022 at 9:22 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > Would it be better to use vmovss under avx512vl without & 1 for mask. > > > > vmovss clears the upper bits, but the intrinsic requires src1. We > > still need either a mask move or blend for the high part. > not for __m128 _mm_mask_move_ss (__m128 src, __mmask8 k, __m128 a, __m128 b) > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vmovss&ig_expand=3807,3081,3082,3084,3083,4837,4838 > > > > LLVM generates mask & 1 for these intrinsics. > > > > Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年3月21日周一 09:08写道: > > > > > > On Sat, Mar 19, 2022 at 8:09 AM Hongyu Wang via Gcc-patches > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > Hi, > > > > > > > > For complex scalar intrinsic like _mm_mask_fcmadd_sch, the > > > > mask should be and by 1 to ensure the mask is bind to lowest byte. > > > > > > > > Bootstraped/regtested on x86_64-pc-linux-gnu{-m32,} and sde. > > > > > > > > Ok for master? > > > > > > > > gcc/ChangeLog: > > > > > > > > PR target/104978 > > > > * config/i386/sse.md > > > > (avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name): > > > > Generate mask & 1 before move to dest under TARGET_AVX512VL. > > > > (avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name): Likewise. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > PR target/104978 > > > > * gcc.target/i386/pr104978.c: New test. > > > > --- > > > > gcc/config/i386/sse.md | 16 ++++++++++------ > > > > gcc/testsuite/gcc.target/i386/pr104978.c | 18 ++++++++++++++++++ > > > > 2 files changed, 28 insertions(+), 6 deletions(-) > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr104978.c > > > > > > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > > > > index ed98120be59..cc4c5542ee6 100644 > > > > --- a/gcc/config/i386/sse.md > > > > +++ b/gcc/config/i386/sse.md > > > > @@ -6576,7 +6576,7 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" > > > > (match_operand:QI 4 "register_operand")] > > > > "TARGET_AVX512FP16 && <round_mode512bit_condition>" > > > > { > > > > - rtx op0, op1; > > > > + rtx op0, op1, mask; > > > > > > > > if (<round_embedded_complex>) > > > > emit_insn (gen_avx512fp16_fmaddcsh_v8hf_mask<round_expand_name> ( > > > > @@ -6590,11 +6590,13 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" > > > > { > > > > op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); > > > > op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); > > > > - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); > > > > + mask = gen_reg_rtx (QImode); > > > > + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); > > > > + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); > > > > } > > > > else > > > > { > > > > - rtx mask, tmp, vec_mask; > > > > + rtx tmp, vec_mask; > > > > mask = lowpart_subreg (SImode, operands[4], QImode), > > > > tmp = gen_reg_rtx (SImode); > > > > emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); > > > > @@ -6631,7 +6633,7 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" > > > > (match_operand:QI 4 "register_operand")] > > > > "TARGET_AVX512FP16 && <round_mode512bit_condition>" > > > > { > > > > - rtx op0, op1; > > > > + rtx op0, op1, mask; > > > > > > > > if (<round_embedded_complex>) > > > > emit_insn (gen_avx512fp16_fcmaddcsh_v8hf_mask<round_expand_name> ( > > > > @@ -6645,11 +6647,13 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" > > > > { > > > > op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); > > > > op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); > > > > - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); > > > > + mask = gen_reg_rtx (QImode); > > > > + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); > > > > + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); > > > Would it be better to use vmovss under avx512vl without & 1 for mask. > > > > } > > > > else > > > > { > > > > - rtx mask, tmp, vec_mask; > > > > + rtx tmp, vec_mask; > > > > mask = lowpart_subreg (SImode, operands[4], QImode), > > > > tmp = gen_reg_rtx (SImode); > > > > emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr104978.c b/gcc/testsuite/gcc.target/i386/pr104978.c > > > > new file mode 100644 > > > > index 00000000000..fd22a6c3f43 > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.target/i386/pr104978.c > > > > @@ -0,0 +1,18 @@ > > > > +/* PR target/104978 */ > > > > +/* { dg-do compile } */ > > > > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl" } */ > > > > +/* { dg-final { scan-assembler-times "and\[^\\n\\r\]*\\\$1" 2 } } */ > > > > + > > > > +#include<immintrin.h> > > > > + > > > > +__m128h > > > > +foo (__m128h a, __m128h b, __m128h c, __mmask8 m) > > > > +{ > > > > + return _mm_mask_fmadd_round_sch (a, m, b, c, 8); > > > > +} > > > > + > > > > +__m128h > > > > +foo2 (__m128h a, __m128h b, __m128h c, __mmask8 m) > > > > +{ > > > > + return _mm_mask_fcmadd_round_sch (a, m, b, c, 8); > > > > +} > > > > -- > > > > 2.18.1 > > > > > > > > > > > > > -- > > > BR, > > > Hongtao > > > > -- > BR, > Hongtao
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index ed98120be59..cc4c5542ee6 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -6576,7 +6576,7 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" (match_operand:QI 4 "register_operand")] "TARGET_AVX512FP16 && <round_mode512bit_condition>" { - rtx op0, op1; + rtx op0, op1, mask; if (<round_embedded_complex>) emit_insn (gen_avx512fp16_fmaddcsh_v8hf_mask<round_expand_name> ( @@ -6590,11 +6590,13 @@ (define_expand "avx512fp16_fmaddcsh_v8hf_mask1<round_expand_name>" { op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); + mask = gen_reg_rtx (QImode); + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); } else { - rtx mask, tmp, vec_mask; + rtx tmp, vec_mask; mask = lowpart_subreg (SImode, operands[4], QImode), tmp = gen_reg_rtx (SImode); emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); @@ -6631,7 +6633,7 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" (match_operand:QI 4 "register_operand")] "TARGET_AVX512FP16 && <round_mode512bit_condition>" { - rtx op0, op1; + rtx op0, op1, mask; if (<round_embedded_complex>) emit_insn (gen_avx512fp16_fcmaddcsh_v8hf_mask<round_expand_name> ( @@ -6645,11 +6647,13 @@ (define_expand "avx512fp16_fcmaddcsh_v8hf_mask1<round_expand_name>" { op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode); op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode); - emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4])); + mask = gen_reg_rtx (QImode); + emit_insn (gen_andqi3 (mask, operands[4], GEN_INT (1))); + emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, mask)); } else { - rtx mask, tmp, vec_mask; + rtx tmp, vec_mask; mask = lowpart_subreg (SImode, operands[4], QImode), tmp = gen_reg_rtx (SImode); emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31))); diff --git a/gcc/testsuite/gcc.target/i386/pr104978.c b/gcc/testsuite/gcc.target/i386/pr104978.c new file mode 100644 index 00000000000..fd22a6c3f43 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr104978.c @@ -0,0 +1,18 @@ +/* PR target/104978 */ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl" } */ +/* { dg-final { scan-assembler-times "and\[^\\n\\r\]*\\\$1" 2 } } */ + +#include<immintrin.h> + +__m128h +foo (__m128h a, __m128h b, __m128h c, __mmask8 m) +{ + return _mm_mask_fmadd_round_sch (a, m, b, c, 8); +} + +__m128h +foo2 (__m128h a, __m128h b, __m128h c, __mmask8 m) +{ + return _mm_mask_fcmadd_round_sch (a, m, b, c, 8); +}