[x86] UNSPEC_PALIGNR optimizations and clean-ups.

Message ID 006901d88cb1$1a06e870$4e14b950$@nextmovesoftware.com
State New
Headers
Series [x86] UNSPEC_PALIGNR optimizations and clean-ups. |

Commit Message

Roger Sayle June 30, 2022, 6:42 p.m. UTC
  This patch is a follow-up to Hongtao's fix for PR target/105854.  That
fix is perfectly correct, but the thing that caught my eye was why is
the compiler generating a shift by zero at all.  Digging deeper it
turns out that we can easily optimize __builtin_ia32_palignr for
alignments of 0 and 64 respectively, which may be simplified to moves
from the highpart or lowpart.

After adding optimizations to simplify the 64-bit DImode palignr,
I started to add the corresponding optimizations for vpalignr (i.e.
128-bit).  The first oddity is that sse.md uses TImode and a special
SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
above SSESCALARMODE hints that this should be "dropped in favor of
VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
<ssse3_avx2>_palignr<mode> to use VIMAX_AVX2_AVX512BW, basically
using V1TImode instead of TImode for 128-bit palignr.

But it was only after I'd implemented this clean-up that I stumbled
across the strange semantics of 128-bit [v]palignr.  According to
https://www.felixcloutier.com/x86/palignr, the semantics are subtly
different based upon how the instruction is encoded.  PALIGNR leaves
the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
highpart, and (unless I'm mistaken) it looks like GCC currently uses
the exact same RTL/templates for both, treating one as an alternative
for the other.

Hence I thought I'd post what I have so far (part optimization and
part clean-up), to then ask the x86 experts for their opinions.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-,32},
with no new failures.  Ok for mainline?


2022-06-30  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
        CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
        * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
        and gen_ssse3_palignv1ti instead of TImode.
        * config/i386/sse.md (SSESCALARMODE): Delete.
        (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
        (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a mode
        iterator instead of SSESCALARMODE.

        (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
        using a single move instruction (if required).
        (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
        (define_split): Likewise split UNSPEC_PALIGNR $64 into a move.

gcc/testsuite/ChangeLog
        * gcc.target/i386/ssse3-palignr-2.c: New test case.


Thanks in advance,
Roger
--
  

Comments

Hongtao Liu July 1, 2022, 2:12 a.m. UTC | #1
On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch is a follow-up to Hongtao's fix for PR target/105854.  That
> fix is perfectly correct, but the thing that caught my eye was why is
> the compiler generating a shift by zero at all.  Digging deeper it
> turns out that we can easily optimize __builtin_ia32_palignr for
> alignments of 0 and 64 respectively, which may be simplified to moves
> from the highpart or lowpart.
>
> After adding optimizations to simplify the 64-bit DImode palignr,
> I started to add the corresponding optimizations for vpalignr (i.e.
> 128-bit).  The first oddity is that sse.md uses TImode and a special
> SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
> above SSESCALARMODE hints that this should be "dropped in favor of
> VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
> <ssse3_avx2>_palignr<mode> to use VIMAX_AVX2_AVX512BW, basically
> using V1TImode instead of TImode for 128-bit palignr.
>
> But it was only after I'd implemented this clean-up that I stumbled
> across the strange semantics of 128-bit [v]palignr.  According to
> https://www.felixcloutier.com/x86/palignr, the semantics are subtly
> different based upon how the instruction is encoded.  PALIGNR leaves
> the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
> highpart, and (unless I'm mistaken) it looks like GCC currently uses
> the exact same RTL/templates for both, treating one as an alternative
> for the other.
I think as long as patterns or intrinsics only care about the low
part, they should be ok.
But if we want to use default behavior for upper bits, we need to
restrict them under specific isa(.i.e. vmovq in vec_set<mode>_0).
Generally, 128-bit sse legacy instructions have different behaviors
for upper bits from AVX ones, and that's why vzeroupper is introduced
for sse <-> avx instructions transition.
>
> Hence I thought I'd post what I have so far (part optimization and
> part clean-up), to then ask the x86 experts for their opinions.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-,32},
> with no new failures.  Ok for mainline?
>
>
> 2022-06-30  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
>         CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
>         * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
>         and gen_ssse3_palignv1ti instead of TImode.
>         * config/i386/sse.md (SSESCALARMODE): Delete.
>         (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
>         (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a mode
>         iterator instead of SSESCALARMODE.
>
>         (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
>         using a single move instruction (if required).
>         (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
>         (define_split): Likewise split UNSPEC_PALIGNR $64 into a move.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/ssse3-palignr-2.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>

+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+ (unspec:DI [(match_operand:DI 1 "register_operand")
+    (match_operand:DI 2 "register_mmxmem_operand")
+    (const_int 0)]
+   UNSPEC_PALIGNR))]
+  ""
+  [(set (match_dup 0) (match_dup 2))])
+
+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+ (unspec:DI [(match_operand:DI 1 "register_operand")
+    (match_operand:DI 2 "register_mmxmem_operand")
+    (const_int 64)]
+   UNSPEC_PALIGNR))]
+  ""
+  [(set (match_dup 0) (match_dup 1))])
+
define_split is assumed to be splitted to 2(or more) insns, hence
pass_combine will only try define_split if the number of merged insns
is greater than 2.
For palignr, i think most time there would be only 2 merged
insns(constant propagation), so better to change them as pre_reload
splitter.
(.i.e. (define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1").


--
BR,
Hongtao
  
Hongtao Liu July 1, 2022, 2:40 a.m. UTC | #2
On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> >
> > This patch is a follow-up to Hongtao's fix for PR target/105854.  That
> > fix is perfectly correct, but the thing that caught my eye was why is
> > the compiler generating a shift by zero at all.  Digging deeper it
> > turns out that we can easily optimize __builtin_ia32_palignr for
> > alignments of 0 and 64 respectively, which may be simplified to moves
> > from the highpart or lowpart.
> >
> > After adding optimizations to simplify the 64-bit DImode palignr,
> > I started to add the corresponding optimizations for vpalignr (i.e.
> > 128-bit).  The first oddity is that sse.md uses TImode and a special
> > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
> > above SSESCALARMODE hints that this should be "dropped in favor of
> > VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
> > <ssse3_avx2>_palignr<mode> to use VIMAX_AVX2_AVX512BW, basically
> > using V1TImode instead of TImode for 128-bit palignr.
> >
> > But it was only after I'd implemented this clean-up that I stumbled
> > across the strange semantics of 128-bit [v]palignr.  According to
> > https://www.felixcloutier.com/x86/palignr, the semantics are subtly
> > different based upon how the instruction is encoded.  PALIGNR leaves
> > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
> > highpart, and (unless I'm mistaken) it looks like GCC currently uses
> > the exact same RTL/templates for both, treating one as an alternative
> > for the other.
> I think as long as patterns or intrinsics only care about the low
> part, they should be ok.
> But if we want to use default behavior for upper bits, we need to
> restrict them under specific isa(.i.e. vmovq in vec_set<mode>_0).
> Generally, 128-bit sse legacy instructions have different behaviors
> for upper bits from AVX ones, and that's why vzeroupper is introduced
> for sse <-> avx instructions transition.
> >
> > Hence I thought I'd post what I have so far (part optimization and
> > part clean-up), to then ask the x86 experts for their opinions.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-,32},
> > with no new failures.  Ok for mainline?
> >
> >
> > 2022-06-30  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
> >         CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
> >         * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
> >         and gen_ssse3_palignv1ti instead of TImode.
> >         * config/i386/sse.md (SSESCALARMODE): Delete.
> >         (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
> >         (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a mode
> >         iterator instead of SSESCALARMODE.
> >
> >         (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
> >         using a single move instruction (if required).
> >         (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
> >         (define_split): Likewise split UNSPEC_PALIGNR $64 into a move.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/i386/ssse3-palignr-2.c: New test case.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >
>
> +(define_split
> +  [(set (match_operand:DI 0 "register_operand")
> + (unspec:DI [(match_operand:DI 1 "register_operand")
> +    (match_operand:DI 2 "register_mmxmem_operand")
> +    (const_int 0)]
> +   UNSPEC_PALIGNR))]
> +  ""
> +  [(set (match_dup 0) (match_dup 2))])
> +
> +(define_split
> +  [(set (match_operand:DI 0 "register_operand")
> + (unspec:DI [(match_operand:DI 1 "register_operand")
> +    (match_operand:DI 2 "register_mmxmem_operand")
> +    (const_int 64)]
> +   UNSPEC_PALIGNR))]
> +  ""
> +  [(set (match_dup 0) (match_dup 1))])
> +
> define_split is assumed to be splitted to 2(or more) insns, hence
> pass_combine will only try define_split if the number of merged insns
> is greater than 2.
> For palignr, i think most time there would be only 2 merged
> insns(constant propagation), so better to change them as pre_reload
> splitter.
> (.i.e. (define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1").
I think you can just merge 2 define_split into define_insn_and_split
"ssse3_palignrdi" by relaxing split condition as

-  "TARGET_SSSE3 && reload_completed
-   && SSE_REGNO_P (REGNO (operands[0]))"
+  "(TARGET_SSSE3 && reload_completed
+   && SSE_REGNO_P (REGNO (operands[0])))
+   || INVAL(operands[3]) == 0
+   || INVAL(operands[3]) == 64"

and you have already handled them by

+  if (operands[3] == const0_rtx)
+    {
+      if (!rtx_equal_p (operands[0], operands[2]))
+ emit_move_insn (operands[0], operands[2]);
+      else
+ emit_note (NOTE_INSN_DELETED);
+      DONE;
+    }
+  else if (INTVAL (operands[3]) == 64)
+    {
+      if (!rtx_equal_p (operands[0], operands[1]))
+ emit_move_insn (operands[0], operands[1]);
+      else
+ emit_note (NOTE_INSN_DELETED);
+      DONE;
+    }
+

>

>
> --
> BR,
> Hongtao



--
BR,
Hongtao
  
Roger Sayle July 4, 2022, 5:48 p.m. UTC | #3
Hi Hongtao,
Many thanks for your review.  This revised patch implements your
suggestions of removing the combine splitters, and instead reusing
the functionality of the ssse3_palignrdi define_insn_and split.

This revised patch has been tested on x86_64-pc-linux-gnu with make
bootstrap and make -k check, both with and with --target_board=unix{-32},
with no new failures.  Is this revised version Ok for mainline?


2022-07-04  Roger Sayle  <roger@nextmovesoftware.com>
            Hongtao Liu  <hongtao.liu@intel.com>

gcc/ChangeLog
        * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
        CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
        * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
        and gen_ssse3_palignv1ti instead of TImode.
        * config/i386/sse.md (SSESCALARMODE): Delete.
        (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
        (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a mode
        iterator instead of SSESCALARMODE.

        (ssse3_palignrdi): Optimize cases where operands[3] is 0 or 64,
        using a single move instruction (if required).

gcc/testsuite/ChangeLog
        * gcc.target/i386/ssse3-palignr-2.c: New test case.


Thanks in advance,
Roger
--

> -----Original Message-----
> From: Hongtao Liu <crazylht@gmail.com>
> Sent: 01 July 2022 03:40
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.
> 
> On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> > >
> > >
> > > This patch is a follow-up to Hongtao's fix for PR target/105854.
> > > That fix is perfectly correct, but the thing that caught my eye was
> > > why is the compiler generating a shift by zero at all.  Digging
> > > deeper it turns out that we can easily optimize
> > > __builtin_ia32_palignr for alignments of 0 and 64 respectively,
> > > which may be simplified to moves from the highpart or lowpart.
> > >
> > > After adding optimizations to simplify the 64-bit DImode palignr, I
> > > started to add the corresponding optimizations for vpalignr (i.e.
> > > 128-bit).  The first oddity is that sse.md uses TImode and a special
> > > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
> > > above SSESCALARMODE hints that this should be "dropped in favor of
> > > VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
> > > <ssse3_avx2>_palignr<mode> to use VIMAX_AVX2_AVX512BW, basically
> > > using V1TImode instead of TImode for 128-bit palignr.
> > >
> > > But it was only after I'd implemented this clean-up that I stumbled
> > > across the strange semantics of 128-bit [v]palignr.  According to
> > > https://www.felixcloutier.com/x86/palignr, the semantics are subtly
> > > different based upon how the instruction is encoded.  PALIGNR leaves
> > > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
> > > highpart, and (unless I'm mistaken) it looks like GCC currently uses
> > > the exact same RTL/templates for both, treating one as an
> > > alternative for the other.
> > I think as long as patterns or intrinsics only care about the low
> > part, they should be ok.
> > But if we want to use default behavior for upper bits, we need to
> > restrict them under specific isa(.i.e. vmovq in vec_set<mode>_0).
> > Generally, 128-bit sse legacy instructions have different behaviors
> > for upper bits from AVX ones, and that's why vzeroupper is introduced
> > for sse <-> avx instructions transition.
> > >
> > > Hence I thought I'd post what I have so far (part optimization and
> > > part clean-up), to then ask the x86 experts for their opinions.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > bootstrap and make -k check, both with and without
> > > --target_board=unix{-,32}, with no new failures.  Ok for mainline?
> > >
> > >
> > > 2022-06-30  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
> > >         CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
> > >         * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use
> V1TImode
> > >         and gen_ssse3_palignv1ti instead of TImode.
> > >         * config/i386/sse.md (SSESCALARMODE): Delete.
> > >         (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
> > >         (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a
> mode
> > >         iterator instead of SSESCALARMODE.
> > >
> > >         (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
> > >         using a single move instruction (if required).
> > >         (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
> > >         (define_split): Likewise split UNSPEC_PALIGNR $64 into a move.
> > >
> > > gcc/testsuite/ChangeLog
> > >         * gcc.target/i386/ssse3-palignr-2.c: New test case.
> > >
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
> >
> > +(define_split
> > +  [(set (match_operand:DI 0 "register_operand")  (unspec:DI
> > +[(match_operand:DI 1 "register_operand")
> > +    (match_operand:DI 2 "register_mmxmem_operand")
> > +    (const_int 0)]
> > +   UNSPEC_PALIGNR))]
> > +  ""
> > +  [(set (match_dup 0) (match_dup 2))])
> > +
> > +(define_split
> > +  [(set (match_operand:DI 0 "register_operand")  (unspec:DI
> > +[(match_operand:DI 1 "register_operand")
> > +    (match_operand:DI 2 "register_mmxmem_operand")
> > +    (const_int 64)]
> > +   UNSPEC_PALIGNR))]
> > +  ""
> > +  [(set (match_dup 0) (match_dup 1))])
> > +
> > define_split is assumed to be splitted to 2(or more) insns, hence
> > pass_combine will only try define_split if the number of merged insns
> > is greater than 2.
> > For palignr, i think most time there would be only 2 merged
> > insns(constant propagation), so better to change them as pre_reload
> > splitter.
> > (.i.e. (define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1").
> I think you can just merge 2 define_split into define_insn_and_split
> "ssse3_palignrdi" by relaxing split condition as
> 
> -  "TARGET_SSSE3 && reload_completed
> -   && SSE_REGNO_P (REGNO (operands[0]))"
> +  "(TARGET_SSSE3 && reload_completed
> +   && SSE_REGNO_P (REGNO (operands[0])))
> +   || INVAL(operands[3]) == 0
> +   || INVAL(operands[3]) == 64"
> 
> and you have already handled them by
> 
> +  if (operands[3] == const0_rtx)
> +    {
> +      if (!rtx_equal_p (operands[0], operands[2])) emit_move_insn
> + (operands[0], operands[2]);
> +      else
> + emit_note (NOTE_INSN_DELETED);
> +      DONE;
> +    }
> +  else if (INTVAL (operands[3]) == 64)
> +    {
> +      if (!rtx_equal_p (operands[0], operands[1])) emit_move_insn
> + (operands[0], operands[1]);
> +      else
> + emit_note (NOTE_INSN_DELETED);
> +      DONE;
> +    }
> +
> 
> >
> 
> >
> > --
> > BR,
> > Hongtao
> 
> 
> 
> --
> BR,
> Hongtao
diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
index e6daad4..fd16093 100644
--- a/gcc/config/i386/i386-builtin.def
+++ b/gcc/config/i386/i386-builtin.def
@@ -900,7 +900,7 @@ BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_psignv4si3, "__builtin_ia32_psig
 BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_ssse3_psignv2si3, "__builtin_ia32_psignd", IX86_BUILTIN_PSIGND, UNKNOWN, (int) V2SI_FTYPE_V2SI_V2SI)
 
 /* SSSE3.  */
-BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrti, "__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)
+BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrv1ti, "__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)
 BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_ssse3_palignrdi, "__builtin_ia32_palignr", IX86_BUILTIN_PALIGNR, UNKNOWN, (int) V1DI_FTYPE_V1DI_V1DI_INT_CONVERT)
 
 /* SSE4.1 */
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 8bc5430..6a3fcde 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -19548,9 +19548,11 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d, bool single_insn_only_p)
   shift = GEN_INT (min * GET_MODE_UNIT_BITSIZE (d->vmode));
   if (GET_MODE_SIZE (d->vmode) == 16)
     {
-      target = gen_reg_rtx (TImode);
-      emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, dcopy.op1),
-				      gen_lowpart (TImode, dcopy.op0), shift));
+      target = gen_reg_rtx (V1TImode);
+      emit_insn (gen_ssse3_palignrv1ti (target,
+					gen_lowpart (V1TImode, dcopy.op1),
+					gen_lowpart (V1TImode, dcopy.op0),
+					shift));
     }
   else
     {
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index f2f72e8..adf05bf 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -575,10 +575,6 @@
 (define_mode_iterator VIMAX_AVX2
   [(V2TI "TARGET_AVX2") V1TI])
 
-;; ??? This should probably be dropped in favor of VIMAX_AVX2_AVX512BW.
-(define_mode_iterator SSESCALARMODE
-  [(V4TI "TARGET_AVX512BW") (V2TI "TARGET_AVX2") TI])
-
 (define_mode_iterator VI12_AVX2
   [(V32QI "TARGET_AVX2") V16QI
    (V16HI "TARGET_AVX2") V8HI])
@@ -712,7 +708,7 @@
     (V4HI "ssse3") (V8HI "ssse3") (V16HI "avx2") (V32HI "avx512bw")
     (V4SI "ssse3") (V8SI "avx2")
     (V2DI "ssse3") (V4DI "avx2")
-    (TI "ssse3") (V2TI "avx2") (V4TI "avx512bw")])
+    (V1TI "ssse3") (V2TI "avx2") (V4TI "avx512bw")])
 
 (define_mode_attr sse4_1_avx2
    [(V16QI "sse4_1") (V32QI "avx2") (V64QI "avx512bw")
@@ -21092,10 +21088,10 @@
    (set_attr "mode" "<sseinsnmode>")])
 
 (define_insn "<ssse3_avx2>_palignr<mode>"
-  [(set (match_operand:SSESCALARMODE 0 "register_operand" "=x,<v_Yw>")
-	(unspec:SSESCALARMODE
-	  [(match_operand:SSESCALARMODE 1 "register_operand" "0,<v_Yw>")
-	   (match_operand:SSESCALARMODE 2 "vector_operand" "xBm,<v_Yw>m")
+  [(set (match_operand:VIMAX_AVX2_AVX512BW 0 "register_operand" "=x,<v_Yw>")
+	(unspec:VIMAX_AVX2_AVX512BW
+	  [(match_operand:VIMAX_AVX2_AVX512BW 1 "register_operand" "0,<v_Yw>")
+	   (match_operand:VIMAX_AVX2_AVX512BW 2 "vector_operand" "xBm,<v_Yw>m")
 	   (match_operand:SI 3 "const_0_to_255_mul_8_operand")]
 	  UNSPEC_PALIGNR))]
   "TARGET_SSSE3"
@@ -21141,11 +21137,30 @@
       gcc_unreachable ();
     }
 }
-  "TARGET_SSSE3 && reload_completed
-   && SSE_REGNO_P (REGNO (operands[0]))"
+  "(TARGET_SSSE3 && reload_completed
+    && SSE_REGNO_P (REGNO (operands[0])))
+   || operands[3] == const0_rtx
+   || INTVAL (operands[3]) == 64"
   [(set (match_dup 0)
 	(lshiftrt:V1TI (match_dup 0) (match_dup 3)))]
 {
+  if (operands[3] == const0_rtx)
+    {
+      if (!rtx_equal_p (operands[0], operands[2]))
+	emit_move_insn (operands[0], operands[2]);
+      else
+	emit_note (NOTE_INSN_DELETED);
+      DONE;
+    }
+  else if (INTVAL (operands[3]) == 64)
+    {
+      if (!rtx_equal_p (operands[0], operands[1]))
+	emit_move_insn (operands[0], operands[1]);
+      else
+	emit_note (NOTE_INSN_DELETED);
+      DONE;
+    }
+
   /* Emulate MMX palignrdi with SSE psrldq.  */
   rtx op0 = lowpart_subreg (V2DImode, operands[0],
 			    GET_MODE (operands[0]));
diff --git a/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c b/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c
new file mode 100644
index 0000000..791222d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mssse3" } */
+
+typedef long long __attribute__ ((__vector_size__ (8))) T;
+
+T x;
+T y;
+T z;
+
+void foo()
+{
+  z = __builtin_ia32_palignr (x, y, 0);
+}
+
+void bar()
+{
+  z = __builtin_ia32_palignr (x, y, 64);
+}
+/* { dg-final { scan-assembler-not "punpcklqdq" } } */
+/* { dg-final { scan-assembler-not "pshufd" } } */
+/* { dg-final { scan-assembler-not "psrldq" } } */
  
Hongtao Liu July 5, 2022, 12:30 a.m. UTC | #4
On Tue, Jul 5, 2022 at 1:48 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Hongtao,
> Many thanks for your review.  This revised patch implements your
> suggestions of removing the combine splitters, and instead reusing
> the functionality of the ssse3_palignrdi define_insn_and split.
>
> This revised patch has been tested on x86_64-pc-linux-gnu with make
> bootstrap and make -k check, both with and with --target_board=unix{-32},
> with no new failures.  Is this revised version Ok for mainline?
Ok.
>
>
> 2022-07-04  Roger Sayle  <roger@nextmovesoftware.com>
>             Hongtao Liu  <hongtao.liu@intel.com>
>
> gcc/ChangeLog
>         * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
>         CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
>         * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
>         and gen_ssse3_palignv1ti instead of TImode.
>         * config/i386/sse.md (SSESCALARMODE): Delete.
>         (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
>         (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a mode
>         iterator instead of SSESCALARMODE.
>
>         (ssse3_palignrdi): Optimize cases where operands[3] is 0 or 64,
>         using a single move instruction (if required).
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/ssse3-palignr-2.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>
> > -----Original Message-----
> > From: Hongtao Liu <crazylht@gmail.com>
> > Sent: 01 July 2022 03:40
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> > Subject: Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.
> >
> > On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > >
> > > On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > > >
> > > >
> > > > This patch is a follow-up to Hongtao's fix for PR target/105854.
> > > > That fix is perfectly correct, but the thing that caught my eye was
> > > > why is the compiler generating a shift by zero at all.  Digging
> > > > deeper it turns out that we can easily optimize
> > > > __builtin_ia32_palignr for alignments of 0 and 64 respectively,
> > > > which may be simplified to moves from the highpart or lowpart.
> > > >
> > > > After adding optimizations to simplify the 64-bit DImode palignr, I
> > > > started to add the corresponding optimizations for vpalignr (i.e.
> > > > 128-bit).  The first oddity is that sse.md uses TImode and a special
> > > > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
> > > > above SSESCALARMODE hints that this should be "dropped in favor of
> > > > VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
> > > > <ssse3_avx2>_palignr<mode> to use VIMAX_AVX2_AVX512BW, basically
> > > > using V1TImode instead of TImode for 128-bit palignr.
> > > >
> > > > But it was only after I'd implemented this clean-up that I stumbled
> > > > across the strange semantics of 128-bit [v]palignr.  According to
> > > > https://www.felixcloutier.com/x86/palignr, the semantics are subtly
> > > > different based upon how the instruction is encoded.  PALIGNR leaves
> > > > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
> > > > highpart, and (unless I'm mistaken) it looks like GCC currently uses
> > > > the exact same RTL/templates for both, treating one as an
> > > > alternative for the other.
> > > I think as long as patterns or intrinsics only care about the low
> > > part, they should be ok.
> > > But if we want to use default behavior for upper bits, we need to
> > > restrict them under specific isa(.i.e. vmovq in vec_set<mode>_0).
> > > Generally, 128-bit sse legacy instructions have different behaviors
> > > for upper bits from AVX ones, and that's why vzeroupper is introduced
> > > for sse <-> avx instructions transition.
> > > >
> > > > Hence I thought I'd post what I have so far (part optimization and
> > > > part clean-up), to then ask the x86 experts for their opinions.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > bootstrap and make -k check, both with and without
> > > > --target_board=unix{-,32}, with no new failures.  Ok for mainline?
> > > >
> > > >
> > > > 2022-06-30  Roger Sayle  <roger@nextmovesoftware.com>
> > > >
> > > > gcc/ChangeLog
> > > >         * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
> > > >         CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
> > > >         * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use
> > V1TImode
> > > >         and gen_ssse3_palignv1ti instead of TImode.
> > > >         * config/i386/sse.md (SSESCALARMODE): Delete.
> > > >         (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
> > > >         (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a
> > mode
> > > >         iterator instead of SSESCALARMODE.
> > > >
> > > >         (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
> > > >         using a single move instruction (if required).
> > > >         (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
> > > >         (define_split): Likewise split UNSPEC_PALIGNR $64 into a move.
> > > >
> > > > gcc/testsuite/ChangeLog
> > > >         * gcc.target/i386/ssse3-palignr-2.c: New test case.
> > > >
> > > >
> > > > Thanks in advance,
> > > > Roger
> > > > --
> > > >
> > >
> > > +(define_split
> > > +  [(set (match_operand:DI 0 "register_operand")  (unspec:DI
> > > +[(match_operand:DI 1 "register_operand")
> > > +    (match_operand:DI 2 "register_mmxmem_operand")
> > > +    (const_int 0)]
> > > +   UNSPEC_PALIGNR))]
> > > +  ""
> > > +  [(set (match_dup 0) (match_dup 2))])
> > > +
> > > +(define_split
> > > +  [(set (match_operand:DI 0 "register_operand")  (unspec:DI
> > > +[(match_operand:DI 1 "register_operand")
> > > +    (match_operand:DI 2 "register_mmxmem_operand")
> > > +    (const_int 64)]
> > > +   UNSPEC_PALIGNR))]
> > > +  ""
> > > +  [(set (match_dup 0) (match_dup 1))])
> > > +
> > > define_split is assumed to be splitted to 2(or more) insns, hence
> > > pass_combine will only try define_split if the number of merged insns
> > > is greater than 2.
> > > For palignr, i think most time there would be only 2 merged
> > > insns(constant propagation), so better to change them as pre_reload
> > > splitter.
> > > (.i.e. (define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1").
> > I think you can just merge 2 define_split into define_insn_and_split
> > "ssse3_palignrdi" by relaxing split condition as
> >
> > -  "TARGET_SSSE3 && reload_completed
> > -   && SSE_REGNO_P (REGNO (operands[0]))"
> > +  "(TARGET_SSSE3 && reload_completed
> > +   && SSE_REGNO_P (REGNO (operands[0])))
> > +   || INVAL(operands[3]) == 0
> > +   || INVAL(operands[3]) == 64"
> >
> > and you have already handled them by
> >
> > +  if (operands[3] == const0_rtx)
> > +    {
> > +      if (!rtx_equal_p (operands[0], operands[2])) emit_move_insn
> > + (operands[0], operands[2]);
> > +      else
> > + emit_note (NOTE_INSN_DELETED);
> > +      DONE;
> > +    }
> > +  else if (INTVAL (operands[3]) == 64)
> > +    {
> > +      if (!rtx_equal_p (operands[0], operands[1])) emit_move_insn
> > + (operands[0], operands[1]);
> > +      else
> > + emit_note (NOTE_INSN_DELETED);
> > +      DONE;
> > +    }
> > +
> >
> > >
> >
> > >
> > > --
> > > BR,
> > > Hongtao
> >
> >
> >
> > --
> > BR,
> > Hongtao
  

Patch

diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
index e6daad4..fd16093 100644
--- a/gcc/config/i386/i386-builtin.def
+++ b/gcc/config/i386/i386-builtin.def
@@ -900,7 +900,7 @@  BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_psignv4si3, "__builtin_ia32_psig
 BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_ssse3_psignv2si3, "__builtin_ia32_psignd", IX86_BUILTIN_PSIGND, UNKNOWN, (int) V2SI_FTYPE_V2SI_V2SI)
 
 /* SSSE3.  */
-BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrti, "__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)
+BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrv1ti, "__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)
 BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_ssse3_palignrdi, "__builtin_ia32_palignr", IX86_BUILTIN_PALIGNR, UNKNOWN, (int) V1DI_FTYPE_V1DI_V1DI_INT_CONVERT)
 
 /* SSE4.1 */
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 8bc5430..6a3fcde 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -19548,9 +19548,11 @@  expand_vec_perm_palignr (struct expand_vec_perm_d *d, bool single_insn_only_p)
   shift = GEN_INT (min * GET_MODE_UNIT_BITSIZE (d->vmode));
   if (GET_MODE_SIZE (d->vmode) == 16)
     {
-      target = gen_reg_rtx (TImode);
-      emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, dcopy.op1),
-				      gen_lowpart (TImode, dcopy.op0), shift));
+      target = gen_reg_rtx (V1TImode);
+      emit_insn (gen_ssse3_palignrv1ti (target,
+					gen_lowpart (V1TImode, dcopy.op1),
+					gen_lowpart (V1TImode, dcopy.op0),
+					shift));
     }
   else
     {
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8cd0f61..974deca 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -575,10 +575,6 @@ 
 (define_mode_iterator VIMAX_AVX2
   [(V2TI "TARGET_AVX2") V1TI])
 
-;; ??? This should probably be dropped in favor of VIMAX_AVX2_AVX512BW.
-(define_mode_iterator SSESCALARMODE
-  [(V4TI "TARGET_AVX512BW") (V2TI "TARGET_AVX2") TI])
-
 (define_mode_iterator VI12_AVX2
   [(V32QI "TARGET_AVX2") V16QI
    (V16HI "TARGET_AVX2") V8HI])
@@ -712,7 +708,7 @@ 
     (V4HI "ssse3") (V8HI "ssse3") (V16HI "avx2") (V32HI "avx512bw")
     (V4SI "ssse3") (V8SI "avx2")
     (V2DI "ssse3") (V4DI "avx2")
-    (TI "ssse3") (V2TI "avx2") (V4TI "avx512bw")])
+    (V1TI "ssse3") (V2TI "avx2") (V4TI "avx512bw")])
 
 (define_mode_attr sse4_1_avx2
    [(V16QI "sse4_1") (V32QI "avx2") (V64QI "avx512bw")
@@ -21092,10 +21088,10 @@ 
    (set_attr "mode" "<sseinsnmode>")])
 
 (define_insn "<ssse3_avx2>_palignr<mode>"
-  [(set (match_operand:SSESCALARMODE 0 "register_operand" "=x,<v_Yw>")
-	(unspec:SSESCALARMODE
-	  [(match_operand:SSESCALARMODE 1 "register_operand" "0,<v_Yw>")
-	   (match_operand:SSESCALARMODE 2 "vector_operand" "xBm,<v_Yw>m")
+  [(set (match_operand:VIMAX_AVX2_AVX512BW 0 "register_operand" "=x,<v_Yw>")
+	(unspec:VIMAX_AVX2_AVX512BW
+	  [(match_operand:VIMAX_AVX2_AVX512BW 1 "register_operand" "0,<v_Yw>")
+	   (match_operand:VIMAX_AVX2_AVX512BW 2 "vector_operand" "xBm,<v_Yw>m")
 	   (match_operand:SI 3 "const_0_to_255_mul_8_operand")]
 	  UNSPEC_PALIGNR))]
   "TARGET_SSSE3"
@@ -21146,6 +21142,23 @@ 
   [(set (match_dup 0)
 	(lshiftrt:V1TI (match_dup 0) (match_dup 3)))]
 {
+  if (operands[3] == const0_rtx)
+    {
+      if (!rtx_equal_p (operands[0], operands[2]))
+	emit_move_insn (operands[0], operands[2]);
+      else
+	emit_note (NOTE_INSN_DELETED);
+      DONE;
+    }
+  else if (INTVAL (operands[3]) == 64)
+    {
+      if (!rtx_equal_p (operands[0], operands[1]))
+	emit_move_insn (operands[0], operands[1]);
+      else
+	emit_note (NOTE_INSN_DELETED);
+      DONE;
+    }
+
   /* Emulate MMX palignrdi with SSE psrldq.  */
   rtx op0 = lowpart_subreg (V2DImode, operands[0],
 			    GET_MODE (operands[0]));
@@ -21175,6 +21188,24 @@ 
    (set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p (insn)"))
    (set_attr "mode" "DI,TI,TI")])
 
+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+	(unspec:DI [(match_operand:DI 1 "register_operand")
+		    (match_operand:DI 2 "register_mmxmem_operand")
+		    (const_int 0)]
+		   UNSPEC_PALIGNR))]
+  ""
+  [(set (match_dup 0) (match_dup 2))])
+
+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+	(unspec:DI [(match_operand:DI 1 "register_operand")
+		    (match_operand:DI 2 "register_mmxmem_operand")
+		    (const_int 64)]
+		   UNSPEC_PALIGNR))]
+  ""
+  [(set (match_dup 0) (match_dup 1))])
+
 ;; Mode iterator to handle singularity w/ absence of V2DI and V4DI
 ;; modes for abs instruction on pre AVX-512 targets.
 (define_mode_iterator VI1248_AVX512VL_AVX512BW
diff --git a/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c b/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c
new file mode 100644
index 0000000..791222d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c
@@ -0,0 +1,21 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mssse3" } */
+
+typedef long long __attribute__ ((__vector_size__ (8))) T;
+
+T x;
+T y;
+T z;
+
+void foo()
+{
+  z = __builtin_ia32_palignr (x, y, 0);
+}
+
+void bar()
+{
+  z = __builtin_ia32_palignr (x, y, 64);
+}
+/* { dg-final { scan-assembler-not "punpcklqdq" } } */
+/* { dg-final { scan-assembler-not "pshufd" } } */
+/* { dg-final { scan-assembler-not "psrldq" } } */