[x86_64] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.

Message ID 000901d8938d$ead4dc40$c07e94c0$@nextmovesoftware.com
State New
Headers
Series [x86_64] Improved Scalar-To-Vector (STV) support for TImode to V1TImode. |

Commit Message

Roger Sayle July 9, 2022, 12:17 p.m. UTC
  This patch upgrades x86_64's scalar-to-vector (STV) pass to more
aggressively transform 128-bit scalar TImode operations into vector
V1TImode operations performed on SSE registers.  TImode functionality
already exists in STV, but only for move operations, this changes
brings support for logical operations (AND, IOR, XOR, NOT and ANDN)
and comparisons.

The effect of these changes are conveniently demonstrated by the new
sse4_1-stv-5.c test case:

__int128 a[16];
__int128 b[16];
__int128 c[16];

void foo()
{
  for (unsigned int i=0; i<16; i++)
    a[i] = b[i] & ~c[i];
}

which when currently compiled on mainline wtih -O2 -msse4 produces:

foo:    xorl    %eax, %eax
.L2:    movq    c(%rax), %rsi
        movq    c+8(%rax), %rdi
        addq    $16, %rax
        notq    %rsi
        notq    %rdi
        andq    b-16(%rax), %rsi
        andq    b-8(%rax), %rdi
        movq    %rsi, a-16(%rax)
        movq    %rdi, a-8(%rax)
        cmpq    $256, %rax
        jne     .L2
        ret

but with this patch now produces:

foo:    xorl    %eax, %eax
.L2:    movdqa  c(%rax), %xmm0
        pandn   b(%rax), %xmm0
        addq    $16, %rax
        movaps  %xmm0, a-16(%rax)
        cmpq    $256, %rax
        jne     .L2
        ret

Technically, the STV pass is implemented by three C++ classes, a common
abstract base class "scalar_chain" that contains common functionality,
and two derived classes: general_scalar_chain (which handles SI and
DI modes) and timode_scalar_chain (which handles TI modes).  As
mentioned previously, because only TI mode moves were handled the
two worker classes behaved significantly differently.  These changes
bring the functionality of these two classes closer together, which
is reflected by refactoring more shared code from general_scalar_chain
to the parent scalar_chain and reusing it from timode.  There still
remain significant differences (and simplifications) so the existing
division of classes (as specializations) continues to make sense.

Obviously, there are more changes to come (shifts and rotates),
and compute_convert_gain doesn't yet have its final (tuned) form,
but is already an improvement over the "return 1;" used previously.

This patch has been tested on x86_64-pc-linux-gnu with make boostrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2022-07-09  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386-features.h (scalar_chain): Add fields
        insns_conv, n_sse_to_integer and n_integer_to_sse to this
        parent class, moved from general_scalar_chain.
        (scalar_chain::convert_compare): Protected method moved
        from general_scalar_chain.
        (mark_dual_mode_def): Make protected, not private virtual.
        (scalar_chain:convert_op): New private virtual method.

        (general_scalar_chain::general_scalar_chain): Simplify constructor.
        (general_scalar_chain::~general_scalar_chain): Delete destructor.
        (general_scalar_chain): Move insns_conv, n_sse_to_integer and
        n_integer_to_sse fields to parent class, scalar_chain.
        (general_scalar_chain::mark_dual_mode_def): Delete prototype.
        (general_scalar_chain::convert_compare): Delete prototype.

        (timode_scalar_chain::compute_convert_gain): Remove simplistic
        implementation, convert to a method prototype.
        (timode_scalar_chain::mark_dual_mode_def): Delete prototype.
        (timode_scalar_chain::convert_op): Prototype new virtual method.

        * config/i386/i386-features.cc (scalar_chain::scalar_chain):
        Allocate insns_conv and initialize n_sse_to_integer and
        n_integer_to_sse fields in constructor.
        (scalar_chain::scalar_chain): Free insns_conv in destructor.

        (general_scalar_chain::general_scalar_chain): Delete
        constructor, now defined in the class declaration.
        (general_scalar_chain::~general_scalar_chain): Delete destructor.

        (scalar_chain::mark_dual_mode_def): Renamed from
        general_scalar_chain::mark_dual_mode_def.
        (timode_scalar_chain::mark_dual_mode_def): Delete.
        (scalar_chain::convert_compare): Renamed from
        general_scalar_chain::convert_compare.

        (timode_scalar_chain::compute_convert_gain): New method to
        determine the gain from converting a TImode chain to V1TImode.
        (timode_scalar_chain::convert_op): New method to convert an
        operand from TImode to V1TImode.

        (timode_scalar_chain::convert_insn) <case REG>: Only PUT_MODE
        on REG_EQUAL notes that were originally TImode (not CONST_INT).
        Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
        (timode_mem_p): Helper predicate to check where operand is
        memory reference with sufficient alignment for TImode STV.
        (timode_scalar_to_vector_candidate_p): Use convertible_comparison_p
        to check whether COMPARE is convertible.  Handle SET_DESTs that
        that are REG_P or MEM_P and SET_SRCs that are REG, CONST_INT,
        CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT.

gcc/testsuite/ChangeLog
        * gcc.target/i386/sse4_1-stv-2.c: New test case, pand.
        * gcc.target/i386/sse4_1-stv-3.c: New test case, por.
        * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor.
        * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn.
        * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest.

Roger
--
  

Comments

Uros Bizjak July 10, 2022, 6:05 p.m. UTC | #1
On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> aggressively transform 128-bit scalar TImode operations into vector
> V1TImode operations performed on SSE registers.  TImode functionality
> already exists in STV, but only for move operations, this changes
> brings support for logical operations (AND, IOR, XOR, NOT and ANDN)
> and comparisons.
>
> The effect of these changes are conveniently demonstrated by the new
> sse4_1-stv-5.c test case:
>
> __int128 a[16];
> __int128 b[16];
> __int128 c[16];
>
> void foo()
> {
>   for (unsigned int i=0; i<16; i++)
>     a[i] = b[i] & ~c[i];
> }
>
> which when currently compiled on mainline wtih -O2 -msse4 produces:
>
> foo:    xorl    %eax, %eax
> .L2:    movq    c(%rax), %rsi
>         movq    c+8(%rax), %rdi
>         addq    $16, %rax
>         notq    %rsi
>         notq    %rdi
>         andq    b-16(%rax), %rsi
>         andq    b-8(%rax), %rdi
>         movq    %rsi, a-16(%rax)
>         movq    %rdi, a-8(%rax)
>         cmpq    $256, %rax
>         jne     .L2
>         ret
>
> but with this patch now produces:
>
> foo:    xorl    %eax, %eax
> .L2:    movdqa  c(%rax), %xmm0
>         pandn   b(%rax), %xmm0
>         addq    $16, %rax
>         movaps  %xmm0, a-16(%rax)
>         cmpq    $256, %rax
>         jne     .L2
>         ret
>
> Technically, the STV pass is implemented by three C++ classes, a common
> abstract base class "scalar_chain" that contains common functionality,
> and two derived classes: general_scalar_chain (which handles SI and
> DI modes) and timode_scalar_chain (which handles TI modes).  As
> mentioned previously, because only TI mode moves were handled the
> two worker classes behaved significantly differently.  These changes
> bring the functionality of these two classes closer together, which
> is reflected by refactoring more shared code from general_scalar_chain
> to the parent scalar_chain and reusing it from timode.  There still
> remain significant differences (and simplifications) so the existing
> division of classes (as specializations) continues to make sense.

Please note that there are in fact two STV passes, one before combine
and the other after combine. The TImode pass that previously handled
only loads and stores is positioned before combine (there was a reason
for this decision, but I don't remember the details - let's ask
HJ...). However, DImode STV pass transforms much more instructions and
the reason it was positioned after the combine pass was that STV pass
transforms optimized insn stream where forward propagation was already
performed.

What is not clear to me from the above explanation is: is the new
TImode STV pass positioned after the combine pass, and if this is the
case, how the change affects current load/store TImode STV pass. I
must admit, I don't like two separate STV passess, so if TImode is now
similar to DImode, I suggest we abandon STV1 pass and do everything
concerning TImode after the combine pass. HJ, what is your opinion on
this?

Other than the above, the patch LGTM to me.

Uros.

> Obviously, there are more changes to come (shifts and rotates),
> and compute_convert_gain doesn't yet have its final (tuned) form,
> but is already an improvement over the "return 1;" used previously.
>
> This patch has been tested on x86_64-pc-linux-gnu with make boostrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2022-07-09  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-features.h (scalar_chain): Add fields
>         insns_conv, n_sse_to_integer and n_integer_to_sse to this
>         parent class, moved from general_scalar_chain.
>         (scalar_chain::convert_compare): Protected method moved
>         from general_scalar_chain.
>         (mark_dual_mode_def): Make protected, not private virtual.
>         (scalar_chain:convert_op): New private virtual method.
>
>         (general_scalar_chain::general_scalar_chain): Simplify constructor.
>         (general_scalar_chain::~general_scalar_chain): Delete destructor.
>         (general_scalar_chain): Move insns_conv, n_sse_to_integer and
>         n_integer_to_sse fields to parent class, scalar_chain.
>         (general_scalar_chain::mark_dual_mode_def): Delete prototype.
>         (general_scalar_chain::convert_compare): Delete prototype.
>
>         (timode_scalar_chain::compute_convert_gain): Remove simplistic
>         implementation, convert to a method prototype.
>         (timode_scalar_chain::mark_dual_mode_def): Delete prototype.
>         (timode_scalar_chain::convert_op): Prototype new virtual method.
>
>         * config/i386/i386-features.cc (scalar_chain::scalar_chain):
>         Allocate insns_conv and initialize n_sse_to_integer and
>         n_integer_to_sse fields in constructor.
>         (scalar_chain::scalar_chain): Free insns_conv in destructor.
>
>         (general_scalar_chain::general_scalar_chain): Delete
>         constructor, now defined in the class declaration.
>         (general_scalar_chain::~general_scalar_chain): Delete destructor.
>
>         (scalar_chain::mark_dual_mode_def): Renamed from
>         general_scalar_chain::mark_dual_mode_def.
>         (timode_scalar_chain::mark_dual_mode_def): Delete.
>         (scalar_chain::convert_compare): Renamed from
>         general_scalar_chain::convert_compare.
>
>         (timode_scalar_chain::compute_convert_gain): New method to
>         determine the gain from converting a TImode chain to V1TImode.
>         (timode_scalar_chain::convert_op): New method to convert an
>         operand from TImode to V1TImode.
>
>         (timode_scalar_chain::convert_insn) <case REG>: Only PUT_MODE
>         on REG_EQUAL notes that were originally TImode (not CONST_INT).
>         Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
>         (timode_mem_p): Helper predicate to check where operand is
>         memory reference with sufficient alignment for TImode STV.
>         (timode_scalar_to_vector_candidate_p): Use convertible_comparison_p
>         to check whether COMPARE is convertible.  Handle SET_DESTs that
>         that are REG_P or MEM_P and SET_SRCs that are REG, CONST_INT,
>         CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/sse4_1-stv-2.c: New test case, pand.
>         * gcc.target/i386/sse4_1-stv-3.c: New test case, por.
>         * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor.
>         * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn.
>         * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest.
>
> Roger
> --
>
  
Roger Sayle July 10, 2022, 6:36 p.m. UTC | #2
Hi Uros,
Yes, I agree.  I think it makes sense to have a single STV pass (after
combine and before reload).  Let's hear what HJ thinks, but I'm
happy to investigate a follow-up patch that unifies the STV passes.
But it'll be easier to confirm there are no "code generation" changes
if those modifications are pushed independently of these ones.
Time to look into the (git) history of multiple STV passes...

Thanks for the review.  I'll wait for HJ's thoughts.
Cheers,
Roger
--

> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: 10 July 2022 19:06
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: gcc-patches@gcc.gnu.org; H. J. Lu <hjl.tools@gmail.com>
> Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for
> TImode to V1TImode.
> 
> On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> > aggressively transform 128-bit scalar TImode operations into vector
> > V1TImode operations performed on SSE registers.  TImode functionality
> > already exists in STV, but only for move operations, this changes
> > brings support for logical operations (AND, IOR, XOR, NOT and ANDN)
> > and comparisons.
> >
> > The effect of these changes are conveniently demonstrated by the new
> > sse4_1-stv-5.c test case:
> >
> > __int128 a[16];
> > __int128 b[16];
> > __int128 c[16];
> >
> > void foo()
> > {
> >   for (unsigned int i=0; i<16; i++)
> >     a[i] = b[i] & ~c[i];
> > }
> >
> > which when currently compiled on mainline wtih -O2 -msse4 produces:
> >
> > foo:    xorl    %eax, %eax
> > .L2:    movq    c(%rax), %rsi
> >         movq    c+8(%rax), %rdi
> >         addq    $16, %rax
> >         notq    %rsi
> >         notq    %rdi
> >         andq    b-16(%rax), %rsi
> >         andq    b-8(%rax), %rdi
> >         movq    %rsi, a-16(%rax)
> >         movq    %rdi, a-8(%rax)
> >         cmpq    $256, %rax
> >         jne     .L2
> >         ret
> >
> > but with this patch now produces:
> >
> > foo:    xorl    %eax, %eax
> > .L2:    movdqa  c(%rax), %xmm0
> >         pandn   b(%rax), %xmm0
> >         addq    $16, %rax
> >         movaps  %xmm0, a-16(%rax)
> >         cmpq    $256, %rax
> >         jne     .L2
> >         ret
> >
> > Technically, the STV pass is implemented by three C++ classes, a
> > common abstract base class "scalar_chain" that contains common
> > functionality, and two derived classes: general_scalar_chain (which
> > handles SI and DI modes) and timode_scalar_chain (which handles TI
> > modes).  As mentioned previously, because only TI mode moves were
> > handled the two worker classes behaved significantly differently.
> > These changes bring the functionality of these two classes closer
> > together, which is reflected by refactoring more shared code from
> > general_scalar_chain to the parent scalar_chain and reusing it from
> > timode.  There still remain significant differences (and
> > simplifications) so the existing division of classes (as specializations) continues
> to make sense.
> 
> Please note that there are in fact two STV passes, one before combine and the
> other after combine. The TImode pass that previously handled only loads and
> stores is positioned before combine (there was a reason for this decision, but I
> don't remember the details - let's ask HJ...). However, DImode STV pass
> transforms much more instructions and the reason it was positioned after the
> combine pass was that STV pass transforms optimized insn stream where
> forward propagation was already performed.
> 
> What is not clear to me from the above explanation is: is the new TImode STV
> pass positioned after the combine pass, and if this is the case, how the change
> affects current load/store TImode STV pass. I must admit, I don't like two
> separate STV passess, so if TImode is now similar to DImode, I suggest we
> abandon STV1 pass and do everything concerning TImode after the combine
> pass. HJ, what is your opinion on this?
> 
> Other than the above, the patch LGTM to me.
> 
> Uros.
> 
> > Obviously, there are more changes to come (shifts and rotates), and
> > compute_convert_gain doesn't yet have its final (tuned) form, but is
> > already an improvement over the "return 1;" used previously.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make boostrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2022-07-09  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386-features.h (scalar_chain): Add fields
> >         insns_conv, n_sse_to_integer and n_integer_to_sse to this
> >         parent class, moved from general_scalar_chain.
> >         (scalar_chain::convert_compare): Protected method moved
> >         from general_scalar_chain.
> >         (mark_dual_mode_def): Make protected, not private virtual.
> >         (scalar_chain:convert_op): New private virtual method.
> >
> >         (general_scalar_chain::general_scalar_chain): Simplify constructor.
> >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> >         (general_scalar_chain): Move insns_conv, n_sse_to_integer and
> >         n_integer_to_sse fields to parent class, scalar_chain.
> >         (general_scalar_chain::mark_dual_mode_def): Delete prototype.
> >         (general_scalar_chain::convert_compare): Delete prototype.
> >
> >         (timode_scalar_chain::compute_convert_gain): Remove simplistic
> >         implementation, convert to a method prototype.
> >         (timode_scalar_chain::mark_dual_mode_def): Delete prototype.
> >         (timode_scalar_chain::convert_op): Prototype new virtual method.
> >
> >         * config/i386/i386-features.cc (scalar_chain::scalar_chain):
> >         Allocate insns_conv and initialize n_sse_to_integer and
> >         n_integer_to_sse fields in constructor.
> >         (scalar_chain::scalar_chain): Free insns_conv in destructor.
> >
> >         (general_scalar_chain::general_scalar_chain): Delete
> >         constructor, now defined in the class declaration.
> >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> >
> >         (scalar_chain::mark_dual_mode_def): Renamed from
> >         general_scalar_chain::mark_dual_mode_def.
> >         (timode_scalar_chain::mark_dual_mode_def): Delete.
> >         (scalar_chain::convert_compare): Renamed from
> >         general_scalar_chain::convert_compare.
> >
> >         (timode_scalar_chain::compute_convert_gain): New method to
> >         determine the gain from converting a TImode chain to V1TImode.
> >         (timode_scalar_chain::convert_op): New method to convert an
> >         operand from TImode to V1TImode.
> >
> >         (timode_scalar_chain::convert_insn) <case REG>: Only PUT_MODE
> >         on REG_EQUAL notes that were originally TImode (not CONST_INT).
> >         Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
> >         (timode_mem_p): Helper predicate to check where operand is
> >         memory reference with sufficient alignment for TImode STV.
> >         (timode_scalar_to_vector_candidate_p): Use convertible_comparison_p
> >         to check whether COMPARE is convertible.  Handle SET_DESTs that
> >         that are REG_P or MEM_P and SET_SRCs that are REG, CONST_INT,
> >         CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/i386/sse4_1-stv-2.c: New test case, pand.
> >         * gcc.target/i386/sse4_1-stv-3.c: New test case, por.
> >         * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor.
> >         * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn.
> >         * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest.
> >
> > Roger
> > --
> >
  
H.J. Lu July 10, 2022, 7:15 p.m. UTC | #3
On Sun, Jul 10, 2022 at 11:36 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Yes, I agree.  I think it makes sense to have a single STV pass (after
> combine and before reload).  Let's hear what HJ thinks, but I'm
> happy to investigate a follow-up patch that unifies the STV passes.
> But it'll be easier to confirm there are no "code generation" changes
> if those modifications are pushed independently of these ones.
> Time to look into the (git) history of multiple STV passes...
>
> Thanks for the review.  I'll wait for HJ's thoughts.

The TImode STV pass is run before the CSE pass so that
instructions changed or generated by the STV pass can be CSEed.

> Cheers,
> Roger
> --
>
> > -----Original Message-----
> > From: Uros Bizjak <ubizjak@gmail.com>
> > Sent: 10 July 2022 19:06
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: gcc-patches@gcc.gnu.org; H. J. Lu <hjl.tools@gmail.com>
> > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for
> > TImode to V1TImode.
> >
> > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> > > aggressively transform 128-bit scalar TImode operations into vector
> > > V1TImode operations performed on SSE registers.  TImode functionality
> > > already exists in STV, but only for move operations, this changes
> > > brings support for logical operations (AND, IOR, XOR, NOT and ANDN)
> > > and comparisons.
> > >
> > > The effect of these changes are conveniently demonstrated by the new
> > > sse4_1-stv-5.c test case:
> > >
> > > __int128 a[16];
> > > __int128 b[16];
> > > __int128 c[16];
> > >
> > > void foo()
> > > {
> > >   for (unsigned int i=0; i<16; i++)
> > >     a[i] = b[i] & ~c[i];
> > > }
> > >
> > > which when currently compiled on mainline wtih -O2 -msse4 produces:
> > >
> > > foo:    xorl    %eax, %eax
> > > .L2:    movq    c(%rax), %rsi
> > >         movq    c+8(%rax), %rdi
> > >         addq    $16, %rax
> > >         notq    %rsi
> > >         notq    %rdi
> > >         andq    b-16(%rax), %rsi
> > >         andq    b-8(%rax), %rdi
> > >         movq    %rsi, a-16(%rax)
> > >         movq    %rdi, a-8(%rax)
> > >         cmpq    $256, %rax
> > >         jne     .L2
> > >         ret
> > >
> > > but with this patch now produces:
> > >
> > > foo:    xorl    %eax, %eax
> > > .L2:    movdqa  c(%rax), %xmm0
> > >         pandn   b(%rax), %xmm0
> > >         addq    $16, %rax
> > >         movaps  %xmm0, a-16(%rax)
> > >         cmpq    $256, %rax
> > >         jne     .L2
> > >         ret
> > >
> > > Technically, the STV pass is implemented by three C++ classes, a
> > > common abstract base class "scalar_chain" that contains common
> > > functionality, and two derived classes: general_scalar_chain (which
> > > handles SI and DI modes) and timode_scalar_chain (which handles TI
> > > modes).  As mentioned previously, because only TI mode moves were
> > > handled the two worker classes behaved significantly differently.
> > > These changes bring the functionality of these two classes closer
> > > together, which is reflected by refactoring more shared code from
> > > general_scalar_chain to the parent scalar_chain and reusing it from
> > > timode.  There still remain significant differences (and
> > > simplifications) so the existing division of classes (as specializations) continues
> > to make sense.
> >
> > Please note that there are in fact two STV passes, one before combine and the
> > other after combine. The TImode pass that previously handled only loads and
> > stores is positioned before combine (there was a reason for this decision, but I
> > don't remember the details - let's ask HJ...). However, DImode STV pass
> > transforms much more instructions and the reason it was positioned after the
> > combine pass was that STV pass transforms optimized insn stream where
> > forward propagation was already performed.
> >
> > What is not clear to me from the above explanation is: is the new TImode STV
> > pass positioned after the combine pass, and if this is the case, how the change
> > affects current load/store TImode STV pass. I must admit, I don't like two
> > separate STV passess, so if TImode is now similar to DImode, I suggest we
> > abandon STV1 pass and do everything concerning TImode after the combine
> > pass. HJ, what is your opinion on this?
> >
> > Other than the above, the patch LGTM to me.
> >
> > Uros.
> >
> > > Obviously, there are more changes to come (shifts and rotates), and
> > > compute_convert_gain doesn't yet have its final (tuned) form, but is
> > > already an improvement over the "return 1;" used previously.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make boostrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2022-07-09  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/i386-features.h (scalar_chain): Add fields
> > >         insns_conv, n_sse_to_integer and n_integer_to_sse to this
> > >         parent class, moved from general_scalar_chain.
> > >         (scalar_chain::convert_compare): Protected method moved
> > >         from general_scalar_chain.
> > >         (mark_dual_mode_def): Make protected, not private virtual.
> > >         (scalar_chain:convert_op): New private virtual method.
> > >
> > >         (general_scalar_chain::general_scalar_chain): Simplify constructor.
> > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > >         (general_scalar_chain): Move insns_conv, n_sse_to_integer and
> > >         n_integer_to_sse fields to parent class, scalar_chain.
> > >         (general_scalar_chain::mark_dual_mode_def): Delete prototype.
> > >         (general_scalar_chain::convert_compare): Delete prototype.
> > >
> > >         (timode_scalar_chain::compute_convert_gain): Remove simplistic
> > >         implementation, convert to a method prototype.
> > >         (timode_scalar_chain::mark_dual_mode_def): Delete prototype.
> > >         (timode_scalar_chain::convert_op): Prototype new virtual method.
> > >
> > >         * config/i386/i386-features.cc (scalar_chain::scalar_chain):
> > >         Allocate insns_conv and initialize n_sse_to_integer and
> > >         n_integer_to_sse fields in constructor.
> > >         (scalar_chain::scalar_chain): Free insns_conv in destructor.
> > >
> > >         (general_scalar_chain::general_scalar_chain): Delete
> > >         constructor, now defined in the class declaration.
> > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > >
> > >         (scalar_chain::mark_dual_mode_def): Renamed from
> > >         general_scalar_chain::mark_dual_mode_def.
> > >         (timode_scalar_chain::mark_dual_mode_def): Delete.
> > >         (scalar_chain::convert_compare): Renamed from
> > >         general_scalar_chain::convert_compare.
> > >
> > >         (timode_scalar_chain::compute_convert_gain): New method to
> > >         determine the gain from converting a TImode chain to V1TImode.
> > >         (timode_scalar_chain::convert_op): New method to convert an
> > >         operand from TImode to V1TImode.
> > >
> > >         (timode_scalar_chain::convert_insn) <case REG>: Only PUT_MODE
> > >         on REG_EQUAL notes that were originally TImode (not CONST_INT).
> > >         Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
> > >         (timode_mem_p): Helper predicate to check where operand is
> > >         memory reference with sufficient alignment for TImode STV.
> > >         (timode_scalar_to_vector_candidate_p): Use convertible_comparison_p
> > >         to check whether COMPARE is convertible.  Handle SET_DESTs that
> > >         that are REG_P or MEM_P and SET_SRCs that are REG, CONST_INT,
> > >         CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT.
> > >
> > > gcc/testsuite/ChangeLog
> > >         * gcc.target/i386/sse4_1-stv-2.c: New test case, pand.
> > >         * gcc.target/i386/sse4_1-stv-3.c: New test case, por.
> > >         * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor.
> > >         * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn.
> > >         * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest.
> > >
> > > Roger
> > > --
> > >
>
  
Roger Sayle July 10, 2022, 9:38 p.m. UTC | #4
Hi HJ,

I believe this should now be handled by the post-reload (CSE) pass.
Consider the simple test case:

__int128 a, b, c;
void foo()
{
  a = 0;
  b = 0;
  c = 0;
}

Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
        movq    $0, a(%rip)
        movq    $0, a+8(%rip)
        movq    $0, b(%rip)
        movq    $0, b+8(%rip)
        movq    $0, c(%rip)
        movq    $0, c+8(%rip)
        ret

But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
        pxor    %xmm0, %xmm0
        movaps  %xmm0, a(%rip)
        movaps  %xmm0, b(%rip)
        movaps  %xmm0, c(%rip)
        ret

You're quite right internally the STV actually generates the equivalent of:
        pxor    %xmm0, %xmm0
        movaps  %xmm0, a(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, b(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, c(%rip)
        ret

And currently because STV run before cse2 and combine, the const0_rtx
gets CSE'd be the cse2 pass to produce the code we see.  However, if you
specify -fno-rerun-cse-after-loop (to disable the cse2 pass), you'll see we
continue to generate the same optimized code, as the same const0_rtx
gets CSE'd in postreload.

I can't be certain until I try the experiment, but I believe that the postreload
CSE will clean-up, all of the same common subexpressions.  Hence, it should
be safe to perform all STV at the same point (after combine), which for a few
additional optimizations.

Does this make sense?  Do you have a test case, -fno-rerun-cse-after-loop
produces different/inferior code for TImode STV chains?

My guess is that the RTL passes have changed so much in the last six or
seven years, that some of the original motivation no longer applies.
Certainly we now try to keep TI mode operations visible longer, and
then allow STV to behave like a pre-reload pass to decide which set of
registers to use (vector V1TI or scalar doubleword DI).  Any CSE opportunities
that cse2 finds with V1TI mode, could/should equally well be found for
TI mode (mostly).

Cheers,
Roger
--

> -----Original Message-----
> From: H.J. Lu <hjl.tools@gmail.com>
> Sent: 10 July 2022 20:15
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: Uros Bizjak <ubizjak@gmail.com>; GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for
> TImode to V1TImode.
> 
> On Sun, Jul 10, 2022 at 11:36 AM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > Hi Uros,
> > Yes, I agree.  I think it makes sense to have a single STV pass (after
> > combine and before reload).  Let's hear what HJ thinks, but I'm happy
> > to investigate a follow-up patch that unifies the STV passes.
> > But it'll be easier to confirm there are no "code generation" changes
> > if those modifications are pushed independently of these ones.
> > Time to look into the (git) history of multiple STV passes...
> >
> > Thanks for the review.  I'll wait for HJ's thoughts.
> 
> The TImode STV pass is run before the CSE pass so that instructions changed or
> generated by the STV pass can be CSEed.
> 
> > Cheers,
> > Roger
> > --
> >
> > > -----Original Message-----
> > > From: Uros Bizjak <ubizjak@gmail.com>
> > > Sent: 10 July 2022 19:06
> > > To: Roger Sayle <roger@nextmovesoftware.com>
> > > Cc: gcc-patches@gcc.gnu.org; H. J. Lu <hjl.tools@gmail.com>
> > > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support
> > > for TImode to V1TImode.
> > >
> > > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle
> > > <roger@nextmovesoftware.com>
> > > wrote:
> > > >
> > > >
> > > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> > > > aggressively transform 128-bit scalar TImode operations into
> > > > vector V1TImode operations performed on SSE registers.  TImode
> > > > functionality already exists in STV, but only for move operations,
> > > > this changes brings support for logical operations (AND, IOR, XOR,
> > > > NOT and ANDN) and comparisons.
> > > >
> > > > The effect of these changes are conveniently demonstrated by the
> > > > new sse4_1-stv-5.c test case:
> > > >
> > > > __int128 a[16];
> > > > __int128 b[16];
> > > > __int128 c[16];
> > > >
> > > > void foo()
> > > > {
> > > >   for (unsigned int i=0; i<16; i++)
> > > >     a[i] = b[i] & ~c[i];
> > > > }
> > > >
> > > > which when currently compiled on mainline wtih -O2 -msse4 produces:
> > > >
> > > > foo:    xorl    %eax, %eax
> > > > .L2:    movq    c(%rax), %rsi
> > > >         movq    c+8(%rax), %rdi
> > > >         addq    $16, %rax
> > > >         notq    %rsi
> > > >         notq    %rdi
> > > >         andq    b-16(%rax), %rsi
> > > >         andq    b-8(%rax), %rdi
> > > >         movq    %rsi, a-16(%rax)
> > > >         movq    %rdi, a-8(%rax)
> > > >         cmpq    $256, %rax
> > > >         jne     .L2
> > > >         ret
> > > >
> > > > but with this patch now produces:
> > > >
> > > > foo:    xorl    %eax, %eax
> > > > .L2:    movdqa  c(%rax), %xmm0
> > > >         pandn   b(%rax), %xmm0
> > > >         addq    $16, %rax
> > > >         movaps  %xmm0, a-16(%rax)
> > > >         cmpq    $256, %rax
> > > >         jne     .L2
> > > >         ret
> > > >
> > > > Technically, the STV pass is implemented by three C++ classes, a
> > > > common abstract base class "scalar_chain" that contains common
> > > > functionality, and two derived classes: general_scalar_chain
> > > > (which handles SI and DI modes) and timode_scalar_chain (which
> > > > handles TI modes).  As mentioned previously, because only TI mode
> > > > moves were handled the two worker classes behaved significantly
> differently.
> > > > These changes bring the functionality of these two classes closer
> > > > together, which is reflected by refactoring more shared code from
> > > > general_scalar_chain to the parent scalar_chain and reusing it
> > > > from timode.  There still remain significant differences (and
> > > > simplifications) so the existing division of classes (as
> > > > specializations) continues
> > > to make sense.
> > >
> > > Please note that there are in fact two STV passes, one before
> > > combine and the other after combine. The TImode pass that previously
> > > handled only loads and stores is positioned before combine (there
> > > was a reason for this decision, but I don't remember the details -
> > > let's ask HJ...). However, DImode STV pass transforms much more
> > > instructions and the reason it was positioned after the combine pass
> > > was that STV pass transforms optimized insn stream where forward
> propagation was already performed.
> > >
> > > What is not clear to me from the above explanation is: is the new
> > > TImode STV pass positioned after the combine pass, and if this is
> > > the case, how the change affects current load/store TImode STV pass.
> > > I must admit, I don't like two separate STV passess, so if TImode is
> > > now similar to DImode, I suggest we abandon STV1 pass and do
> > > everything concerning TImode after the combine pass. HJ, what is your
> opinion on this?
> > >
> > > Other than the above, the patch LGTM to me.
> > >
> > > Uros.
> > >
> > > > Obviously, there are more changes to come (shifts and rotates),
> > > > and compute_convert_gain doesn't yet have its final (tuned) form,
> > > > but is already an improvement over the "return 1;" used previously.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > boostrap and make -k check, both with and without
> > > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > > >
> > > >
> > > > 2022-07-09  Roger Sayle  <roger@nextmovesoftware.com>
> > > >
> > > > gcc/ChangeLog
> > > >         * config/i386/i386-features.h (scalar_chain): Add fields
> > > >         insns_conv, n_sse_to_integer and n_integer_to_sse to this
> > > >         parent class, moved from general_scalar_chain.
> > > >         (scalar_chain::convert_compare): Protected method moved
> > > >         from general_scalar_chain.
> > > >         (mark_dual_mode_def): Make protected, not private virtual.
> > > >         (scalar_chain:convert_op): New private virtual method.
> > > >
> > > >         (general_scalar_chain::general_scalar_chain): Simplify constructor.
> > > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > > >         (general_scalar_chain): Move insns_conv, n_sse_to_integer and
> > > >         n_integer_to_sse fields to parent class, scalar_chain.
> > > >         (general_scalar_chain::mark_dual_mode_def): Delete prototype.
> > > >         (general_scalar_chain::convert_compare): Delete prototype.
> > > >
> > > >         (timode_scalar_chain::compute_convert_gain): Remove simplistic
> > > >         implementation, convert to a method prototype.
> > > >         (timode_scalar_chain::mark_dual_mode_def): Delete prototype.
> > > >         (timode_scalar_chain::convert_op): Prototype new virtual method.
> > > >
> > > >         * config/i386/i386-features.cc (scalar_chain::scalar_chain):
> > > >         Allocate insns_conv and initialize n_sse_to_integer and
> > > >         n_integer_to_sse fields in constructor.
> > > >         (scalar_chain::scalar_chain): Free insns_conv in destructor.
> > > >
> > > >         (general_scalar_chain::general_scalar_chain): Delete
> > > >         constructor, now defined in the class declaration.
> > > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > > >
> > > >         (scalar_chain::mark_dual_mode_def): Renamed from
> > > >         general_scalar_chain::mark_dual_mode_def.
> > > >         (timode_scalar_chain::mark_dual_mode_def): Delete.
> > > >         (scalar_chain::convert_compare): Renamed from
> > > >         general_scalar_chain::convert_compare.
> > > >
> > > >         (timode_scalar_chain::compute_convert_gain): New method to
> > > >         determine the gain from converting a TImode chain to V1TImode.
> > > >         (timode_scalar_chain::convert_op): New method to convert an
> > > >         operand from TImode to V1TImode.
> > > >
> > > >         (timode_scalar_chain::convert_insn) <case REG>: Only PUT_MODE
> > > >         on REG_EQUAL notes that were originally TImode (not CONST_INT).
> > > >         Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
> > > >         (timode_mem_p): Helper predicate to check where operand is
> > > >         memory reference with sufficient alignment for TImode STV.
> > > >         (timode_scalar_to_vector_candidate_p): Use
> convertible_comparison_p
> > > >         to check whether COMPARE is convertible.  Handle SET_DESTs that
> > > >         that are REG_P or MEM_P and SET_SRCs that are REG, CONST_INT,
> > > >         CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT.
> > > >
> > > > gcc/testsuite/ChangeLog
> > > >         * gcc.target/i386/sse4_1-stv-2.c: New test case, pand.
> > > >         * gcc.target/i386/sse4_1-stv-3.c: New test case, por.
> > > >         * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor.
> > > >         * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn.
> > > >         * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest.
> > > >
> > > > Roger
> > > > --
> > > >
> >
> 
> 
> --
> H.J.
  
H.J. Lu July 10, 2022, 11:56 p.m. UTC | #5
On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi HJ,
>
> I believe this should now be handled by the post-reload (CSE) pass.
> Consider the simple test case:
>
> __int128 a, b, c;
> void foo()
> {
>   a = 0;
>   b = 0;
>   c = 0;
> }
>
> Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
>         movq    $0, a(%rip)
>         movq    $0, a+8(%rip)
>         movq    $0, b(%rip)
>         movq    $0, b+8(%rip)
>         movq    $0, c(%rip)
>         movq    $0, c+8(%rip)
>         ret
>
> But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, a(%rip)
>         movaps  %xmm0, b(%rip)
>         movaps  %xmm0, c(%rip)
>         ret
>
> You're quite right internally the STV actually generates the equivalent of:
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, a(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, b(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, c(%rip)
>         ret
>
> And currently because STV run before cse2 and combine, the const0_rtx
> gets CSE'd be the cse2 pass to produce the code we see.  However, if you
> specify -fno-rerun-cse-after-loop (to disable the cse2 pass), you'll see we
> continue to generate the same optimized code, as the same const0_rtx
> gets CSE'd in postreload.
>
> I can't be certain until I try the experiment, but I believe that the postreload
> CSE will clean-up, all of the same common subexpressions.  Hence, it should
> be safe to perform all STV at the same point (after combine), which for a few
> additional optimizations.
>
> Does this make sense?  Do you have a test case, -fno-rerun-cse-after-loop
> produces different/inferior code for TImode STV chains?
>
> My guess is that the RTL passes have changed so much in the last six or
> seven years, that some of the original motivation no longer applies.
> Certainly we now try to keep TI mode operations visible longer, and
> then allow STV to behave like a pre-reload pass to decide which set of
> registers to use (vector V1TI or scalar doubleword DI).  Any CSE opportunities
> that cse2 finds with V1TI mode, could/should equally well be found for
> TI mode (mostly).

You are probably right.  If there are no regressions in GCC testsuite,
my original motivation is no longer valid.

Thanks.

> Cheers,
> Roger
> --
>
> > -----Original Message-----
> > From: H.J. Lu <hjl.tools@gmail.com>
> > Sent: 10 July 2022 20:15
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: Uros Bizjak <ubizjak@gmail.com>; GCC Patches <gcc-patches@gcc.gnu.org>
> > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for
> > TImode to V1TImode.
> >
> > On Sun, Jul 10, 2022 at 11:36 AM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > Hi Uros,
> > > Yes, I agree.  I think it makes sense to have a single STV pass (after
> > > combine and before reload).  Let's hear what HJ thinks, but I'm happy
> > > to investigate a follow-up patch that unifies the STV passes.
> > > But it'll be easier to confirm there are no "code generation" changes
> > > if those modifications are pushed independently of these ones.
> > > Time to look into the (git) history of multiple STV passes...
> > >
> > > Thanks for the review.  I'll wait for HJ's thoughts.
> >
> > The TImode STV pass is run before the CSE pass so that instructions changed or
> > generated by the STV pass can be CSEed.
> >
> > > Cheers,
> > > Roger
> > > --
> > >
> > > > -----Original Message-----
> > > > From: Uros Bizjak <ubizjak@gmail.com>
> > > > Sent: 10 July 2022 19:06
> > > > To: Roger Sayle <roger@nextmovesoftware.com>
> > > > Cc: gcc-patches@gcc.gnu.org; H. J. Lu <hjl.tools@gmail.com>
> > > > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support
> > > > for TImode to V1TImode.
> > > >
> > > > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle
> > > > <roger@nextmovesoftware.com>
> > > > wrote:
> > > > >
> > > > >
> > > > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> > > > > aggressively transform 128-bit scalar TImode operations into
> > > > > vector V1TImode operations performed on SSE registers.  TImode
> > > > > functionality already exists in STV, but only for move operations,
> > > > > this changes brings support for logical operations (AND, IOR, XOR,
> > > > > NOT and ANDN) and comparisons.
> > > > >
> > > > > The effect of these changes are conveniently demonstrated by the
> > > > > new sse4_1-stv-5.c test case:
> > > > >
> > > > > __int128 a[16];
> > > > > __int128 b[16];
> > > > > __int128 c[16];
> > > > >
> > > > > void foo()
> > > > > {
> > > > >   for (unsigned int i=0; i<16; i++)
> > > > >     a[i] = b[i] & ~c[i];
> > > > > }
> > > > >
> > > > > which when currently compiled on mainline wtih -O2 -msse4 produces:
> > > > >
> > > > > foo:    xorl    %eax, %eax
> > > > > .L2:    movq    c(%rax), %rsi
> > > > >         movq    c+8(%rax), %rdi
> > > > >         addq    $16, %rax
> > > > >         notq    %rsi
> > > > >         notq    %rdi
> > > > >         andq    b-16(%rax), %rsi
> > > > >         andq    b-8(%rax), %rdi
> > > > >         movq    %rsi, a-16(%rax)
> > > > >         movq    %rdi, a-8(%rax)
> > > > >         cmpq    $256, %rax
> > > > >         jne     .L2
> > > > >         ret
> > > > >
> > > > > but with this patch now produces:
> > > > >
> > > > > foo:    xorl    %eax, %eax
> > > > > .L2:    movdqa  c(%rax), %xmm0
> > > > >         pandn   b(%rax), %xmm0
> > > > >         addq    $16, %rax
> > > > >         movaps  %xmm0, a-16(%rax)
> > > > >         cmpq    $256, %rax
> > > > >         jne     .L2
> > > > >         ret
> > > > >
> > > > > Technically, the STV pass is implemented by three C++ classes, a
> > > > > common abstract base class "scalar_chain" that contains common
> > > > > functionality, and two derived classes: general_scalar_chain
> > > > > (which handles SI and DI modes) and timode_scalar_chain (which
> > > > > handles TI modes).  As mentioned previously, because only TI mode
> > > > > moves were handled the two worker classes behaved significantly
> > differently.
> > > > > These changes bring the functionality of these two classes closer
> > > > > together, which is reflected by refactoring more shared code from
> > > > > general_scalar_chain to the parent scalar_chain and reusing it
> > > > > from timode.  There still remain significant differences (and
> > > > > simplifications) so the existing division of classes (as
> > > > > specializations) continues
> > > > to make sense.
> > > >
> > > > Please note that there are in fact two STV passes, one before
> > > > combine and the other after combine. The TImode pass that previously
> > > > handled only loads and stores is positioned before combine (there
> > > > was a reason for this decision, but I don't remember the details -
> > > > let's ask HJ...). However, DImode STV pass transforms much more
> > > > instructions and the reason it was positioned after the combine pass
> > > > was that STV pass transforms optimized insn stream where forward
> > propagation was already performed.
> > > >
> > > > What is not clear to me from the above explanation is: is the new
> > > > TImode STV pass positioned after the combine pass, and if this is
> > > > the case, how the change affects current load/store TImode STV pass.
> > > > I must admit, I don't like two separate STV passess, so if TImode is
> > > > now similar to DImode, I suggest we abandon STV1 pass and do
> > > > everything concerning TImode after the combine pass. HJ, what is your
> > opinion on this?
> > > >
> > > > Other than the above, the patch LGTM to me.
> > > >
> > > > Uros.
> > > >
> > > > > Obviously, there are more changes to come (shifts and rotates),
> > > > > and compute_convert_gain doesn't yet have its final (tuned) form,
> > > > > but is already an improvement over the "return 1;" used previously.
> > > > >
> > > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > > boostrap and make -k check, both with and without
> > > > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > > > >
> > > > >
> > > > > 2022-07-09  Roger Sayle  <roger@nextmovesoftware.com>
> > > > >
> > > > > gcc/ChangeLog
> > > > >         * config/i386/i386-features.h (scalar_chain): Add fields
> > > > >         insns_conv, n_sse_to_integer and n_integer_to_sse to this
> > > > >         parent class, moved from general_scalar_chain.
> > > > >         (scalar_chain::convert_compare): Protected method moved
> > > > >         from general_scalar_chain.
> > > > >         (mark_dual_mode_def): Make protected, not private virtual.
> > > > >         (scalar_chain:convert_op): New private virtual method.
> > > > >
> > > > >         (general_scalar_chain::general_scalar_chain): Simplify constructor.
> > > > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > > > >         (general_scalar_chain): Move insns_conv, n_sse_to_integer and
> > > > >         n_integer_to_sse fields to parent class, scalar_chain.
> > > > >         (general_scalar_chain::mark_dual_mode_def): Delete prototype.
> > > > >         (general_scalar_chain::convert_compare): Delete prototype.
> > > > >
> > > > >         (timode_scalar_chain::compute_convert_gain): Remove simplistic
> > > > >         implementation, convert to a method prototype.
> > > > >         (timode_scalar_chain::mark_dual_mode_def): Delete prototype.
> > > > >         (timode_scalar_chain::convert_op): Prototype new virtual method.
> > > > >
> > > > >         * config/i386/i386-features.cc (scalar_chain::scalar_chain):
> > > > >         Allocate insns_conv and initialize n_sse_to_integer and
> > > > >         n_integer_to_sse fields in constructor.
> > > > >         (scalar_chain::scalar_chain): Free insns_conv in destructor.
> > > > >
> > > > >         (general_scalar_chain::general_scalar_chain): Delete
> > > > >         constructor, now defined in the class declaration.
> > > > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > > > >
> > > > >         (scalar_chain::mark_dual_mode_def): Renamed from
> > > > >         general_scalar_chain::mark_dual_mode_def.
> > > > >         (timode_scalar_chain::mark_dual_mode_def): Delete.
> > > > >         (scalar_chain::convert_compare): Renamed from
> > > > >         general_scalar_chain::convert_compare.
> > > > >
> > > > >         (timode_scalar_chain::compute_convert_gain): New method to
> > > > >         determine the gain from converting a TImode chain to V1TImode.
> > > > >         (timode_scalar_chain::convert_op): New method to convert an
> > > > >         operand from TImode to V1TImode.
> > > > >
> > > > >         (timode_scalar_chain::convert_insn) <case REG>: Only PUT_MODE
> > > > >         on REG_EQUAL notes that were originally TImode (not CONST_INT).
> > > > >         Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
> > > > >         (timode_mem_p): Helper predicate to check where operand is
> > > > >         memory reference with sufficient alignment for TImode STV.
> > > > >         (timode_scalar_to_vector_candidate_p): Use
> > convertible_comparison_p
> > > > >         to check whether COMPARE is convertible.  Handle SET_DESTs that
> > > > >         that are REG_P or MEM_P and SET_SRCs that are REG, CONST_INT,
> > > > >         CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT.
> > > > >
> > > > > gcc/testsuite/ChangeLog
> > > > >         * gcc.target/i386/sse4_1-stv-2.c: New test case, pand.
> > > > >         * gcc.target/i386/sse4_1-stv-3.c: New test case, por.
> > > > >         * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor.
> > > > >         * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn.
> > > > >         * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest.
> > > > >
> > > > > Roger
> > > > > --
> > > > >
> > >
> >
> >
> > --
> > H.J.
>
  
Uros Bizjak July 11, 2022, 6:41 a.m. UTC | #6
On Sun, Jul 10, 2022 at 8:36 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Yes, I agree.  I think it makes sense to have a single STV pass (after
> combine and before reload).  Let's hear what HJ thinks, but I'm
> happy to investigate a follow-up patch that unifies the STV passes.
> But it'll be easier to confirm there are no "code generation" changes
> if those modifications are pushed independently of these ones.
> Time to look into the (git) history of multiple STV passes...

OK, let's go forward with the proposed patch.

Thanks,
Uros.
  
Roger Sayle July 14, 2022, 5:31 a.m. UTC | #7
On Mon, Jul 11, 2022, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> > Hi HJ,
> >
> > I believe this should now be handled by the post-reload (CSE) pass.
> > Consider the simple test case:
> >
> > __int128 a, b, c;
> > void foo()
> > {
> >   a = 0;
> >   b = 0;
> >   c = 0;
> > }
> >
> > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
> >         movq    $0, a(%rip)
> >         movq    $0, a+8(%rip)
> >         movq    $0, b(%rip)
> >         movq    $0, b+8(%rip)
> >         movq    $0, c(%rip)
> >         movq    $0, c+8(%rip)
> >         ret
> >
> > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, a(%rip)
> >         movaps  %xmm0, b(%rip)
> >         movaps  %xmm0, c(%rip)
> >         ret
> >
> > You're quite right internally the STV actually generates the equivalent of:
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, a(%rip)
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, b(%rip)
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, c(%rip)
> >         ret
> >
> > And currently because STV run before cse2 and combine, the const0_rtx
> > gets CSE'd be the cse2 pass to produce the code we see.  However, if
> > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass),
> > you'll see we continue to generate the same optimized code, as the
> > same const0_rtx gets CSE'd in postreload.
> >
> > I can't be certain until I try the experiment, but I believe that the
> > postreload CSE will clean-up, all of the same common subexpressions.
> > Hence, it should be safe to perform all STV at the same point (after
> > combine), which for a few additional optimizations.
> >
> > Does this make sense?  Do you have a test case,
> > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV
> chains?
> >
> > My guess is that the RTL passes have changed so much in the last six
> > or seven years, that some of the original motivation no longer applies.
> > Certainly we now try to keep TI mode operations visible longer, and
> > then allow STV to behave like a pre-reload pass to decide which set of
> > registers to use (vector V1TI or scalar doubleword DI).  Any CSE
> > opportunities that cse2 finds with V1TI mode, could/should equally
> > well be found for TI mode (mostly).
> 
> You are probably right.  If there are no regressions in GCC testsuite, my original
> motivation is no longer valid.

It was good to try the experiment, but H.J. is right, there is still some benefit
(as well as some disadvantages)  to running STV lowering before CSE2/combine.
A clean-up patch to perform all STV conversion as a single pass (removing a
pass from the compiler) results in just a single regression in the test suite:
FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8
which looks like:

__int128 a, b, c, d, e, f;
void foo (void)
{
  a = 0;
  b = -1;
  c = 0;
  d = -1;
  e = 0;
  f = -1;
}

By performing STV after combine (without CSE), reload prefers to implement
this function using a single register, that then requires 12 instructions rather
than 8 (if using two registers).  Alas there's nothing that postreload CSE/GCSE
can do.  Doh!

        pxor    %xmm0, %xmm0
        movaps  %xmm0, a(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, b(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, c(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, d(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, e(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, f(%rip)
        ret

I also note that even without STV, the scalar implementation of this function when
compiled with -Os is also larger than it needs to be due to poor CSE (notice in the
following we only need a single zero register, and  an all_ones reg would be helpful).

        xorl    %eax, %eax
        xorl    %edx, %edx
        xorl    %ecx, %ecx
        movq    $-1, b(%rip)
        movq    %rax, a(%rip)
        movq    %rax, a+8(%rip)
        movq    $-1, b+8(%rip)
        movq    %rdx, c(%rip)
        movq    %rdx, c+8(%rip)
        movq    $-1, d(%rip)
        movq    $-1, d+8(%rip)
        movq    %rcx, e(%rip)
        movq    %rcx, e+8(%rip)
        movq    $-1, f(%rip)
        movq    $-1, f+8(%rip)
        ret

I need to give the problem some more thought.  It would be good to clean-up/unify
the STV passes, but I/we need to solve/CSE HJ's last test case before we do.  Perhaps
by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force the zero
register to become visible, and CSE'd, benefiting both vector code and scalar -Os code,
then use postreload/peephole2 to fix up the remaining scalar cases.  It's tricky.

Cheers,
Roger
--
  
Richard Biener July 14, 2022, 7:10 a.m. UTC | #8
On Thu, Jul 14, 2022 at 7:32 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> On Mon, Jul 11, 2022, H.J. Lu <hjl.tools@gmail.com> wrote:
> > On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > > Hi HJ,
> > >
> > > I believe this should now be handled by the post-reload (CSE) pass.
> > > Consider the simple test case:
> > >
> > > __int128 a, b, c;
> > > void foo()
> > > {
> > >   a = 0;
> > >   b = 0;
> > >   c = 0;
> > > }
> > >
> > > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
> > >         movq    $0, a(%rip)
> > >         movq    $0, a+8(%rip)
> > >         movq    $0, b(%rip)
> > >         movq    $0, b+8(%rip)
> > >         movq    $0, c(%rip)
> > >         movq    $0, c+8(%rip)
> > >         ret
> > >
> > > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, a(%rip)
> > >         movaps  %xmm0, b(%rip)
> > >         movaps  %xmm0, c(%rip)
> > >         ret
> > >
> > > You're quite right internally the STV actually generates the equivalent of:
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, a(%rip)
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, b(%rip)
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, c(%rip)
> > >         ret
> > >
> > > And currently because STV run before cse2 and combine, the const0_rtx
> > > gets CSE'd be the cse2 pass to produce the code we see.  However, if
> > > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass),
> > > you'll see we continue to generate the same optimized code, as the
> > > same const0_rtx gets CSE'd in postreload.
> > >
> > > I can't be certain until I try the experiment, but I believe that the
> > > postreload CSE will clean-up, all of the same common subexpressions.
> > > Hence, it should be safe to perform all STV at the same point (after
> > > combine), which for a few additional optimizations.
> > >
> > > Does this make sense?  Do you have a test case,
> > > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV
> > chains?
> > >
> > > My guess is that the RTL passes have changed so much in the last six
> > > or seven years, that some of the original motivation no longer applies.
> > > Certainly we now try to keep TI mode operations visible longer, and
> > > then allow STV to behave like a pre-reload pass to decide which set of
> > > registers to use (vector V1TI or scalar doubleword DI).  Any CSE
> > > opportunities that cse2 finds with V1TI mode, could/should equally
> > > well be found for TI mode (mostly).
> >
> > You are probably right.  If there are no regressions in GCC testsuite, my original
> > motivation is no longer valid.
>
> It was good to try the experiment, but H.J. is right, there is still some benefit
> (as well as some disadvantages)  to running STV lowering before CSE2/combine.
> A clean-up patch to perform all STV conversion as a single pass (removing a
> pass from the compiler) results in just a single regression in the test suite:
> FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8
> which looks like:
>
> __int128 a, b, c, d, e, f;
> void foo (void)
> {
>   a = 0;
>   b = -1;
>   c = 0;
>   d = -1;
>   e = 0;
>   f = -1;
> }
>
> By performing STV after combine (without CSE), reload prefers to implement
> this function using a single register, that then requires 12 instructions rather
> than 8 (if using two registers).  Alas there's nothing that postreload CSE/GCSE
> can do.  Doh!

Hmm, the RA could be taught to make use of more of the register file I suppose
(shouldn't regrename do this job - but it runs after postreload-cse)

>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, a(%rip)
>         pcmpeqd %xmm0, %xmm0
>         movaps  %xmm0, b(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, c(%rip)
>         pcmpeqd %xmm0, %xmm0
>         movaps  %xmm0, d(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, e(%rip)
>         pcmpeqd %xmm0, %xmm0
>         movaps  %xmm0, f(%rip)
>         ret
>
> I also note that even without STV, the scalar implementation of this function when
> compiled with -Os is also larger than it needs to be due to poor CSE (notice in the
> following we only need a single zero register, and  an all_ones reg would be helpful).
>
>         xorl    %eax, %eax
>         xorl    %edx, %edx
>         xorl    %ecx, %ecx
>         movq    $-1, b(%rip)
>         movq    %rax, a(%rip)
>         movq    %rax, a+8(%rip)
>         movq    $-1, b+8(%rip)
>         movq    %rdx, c(%rip)
>         movq    %rdx, c+8(%rip)
>         movq    $-1, d(%rip)
>         movq    $-1, d+8(%rip)
>         movq    %rcx, e(%rip)
>         movq    %rcx, e+8(%rip)
>         movq    $-1, f(%rip)
>         movq    $-1, f+8(%rip)
>         ret
>
> I need to give the problem some more thought.  It would be good to clean-up/unify
> the STV passes, but I/we need to solve/CSE HJ's last test case before we do.  Perhaps
> by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force the zero
> register to become visible, and CSE'd, benefiting both vector code and scalar -Os code,
> then use postreload/peephole2 to fix up the remaining scalar cases.  It's tricky.

Not sure if related but ppc(?) folks recently tried to massage CSE to
avoid propagating
constants by making sure that rtx_cost handles (set (...) (const_int
...)) "properly".
But IIRC CSE never does the reverse transform - split out a constant
to a pseudo from
multiple uses of the same constant - that's probably on the job of
reload + postreload-CSE
right now, but reload probably does not know that there are multiple
uses of the constant
so the splitting is worthwhile.

> Cheers,
> Roger
> --
>
>
  

Patch

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index be38586..a17327e 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -291,7 +291,11 @@  scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
   insns = BITMAP_ALLOC (NULL);
   defs = BITMAP_ALLOC (NULL);
   defs_conv = BITMAP_ALLOC (NULL);
+  insns_conv = BITMAP_ALLOC (NULL);
   queue = NULL;
+
+  n_sse_to_integer = 0;
+  n_integer_to_sse = 0;
 }
 
 /* Free chain's data.  */
@@ -301,6 +305,7 @@  scalar_chain::~scalar_chain ()
   BITMAP_FREE (insns);
   BITMAP_FREE (defs);
   BITMAP_FREE (defs_conv);
+  BITMAP_FREE (insns_conv);
   bitmap_obstack_release (NULL);
 }
 
@@ -319,25 +324,11 @@  scalar_chain::add_to_queue (unsigned insn_uid)
   bitmap_set_bit (queue, insn_uid);
 }
 
-general_scalar_chain::general_scalar_chain (enum machine_mode smode_,
-					    enum machine_mode vmode_)
-     : scalar_chain (smode_, vmode_)
-{
-  insns_conv = BITMAP_ALLOC (NULL);
-  n_sse_to_integer = 0;
-  n_integer_to_sse = 0;
-}
-
-general_scalar_chain::~general_scalar_chain ()
-{
-  BITMAP_FREE (insns_conv);
-}
-
 /* For DImode conversion, mark register defined by DEF as requiring
    conversion.  */
 
 void
-general_scalar_chain::mark_dual_mode_def (df_ref def)
+scalar_chain::mark_dual_mode_def (df_ref def)
 {
   gcc_assert (DF_REF_REG_DEF_P (def));
 
@@ -364,14 +355,6 @@  general_scalar_chain::mark_dual_mode_def (df_ref def)
 	     DF_REF_REGNO (def), DF_REF_INSN_UID (def), chain_id);
 }
 
-/* For TImode conversion, it is unused.  */
-
-void
-timode_scalar_chain::mark_dual_mode_def (df_ref)
-{
-  gcc_unreachable ();
-}
-
 /* Check REF's chain to add new insns into a queue
    and find registers requiring conversion.  */
 
@@ -934,7 +917,7 @@  general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 /* Convert COMPARE to vector mode.  */
 
 rtx
-general_scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn)
+scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn)
 {
   rtx tmp = gen_reg_rtx (vmode);
   rtx src;
@@ -1156,6 +1139,83 @@  general_scalar_chain::convert_insn (rtx_insn *insn)
   df_insn_rescan (insn);
 }
 
+/* Compute a gain for chain conversion.  */
+
+int
+timode_scalar_chain::compute_convert_gain ()
+{
+  /* Assume that if we have to move TImode values between units,
+     then transforming this chain isn't worth it.  */
+  if (n_sse_to_integer || n_integer_to_sse)
+    return -1;
+
+  bitmap_iterator bi;
+  unsigned insn_uid;
+
+  /* Split ties to prefer V1TImode when not optimizing for size.  */
+  int gain = optimize_size ? 0 : 1;
+
+  if (dump_file)
+    fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
+
+  EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
+    {
+      rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
+      rtx def_set = single_set (insn);
+      rtx src = SET_SRC (def_set);
+      rtx dst = SET_DEST (def_set);
+      int igain = 0;
+
+      switch (GET_CODE (src))
+	{
+	case REG:
+	  if (optimize_insn_for_size_p ())
+	    igain = MEM_P (dst) ? COSTS_N_BYTES (6) : COSTS_N_BYTES (3);
+	  else
+	    igain = COSTS_N_INSNS (1);
+	  break;
+
+	case MEM:
+	  igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (7)
+					      : COSTS_N_INSNS (1);
+	  break;
+
+	case CONST_INT:
+	  if (MEM_P (dst)
+	      && standard_sse_constant_p (src, V1TImode))
+	    igain = optimize_insn_for_size_p() ? COSTS_N_BYTES (11) : 1;
+	  break;
+
+	case NOT:
+	  if (MEM_P (dst))
+	    igain = -COSTS_N_INSNS (1);
+	  break;
+
+	case AND:
+	case XOR:
+	case IOR:
+	  if (!MEM_P (dst))
+	    igain = COSTS_N_INSNS (1);
+	  break;
+
+	default:
+	  break;
+	}
+
+      if (igain != 0 && dump_file)
+	{
+	  fprintf (dump_file, "  Instruction gain %d for ", igain);
+	  dump_insn_slim (dump_file, insn);
+	}
+      gain += igain;
+    }
+
+  if (dump_file)
+    fprintf (dump_file, "  Total gain: %d\n", gain);
+
+  return gain;
+}
+
 /* Fix uses of converted REG in debug insns.  */
 
 void
@@ -1194,6 +1254,61 @@  timode_scalar_chain::fix_debug_reg_uses (rtx reg)
     }
 }
 
+/* Convert operand OP in INSN from TImode to V1TImode.  */
+
+void
+timode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
+{
+  *op = copy_rtx_if_shared (*op);
+
+  if (REG_P (*op))
+    *op = gen_rtx_SUBREG (V1TImode, *op, 0);
+  else if (MEM_P (*op))
+    {
+      rtx tmp = gen_reg_rtx (V1TImode);
+      emit_insn_before (gen_rtx_SET (gen_rtx_SUBREG (V1TImode, tmp, 0),
+				     gen_gpr_to_xmm_move_src (V1TImode, *op)),
+			insn);
+      *op = gen_rtx_SUBREG (V1TImode, tmp, 0);
+
+      if (dump_file)
+	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
+		 INSN_UID (insn), REGNO (tmp));
+    }
+  else if (CONST_INT_P (*op))
+    {
+      rtx vec_cst;
+      rtx tmp = gen_rtx_SUBREG (V1TImode, gen_reg_rtx (TImode), 0);
+
+      /* Prefer all ones vector in case of -1.  */
+      if (constm1_operand (*op, TImode))
+	vec_cst = CONSTM1_RTX (V1TImode);
+      else
+	{
+	  rtx *v = XALLOCAVEC (rtx, 1);
+	  v[0] = *op;
+	  vec_cst = gen_rtx_CONST_VECTOR (V1TImode, gen_rtvec_v (1, v));
+	}
+
+      if (!standard_sse_constant_p (vec_cst, V1TImode))
+	{
+	  start_sequence ();
+	  vec_cst = validize_mem (force_const_mem (V1TImode, vec_cst));
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+	  emit_insn_before (seq, insn);
+	}
+
+      emit_insn_before (gen_move_insn (copy_rtx (tmp), vec_cst), insn);
+      *op = tmp;
+    }
+  else
+    {
+      gcc_assert (SUBREG_P (*op));
+      gcc_assert (GET_MODE (*op) == vmode);
+    }
+}
+
 /* Convert INSN from TImode to V1T1mode.  */
 
 void
@@ -1202,13 +1317,14 @@  timode_scalar_chain::convert_insn (rtx_insn *insn)
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
   rtx dst = SET_DEST (def_set);
+  rtx tmp;
 
   switch (GET_CODE (dst))
     {
     case REG:
       {
-	rtx tmp = find_reg_equal_equiv_note (insn);
-	if (tmp)
+	tmp = find_reg_equal_equiv_note (insn);
+	if (tmp && GET_MODE (XEXP (tmp, 0)) == TImode)
 	  PUT_MODE (XEXP (tmp, 0), V1TImode);
 	PUT_MODE (dst, V1TImode);
 	fix_debug_reg_uses (dst);
@@ -1275,6 +1391,49 @@  timode_scalar_chain::convert_insn (rtx_insn *insn)
 	}
       break;
 
+    case AND:
+      if (GET_CODE (XEXP (src, 0)) == NOT)
+	{
+	  convert_op (&XEXP (XEXP (src, 0), 0), insn);
+	  convert_op (&XEXP (src, 1), insn);
+	  PUT_MODE (XEXP (src, 0), V1TImode);
+	  PUT_MODE (src, V1TImode);
+	  break;
+	}
+      /* FALLTHRU */
+
+    case XOR:
+    case IOR:
+      convert_op (&XEXP (src, 0), insn);
+      convert_op (&XEXP (src, 1), insn);
+      PUT_MODE (src, V1TImode);
+      if (MEM_P (dst))
+	{
+	  tmp = gen_reg_rtx (V1TImode);
+          emit_insn_before (gen_rtx_SET (tmp, src), insn);
+          src = tmp;
+	}
+      break;
+
+    case NOT:
+      src = XEXP (src, 0);
+      convert_op (&src, insn);
+      tmp = gen_reg_rtx (V1TImode);
+      emit_insn_before (gen_move_insn (tmp, CONSTM1_RTX (V1TImode)), insn);
+      src = gen_rtx_XOR (V1TImode, src, tmp);
+      if (MEM_P (dst))
+	{
+	  tmp = gen_reg_rtx (V1TImode);
+          emit_insn_before (gen_rtx_SET (tmp, src), insn);
+          src = tmp;
+	}
+      break;
+
+    case COMPARE:
+      dst = gen_rtx_REG (CCmode, FLAGS_REG);
+      src = convert_compare (XEXP (src, 0), XEXP (src, 1), insn);
+      break;
+
     default:
       gcc_unreachable ();
     }
@@ -1522,6 +1681,16 @@  general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
   return true;
 }
 
+/* Check for a suitable TImode memory operand.  */
+
+static bool
+timode_mem_p (rtx x)
+{
+  return MEM_P (x)
+	 && (TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
+	     || !misaligned_operand (x, TImode));
+}
+
 /* The TImode version of scalar_to_vector_candidate_p.  */
 
 static bool
@@ -1535,45 +1704,64 @@  timode_scalar_to_vector_candidate_p (rtx_insn *insn)
   rtx src = SET_SRC (def_set);
   rtx dst = SET_DEST (def_set);
 
-  /* Only TImode load and store are allowed.  */
-  if (GET_MODE (dst) != TImode)
+  if (GET_CODE (src) == COMPARE)
+    return convertible_comparison_p (insn, TImode);
+
+  if (GET_MODE (dst) != TImode
+      || (GET_MODE (src) != TImode
+          && !CONST_SCALAR_INT_P (src)))
     return false;
 
-  if (MEM_P (dst))
-    {
-      /* Check for store.  Memory must be aligned or unaligned store
-	 is optimal.  Only support store from register, standard SSE
-	 constant or CONST_WIDE_INT generated from piecewise store.
+  if (!REG_P (dst) && !MEM_P (dst))
+    return false;
 
-	 ??? Verify performance impact before enabling CONST_INT for
-	 __int128 store.  */
-      if (misaligned_operand (dst, TImode)
-	  && !TARGET_SSE_UNALIGNED_STORE_OPTIMAL)
-	return false;
+  if (MEM_P (dst)
+      && misaligned_operand (dst, TImode)
+      && !TARGET_SSE_UNALIGNED_STORE_OPTIMAL)
+    return false;
 
-      switch (GET_CODE (src))
-	{
-	default:
-	  return false;
+  switch (GET_CODE (src))
+    {
+    case REG:
+    case CONST_WIDE_INT:
+      return true;
 
-	case REG:
-	case CONST_WIDE_INT:
-	  return true;
+    case CONST_INT:
+      /* ??? Verify performance impact before enabling CONST_INT for
+	 __int128 store.  */
+      return standard_sse_constant_p (src, TImode);
 
-	case CONST_INT:
-	  return standard_sse_constant_p (src, TImode);
-	}
-    }
-  else if (MEM_P (src))
-    {
-      /* Check for load.  Memory must be aligned or unaligned load is
-	 optimal.  */
+    case MEM:
+      /* Memory must be aligned or unaligned load is optimal.  */
       return (REG_P (dst)
 	      && (!misaligned_operand (src, TImode)
 		  || TARGET_SSE_UNALIGNED_LOAD_OPTIMAL));
-    }
 
-  return false;
+    case AND:
+      if (MEM_P (dst))
+	return false;
+      if (GET_CODE (XEXP (src, 0)) == NOT
+	  && REG_P (XEXP (XEXP (src, 0), 0))
+	  && (REG_P (XEXP (src, 1)) || timode_mem_p (XEXP (src, 1))))
+	return true;
+      return REG_P (XEXP (src, 0))
+	     && (REG_P (XEXP (src, 1)) || timode_mem_p (XEXP (src, 1)));
+
+    case IOR:
+    case XOR:
+      if (MEM_P (dst))
+	return false;
+      return REG_P (XEXP (src, 0))
+	     && (REG_P (XEXP (src, 1)) || timode_mem_p (XEXP (src, 1)));
+
+    case NOT:
+      if (MEM_P (dst))
+	return false;
+      return REG_P (XEXP (src, 0)) || timode_mem_p (XEXP (src, 0));
+
+    default:
+      return false;
+    }
 }
 
 /* For a register REGNO, scan instructions for its defs and uses.
diff --git a/gcc/config/i386/i386-features.h b/gcc/config/i386/i386-features.h
index 839b63c..88b222d 100644
--- a/gcc/config/i386/i386-features.h
+++ b/gcc/config/i386/i386-features.h
@@ -148,6 +148,10 @@  class scalar_chain
   /* Registers used in both vector and sclar modes.  */
   bitmap defs_conv;
 
+  bitmap insns_conv;
+  unsigned n_sse_to_integer;
+  unsigned n_integer_to_sse;
+
   void build (bitmap candidates, unsigned insn_uid);
   virtual int compute_convert_gain () = 0;
   int convert ();
@@ -155,33 +159,31 @@  class scalar_chain
  protected:
   void add_to_queue (unsigned insn_uid);
   void emit_conversion_insns (rtx insns, rtx_insn *pos);
+  rtx convert_compare (rtx op1, rtx op2, rtx_insn *insn);
+  void mark_dual_mode_def (df_ref def);
 
  private:
   void add_insn (bitmap candidates, unsigned insn_uid);
   void analyze_register_chain (bitmap candidates, df_ref ref);
-  virtual void mark_dual_mode_def (df_ref def) = 0;
   virtual void convert_insn (rtx_insn *insn) = 0;
   virtual void convert_registers () = 0;
+  virtual void convert_op (rtx *op, rtx_insn *insn) = 0;
 };
 
 class general_scalar_chain : public scalar_chain
 {
  public:
-  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_);
-  ~general_scalar_chain ();
+  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
   int compute_convert_gain () final override;
+
  private:
   hash_map<rtx, rtx> defs_map;
-  bitmap insns_conv;
-  unsigned n_sse_to_integer;
-  unsigned n_integer_to_sse;
-  void mark_dual_mode_def (df_ref def) final override;
   void convert_insn (rtx_insn *insn) final override;
-  void convert_op (rtx *op, rtx_insn *insn);
   void convert_reg (rtx_insn *insn, rtx dst, rtx src);
+  void convert_op (rtx *op, rtx_insn *insn);
   void make_vector_copies (rtx_insn *, rtx);
   void convert_registers () final override;
-  rtx convert_compare (rtx op1, rtx op2, rtx_insn *insn);
   int vector_const_cost (rtx exp);
 };
 
@@ -189,21 +191,20 @@  class timode_scalar_chain : public scalar_chain
 {
  public:
   timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
-
-  /* Convert from TImode to V1TImode is always faster.  */
-  int compute_convert_gain () final override { return 1; }
+  int compute_convert_gain () final override;
 
  private:
-  void mark_dual_mode_def (df_ref def) final override;
   void fix_debug_reg_uses (rtx reg);
   void convert_insn (rtx_insn *insn) final override;
-  /* We don't convert registers to difference size.  */
+  void convert_op (rtx *op, rtx_insn *insn);
+  /* We don't convert registers to different size.  */
   void convert_registers () final override {}
 };
 
 } // anon namespace
 
-bool ix86_save_reg (unsigned int regno, bool maybe_eh_return, bool ignore_outlined);
+bool ix86_save_reg (unsigned int regno, bool maybe_eh_return,
+		    bool ignore_outlined);
 int ix86_compare_version_priority (tree decl1, tree decl2);
 tree ix86_generate_version_dispatcher_body (void *node_p);
 tree ix86_get_function_versions_dispatcher (void *decl);
diff --git a/gcc/testsuite/gcc.target/i386/sse4_1-stv-2.c b/gcc/testsuite/gcc.target/i386/sse4_1-stv-2.c
new file mode 100644
index 0000000..e637927
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse4_1-stv-2.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse4.1 -mstv -mno-stackrealign" } */
+
+__int128 a[16];
+__int128 b[16];
+__int128 c[16];
+
+void foo()
+{
+  for (unsigned int i=0; i<16; i++)
+    a[i] = b[i] & c[i];
+}
+
+/* { dg-final { scan-assembler "pand" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse4_1-stv-3.c b/gcc/testsuite/gcc.target/i386/sse4_1-stv-3.c
new file mode 100644
index 0000000..bdc0bac
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse4_1-stv-3.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse4.1 -mstv -mno-stackrealign" } */
+
+__int128 a[16];
+__int128 b[16];
+__int128 c[16];
+
+void foo()
+{
+  for (unsigned int i=0; i<16; i++)
+    a[i] = b[i] | c[i];
+}
+
+/* { dg-final { scan-assembler "por" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse4_1-stv-4.c b/gcc/testsuite/gcc.target/i386/sse4_1-stv-4.c
new file mode 100644
index 0000000..a9ef619
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse4_1-stv-4.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse4.1 -mstv -mno-stackrealign" } */
+
+__int128 a[16];
+__int128 b[16];
+__int128 c[16];
+
+void foo()
+{
+  for (unsigned int i=0; i<16; i++)
+    a[i] = b[i] ^ c[i];
+}
+
+/* { dg-final { scan-assembler "pxor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse4_1-stv-5.c b/gcc/testsuite/gcc.target/i386/sse4_1-stv-5.c
new file mode 100644
index 0000000..7eff3c1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse4_1-stv-5.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse4.1 -mstv -mno-stackrealign" } */
+
+__int128 a[16];
+__int128 b[16];
+__int128 c[16];
+
+void foo()
+{
+  for (unsigned int i=0; i<16; i++)
+    a[i] = b[i] & ~c[i];
+}
+
+/* { dg-final { scan-assembler "pandn" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse4_1-stv-6.c b/gcc/testsuite/gcc.target/i386/sse4_1-stv-6.c
new file mode 100644
index 0000000..407bc55
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse4_1-stv-6.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse4.1 -mstv -mno-stackrealign" } */
+
+__int128 a[16];
+__int128 b[16];
+
+int foo()
+{
+  for (unsigned int i=0; i<16; i++)
+    if (a[i] == b[i])
+      return i;
+  return -1;
+}
+
+/* { dg-final { scan-assembler "ptest" } } */