[x86] Double word implementation of and; cmp to not; test optimization.

Message ID 016e01d87900$615a7a30$240f6e90$@nextmovesoftware.com
State New
Headers
Series [x86] Double word implementation of and; cmp to not; test optimization. |

Commit Message

Roger Sayle June 5, 2022, 5:19 p.m. UTC
  This patch extends the recent and;cmp to not;test optimization to also
perform this transformation for TImode on TARGET_64BIT and DImode on -m32,
One motivation for this is that it's a step to fixing the current failure
of gcc.target/i386/pr65105-5.c on -m32.

A more direct benefit for x86_64 is that the following code:

int foo(__int128 x, __int128 y)
{
  return (x & y) == y;
}

improves (with -O2 -mbmi) from:

        movq    %rdi, %r8
        movq    %rsi, %rdi
        movq    %rdx, %rsi
        andq    %rcx, %rdi
        movq    %r8, %rax
        andq    %rdx, %rax
        movq    %rdi, %rdx
        xorq    %rsi, %rax
        xorq    %rcx, %rdx
        orq     %rdx, %rax
        sete    %al
        movzbl  %al, %eax
        ret

to the much better:

        movq    %rdi, %r8
        movq    %rsi, %rdi
        andn    %rdx, %r8, %rax
        andn    %rcx, %rdi, %rsi
        orq     %rsi, %rax
        sete    %al
        movzbl  %al, %eax
        ret

The major theme of this patch is to generalize many of i386.md's
*di3_doubleword patterns to become *<dwi>_doubleword patterns, i.e.
whenever there exists a "double word" optimization for DImode with -m32,
there should be an equivalent TImode optimization on TARGET_64BIT.

The following patch has been tested on x86_64-pc-linux-gnu with
make bootstrap and make -k check, where on TARGET_64BIT there are
no new failures, but paradoxically with --target_board=unix{-m32}
the other dg-final clause in gcc.target/i386/pr65105-5.c now fails.
Counter-intuitively, this is progress, and pr65105-5.c may now be
fixed (without using peephole2) simply by tweaking the STV pass to
handle andn/test (in a follow-up patch).
OK for mainline?


2022-06-05  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386.cc (ix86_rtx_costs) <COMPARE>: Provide costs
        for double word comparisons and tests (comparisons against zero).
        * config/i386/i386.md (*test<mode>_not_doubleword): Split DWI
        and;cmp into andn;cmp $0 as a pre-reload splitter.
        (define_expand and<mode>3): Generalize from SWIM1248x to SWIDWI.
        (define_insn_and_split "*anddi3_doubleword"): Rename/generalize...
        (define_insn_and_split "*and<dwi>3_doubleword"): ... to this.
        (define_insn "*andndi3_doubleword"): Rename and generalize...
        (define_insn "*andn<mode>3_doubleword): ... to this.
        (define_split): Split andn when TARGET_BMI for both <DWI> modes.
        (define_split): Split andn when !TARGET_BMI for both <DWI> modes.
        (define_expand <any_or><mode>3): Generalize from SWIM1248x to
SWIDWI.
        (define_insn_and_split "*<any_or><dwi>3_doubleword): Generalize
        from DI mode to both <DWI> modes.

gcc/testsuite/ChangeLog
        * gcc.target/i386/testnot-3.c: New test case.


Thanks again,
Roger
--
  

Comments

Uros Bizjak June 5, 2022, 7:12 p.m. UTC | #1
On Sun, Jun 5, 2022 at 7:19 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch extends the recent and;cmp to not;test optimization to also
> perform this transformation for TImode on TARGET_64BIT and DImode on -m32,
> One motivation for this is that it's a step to fixing the current failure
> of gcc.target/i386/pr65105-5.c on -m32.
>
> A more direct benefit for x86_64 is that the following code:
>
> int foo(__int128 x, __int128 y)
> {
>   return (x & y) == y;
> }
>
> improves (with -O2 -mbmi) from:
>
>         movq    %rdi, %r8
>         movq    %rsi, %rdi
>         movq    %rdx, %rsi
>         andq    %rcx, %rdi
>         movq    %r8, %rax
>         andq    %rdx, %rax
>         movq    %rdi, %rdx
>         xorq    %rsi, %rax
>         xorq    %rcx, %rdx
>         orq     %rdx, %rax
>         sete    %al
>         movzbl  %al, %eax
>         ret
>
> to the much better:
>
>         movq    %rdi, %r8
>         movq    %rsi, %rdi
>         andn    %rdx, %r8, %rax
>         andn    %rcx, %rdi, %rsi
>         orq     %rsi, %rax
>         sete    %al
>         movzbl  %al, %eax
>         ret
>
> The major theme of this patch is to generalize many of i386.md's
> *di3_doubleword patterns to become *<dwi>_doubleword patterns, i.e.
> whenever there exists a "double word" optimization for DImode with -m32,
> there should be an equivalent TImode optimization on TARGET_64BIT.

No, please do not mix two different themes in one patch.

OTOH, the only TImode optimization that can be used with SSE registers
is with logic instructions and some constant shifts, but there is no
TImode arithmetic. I assume your end goal is to introduce STV for
TImode on 64-bit targets, because DImode patterns for x86_32 were
introduced to avoid early decomposition by middle end and to split
instructions that STV didn't convert to vector instructions after STV
pass. So, let's start with basic V1TImode support before optimizations
are introduced.

Uros.

> The following patch has been tested on x86_64-pc-linux-gnu with
> make bootstrap and make -k check, where on TARGET_64BIT there are
> no new failures, but paradoxically with --target_board=unix{-m32}
> the other dg-final clause in gcc.target/i386/pr65105-5.c now fails.
> Counter-intuitively, this is progress, and pr65105-5.c may now be
> fixed (without using peephole2) simply by tweaking the STV pass to
> handle andn/test (in a follow-up patch).
> OK for mainline?
>
>
> 2022-06-05  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386.cc (ix86_rtx_costs) <COMPARE>: Provide costs
>         for double word comparisons and tests (comparisons against zero).
>         * config/i386/i386.md (*test<mode>_not_doubleword): Split DWI
>         and;cmp into andn;cmp $0 as a pre-reload splitter.
>         (define_expand and<mode>3): Generalize from SWIM1248x to SWIDWI.
>         (define_insn_and_split "*anddi3_doubleword"): Rename/generalize...
>         (define_insn_and_split "*and<dwi>3_doubleword"): ... to this.
>         (define_insn "*andndi3_doubleword"): Rename and generalize...
>         (define_insn "*andn<mode>3_doubleword): ... to this.
>         (define_split): Split andn when TARGET_BMI for both <DWI> modes.
>         (define_split): Split andn when !TARGET_BMI for both <DWI> modes.
>         (define_expand <any_or><mode>3): Generalize from SWIM1248x to
> SWIDWI.
>         (define_insn_and_split "*<any_or><dwi>3_doubleword): Generalize
>         from DI mode to both <DWI> modes.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/testnot-3.c: New test case.
>
>
> Thanks again,
> Roger
> --
>
  
Roger Sayle June 6, 2022, 8:23 a.m. UTC | #2
Hi Uros,
> > The major theme of this patch is to generalize many of i386.md's
> > *di3_doubleword patterns to become *<dwi>_doubleword patterns, i.e.
> > whenever there exists a "double word" optimization for DImode with
> > -m32, there should be an equivalent TImode optimization on TARGET_64BIT.
> 
> No, please do not mix two different themes in one patch.
> 
> OTOH, the only TImode optimization that can be used with SSE registers is with
> logic instructions and some constant shifts, but there is no TImode arithmetic. I
> assume your end goal is to introduce STV for TImode on 64-bit targets, because
> DImode patterns for x86_32 were introduced to avoid early decomposition by
> middle end and to split instructions that STV didn't convert to vector instructions
> after STV pass. So, let's start with basic V1TImode support before optimizations
> are introduced.

I'm not sure I understand.  What basic V1TImode support do you/we want next?

This testcase and worked example with this patch shows its benefits without STV
nor using V1TI mode vectors.  As explained in the subject, and;cmp can be turned
into the cheaper not;cmp $0, for TImode (and DImode with -m32) in the same way
as we currently do for SImode everywhere.  Having double word modes visible to
combine, allows it to work its magic.  A recent patch ensured that double word
compares were visible to combine, this optimization just required that double
word logic (AND, IOR and XOR) are visible after combine, and in fact for -m32 DImode
they already are, it's just that TImode is inconsistent, leading to missed optimizations.
Likewise, STV can't choose between implementations before there are alternative
Implementations to choose from.

As always I'm happy to do things in the order you want (modulo my 36 hour spin
cycle), in fact the reason this is being done now is that you recommended it best
to fix pr65105-5.c after the "double word comparison", which I fully agree with,
as it leads to a better solution that doesn’t require peephole2 (in your own words,
"why isn't this being done in combine?").

I'm also certainly misunderstanding.  Which piece needs to be done next?

Perhaps I should have used the term "the common theme" rather than
"the major theme" that may have made it sound like there were unrelated
or Independent bits in this patch.  But there are no V1TI changes in it.

Thanks in advance, for any clarification.

Cheers,
Roger
--
  
Uros Bizjak June 6, 2022, 9:22 a.m. UTC | #3
On Mon, Jun 6, 2022 at 10:23 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> > > The major theme of this patch is to generalize many of i386.md's
> > > *di3_doubleword patterns to become *<dwi>_doubleword patterns, i.e.
> > > whenever there exists a "double word" optimization for DImode with
> > > -m32, there should be an equivalent TImode optimization on TARGET_64BIT.
> >
> > No, please do not mix two different themes in one patch.
> >
> > OTOH, the only TImode optimization that can be used with SSE registers is with
> > logic instructions and some constant shifts, but there is no TImode arithmetic. I
> > assume your end goal is to introduce STV for TImode on 64-bit targets, because
> > DImode patterns for x86_32 were introduced to avoid early decomposition by
> > middle end and to split instructions that STV didn't convert to vector instructions
> > after STV pass. So, let's start with basic V1TImode support before optimizations
> > are introduced.
>
> I'm not sure I understand.  What basic V1TImode support do you/we want next?
>
> This testcase and worked example with this patch shows its benefits without STV
> nor using V1TI mode vectors.  As explained in the subject, and;cmp can be turned
> into the cheaper not;cmp $0, for TImode (and DImode with -m32) in the same way
> as we currently do for SImode everywhere.  Having double word modes visible to
> combine, allows it to work its magic.  A recent patch ensured that double word
> compares were visible to combine, this optimization just required that double
> word logic (AND, IOR and XOR) are visible after combine, and in fact for -m32 DImode
> they already are, it's just that TImode is inconsistent, leading to missed optimizations.
> Likewise, STV can't choose between implementations before there are alternative
> Implementations to choose from.

Let me clarify my statement:

When double-mode patterns are NOT present, the middle-end splits
double-mode operations to word-mode at expansion time, taking into
account constant propagation on split operations, etc. The reason that
DImode patterns are present are due to STV on a 32-bit target, which
wants double-word operations to be unsplit until STV pass.
Unfortunately, this approach inhibits constant propagation, and the
missing functionality was implemented in a "manual way" when
operations are split to word-mode. Without targeting STV, optimization
opportunities are quite small (one of them is the conversion you
proposed above), so there is no pressing need to introduce TImode
operations.

So, by extending all DImode logic patterns to also handle TImode on
x86_64, we can also use them to implement TImode STV pass on x86_64.
This is something that would have a noticeable impact on the generated
code.

Uros.

> As always I'm happy to do things in the order you want (modulo my 36 hour spin
> cycle), in fact the reason this is being done now is that you recommended it best
> to fix pr65105-5.c after the "double word comparison", which I fully agree with,
> as it leads to a better solution that doesn’t require peephole2 (in your own words,
> "why isn't this being done in combine?").
>
> I'm also certainly misunderstanding.  Which piece needs to be done next?
>
> Perhaps I should have used the term "the common theme" rather than
> "the major theme" that may have made it sound like there were unrelated
> or Independent bits in this patch.  But there are no V1TI changes in it.
>
> Thanks in advance, for any clarification.
>
> Cheers,
> Roger
> --
>
>
  
Uros Bizjak June 6, 2022, 10:25 a.m. UTC | #4
On Sun, Jun 5, 2022 at 7:19 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch extends the recent and;cmp to not;test optimization to also
> perform this transformation for TImode on TARGET_64BIT and DImode on -m32,
> One motivation for this is that it's a step to fixing the current failure
> of gcc.target/i386/pr65105-5.c on -m32.
>
> A more direct benefit for x86_64 is that the following code:
>
> int foo(__int128 x, __int128 y)
> {
>   return (x & y) == y;
> }
>
> improves (with -O2 -mbmi) from:
>
>         movq    %rdi, %r8
>         movq    %rsi, %rdi
>         movq    %rdx, %rsi
>         andq    %rcx, %rdi
>         movq    %r8, %rax
>         andq    %rdx, %rax
>         movq    %rdi, %rdx
>         xorq    %rsi, %rax
>         xorq    %rcx, %rdx
>         orq     %rdx, %rax
>         sete    %al
>         movzbl  %al, %eax
>         ret
>
> to the much better:
>
>         movq    %rdi, %r8
>         movq    %rsi, %rdi
>         andn    %rdx, %r8, %rax
>         andn    %rcx, %rdi, %rsi
>         orq     %rsi, %rax
>         sete    %al
>         movzbl  %al, %eax
>         ret
>
> The major theme of this patch is to generalize many of i386.md's
> *di3_doubleword patterns to become *<dwi>_doubleword patterns, i.e.
> whenever there exists a "double word" optimization for DImode with -m32,
> there should be an equivalent TImode optimization on TARGET_64BIT.
>
> The following patch has been tested on x86_64-pc-linux-gnu with
> make bootstrap and make -k check, where on TARGET_64BIT there are
> no new failures, but paradoxically with --target_board=unix{-m32}
> the other dg-final clause in gcc.target/i386/pr65105-5.c now fails.
> Counter-intuitively, this is progress, and pr65105-5.c may now be
> fixed (without using peephole2) simply by tweaking the STV pass to
> handle andn/test (in a follow-up patch).
> OK for mainline?
>
>
> 2022-06-05  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386.cc (ix86_rtx_costs) <COMPARE>: Provide costs
>         for double word comparisons and tests (comparisons against zero).
>         * config/i386/i386.md (*test<mode>_not_doubleword): Split DWI
>         and;cmp into andn;cmp $0 as a pre-reload splitter.
>         (define_expand and<mode>3): Generalize from SWIM1248x to SWIDWI.
>         (define_insn_and_split "*anddi3_doubleword"): Rename/generalize...
>         (define_insn_and_split "*and<dwi>3_doubleword"): ... to this.
>         (define_insn "*andndi3_doubleword"): Rename and generalize...
>         (define_insn "*andn<mode>3_doubleword): ... to this.
>         (define_split): Split andn when TARGET_BMI for both <DWI> modes.
>         (define_split): Split andn when !TARGET_BMI for both <DWI> modes.
>         (define_expand <any_or><mode>3): Generalize from SWIM1248x to
> SWIDWI.
>         (define_insn_and_split "*<any_or><dwi>3_doubleword): Generalize
>         from DI mode to both <DWI> modes.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/testnot-3.c: New test case.

-(define_insn_and_split "*anddi3_doubleword"
-  [(set (match_operand:DI 0 "nonimmediate_operand")
- (and:DI
- (match_operand:DI 1 "nonimmediate_operand")
- (match_operand:DI 2 "x86_64_szext_general_operand")))
+(define_insn_and_split "*and<dwi>3_doubleword"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand")
+ (and:<DWI>
+ (match_operand:<DWI> 1 "nonimmediate_operand")
+ (match_operand:<DWI> 2 "<general_operand>")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT
-   && ix86_binary_operator_ok (AND, DImode, operands)
+  "ix86_binary_operator_ok (AND, <DWI>mode, operands)
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
-  [(const_int 0)]
+  [(parallel
+    [(set (match_dup 0) (and:DWIH (match_dup 1) (match_dup 2)))
+     (clobber (reg:CC FLAGS_REG))])
+   (parallel
+    [(set (match_dup 3) (and:DWIH (match_dup 4) (match_dup 5)))
+     (clobber (reg:CC FLAGS_REG))])]

Please also note that the above pattern generation will never be
reached, you have unconditional DONE in the insn preparation
statement.

 {
-  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
+  split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);

   if (operands[2] == const0_rtx)
     emit_move_insn (operands[0], const0_rtx);
   else if (operands[2] == constm1_rtx)
     emit_move_insn (operands[0], operands[1]);
   else
-    emit_insn (gen_andsi3 (operands[0], operands[1], operands[2]));
+    ix86_expand_binary_operator (AND, <MODE>mode, &operands[0]);

   if (operands[5] == const0_rtx)
     emit_move_insn (operands[3], const0_rtx);
   else if (operands[5] == constm1_rtx)
     emit_move_insn (operands[3], operands[4]);
   else
-    emit_insn (gen_andsi3 (operands[3], operands[4], operands[5]));
+    ix86_expand_binary_operator (AND, <MODE>mode, &operands[3]);

   DONE;
 })

-(define_insn_and_split "*<code>di3_doubleword"
-  [(set (match_operand:DI 0 "nonimmediate_operand")
- (any_or:DI
- (match_operand:DI 1 "nonimmediate_operand")
- (match_operand:DI 2 "x86_64_szext_general_operand")))
+(define_insn_and_split "*<code><dwi>3_doubleword"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand")
+ (any_or:<DWI>
+ (match_operand:<DWI> 1 "nonimmediate_operand")
+ (match_operand:<DWI> 2 "<general_hilo_operand>")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT
-   && ix86_binary_operator_ok (<CODE>, DImode, operands)
+  "ix86_binary_operator_ok (<CODE>, <DWI>mode, operands)
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
-  [(const_int 0)]
+  [(parallel
+    [(set (match_dup 0) (any_or:DWIH (match_dup 1) (match_dup 2)))
+     (clobber (reg:CC FLAGS_REG))])
+   (parallel
+    [(set (match_dup 3) (any_or:DWIH (match_dup 4) (match_dup 5)))
+     (clobber (reg:CC FLAGS_REG))])]

Also here.

Uros.
  
Uros Bizjak June 6, 2022, 11:28 a.m. UTC | #5
On Sun, Jun 5, 2022 at 7:19 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch extends the recent and;cmp to not;test optimization to also
> perform this transformation for TImode on TARGET_64BIT and DImode on -m32,
> One motivation for this is that it's a step to fixing the current failure
> of gcc.target/i386/pr65105-5.c on -m32.
>
> A more direct benefit for x86_64 is that the following code:
>
> int foo(__int128 x, __int128 y)
> {
>   return (x & y) == y;
> }
>
> improves (with -O2 -mbmi) from:
>
>         movq    %rdi, %r8
>         movq    %rsi, %rdi
>         movq    %rdx, %rsi
>         andq    %rcx, %rdi
>         movq    %r8, %rax
>         andq    %rdx, %rax
>         movq    %rdi, %rdx
>         xorq    %rsi, %rax
>         xorq    %rcx, %rdx
>         orq     %rdx, %rax
>         sete    %al
>         movzbl  %al, %eax
>         ret
>
> to the much better:
>
>         movq    %rdi, %r8
>         movq    %rsi, %rdi
>         andn    %rdx, %r8, %rax
>         andn    %rcx, %rdi, %rsi
>         orq     %rsi, %rax
>         sete    %al
>         movzbl  %al, %eax
>         ret
>
> The major theme of this patch is to generalize many of i386.md's
> *di3_doubleword patterns to become *<dwi>_doubleword patterns, i.e.
> whenever there exists a "double word" optimization for DImode with -m32,
> there should be an equivalent TImode optimization on TARGET_64BIT.
>
> The following patch has been tested on x86_64-pc-linux-gnu with
> make bootstrap and make -k check, where on TARGET_64BIT there are
> no new failures, but paradoxically with --target_board=unix{-m32}
> the other dg-final clause in gcc.target/i386/pr65105-5.c now fails.
> Counter-intuitively, this is progress, and pr65105-5.c may now be
> fixed (without using peephole2) simply by tweaking the STV pass to
> handle andn/test (in a follow-up patch).
> OK for mainline?
>
>
> 2022-06-05  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386.cc (ix86_rtx_costs) <COMPARE>: Provide costs
>         for double word comparisons and tests (comparisons against zero).
>         * config/i386/i386.md (*test<mode>_not_doubleword): Split DWI
>         and;cmp into andn;cmp $0 as a pre-reload splitter.
>         (define_expand and<mode>3): Generalize from SWIM1248x to SWIDWI.
>         (define_insn_and_split "*anddi3_doubleword"): Rename/generalize...
>         (define_insn_and_split "*and<dwi>3_doubleword"): ... to this.
>         (define_insn "*andndi3_doubleword"): Rename and generalize...
>         (define_insn "*andn<mode>3_doubleword): ... to this.
>         (define_split): Split andn when TARGET_BMI for both <DWI> modes.
>         (define_split): Split andn when !TARGET_BMI for both <DWI> modes.
>         (define_expand <any_or><mode>3): Generalize from SWIM1248x to
> SWIDWI.
>         (define_insn_and_split "*<any_or><dwi>3_doubleword): Generalize
>         from DI mode to both <DWI> modes.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/testnot-3.c: New test case.

 (define_expand "and<mode>3"
-  [(set (match_operand:SWIM1248x 0 "nonimmediate_operand")
-    (and:SWIM1248x (match_operand:SWIM1248x 1 "nonimmediate_operand")
-               (match_operand:SWIM1248x 2 "<general_szext_operand>")))]
+  [(set (match_operand:SWIDWI 0 "nonimmediate_operand")
+    (and:SWIDWI (match_operand:SWIDWI 1 "nonimmediate_operand")
+            (match_operand:SWIDWI 2 "<general_operand>")))]


SWIM1248x should be changed to SDWIM mode iterator, not SWIDWI:

;; Math-dependant integer modes with DImode.
(define_mode_iterator SWIM1248x
    [(QI "TARGET_QIMODE_MATH")
     (HI "TARGET_HIMODE_MATH")
     SI DI])

;; All math-dependant single and double word integer modes.
(define_mode_iterator SDWIM [(QI "TARGET_QIMODE_MATH")
                 (HI "TARGET_HIMODE_MATH")
                 SI DI (TI "TARGET_64BIT")])

but

;; SWI and DWI together.
(define_mode_iterator SWIDWI [QI HI SI DI (TI "TARGET_64BIT")])

does not handle QI and HI correctly w.r.t. TARGET_[QI,HI]MODE_MATH.

Uros.
  
Uros Bizjak June 7, 2022, 6:08 a.m. UTC | #6
On Mon, Jun 6, 2022 at 1:28 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Sun, Jun 5, 2022 at 7:19 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> >
> > This patch extends the recent and;cmp to not;test optimization to also
> > perform this transformation for TImode on TARGET_64BIT and DImode on -m32,
> > One motivation for this is that it's a step to fixing the current failure
> > of gcc.target/i386/pr65105-5.c on -m32.
> >
> > A more direct benefit for x86_64 is that the following code:
> >
> > int foo(__int128 x, __int128 y)
> > {
> >   return (x & y) == y;
> > }
> >
> > improves (with -O2 -mbmi) from:
> >
> >         movq    %rdi, %r8
> >         movq    %rsi, %rdi
> >         movq    %rdx, %rsi
> >         andq    %rcx, %rdi
> >         movq    %r8, %rax
> >         andq    %rdx, %rax
> >         movq    %rdi, %rdx
> >         xorq    %rsi, %rax
> >         xorq    %rcx, %rdx
> >         orq     %rdx, %rax
> >         sete    %al
> >         movzbl  %al, %eax
> >         ret
> >
> > to the much better:
> >
> >         movq    %rdi, %r8
> >         movq    %rsi, %rdi
> >         andn    %rdx, %r8, %rax
> >         andn    %rcx, %rdi, %rsi
> >         orq     %rsi, %rax
> >         sete    %al
> >         movzbl  %al, %eax
> >         ret
> >
> > The major theme of this patch is to generalize many of i386.md's
> > *di3_doubleword patterns to become *<dwi>_doubleword patterns, i.e.
> > whenever there exists a "double word" optimization for DImode with -m32,
> > there should be an equivalent TImode optimization on TARGET_64BIT.
> >
> > The following patch has been tested on x86_64-pc-linux-gnu with
> > make bootstrap and make -k check, where on TARGET_64BIT there are
> > no new failures, but paradoxically with --target_board=unix{-m32}
> > the other dg-final clause in gcc.target/i386/pr65105-5.c now fails.
> > Counter-intuitively, this is progress, and pr65105-5.c may now be
> > fixed (without using peephole2) simply by tweaking the STV pass to
> > handle andn/test (in a follow-up patch).
> > OK for mainline?
> >
> >
> > 2022-06-05  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386.cc (ix86_rtx_costs) <COMPARE>: Provide costs
> >         for double word comparisons and tests (comparisons against zero).
> >         * config/i386/i386.md (*test<mode>_not_doubleword): Split DWI
> >         and;cmp into andn;cmp $0 as a pre-reload splitter.
> >         (define_expand and<mode>3): Generalize from SWIM1248x to SWIDWI.
> >         (define_insn_and_split "*anddi3_doubleword"): Rename/generalize...
> >         (define_insn_and_split "*and<dwi>3_doubleword"): ... to this.
> >         (define_insn "*andndi3_doubleword"): Rename and generalize...
> >         (define_insn "*andn<mode>3_doubleword): ... to this.
> >         (define_split): Split andn when TARGET_BMI for both <DWI> modes.
> >         (define_split): Split andn when !TARGET_BMI for both <DWI> modes.
> >         (define_expand <any_or><mode>3): Generalize from SWIM1248x to
> > SWIDWI.
> >         (define_insn_and_split "*<any_or><dwi>3_doubleword): Generalize
> >         from DI mode to both <DWI> modes.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/i386/testnot-3.c: New test case.
>
>  (define_expand "and<mode>3"
> -  [(set (match_operand:SWIM1248x 0 "nonimmediate_operand")
> -    (and:SWIM1248x (match_operand:SWIM1248x 1 "nonimmediate_operand")
> -               (match_operand:SWIM1248x 2 "<general_szext_operand>")))]
> +  [(set (match_operand:SWIDWI 0 "nonimmediate_operand")
> +    (and:SWIDWI (match_operand:SWIDWI 1 "nonimmediate_operand")
> +            (match_operand:SWIDWI 2 "<general_operand>")))]

(Didn't notice at yesterday's review)

Please do not change existing predicates, but extend
general_szext_operand mode attribute to handle TImode

-(define_insn_and_split "*anddi3_doubleword"
-  [(set (match_operand:DI 0 "nonimmediate_operand")
-    (and:DI
-     (match_operand:DI 1 "nonimmediate_operand")
-     (match_operand:DI 2 "x86_64_szext_general_operand")))
+(define_insn_and_split "*and<dwi>3_doubleword"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand")
+    (and:<DWI>
+     (match_operand:<DWI> 1 "nonimmediate_operand")
+     (match_operand:<DWI> 2 "<general_operand>")))

Also here, extended general_szext_operand attribute should handle DI and TImode.

Uros.
  

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index df5c80d..af11669 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -20918,6 +20918,19 @@  ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno,
 	  return true;
 	}
 
+      if (SCALAR_INT_MODE_P (GET_MODE (op0))
+	  && GET_MODE_SIZE (GET_MODE (op0)) > UNITS_PER_WORD)
+	{
+	  if (op1 == const0_rtx)
+	    *total = cost->add
+		     + rtx_cost (op0, GET_MODE (op0), outer_code, opno, speed);
+	  else
+	    *total = 3*cost->add
+		     + rtx_cost (op0, GET_MODE (op0), outer_code, opno, speed)
+		     + rtx_cost (op1, GET_MODE (op0), outer_code, opno, speed);
+	  return true;
+	}
+
       /* The embedded comparison operand is completely free.  */
       if (!general_operand (op0, GET_MODE (op0)) && op1 == const0_rtx)
 	*total = 0;
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2b1d65b..502416b 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9785,9 +9785,24 @@ 
    (set (reg:CCZ FLAGS_REG)
 	(compare:CCZ (and:SWI (match_dup 2) (match_dup 1))
 		     (const_int 0)))]
-{
-  operands[2] = gen_reg_rtx (<MODE>mode);
-})
+  "operands[2] = gen_reg_rtx (<MODE>mode);")
+
+;; Split and;cmp (as optimized by combine) into andn;cmp $0
+(define_insn_and_split "*test<mode>_not_doubleword"
+  [(set (reg:CCZ FLAGS_REG)
+	(compare:CCZ
+	  (and:DWI
+	    (not:DWI (match_operand:DWI 0 "register_operand"))
+	    (match_operand:DWI 1 "nonimmediate_operand"))
+	  (const_int 0)))]
+  "ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(parallel
+      [(set (match_dup 2) (and:DWI (not:DWI (match_dup 0)) (match_dup 1)))
+       (clobber (reg:CC FLAGS_REG))])
+   (set (reg:CCZ FLAGS_REG) (compare:CCZ (match_dup 2) (const_int 0)))]
+  "operands[2] = gen_reg_rtx (<MODE>mode);")
 
 ;; Convert HImode/SImode test instructions with immediate to QImode ones.
 ;; i386 does not allow to encode test with 8bit sign extended immediate, so
@@ -9846,19 +9861,21 @@ 
 ;; it should be done with splitters.
 
 (define_expand "and<mode>3"
-  [(set (match_operand:SWIM1248x 0 "nonimmediate_operand")
-	(and:SWIM1248x (match_operand:SWIM1248x 1 "nonimmediate_operand")
-		       (match_operand:SWIM1248x 2 "<general_szext_operand>")))]
+  [(set (match_operand:SWIDWI 0 "nonimmediate_operand")
+	(and:SWIDWI (match_operand:SWIDWI 1 "nonimmediate_operand")
+		    (match_operand:SWIDWI 2 "<general_operand>")))]
   ""
 {
   machine_mode mode = <MODE>mode;
 
-  if (<MODE>mode == DImode && !TARGET_64BIT)
-    ;
-  else if (const_int_operand (operands[2], <MODE>mode)
-	   && register_operand (operands[0], <MODE>mode)
-	   && !(TARGET_ZERO_EXTEND_WITH_AND
-		&& optimize_function_for_speed_p (cfun)))
+  if (!x86_64_hilo_general_operand (operands[2], <MODE>mode))
+    operands[2] = force_reg (<MODE>mode, operands[2]);
+
+  if (GET_MODE_SIZE (<MODE>mode) <= UNITS_PER_WORD
+      && const_int_operand (operands[2], <MODE>mode)
+      && register_operand (operands[0], <MODE>mode)
+      && !(TARGET_ZERO_EXTEND_WITH_AND
+	   && optimize_function_for_speed_p (cfun)))
     {
       unsigned HOST_WIDE_INT ival = UINTVAL (operands[2]);
 
@@ -9880,34 +9897,38 @@ 
   DONE;
 })
 
-(define_insn_and_split "*anddi3_doubleword"
-  [(set (match_operand:DI 0 "nonimmediate_operand")
-	(and:DI
-	 (match_operand:DI 1 "nonimmediate_operand")
-	 (match_operand:DI 2 "x86_64_szext_general_operand")))
+(define_insn_and_split "*and<dwi>3_doubleword"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand")
+	(and:<DWI>
+	 (match_operand:<DWI> 1 "nonimmediate_operand")
+	 (match_operand:<DWI> 2 "<general_operand>")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT
-   && ix86_binary_operator_ok (AND, DImode, operands)
+  "ix86_binary_operator_ok (AND, <DWI>mode, operands)
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
-  [(const_int 0)]
+  [(parallel
+    [(set (match_dup 0) (and:DWIH (match_dup 1) (match_dup 2)))
+     (clobber (reg:CC FLAGS_REG))])
+   (parallel
+    [(set (match_dup 3) (and:DWIH (match_dup 4) (match_dup 5)))
+     (clobber (reg:CC FLAGS_REG))])]
 {
-  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
+  split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
 
   if (operands[2] == const0_rtx)
     emit_move_insn (operands[0], const0_rtx);
   else if (operands[2] == constm1_rtx)
     emit_move_insn (operands[0], operands[1]);
   else
-    emit_insn (gen_andsi3 (operands[0], operands[1], operands[2]));
+    ix86_expand_binary_operator (AND, <MODE>mode, &operands[0]);
 
   if (operands[5] == const0_rtx)
     emit_move_insn (operands[3], const0_rtx);
   else if (operands[5] == constm1_rtx)
     emit_move_insn (operands[3], operands[4]);
   else
-    emit_insn (gen_andsi3 (operands[3], operands[4], operands[5]));
+    ix86_expand_binary_operator (AND, <MODE>mode, &operands[3]);
 
   DONE;
 })
@@ -10391,53 +10412,52 @@ 
   operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
 })
 
-(define_insn "*andndi3_doubleword"
-  [(set (match_operand:DI 0 "register_operand")
-	(and:DI
-	  (not:DI (match_operand:DI 1 "register_operand"))
-	  (match_operand:DI 2 "nonimmediate_operand")))
+(define_insn "*andn<mode>3_doubleword"
+  [(set (match_operand:DWI 0 "register_operand")
+	(and:DWI
+	  (not:DWI (match_operand:DWI 1 "register_operand"))
+	  (match_operand:DWI 2 "nonimmediate_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT && TARGET_STV && TARGET_SSE2
-   && ix86_pre_reload_split ()"
+  "ix86_pre_reload_split ()"
   "#")
 
 (define_split
-  [(set (match_operand:DI 0 "register_operand")
-	(and:DI
-	  (not:DI (match_operand:DI 1 "register_operand"))
-	  (match_operand:DI 2 "nonimmediate_operand")))
+  [(set (match_operand:<DWI> 0 "register_operand")
+	(and:<DWI>
+	  (not:<DWI> (match_operand:<DWI> 1 "register_operand"))
+	  (match_operand:<DWI> 2 "nonimmediate_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT && TARGET_BMI && TARGET_STV && TARGET_SSE2
+  "TARGET_BMI
    && can_create_pseudo_p ()"
   [(parallel [(set (match_dup 0)
-		   (and:SI (not:SI (match_dup 1)) (match_dup 2)))
+		   (and:DWIH (not:DWIH (match_dup 1)) (match_dup 2)))
 	      (clobber (reg:CC FLAGS_REG))])
    (parallel [(set (match_dup 3)
-		   (and:SI (not:SI (match_dup 4)) (match_dup 5)))
+		   (and:DWIH (not:DWIH (match_dup 4)) (match_dup 5)))
 	      (clobber (reg:CC FLAGS_REG))])]
-  "split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);")
+  "split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);")
 
 (define_split
-  [(set (match_operand:DI 0 "register_operand")
-	(and:DI
-	  (not:DI (match_operand:DI 1 "register_operand"))
-	  (match_operand:DI 2 "nonimmediate_operand")))
+  [(set (match_operand:<DWI> 0 "register_operand")
+	(and:<DWI>
+	  (not:<DWI> (match_operand:<DWI> 1 "register_operand"))
+	  (match_operand:<DWI> 2 "nonimmediate_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT && !TARGET_BMI && TARGET_STV && TARGET_SSE2
+  "!TARGET_BMI
    && can_create_pseudo_p ()"
-  [(set (match_dup 6) (not:SI (match_dup 1)))
+  [(set (match_dup 6) (not:DWIH (match_dup 1)))
    (parallel [(set (match_dup 0)
-		   (and:SI (match_dup 6) (match_dup 2)))
+		   (and:DWIH (match_dup 6) (match_dup 2)))
 	      (clobber (reg:CC FLAGS_REG))])
-   (set (match_dup 7) (not:SI (match_dup 4)))
+   (set (match_dup 7) (not:DWIH (match_dup 4)))
    (parallel [(set (match_dup 3)
-		   (and:SI (match_dup 7) (match_dup 5)))
+		   (and:DWIH (match_dup 7) (match_dup 5)))
 	      (clobber (reg:CC FLAGS_REG))])]
 {
-  operands[6] = gen_reg_rtx (SImode);
-  operands[7] = gen_reg_rtx (SImode);
+  operands[6] = gen_reg_rtx (<MODE>mode);
+  operands[7] = gen_reg_rtx (<MODE>mode);
 
-  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
+  split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
 })
 
 (define_insn "*andn<mode>_1"
@@ -10532,26 +10552,35 @@ 
 ;; If this is considered useful, it should be done with splitters.
 
 (define_expand "<code><mode>3"
-  [(set (match_operand:SWIM1248x 0 "nonimmediate_operand")
-	(any_or:SWIM1248x (match_operand:SWIM1248x 1 "nonimmediate_operand")
-			  (match_operand:SWIM1248x 2 "<general_operand>")))]
+  [(set (match_operand:SWIDWI 0 "nonimmediate_operand")
+	(any_or:SWIDWI (match_operand:SWIDWI 1 "nonimmediate_operand")
+		       (match_operand:SWIDWI 2 "<general_hilo_operand>")))]
   ""
-  "ix86_expand_binary_operator (<CODE>, <MODE>mode, operands); DONE;")
+{
+  if (!x86_64_hilo_general_operand (operands[2], <MODE>mode))
+    operands[2] = force_reg (<MODE>mode, operands[2]);
+  ix86_expand_binary_operator (<CODE>, <MODE>mode, operands);
+  DONE;
+})
 
-(define_insn_and_split "*<code>di3_doubleword"
-  [(set (match_operand:DI 0 "nonimmediate_operand")
-	(any_or:DI
-	 (match_operand:DI 1 "nonimmediate_operand")
-	 (match_operand:DI 2 "x86_64_szext_general_operand")))
+(define_insn_and_split "*<code><dwi>3_doubleword"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand")
+	(any_or:<DWI>
+	 (match_operand:<DWI> 1 "nonimmediate_operand")
+	 (match_operand:<DWI> 2 "<general_hilo_operand>")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT
-   && ix86_binary_operator_ok (<CODE>, DImode, operands)
+  "ix86_binary_operator_ok (<CODE>, <DWI>mode, operands)
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
-  [(const_int 0)]
+  [(parallel
+    [(set (match_dup 0) (any_or:DWIH (match_dup 1) (match_dup 2)))
+     (clobber (reg:CC FLAGS_REG))])
+   (parallel
+    [(set (match_dup 3) (any_or:DWIH (match_dup 4) (match_dup 5)))
+     (clobber (reg:CC FLAGS_REG))])]
 {
-  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
+  split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
 
   if (operands[2] == const0_rtx)
     emit_move_insn (operands[0], operands[1]);
@@ -10560,10 +10589,10 @@ 
       if (<CODE> == IOR)
 	emit_move_insn (operands[0], constm1_rtx);
       else
-	ix86_expand_unary_operator (NOT, SImode, &operands[0]);
+	ix86_expand_unary_operator (NOT, <MODE>mode, &operands[0]);
     }
   else
-    ix86_expand_binary_operator (<CODE>, SImode, &operands[0]);
+    ix86_expand_binary_operator (<CODE>, <MODE>mode, &operands[0]);
 
   if (operands[5] == const0_rtx)
     emit_move_insn (operands[3], operands[4]);
@@ -10572,10 +10601,10 @@ 
       if (<CODE> == IOR)
 	emit_move_insn (operands[3], constm1_rtx);
       else
-	ix86_expand_unary_operator (NOT, SImode, &operands[3]);
+	ix86_expand_unary_operator (NOT, <MODE>mode, &operands[3]);
     }
   else
-    ix86_expand_binary_operator (<CODE>, SImode, &operands[3]);
+    ix86_expand_binary_operator (<CODE>, <MODE>mode, &operands[3]);
 
   DONE;
 })
diff --git a/gcc/testsuite/gcc.target/i386/testnot-3.c b/gcc/testsuite/gcc.target/i386/testnot-3.c
new file mode 100644
index 0000000..56438df
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/testnot-3.c
@@ -0,0 +1,10 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+int foo(__int128 x, __int128 y)
+{
+  return (x & y) == y;
+}
+
+/* { dg-final { scan-assembler-times "notq" 2 } } */
+/* { dg-final { scan-assembler-not "xorq" } } */