[tree-ssa] PR target/113560: Enhance is_widening_mult_rhs_p.

Message ID 003f01da534e$94918450$bdb48cf0$@nextmovesoftware.com
State New
Headers
Series [tree-ssa] PR target/113560: Enhance is_widening_mult_rhs_p. |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm fail Testing failed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Testing passed

Commit Message

Roger Sayle Jan. 30, 2024, 7:33 a.m. UTC
  This patch resolves PR113560, a code quality regression from GCC12
affecting x86_64, by enhancing the middle-end's tree-ssa-math-opts.cc
to recognize more instances of widening multiplications.

The widening multiplication perception code identifies cases like:

        _1 = (unsigned __int128) x;
        __res = _1 * 100;

but in the reported test case, the original input looks like:

        _1 = (unsigned long long) x;
        _2 = (unsigned __int128) _1;
        __res = _2 * 100;

which gets optimized by constant folding during tree-ssa to:

        _2 = x & 18446744073709551615;  // x & 0xffffffffffffffff
        __res = _2 * 100;

where the BIT_AND_EXPR hides (has consumed) the extension operation.
This reveals the more general deficiency (missed optimization
opportunity) in widening multiplication perception that additionally
both

__int128 foo(__int128 x, __int128 y) {
  return (x & 1000) * (y & 1000)
}

and

unsigned __int128 bar(unsigned __int128 x, unsigned __int128) {
  return (x >> 80) * (y >> 80);
}

should be recognized as widening multiplications.  Hence rather than
test explicitly for BIT_AND_EXPR (as in the first version of this patch)
the more general solution is to make use of range information, as
provided by tree_non_zero_bits.

As a demonstration of the observed improvements, function foo above
currently with -O2 compiles on x86_64 to:

foo:    movq    %rdi, %rsi
        movq    %rdx, %r8
        xorl    %edi, %edi
        xorl    %r9d, %r9d
        andl    $1000, %esi
        andl    $1000, %r8d
        movq    %rdi, %rcx
        movq    %r9, %rdx
        imulq   %rsi, %rdx
        movq    %rsi, %rax
        imulq   %r8, %rcx
        addq    %rdx, %rcx
        mulq    %r8
        addq    %rdx, %rcx
        movq    %rcx, %rdx
        ret

with this patch, GCC recognizes the *w and instead generates:

foo:    movq    %rdi, %rsi
        movq    %rdx, %r8
        andl    $1000, %esi
        andl    $1000, %r8d
        movq    %rsi, %rax
        imulq   %r8
        ret

which is perhaps easier to understand at the tree-level where

__int128 foo (__int128 x, __int128 y)
{
  __int128 _1;
  __int128 _2;
  __int128 _5;

  <bb 2> [local count: 1073741824]:
  _1 = x_3(D) & 1000;
  _2 = y_4(D) & 1000;
  _5 = _1 * _2;
  return _5;
}

gets transformed to:

__int128 foo (__int128 x, __int128 y)
{
  __int128 _1;
  __int128 _2;
  __int128 _5;
  signed long _7;
  signed long _8;

  <bb 2> [local count: 1073741824]:
  _1 = x_3(D) & 1000;
  _2 = y_4(D) & 1000;
  _7 = (signed long) _1;
  _8 = (signed long) _2;
  _5 = _7 w* _8;
  return _5;
}

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-01-30  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR target/113560
        * tree-ssa-math-opts.cc (is_widening_mult_rhs_p): Use range
        information via tree_non_zero_bits to check if this operand
        is suitably extended for a widening (or highpart) multiplication.
        (convert_mult_to_widen): Insert explicit casts if the RHS or LHS
        isn't already of the claimed type.

gcc/testsuite/ChangeLog
        PR target/113560
        * g++.target/i386/pr113560.C: New test case.
        * gcc.target/i386/pr113560.c: Likewise.


Thanks in advance,
Roger
--
  

Comments

Richard Biener Jan. 30, 2024, 10:04 a.m. UTC | #1
On Tue, Jan 30, 2024 at 8:33 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch resolves PR113560, a code quality regression from GCC12
> affecting x86_64, by enhancing the middle-end's tree-ssa-math-opts.cc
> to recognize more instances of widening multiplications.
>
> The widening multiplication perception code identifies cases like:
>
>         _1 = (unsigned __int128) x;
>         __res = _1 * 100;
>
> but in the reported test case, the original input looks like:
>
>         _1 = (unsigned long long) x;
>         _2 = (unsigned __int128) _1;
>         __res = _2 * 100;
>
> which gets optimized by constant folding during tree-ssa to:
>
>         _2 = x & 18446744073709551615;  // x & 0xffffffffffffffff
>         __res = _2 * 100;
>
> where the BIT_AND_EXPR hides (has consumed) the extension operation.
> This reveals the more general deficiency (missed optimization
> opportunity) in widening multiplication perception that additionally
> both
>
> __int128 foo(__int128 x, __int128 y) {
>   return (x & 1000) * (y & 1000)
> }
>
> and
>
> unsigned __int128 bar(unsigned __int128 x, unsigned __int128) {
>   return (x >> 80) * (y >> 80);
> }
>
> should be recognized as widening multiplications.  Hence rather than
> test explicitly for BIT_AND_EXPR (as in the first version of this patch)
> the more general solution is to make use of range information, as
> provided by tree_non_zero_bits.
>
> As a demonstration of the observed improvements, function foo above
> currently with -O2 compiles on x86_64 to:
>
> foo:    movq    %rdi, %rsi
>         movq    %rdx, %r8
>         xorl    %edi, %edi
>         xorl    %r9d, %r9d
>         andl    $1000, %esi
>         andl    $1000, %r8d
>         movq    %rdi, %rcx
>         movq    %r9, %rdx
>         imulq   %rsi, %rdx
>         movq    %rsi, %rax
>         imulq   %r8, %rcx
>         addq    %rdx, %rcx
>         mulq    %r8
>         addq    %rdx, %rcx
>         movq    %rcx, %rdx
>         ret
>
> with this patch, GCC recognizes the *w and instead generates:
>
> foo:    movq    %rdi, %rsi
>         movq    %rdx, %r8
>         andl    $1000, %esi
>         andl    $1000, %r8d
>         movq    %rsi, %rax
>         imulq   %r8
>         ret
>
> which is perhaps easier to understand at the tree-level where
>
> __int128 foo (__int128 x, __int128 y)
> {
>   __int128 _1;
>   __int128 _2;
>   __int128 _5;
>
>   <bb 2> [local count: 1073741824]:
>   _1 = x_3(D) & 1000;
>   _2 = y_4(D) & 1000;
>   _5 = _1 * _2;
>   return _5;
> }
>
> gets transformed to:
>
> __int128 foo (__int128 x, __int128 y)
> {
>   __int128 _1;
>   __int128 _2;
>   __int128 _5;
>   signed long _7;
>   signed long _8;
>
>   <bb 2> [local count: 1073741824]:
>   _1 = x_3(D) & 1000;
>   _2 = y_4(D) & 1000;
>   _7 = (signed long) _1;
>   _8 = (signed long) _2;
>   _5 = _7 w* _8;
>   return _5;
> }
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?

Nice.  I'll note that the range check works on non-assign defs ('stmt')
as well, so can you put this outside of

       stmt = SSA_NAME_DEF_STMT (rhs);
       if (is_gimple_assign (stmt))
        {

and then of course, for

+                 /* X & MODE_MASK can be simplified to (T)X.  */
+                 if (gimple_assign_rhs_code (stmt) == BIT_AND_EXPR
+                     && TREE_CODE (gimple_assign_rhs2 (stmt)) == INTEGER_CST
+                     && wi::to_wide (gimple_assign_rhs2 (stmt))
+                        == wi::mask (hprec, false, prec))

add is_gimple_assign (stmt) in the condition?

In particular this might help to detect cases where the operand is defined
by a PHI node (aka a conditional).

OK with that change.

Thanks,
Richard.

>
> 2023-01-30  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR target/113560
>         * tree-ssa-math-opts.cc (is_widening_mult_rhs_p): Use range
>         information via tree_non_zero_bits to check if this operand
>         is suitably extended for a widening (or highpart) multiplication.
>         (convert_mult_to_widen): Insert explicit casts if the RHS or LHS
>         isn't already of the claimed type.
>
> gcc/testsuite/ChangeLog
>         PR target/113560
>         * g++.target/i386/pr113560.C: New test case.
>         * gcc.target/i386/pr113560.c: Likewise.
>
>
> Thanks in advance,
> Roger
> --
>
  

Patch

diff --git a/gcc/testsuite/g++.target/i386/pr113560.C b/gcc/testsuite/g++.target/i386/pr113560.C
new file mode 100644
index 0000000..179b68f
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr113560.C
@@ -0,0 +1,19 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-Ofast -std=c++23 -march=znver4" } */
+
+#include <immintrin.h>
+auto f(char *buf, unsigned long long in) noexcept
+{
+    unsigned long long hi{};
+    auto lo{_mulx_u64(in, 0x2af31dc462ull, &hi)};
+    lo = _mulx_u64(lo, 100, &hi);
+    __builtin_memcpy(buf + 2, &hi, 2);
+    return buf + 10;
+}
+
+/* { dg-final { scan-assembler-times "mulx" 1 } } */
+/* { dg-final { scan-assembler-times "mulq" 1 } } */
+/* { dg-final { scan-assembler-not "addq" } } */
+/* { dg-final { scan-assembler-not "adcq" } } */
+/* { dg-final { scan-assembler-not "salq" } } */
+/* { dg-final { scan-assembler-not "shldq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr113560.c b/gcc/testsuite/gcc.target/i386/pr113560.c
new file mode 100644
index 0000000..ac2e01a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr113560.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+unsigned __int128 foo(unsigned __int128 x, unsigned __int128 y)
+{
+  return (x & 1000) * (y & 1000);
+}
+
+__int128 bar(__int128 x, __int128 y)
+{
+  return (x & 1000) * (y & 1000);
+}
+
+/* { dg-final { scan-assembler-times "\tmulq" 1 } } */
+/* { dg-final { scan-assembler-times "\timulq" 1 } } */
+/* { dg-final { scan-assembler-not "addq" } } */
+/* { dg-final { scan-assembler-not "xorl" } } */
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 2db26e4..010fec4 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -2555,9 +2555,43 @@  is_widening_mult_rhs_p (tree type, tree rhs, tree *type_out,
       stmt = SSA_NAME_DEF_STMT (rhs);
       if (is_gimple_assign (stmt))
 	{
-	  if (! widening_mult_conversion_strippable_p (type, stmt))
-	    rhs1 = rhs;
-	  else
+	  /* Use tree_non_zero_bits to see if this operand is zero_extended
+	     for unsigned widening multiplications or non-negative for
+	     signed widening multiplications.  */
+	  if (TREE_CODE (type) == INTEGER_TYPE
+	      && (TYPE_PRECISION (type) & 1) == 0
+	      && int_mode_for_size (TYPE_PRECISION (type) / 2, 1).exists ())
+	    {
+	      unsigned int prec = TYPE_PRECISION (type);
+	      unsigned int hprec = prec / 2;
+	      wide_int bits = wide_int::from (tree_nonzero_bits (rhs),
+					      prec,
+					      TYPE_SIGN (TREE_TYPE (rhs)));
+	      if (TYPE_UNSIGNED (type)
+		  && wi::bit_and (bits, wi::mask (hprec, true, prec)) == 0)
+		{
+		  *type_out = build_nonstandard_integer_type (hprec, true);
+		  /* X & MODE_MASK can be simplified to (T)X.  */
+		  if (gimple_assign_rhs_code (stmt) == BIT_AND_EXPR
+		      && TREE_CODE (gimple_assign_rhs2 (stmt)) == INTEGER_CST
+		      && wi::to_wide (gimple_assign_rhs2 (stmt))
+			 == wi::mask (hprec, false, prec))
+		    *new_rhs_out = gimple_assign_rhs1 (stmt);
+		  else
+		    *new_rhs_out = rhs;
+		  return true;
+		}
+	      else if (!TYPE_UNSIGNED (type)
+		       && wi::bit_and (bits, wi::mask (hprec - 1, true, prec))
+			  == 0)
+		{
+		  *type_out = build_nonstandard_integer_type (hprec, false);
+		  *new_rhs_out = rhs;
+		  return true;
+		}
+	    }
+
+	  if (widening_mult_conversion_strippable_p (type, stmt))
 	    {
 	      rhs1 = gimple_assign_rhs1 (stmt);
 
@@ -2568,6 +2602,8 @@  is_widening_mult_rhs_p (tree type, tree rhs, tree *type_out,
 		  return true;
 		}
 	    }
+	  else
+	    rhs1 = rhs;
 	}
       else
 	rhs1 = rhs;
@@ -2827,12 +2863,16 @@  convert_mult_to_widen (gimple *stmt, gimple_stmt_iterator *gsi)
   if (2 * actual_precision > TYPE_PRECISION (type))
     return false;
   if (actual_precision != TYPE_PRECISION (type1)
-      || from_unsigned1 != TYPE_UNSIGNED (type1))
+      || from_unsigned1 != TYPE_UNSIGNED (type1)
+      || (TREE_TYPE (rhs1) != type1
+	  && TREE_CODE (rhs1) != INTEGER_CST))
     rhs1 = build_and_insert_cast (gsi, loc,
 				  build_nonstandard_integer_type
 				    (actual_precision, from_unsigned1), rhs1);
   if (actual_precision != TYPE_PRECISION (type2)
-      || from_unsigned2 != TYPE_UNSIGNED (type2))
+      || from_unsigned2 != TYPE_UNSIGNED (type2)
+      || (TREE_TYPE (rhs2) != type2
+	  && TREE_CODE (rhs2) != INTEGER_CST))
     rhs2 = build_and_insert_cast (gsi, loc,
 				  build_nonstandard_integer_type
 				    (actual_precision, from_unsigned2), rhs2);