rs6000: Add support for vmsumcud and vec_msumc

Message ID b75675f7-a746-150e-1b48-31619d039893@linux.ibm.com
State New
Headers
Series rs6000: Add support for vmsumcud and vec_msumc |

Commit Message

Li, Pan2 via Gcc-patches Feb. 7, 2022, 10:20 p.m. UTC
  Hi!

I observed recently that a couple of Power10 instructions and built-in functions
were somehow not implemented.  This patch adds one of them (vmsumcud).  Although
this isn't normally stage-4 material, this is really simple and carries no
discernible risk, so I hope it can be considered.

Bootstrapped and tested on powerpc64le-linux-gnu with no regressions.  Is this
okay for trunk?

Thanks!
Bill


2022-02-07  Bill Schmidt  <wschmidt@linux.ibm.com>

gcc/
	* config/rs6000/rs6000-builtins.def (VMSUMCUD): New.
	* config/rs6000/rs6000-overload.def (VEC_MSUMC): New.
	* config/rs6000/vsx.md (UNSPEC_VMSUMCUD): New constant.
	(vmsumcud): New define_insn.

gcc/testsuite/
	* gcc.target/powerpc/vec-msumc.c: New test.
---
 gcc/config/rs6000/rs6000-builtins.def        |  3 ++
 gcc/config/rs6000/rs6000-overload.def        |  4 ++
 gcc/config/rs6000/vsx.md                     | 13 +++++++
 gcc/testsuite/gcc.target/powerpc/vec-msumc.c | 39 ++++++++++++++++++++
 4 files changed, 59 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/vec-msumc.c
  

Comments

Segher Boessenkool Feb. 7, 2022, 11:05 p.m. UTC | #1
Hi!

On Mon, Feb 07, 2022 at 04:20:24PM -0600, Bill Schmidt wrote:
> I observed recently that a couple of Power10 instructions and built-in functions
> were somehow not implemented.  This patch adds one of them (vmsumcud).  Although
> this isn't normally stage-4 material, this is really simple and carries no
> discernible risk, so I hope it can be considered.

But what is the advantage?  That will be very tiny as well, afaics?

Ah, this implements a builtin as well.  But that builtin is not in the
PVIPR, so no one yet uses it most likely?

> gcc/
> 	* config/rs6000/rs6000-builtins.def (VMSUMCUD): New.
> 	* config/rs6000/rs6000-overload.def (VEC_MSUMC): New.
> 	* config/rs6000/vsx.md (UNSPEC_VMSUMCUD): New constant.
> 	(vmsumcud): New define_insn.
> 
> gcc/testsuite/
> 	* gcc.target/powerpc/vec-msumc.c: New test.

> +;; vmsumcud
> +(define_insn "vmsumcud"
> +[(set (match_operand:V1TI 0 "register_operand" "+v")
> +      (unspec:V1TI [(match_operand:V2DI 1 "register_operand" "v")
> +                    (match_operand:V2DI 2 "register_operand" "v")
> +		    (match_operand:V1TI 3 "register_operand" "v")]
> +		   UNSPEC_VMSUMCUD))]
> +  "TARGET_POWER10"
> +  "vmsumcud %0,%1,%2,%3"
> +  [(set_attr "type" "vecsimple")]
> +)

This can be properly described in RTL instead of using an unspec.  This
is much preferable.  I would say compare to maddhd[u], but those insns
aren't implemented either (maddld is though).


Segher
  
Li, Pan2 via Gcc-patches Feb. 8, 2022, 4:06 a.m. UTC | #2
Hi!

On 2/7/22 5:05 PM, Segher Boessenkool wrote:
> Hi!
>
> On Mon, Feb 07, 2022 at 04:20:24PM -0600, Bill Schmidt wrote:
>> I observed recently that a couple of Power10 instructions and built-in functions
>> were somehow not implemented.  This patch adds one of them (vmsumcud).  Although
>> this isn't normally stage-4 material, this is really simple and carries no
>> discernible risk, so I hope it can be considered.
> But what is the advantage?  That will be very tiny as well, afaics?
>
> Ah, this implements a builtin as well.  But that builtin is not in the
> PVIPR, so no one yet uses it most likely?

It's in the yet unpublished version of PVIPR that adds ISA 3.1 support,
currently awaiting public review.  It should have been implemented with
the rest of the ISA 3.1 built-ins.  (There are two more that were missed
as well, which I haven't yet addressed.)

>> gcc/
>> 	* config/rs6000/rs6000-builtins.def (VMSUMCUD): New.
>> 	* config/rs6000/rs6000-overload.def (VEC_MSUMC): New.
>> 	* config/rs6000/vsx.md (UNSPEC_VMSUMCUD): New constant.
>> 	(vmsumcud): New define_insn.
>>
>> gcc/testsuite/
>> 	* gcc.target/powerpc/vec-msumc.c: New test.
>> +;; vmsumcud
>> +(define_insn "vmsumcud"
>> +[(set (match_operand:V1TI 0 "register_operand" "+v")
>> +      (unspec:V1TI [(match_operand:V2DI 1 "register_operand" "v")
>> +                    (match_operand:V2DI 2 "register_operand" "v")
>> +		    (match_operand:V1TI 3 "register_operand" "v")]
>> +		   UNSPEC_VMSUMCUD))]
>> +  "TARGET_POWER10"
>> +  "vmsumcud %0,%1,%2,%3"
>> +  [(set_attr "type" "vecsimple")]
>> +)
> This can be properly described in RTL instead of using an unspec.  This
> is much preferable.  I would say compare to maddhd[u], but those insns
> aren't implemented either (maddld is though).

Is it?  Note that vmsumcud produces the carry out of the final
result, not the result itself.  I couldn't immediately see how
to express this in RTL.

The full operation multiplies the corresponding lanes of each
doubleword of arguments 1 and 2, adds them together with the
128-bit value in argument 3, and produces the carry out of the
result as a 128-bit value in the result.  I think I'd need to
have a 256-bit mode to express this properly in RTL, right?

Thanks,
Bill

>
>
> Segher
  
Segher Boessenkool Feb. 8, 2022, 3:45 p.m. UTC | #3
On Mon, Feb 07, 2022 at 10:06:36PM -0600, Bill Schmidt wrote:
> On 2/7/22 5:05 PM, Segher Boessenkool wrote:
> > On Mon, Feb 07, 2022 at 04:20:24PM -0600, Bill Schmidt wrote:
> >> I observed recently that a couple of Power10 instructions and built-in functions
> >> were somehow not implemented.  This patch adds one of them (vmsumcud).  Although
> >> this isn't normally stage-4 material, this is really simple and carries no
> >> discernible risk, so I hope it can be considered.
> > But what is the advantage?  That will be very tiny as well, afaics?
> >
> > Ah, this implements a builtin as well.  But that builtin is not in the
> > PVIPR, so no one yet uses it most likely?
> 
> It's in the yet unpublished version of PVIPR that adds ISA 3.1 support,
> currently awaiting public review.  It should have been implemented with
> the rest of the ISA 3.1 built-ins.  (There are two more that were missed
> as well, which I haven't yet addressed.)

Ugh.  Too much process, not enough speed.

> >> +;; vmsumcud
> >> +(define_insn "vmsumcud"
> >> +[(set (match_operand:V1TI 0 "register_operand" "+v")
> >> +      (unspec:V1TI [(match_operand:V2DI 1 "register_operand" "v")
> >> +                    (match_operand:V2DI 2 "register_operand" "v")
> >> +		    (match_operand:V1TI 3 "register_operand" "v")]
> >> +		   UNSPEC_VMSUMCUD))]
> >> +  "TARGET_POWER10"
> >> +  "vmsumcud %0,%1,%2,%3"
> >> +  [(set_attr "type" "vecsimple")]
> >> +)
> > This can be properly described in RTL instead of using an unspec.  This
> > is much preferable.  I would say compare to maddhd[u], but those insns
> > aren't implemented either (maddld is though).
> 
> Is it?  Note that vmsumcud produces the carry out of the final
> result, not the result itself.  I couldn't immediately see how
> to express this in RTL.

It produces thw top 128 bits of the (infinitely precise) result.  But
yeah that requires an OImode here (for the temp itself), and we do not
have that in the backend yet.

> The full operation multiplies the corresponding lanes of each
> doubleword of arguments 1 and 2, adds them together with the
> 128-bit value in argument 3, and produces the carry out of the
> result as a 128-bit value in the result.  I think I'd need to
> have a 256-bit mode to express this properly in RTL, right?

Not if you actually calculate the carry, instead of computing the
256-bit result and truncating it.  But this is very unwieldy (it
would be fine if adding just two datums, but here there are three).

Should the type be vecsimple?  Don't we have a type for multiplications?
Hrm it looks like we use veccomplex usually.

Okay for trunk with that taken care of.  Thanks!


Segher
  
Li, Pan2 via Gcc-patches Feb. 8, 2022, 6:40 p.m. UTC | #4
On 2/8/22 9:45 AM, Segher Boessenkool wrote:
> On Mon, Feb 07, 2022 at 10:06:36PM -0600, Bill Schmidt wrote:
>> On 2/7/22 5:05 PM, Segher Boessenkool wrote:
>>> On Mon, Feb 07, 2022 at 04:20:24PM -0600, Bill Schmidt wrote:
>>>> I observed recently that a couple of Power10 instructions and built-in functions
>>>> were somehow not implemented.  This patch adds one of them (vmsumcud).  Although
>>>> this isn't normally stage-4 material, this is really simple and carries no
>>>> discernible risk, so I hope it can be considered.
>>> But what is the advantage?  That will be very tiny as well, afaics?
>>>
>>> Ah, this implements a builtin as well.  But that builtin is not in the
>>> PVIPR, so no one yet uses it most likely?
>> It's in the yet unpublished version of PVIPR that adds ISA 3.1 support,
>> currently awaiting public review.  It should have been implemented with
>> the rest of the ISA 3.1 built-ins.  (There are two more that were missed
>> as well, which I haven't yet addressed.)
> Ugh.  Too much process, not enough speed.
>
>>>> +;; vmsumcud
>>>> +(define_insn "vmsumcud"
>>>> +[(set (match_operand:V1TI 0 "register_operand" "+v")
>>>> +      (unspec:V1TI [(match_operand:V2DI 1 "register_operand" "v")
>>>> +                    (match_operand:V2DI 2 "register_operand" "v")
>>>> +		    (match_operand:V1TI 3 "register_operand" "v")]
>>>> +		   UNSPEC_VMSUMCUD))]
>>>> +  "TARGET_POWER10"
>>>> +  "vmsumcud %0,%1,%2,%3"
>>>> +  [(set_attr "type" "vecsimple")]
>>>> +)
>>> This can be properly described in RTL instead of using an unspec.  This
>>> is much preferable.  I would say compare to maddhd[u], but those insns
>>> aren't implemented either (maddld is though).
>> Is it?  Note that vmsumcud produces the carry out of the final
>> result, not the result itself.  I couldn't immediately see how
>> to express this in RTL.
> It produces thw top 128 bits of the (infinitely precise) result.  But
> yeah that requires an OImode here (for the temp itself), and we do not
> have that in the backend yet.
>
>> The full operation multiplies the corresponding lanes of each
>> doubleword of arguments 1 and 2, adds them together with the
>> 128-bit value in argument 3, and produces the carry out of the
>> result as a 128-bit value in the result.  I think I'd need to
>> have a 256-bit mode to express this properly in RTL, right?
> Not if you actually calculate the carry, instead of computing the
> 256-bit result and truncating it.  But this is very unwieldy (it
> would be fine if adding just two datums, but here there are three).
>
> Should the type be vecsimple?  Don't we have a type for multiplications?
> Hrm it looks like we use veccomplex usually.
>
> Okay for trunk with that taken care of.  Thanks!

Thanks!  Revised as requested and pushed as r12-7110 (943d631abdd7be623c).

Bill

>
>
> Segher
  

Patch

diff --git a/gcc/config/rs6000/rs6000-builtins.def b/gcc/config/rs6000/rs6000-builtins.def
index d0ea54d77e4..846c0bafd45 100644
--- a/gcc/config/rs6000/rs6000-builtins.def
+++ b/gcc/config/rs6000/rs6000-builtins.def
@@ -3497,6 +3497,9 @@ 
   const signed int __builtin_altivec_vstrihr_p (vss);
     VSTRIHR_P vstrir_p_v8hi {}
 
+  const vuq __builtin_vsx_vmsumcud (vull, vull, vuq);
+    VMSUMCUD vmsumcud {}
+
   const signed int __builtin_vsx_xvtlsbb_all_ones (vsc);
     XVTLSBB_ONES xvtlsbbo {}
 
diff --git a/gcc/config/rs6000/rs6000-overload.def b/gcc/config/rs6000/rs6000-overload.def
index 5e38d597722..44e2945aaa0 100644
--- a/gcc/config/rs6000/rs6000-overload.def
+++ b/gcc/config/rs6000/rs6000-overload.def
@@ -2456,6 +2456,10 @@ 
   vuq __builtin_vec_msum (vull, vull, vuq);
     VMSUMUDM  VMSUMUDM_U
 
+[VEC_MSUMC, vec_msumc, __builtin_vec_msumc]
+  vuq __builtin_vec_msumc (vull, vull, vuq);
+    VMSUMCUD
+
 [VEC_MSUMS, vec_msums, __builtin_vec_msums]
   vui __builtin_vec_msums (vus, vus, vui);
     VMSUMUHS
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 88053f11e29..e4904102526 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -372,6 +372,7 @@  (define_c_enum "unspec"
    UNSPEC_REPLACE_UN
    UNSPEC_VDIVES
    UNSPEC_VDIVEU
+   UNSPEC_VMSUMCUD
    UNSPEC_XXEVAL
    UNSPEC_XXSPLTIW
    UNSPEC_XXSPLTIDP
@@ -6615,3 +6616,15 @@  (define_split
   emit_move_insn (operands[0], tmp4);
   DONE;
 })
+
+;; vmsumcud
+(define_insn "vmsumcud"
+[(set (match_operand:V1TI 0 "register_operand" "+v")
+      (unspec:V1TI [(match_operand:V2DI 1 "register_operand" "v")
+                    (match_operand:V2DI 2 "register_operand" "v")
+		    (match_operand:V1TI 3 "register_operand" "v")]
+		   UNSPEC_VMSUMCUD))]
+  "TARGET_POWER10"
+  "vmsumcud %0,%1,%2,%3"
+  [(set_attr "type" "vecsimple")]
+)
diff --git a/gcc/testsuite/gcc.target/powerpc/vec-msumc.c b/gcc/testsuite/gcc.target/powerpc/vec-msumc.c
new file mode 100644
index 00000000000..524a2225c6c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/vec-msumc.c
@@ -0,0 +1,39 @@ 
+/* { dg-do run { target { power10_hw } } } */
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
+#include <altivec.h>
+
+#define DEBUG 0
+
+#if DEBUG
+#include <stdio.h>
+#endif
+
+extern void abort (void);
+
+int
+main ()
+{
+  vector unsigned long long arg1, arg2;
+  vector unsigned __int128 arg3, result, expected;
+  unsigned __int128 c = (unsigned __int128) (-1); /* 2^128 - 1 */
+
+  arg1 = (vector unsigned long long) { 111ULL, 300ULL };
+  arg2 = (vector unsigned long long) { 700ULL, 222ULL };
+  arg3 = (vector unsigned __int128) { c };
+  expected = (vector unsigned __int128) { 1 };
+
+  result = vec_msumc (arg1, arg2, arg3);
+  if (result[0] != expected[0])
+    {
+#if DEBUG
+      printf ("ERROR, expected %d, result %d\n",
+	      (unsigned int) expected[0],
+	      (unsigned int) result[0]);
+#else
+      abort ();
+#endif
+    }
+
+  return 0;
+}