On 5/10/22 5:35 PM, Segher Boessenkool wrote:
> Out of interest, did you try using v,?wa (so just two alternatives, not
> four)? Or did you think it wouldresult in measurably worse code? Or
> did you decide it is not such bad backend code size explosion after
> all :-)
So I tried using just "v,?wa" instead of the 4 alternative "v,v,?d,?d"
version and that fixes the performance issue too and is simpler too.
The other option is "better", in that it can allow one operand to get
a "v" reg when the other gets a "d" reg, but I think that's just a
micro-optimization and not worth the extra complexity in the pattern.
Thanks for the suggestion!
Changes since v1:
* Use v,?wa constraints rather than v,v,?d,?d.
* Update git log entry and ChangeLog text.
When optimizing the DGEMM kernel in OpenBLAS to use MMA, the MMA code
uses all 8 accumulators, which overlap all vs0-vs31 vector registers.
Current trunk assigns one of the normal vector inputs to one of the MMA
instructions, which forces us to spill one of the accumulators to memory,
leading to poor performance. The solution here is to replace the "wa"
constraints for the vector input operands in the MMA instruction patterns
with "v,?wa" so that we prefer using the altivec registers vs32-vs63
over the vs0-vs31 registers.
Bootstrap and regtesting on powerpc64le-linux showed no regressions.
Ok for trunk and the GCC12 release branch after some burn-in time on
trunk?
Technically, the same issue exists in GCC11 and GCC10, but the RA
assignment is OK with the current code, so unless/until we have a
test case that exhibits the issue, I'm only asking for a backport
to GCC12 which does show the performance problem.
Peter
gcc/
PR target/105556
* config/rs6000/mma.md (mma_<vv>, mma_<avv>, mma_<pv>, mma_<apv>,
mma_<vvi4i4i8>, mma_<avvi4i4i8>, mma_<vvi4i4i2>, mma_<avvi4i4i2>,
mma_<vvi4i4>, mma_<avvi4i4>, mma_<pvi4i2>, mma_<apvi4i2>,
mma_<vvi4i4i4>, mma_<avvi4i4i4>): Replace "wa" constraints with "v,?wa".
Update other operands accordingly.
On Mon, May 16, 2022 at 05:31:31PM -0500, Peter Bergner wrote:
> On 5/10/22 5:35 PM, Segher Boessenkool wrote:
> > Out of interest, did you try using v,?wa (so just two alternatives, not
> > four)? Or did you think it wouldresult in measurably worse code? Or
> > did you decide it is not such bad backend code size explosion after
> > all :-)
>
> So I tried using just "v,?wa" instead of the 4 alternative "v,v,?d,?d"
> version and that fixes the performance issue too and is simpler too.
> The other option is "better", in that it can allow one operand to get
> a "v" reg when the other gets a "d" reg, but I think that's just a
> micro-optimization and not worth the extra complexity in the pattern.
> Thanks for the suggestion!
The difference is that "v,?wa" makes no difference between one or more
lower VSRs used. But whenever you would see that there is so much
register allocation pressure already that it does not change anything
materially.
> gcc/
> PR target/105556
> * config/rs6000/mma.md (mma_<vv>, mma_<avv>, mma_<pv>, mma_<apv>,
> mma_<vvi4i4i8>, mma_<avvi4i4i8>, mma_<vvi4i4i2>, mma_<avvi4i4i2>,
> mma_<vvi4i4>, mma_<avvi4i4>, mma_<pvi4i2>, mma_<apvi4i2>,
> mma_<vvi4i4i4>, mma_<avvi4i4i4>): Replace "wa" constraints with "v,?wa".
> Update other operands accordingly.
> (define_insn "mma_<vv>"
> - [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
> - (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
> - (match_operand:V16QI 2 "vsx_register_operand" "wa")]
> + [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> + (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
> + (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
You now have two "?" on alternative 1, instead of just one. This is the
same as if you had had
[(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,??wa")
(match_operand:V16QI 2 "vsx_register_operand" "v,wa")]
The "?" are per alternative, not really per operand. It won't change
much here of course, just penalise more than you perhaps expected.
With or without that changed: okay for trunk and for 12 (after the usual
cooldown). Thanks! Also okay for 11 and 10, shoukd you want that later
anyway.
Segher
@@ -490,50 +490,50 @@ (define_insn "mma_xxsetaccz"
[(set_attr "type" "mma")])
(define_insn "mma_<vv>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
MMA_VV))]
"TARGET_MMA"
"<vv> %A0,%x1,%x2"
[(set_attr "type" "mma")])
(define_insn "mma_<avv>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:V16QI 3 "vsx_register_operand" "wa")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
MMA_AVV))]
"TARGET_MMA"
"<avv> %A0,%x2,%x3"
[(set_attr "type" "mma")])
(define_insn "mma_<pv>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
MMA_PV))]
"TARGET_MMA"
"<pv> %A0,%x1,%x2"
[(set_attr "type" "mma")])
(define_insn "mma_<apv>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
- (match_operand:OO 2 "vsx_register_operand" "wa")
- (match_operand:V16QI 3 "vsx_register_operand" "wa")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+ (match_operand:OO 2 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
MMA_APV))]
"TARGET_MMA"
"<apv> %A0,%x2,%x3"
[(set_attr "type" "mma")])
(define_insn "mma_<vvi4i4i8>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:SI 3 "const_0_to_15_operand" "n")
- (match_operand:SI 4 "const_0_to_15_operand" "n")
- (match_operand:SI 5 "u8bit_cint_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 5 "u8bit_cint_operand" "n,n")]
MMA_VVI4I4I8))]
"TARGET_MMA"
"<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
@@ -541,13 +541,13 @@ (define_insn "mma_<vvi4i4i8>"
(set_attr "prefixed" "yes")])
(define_insn "mma_<avvi4i4i8>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:V16QI 3 "vsx_register_operand" "wa")
- (match_operand:SI 4 "const_0_to_15_operand" "n")
- (match_operand:SI 5 "const_0_to_15_operand" "n")
- (match_operand:SI 6 "u8bit_cint_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 5 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 6 "u8bit_cint_operand" "n,n")]
MMA_AVVI4I4I8))]
"TARGET_MMA"
"<avvi4i4i8> %A0,%x2,%x3,%4,%5,%6"
@@ -555,12 +555,12 @@ (define_insn "mma_<avvi4i4i8>"
(set_attr "prefixed" "yes")])
(define_insn "mma_<vvi4i4i2>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:SI 3 "const_0_to_15_operand" "n")
- (match_operand:SI 4 "const_0_to_15_operand" "n")
- (match_operand:SI 5 "const_0_to_3_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
MMA_VVI4I4I2))]
"TARGET_MMA"
"<vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
@@ -568,13 +568,13 @@ (define_insn "mma_<vvi4i4i2>"
(set_attr "prefixed" "yes")])
(define_insn "mma_<avvi4i4i2>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:V16QI 3 "vsx_register_operand" "wa")
- (match_operand:SI 4 "const_0_to_15_operand" "n")
- (match_operand:SI 5 "const_0_to_15_operand" "n")
- (match_operand:SI 6 "const_0_to_3_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 5 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 6 "const_0_to_3_operand" "n,n")]
MMA_AVVI4I4I2))]
"TARGET_MMA"
"<avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
@@ -582,11 +582,11 @@ (define_insn "mma_<avvi4i4i2>"
(set_attr "prefixed" "yes")])
(define_insn "mma_<vvi4i4>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:SI 3 "const_0_to_15_operand" "n")
- (match_operand:SI 4 "const_0_to_15_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 4 "const_0_to_15_operand" "n,n")]
MMA_VVI4I4))]
"TARGET_MMA"
"<vvi4i4> %A0,%x1,%x2,%3,%4"
@@ -594,12 +594,12 @@ (define_insn "mma_<vvi4i4>"
(set_attr "prefixed" "yes")])
(define_insn "mma_<avvi4i4>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:V16QI 3 "vsx_register_operand" "wa")
- (match_operand:SI 4 "const_0_to_15_operand" "n")
- (match_operand:SI 5 "const_0_to_15_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
MMA_AVVI4I4))]
"TARGET_MMA"
"<avvi4i4> %A0,%x2,%x3,%4,%5"
@@ -607,11 +607,11 @@ (define_insn "mma_<avvi4i4>"
(set_attr "prefixed" "yes")])
(define_insn "mma_<pvi4i2>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:SI 3 "const_0_to_15_operand" "n")
- (match_operand:SI 4 "const_0_to_3_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 4 "const_0_to_3_operand" "n,n")]
MMA_PVI4I2))]
"TARGET_MMA"
"<pvi4i2> %A0,%x1,%x2,%3,%4"
@@ -619,12 +619,12 @@ (define_insn "mma_<pvi4i2>"
(set_attr "prefixed" "yes")])
(define_insn "mma_<apvi4i2>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
- (match_operand:OO 2 "vsx_register_operand" "wa")
- (match_operand:V16QI 3 "vsx_register_operand" "wa")
- (match_operand:SI 4 "const_0_to_15_operand" "n")
- (match_operand:SI 5 "const_0_to_3_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+ (match_operand:OO 2 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
MMA_APVI4I2))]
"TARGET_MMA"
"<apvi4i2> %A0,%x2,%x3,%4,%5"
@@ -632,12 +632,12 @@ (define_insn "mma_<apvi4i2>"
(set_attr "prefixed" "yes")])
(define_insn "mma_<vvi4i4i4>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:SI 3 "const_0_to_15_operand" "n")
- (match_operand:SI 4 "const_0_to_15_operand" "n")
- (match_operand:SI 5 "const_0_to_15_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
MMA_VVI4I4I4))]
"TARGET_MMA"
"<vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
@@ -645,13 +645,13 @@ (define_insn "mma_<vvi4i4i4>"
(set_attr "prefixed" "yes")])
(define_insn "mma_<avvi4i4i4>"
- [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
- (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
- (match_operand:V16QI 2 "vsx_register_operand" "wa")
- (match_operand:V16QI 3 "vsx_register_operand" "wa")
- (match_operand:SI 4 "const_0_to_15_operand" "n")
- (match_operand:SI 5 "const_0_to_15_operand" "n")
- (match_operand:SI 6 "const_0_to_15_operand" "n")]
+ [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+ (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+ (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+ (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+ (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 5 "const_0_to_15_operand" "n,n")
+ (match_operand:SI 6 "const_0_to_15_operand" "n,n")]
MMA_AVVI4I4I4))]
"TARGET_MMA"
"<avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"