mbox

[0/3] Disable generating store vector pair.

Message ID Yp6hdmJqK1oDsbzB@toto.the-meissners.org
Headers

Message

Michael Meissner June 7, 2022, 12:53 a.m. UTC
  [PATCH 0/3] Disable generating store vector pair.

Testing has revealed that the power10 has some slowdowns if the store vector
pair instruction is generated in some cases.  This patch disables generating
the store vector pair instructions (stxvp, pstxvp, and stxvpx) unless an
undocumented switch (-mstore-vector-pair) is used.  It is anticipated that
perhaps with future machines we can generate the store vector pair instruction.

This patch does a split after reload to convert a store vector pair
instruction into a pair of store vector instructions.

We do continue to generate the load vector pair instructions (lxvp, plxvp,
and lxvpx), since we have found that in code that heavily uses MMA, it is
still a win to generate the load vector pair instructions.

There are 3 patches in this set:

    1)	Disable the generation of the stxvp, stxvpx, and pstxvp instructions
	for stores of OOmode and XOmodes.

    2)	Disable block moves from generating load/store vector pair
        instructions unless the the store vector pair instructions are
        being generted.  With patch #1 installed, the block move code will
        generate a load vector pair and store vector pair combination, but
        after reload, the store vector pair instructions are split into two
        separate store vector instructions.

    2)  Fix up the mma test suite to deal with store vector pair not being
	generated by default.  In most of the tests, I just deleted the lines
	that counted the store vector pair instructions.  In a few of the
	tests, I explicitly passed the -mstore-vector-pair instruction since
	the point of the test was to generate store vector pair instructions.

There is a 4th patch that Peter Bergner will be developing.  This patch will
update the built-in functions for load and store vector pair, so that these
built-ins will always generate the lxvp and stxvp instructions.

I have built bootstrap compilers and run the regression tests on three
different systems:

    1)	Little endian power10 using the --with-cpu=power10 option.

    2)	Little endian power9 using the --with-cpu=power9 option.

    3)	Big endian power8 using the --with-cpu=power8 option.  On this system,
	both 64-bit and 32-bit code generation was tested.

Once all 3 patches have been applied, there are no regressions in the runs.