gcn/gcn-valu.md: Disable fold_left_plus for TARGET_RDNA2_PLUS [PR113615]

Message ID f911f018-ce39-470a-9c56-fc7e596db8f2@baylibre.com
State New
Headers
Series gcn/gcn-valu.md: Disable fold_left_plus for TARGET_RDNA2_PLUS [PR113615] |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm warning Patch is already merged
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 warning Patch is already merged

Commit Message

Tobias Burnus Jan. 29, 2024, 10:34 a.m. UTC
  Andrew wrote off list:
   "Vector reductions don't work on RDNA, as is, but they're
    supposed to be disabled by the insn condition"

This patch disables "fold_left_plus_<mode>", which is about
vectorization and in the code path shown in the backtrace.
I can also confirm manually that it fixes the ICE I saw and
also the ICE for the testfile that Richard's PR shows at the
end of his backtrace.  (-O3 is needed to trigger the ICE.)

OK for mainline?

Tobias

* * *

PS: We could add testcase(s) that is/are explicitly compiled with
gfx1100 and/or gfx1030 + '-O3' to ensure that this gets tested
with AMDGPU enabled, but I am not sure whether it is really worthwhile.


PPS: Running the testsuite, I see the following fails with
gfx1100 offloading:

FAIL: libgomp.c/../libgomp.c-c++-common/for-5.c (test for excess errors)
Excess errors:
/tmp/ccrsHfVQ.mkoffload.2.s:788736:27: error: value out of range
           .amdhsa_next_free_vgpr        516                                        ^~~ [Obviously, likewise forlibgomp.c++/../libgomp.c-c++-common/for-5.c]
FAIL:libgomp.c/pr104783-2.c execution test FAIL:libgomp.c/pr104783.c 
execution test (The .log unfortunately does not show more details) 
FAIL:libgomp.fortran/optional-map.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors) FAIL:libgomp.fortran/optional-map.f90   -O3 -g  (test for 
excess errors) FAIL: libgomp.fortran/target1.f90   -O3 
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer 
-finline-functions  (test for excess errors) FAIL: 
libgomp.fortran/target1.f90   -O3 -g  (test for excess errors)Same 'out 
of range' as above. * * * Manual testing shows for the two execution 
fails: Memory access fault by GPU node-1 (Agent handle: 0x8d1aa0) on 
address (nil). Reason: Page not present or supervisor privilege. 
Interestingly, it only fails with -O1 or higher, for -O0 it works. Tobias
  

Comments

Andrew Stubbs Jan. 29, 2024, 12:34 p.m. UTC | #1
On 29/01/2024 10:34, Tobias Burnus wrote:
> Andrew wrote off list:
>    "Vector reductions don't work on RDNA, as is, but they're
>     supposed to be disabled by the insn condition"
> 
> This patch disables "fold_left_plus_<mode>", which is about
> vectorization and in the code path shown in the backtrace.
> I can also confirm manually that it fixes the ICE I saw and
> also the ICE for the testfile that Richard's PR shows at the
> end of his backtrace.  (-O3 is needed to trigger the ICE.)
> 
> OK for mainline?

OK.

> Tobias
> 
> * * *
> 
> PS: We could add testcase(s) that is/are explicitly compiled with
> gfx1100 and/or gfx1030 + '-O3' to ensure that this gets tested
> with AMDGPU enabled, but I am not sure whether it is really worthwhile.
> 
> 
> PPS: Running the testsuite, I see the following fails with
> gfx1100 offloading:
> 
> FAIL: libgomp.c/../libgomp.c-c++-common/for-5.c (test for excess errors)
> Excess errors:
> /tmp/ccrsHfVQ.mkoffload.2.s:788736:27: error: value out of range
>            .amdhsa_next_free_vgpr        516 
>                                         ^~~ [Obviously, likewise 
> forlibgomp.c++/../libgomp.c-c++-common/for-5.c]
> FAIL:libgomp.c/pr104783-2.c execution test FAIL:libgomp.c/pr104783.c 
> execution test (The .log unfortunately does not show more details) 
> FAIL:libgomp.fortran/optional-map.f90   -O3 -fomit-frame-pointer 
> -funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
> excess errors) FAIL:libgomp.fortran/optional-map.f90   -O3 -g  (test for 
> excess errors) FAIL: libgomp.fortran/target1.f90   -O3 
> -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer 
> -finline-functions  (test for excess errors) FAIL: 
> libgomp.fortran/target1.f90   -O3 -g  (test for excess errors)Same 'out 
> of range' as above. * * * Manual testing shows for the two execution 
> fails: Memory access fault by GPU node-1 (Agent handle: 0x8d1aa0) on 
> address (nil). Reason: Page not present or supervisor privilege. 
> Interestingly, it only fails with -O1 or higher, for -O0 it works. Tobias

Hmm, supposedly there are 768 registers allocated in groups of 12, on 
gfx1100 (8 on other devices), which number you have to double on 
wavefrontsize64 because that field actually counts the number of 32-lane 
registers. The ISA can only actually reference 256 registers, so the 
limit here should be 512. (The remaining registers are intended for 
other wavefronts to use.)

But 256 is not divisible by 12, and it looks like we've rounded up. I 
guess we need to set the limit at 252 (504), for gfx1100.

for-5.c is a register allocation nightmare!

Andrew
  
Tobias Burnus Jan. 29, 2024, 12:50 p.m. UTC | #2
Andrew Stubbs wrote:
>> /tmp/ccrsHfVQ.mkoffload.2.s:788736:27: error: value out of range
>>            .amdhsa_next_free_vgpr        516 
>>                                         ^~~ [Obviously, likewise 
>> forlibgomp.c++/..
> Hmm, supposedly there are 768 registers allocated in groups of 12, on 
> gfx1100 (8 on other devices), which number you have to double on 
> wavefrontsize64 because that field actually counts the number of 
> 32-lane registers. The ISA can only actually reference 256 registers, 
> so the limit here should be 512. (The remaining registers are intended 
> for other wavefronts to use.)
>
> But 256 is not divisible by 12, and it looks like we've rounded up. I 
> guess we need to set the limit at 252 (504), for gfx1100.

BTW: The LLVM source code has,
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp#L1066

unsigned getTotalNumVGPRs(const MCSubtargetInfo *STI) {
   if (STI->getFeatureBits().test(FeatureGFX90AInsts))
     return 512;
   if (!isGFX10Plus(*STI))
     return 256;
   bool IsWave32 = STI->getFeatureBits().test(FeatureWavefrontSize32);
   if (STI->getFeatureBits().test(FeatureGFX11FullVGPRs))
     return IsWave32 ? 1536 : 768;
   return IsWave32 ? 1024 : 512;
}


Tobias
  
Andrew Stubbs Jan. 29, 2024, 3:17 p.m. UTC | #3
On 29/01/2024 12:50, Tobias Burnus wrote:
> Andrew Stubbs wrote:
>>> /tmp/ccrsHfVQ.mkoffload.2.s:788736:27: error: value out of range
>>>            .amdhsa_next_free_vgpr        516 
>>>                                         ^~~ [Obviously, likewise 
>>> forlibgomp.c++/..
>> Hmm, supposedly there are 768 registers allocated in groups of 12, on 
>> gfx1100 (8 on other devices), which number you have to double on 
>> wavefrontsize64 because that field actually counts the number of 
>> 32-lane registers. The ISA can only actually reference 256 registers, 
>> so the limit here should be 512. (The remaining registers are intended 
>> for other wavefronts to use.)
>>
>> But 256 is not divisible by 12, and it looks like we've rounded up. I 
>> guess we need to set the limit at 252 (504), for gfx1100.
> 
> BTW: The LLVM source code has,
> https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp#L1066
> 
> unsigned getTotalNumVGPRs(const MCSubtargetInfo *STI) {
>    if (STI->getFeatureBits().test(FeatureGFX90AInsts))
>      return 512;
>    if (!isGFX10Plus(*STI))
>      return 256;
>    bool IsWave32 = STI->getFeatureBits().test(FeatureWavefrontSize32);
>    if (STI->getFeatureBits().test(FeatureGFX11FullVGPRs))
>      return IsWave32 ? 1536 : 768;
>    return IsWave32 ? 1024 : 512;
> }

That matches what we have in libgomp.

LLVM must have another configuration somewhere for how many registers it 
can actually use in code (the ISA can encode 256, but that doesn't mean 
it should always do so). This may be a moot point because allowing too 
many registers limits how many threads can run in parallel, so they may 
have chosen to impose an artificial limit at all times.

In GCC, non-kernel functions are limited to 24 registers (for maximum 
occupancy -- we could probably increase that 50% on "GFX11Full" 
devices), but the kernel entry point is permitted to go crazy.

Andrew
  

Patch

gcn/gcn-valu.md: Disable fold_left_plus for TARGET_RDNA2_PLUS [PR113615]

gcc/ChangeLog:

	PR target/113615
	* config/gcn/gcn-valu.md (fold_left_plus_<mode>): Only
	define for !TARGET_RDNA2_PLUS.

Signed-off-by: Tobias Burnus <tburnus@baylibre.com>

 gcc/config/gcn/gcn-valu.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index cd027f8b369..23b441f8e8b 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -4274,7 +4274,8 @@  (define_expand "fold_left_plus_<mode>"
  [(match_operand:<SCALAR_MODE> 0 "register_operand")
   (match_operand:<SCALAR_MODE> 1 "gcn_alu_operand")
   (match_operand:V_FP 2 "gcn_alu_operand")]
-  "can_create_pseudo_p ()
+  "!TARGET_RDNA2_PLUS
+   && can_create_pseudo_p ()
    && (flag_openacc || flag_openmp
        || flag_associative_math)"
   {