gcn/gcn-hsa.h: Always pass --amdhsa-code-object-version= in ASM_SPEC

Message ID b9b445cf-03ef-4010-9fe7-a7f29dcfa7b0@baylibre.com
State New
Headers
Series gcn/gcn-hsa.h: Always pass --amdhsa-code-object-version= in ASM_SPEC |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Testing passed

Commit Message

Tobias Burnus Jan. 25, 2024, 11:03 p.m. UTC
  When targeting AMD GPUs, the LLVM assembler (and linker) are used.

Two days ago LLVM changed the default for theAMDHSA code object version (COV) from 4 to 5. In principle, we do not 
care which COV is used as long as it works; unfortunately, 
"mkoffload.cc" also generates an object file directly, bypassing the AMD 
GPU compiler as it copies debugging data to that file. That object file 
must have the same COV version (ELF ABI version) as compiler + llvm-mc 
assembler generated files. In order to ensure those are the same, this 
patch forces the use of COV 4 instead of using the default. Once GCC 
requires LLVM >= 14 instead of LLVM >= 13.0.1 we could change it. 
(Assuming that COV 5 is sufficiently stable in LLVM 14.) - But for now 
COV 4 will do.
If you wonder how this LLVM issue shows up, simply compile any OpenMP
or OpenACC program with AMD GPU offloading and enable debugging ("-g"),
e.g.
   gcc -fopenmp -g test.f90 -foffload=amdgcn-amdhsa -foffload-options=-march=gfx908

With LLVM main (to become LLVM 18), you will then get the error:

   ld: error: incompatible ABI version: /tmp/ccAKx5cz.mkoffload.dbg.o

OK for mainline?

Tobias
  

Comments

Richard Biener Jan. 26, 2024, 7:29 a.m. UTC | #1
On Fri, Jan 26, 2024 at 12:04 AM Tobias Burnus <tburnus@baylibre.com> wrote:
>
> When targeting AMD GPUs, the LLVM assembler (and linker) are used.
>
> Two days ago LLVM changed the default for the AMDHSA code object
> version (COV) from 4 to 5.
>
> In principle, we do not care which COV is used as long as it works;
> unfortunately, "mkoffload.cc" also generates an object file directly,
> bypassing the AMD GPU compiler as it copies debugging data to that
> file. That object file must have the same COV version (ELF ABI version)
> as compiler + llvm-mc assembler generated files.
>
> In order to ensure those are the same, this patch forces the use of
> COV 4 instead of using the default. Once GCC requires LLVM >= 14
> instead of LLVM >= 13.0.1 we could change it. (Assuming that COV 5
> is sufficiently stable in LLVM 14.) - But for now COV 4 will do.
>
> If you wonder how this LLVM issue shows up, simply compile any OpenMP
> or OpenACC program with AMD GPU offloading and enable debugging ("-g"),
> e.g.
>   gcc -fopenmp -g test.f90 -foffload=amdgcn-amdhsa -foffload-options=-march=gfx908
>
> With LLVM main (to become LLVM 18), you will then get the error:
>
>   ld: error: incompatible ABI version: /tmp/ccAKx5cz.mkoffload.dbg.o
>
> OK for mainline?

If you link against prebuilt objects with COV 5 it seems there's no way to
override the COV version GCC uses?  That is, do we want to add
a -mcode-object-version=... option to allow the user to override this
(and ABI_VERSION_SPEC honoring that, if specified and of course
mkoffload following suit)?

Otherwise looks OK in the meantime.

Richard.

> Tobias
  
Andrew Stubbs Jan. 26, 2024, 9:59 a.m. UTC | #2
On 25/01/2024 23:03, Tobias Burnus wrote:
> When targeting AMD GPUs, the LLVM assembler (and linker) are used.
> 
> Two days ago LLVM changed the default for theAMDHSA code object version (COV) from 4 to 5. In principle, we do not 
> care which COV is used as long as it works; unfortunately, 
> "mkoffload.cc" also generates an object file directly, bypassing the AMD 
> GPU compiler as it copies debugging data to that file. That object file 
> must have the same COV version (ELF ABI version) as compiler + llvm-mc 
> assembler generated files. In order to ensure those are the same, this 
> patch forces the use of COV 4 instead of using the default. Once GCC 
> requires LLVM >= 14 instead of LLVM >= 13.0.1 we could change it. 
> (Assuming that COV 5 is sufficiently stable in LLVM 14.) - But for now 
> COV 4 will do.
> If you wonder how this LLVM issue shows up, simply compile any OpenMP
> or OpenACC program with AMD GPU offloading and enable debugging ("-g"),
> e.g.
>    gcc -fopenmp -g test.f90 -foffload=amdgcn-amdhsa -foffload-options=-march=gfx908
> 
> With LLVM main (to become LLVM 18), you will then get the error:
> 
>    ld: error: incompatible ABI version: /tmp/ccAKx5cz.mkoffload.dbg.o
> 
> OK for mainline?

Looks good to me.

The alternative would be to copy the elf flags from another object file; 
that probably has it's own pitfalls.

OK.

Andrew
  
Andrew Stubbs Jan. 26, 2024, 10:09 a.m. UTC | #3
On 26/01/2024 07:29, Richard Biener wrote:
> On Fri, Jan 26, 2024 at 12:04 AM Tobias Burnus <tburnus@baylibre.com> wrote:
>>
>> When targeting AMD GPUs, the LLVM assembler (and linker) are used.
>>
>> Two days ago LLVM changed the default for the AMDHSA code object
>> version (COV) from 4 to 5.
>>
>> In principle, we do not care which COV is used as long as it works;
>> unfortunately, "mkoffload.cc" also generates an object file directly,
>> bypassing the AMD GPU compiler as it copies debugging data to that
>> file. That object file must have the same COV version (ELF ABI version)
>> as compiler + llvm-mc assembler generated files.
>>
>> In order to ensure those are the same, this patch forces the use of
>> COV 4 instead of using the default. Once GCC requires LLVM >= 14
>> instead of LLVM >= 13.0.1 we could change it. (Assuming that COV 5
>> is sufficiently stable in LLVM 14.) - But for now COV 4 will do.
>>
>> If you wonder how this LLVM issue shows up, simply compile any OpenMP
>> or OpenACC program with AMD GPU offloading and enable debugging ("-g"),
>> e.g.
>>    gcc -fopenmp -g test.f90 -foffload=amdgcn-amdhsa -foffload-options=-march=gfx908
>>
>> With LLVM main (to become LLVM 18), you will then get the error:
>>
>>    ld: error: incompatible ABI version: /tmp/ccAKx5cz.mkoffload.dbg.o
>>
>> OK for mainline?
> 
> If you link against prebuilt objects with COV 5 it seems there's no way to
> override the COV version GCC uses?  That is, do we want to add
> a -mcode-object-version=... option to allow the user to override this
> (and ABI_VERSION_SPEC honoring that, if specified and of course
> mkoffload following suit)?
> 
> Otherwise looks OK in the meantime.

We don't have a stable ABI, so trying to link against foreign binaries 
is already a problem. Most recently, the SIMD clone implementation 
required a change to the procedure calling ABI, the reverse-offload 
changes reimplemented the stack setup, and the low-latency memory 
patches changed the way we use local memories and needed more info 
passed into the device runtime. I expect more of this in future.

Compatibility across GCC versions doesn't really exist, and 
compatibility with LLVM-binaries is a non-starter.

Andrew
  
Tobias Burnus Jan. 26, 2024, 10:39 a.m. UTC | #4
Hi all,

Andrew Stubbs wrote:
> On 26/01/2024 07:29, Richard Biener wrote:
>> If you link against prebuilt objects with COV 5 it seems there's no 
>> way to
>> override the COV version GCC uses?  That is, do we want to add
>> a -mcode-object-version=... option to allow the user to override this
>> (and ABI_VERSION_SPEC honoring that, if specified and of course
>> mkoffload following suit)?

For completeness, I added such a feature, see attachment. (Actually, 
'=0' could be permitted for mkoffload without "-g" debugging enabled.)

However, the real problem is that one usually also has libraries build 
with the default such as libc, libm, libgomp, ... Thus, specifying 
anything else but GCC's default is likely to break.

Hence and also because of the following, I think it doesn't make sense 
to add:

> We don't have a stable ABI, so trying to link against foreign binaries 
> is already a problem. Most recently, the SIMD clone implementation 
> required a change to the procedure calling ABI, the reverse-offload 
> changes reimplemented the stack setup, and the low-latency memory 
> patches changed the way we use local memories and needed more info 
> passed into the device runtime. I expect more of this in future.


PS: The original patch has been committed as r14-8449-g4b5650acb31072.

Tobias
  
Andrew Stubbs Jan. 26, 2024, 11:33 a.m. UTC | #5
On 26/01/2024 10:39, Tobias Burnus wrote:
> Hi all,
> 
> Andrew Stubbs wrote:
>> On 26/01/2024 07:29, Richard Biener wrote:
>>> If you link against prebuilt objects with COV 5 it seems there's no 
>>> way to
>>> override the COV version GCC uses?  That is, do we want to add
>>> a -mcode-object-version=... option to allow the user to override this
>>> (and ABI_VERSION_SPEC honoring that, if specified and of course
>>> mkoffload following suit)?
> 
> For completeness, I added such a feature, see attachment. (Actually, 
> '=0' could be permitted for mkoffload without "-g" debugging enabled.)
> 
> However, the real problem is that one usually also has libraries build 
> with the default such as libc, libm, libgomp, ... Thus, specifying 
> anything else but GCC's default is likely to break.
> 
> Hence and also because of the following, I think it doesn't make sense 
> to add:
> 
>> We don't have a stable ABI, so trying to link against foreign binaries 
>> is already a problem. Most recently, the SIMD clone implementation 
>> required a change to the procedure calling ABI, the reverse-offload 
>> changes reimplemented the stack setup, and the low-latency memory 
>> patches changed the way we use local memories and needed more info 
>> passed into the device runtime. I expect more of this in future.
> 
> 
> PS: The original patch has been committed as r14-8449-g4b5650acb31072.
> 
> Tobias

Agreed, there's no point in having a knob that only has one valid 
setting, especially when we want to be able to change that setting 
without breaking third-party scripts that choose to that knob "for 
completeness", or something.

The toolchain can have an opinion about which is the correct COV, and I 
think not relying on the LLVM default choice makes sense also. I think 
COV 5 is only a small update, but there's no reason to imagine that COV6 
will Just Work (COV2 to COV3, and COV3 to COV4 required real effort).

We can move on to COV5 for GCC 15, probably. I'm not aware of any great 
blocker, but it sets a minimum LLVM.

Andrew
  
Tobias Burnus Jan. 26, 2024, 11:38 a.m. UTC | #6
Andrew Stubbs wrote:
> We can move on to COV5 for GCC 15, probably. I'm not aware of any 
> great blocker, but it sets a minimum LLVM.

And as our testing hardware showed, it also bumps the minimal ROCm to 
5.2 (as 5.1 fails with COV5).

Otherwise, as mentioned, COV5 was added to LLVM 14, but as we already 
require 13.0.1; thus, we would just have to require a half-a-year newer 
LLVM, which could be indeed fine for GCC 15.

Tobias
  

Patch

gcn/gcn-hsa.h: Always pass --amdhsa-code-object-version= in ASM_SPEC

Since LLVM commit 082f87c9d418 (Pull Req. #79038; will become LLVM 18)
  "[AMDGPU] Change default AMDHSA Code Object version to 5"
the default - when no --amdhsa-code-object-version= is used - was bumped.

Using --amdhsa-code-object-version=5 is supported (with unknown limitations)
since LLVM 14. GCC required for proper support at least LLVM 13.0.1 such
that explicitly using COV5 is not possible.

Unfortunately, the COV number matters for debugging ("-g") as mkoffload.cc
extracts debugging data from the host's object file and writes into an
an AMD GPU object file it creates. And all object files linked together
must have the same ABI version. 

gcc/ChangeLog:

	* config/gcn/gcn-hsa.h (ABI_VERSION_SPEC): New; creates the
	"--amdhsa-code-object-version=" argument.
	(ASM_SPEC): Use it; replace previous version of it.

Signed-off-by: Tobias Burnus <tburnus@baylibre.com>

diff --git a/gcc/config/gcn/gcn-hsa.h b/gcc/config/gcn/gcn-hsa.h
index f5de0d2969f..e5b93f7d9e5 100644
--- a/gcc/config/gcn/gcn-hsa.h
+++ b/gcc/config/gcn/gcn-hsa.h
@@ -75,6 +75,21 @@  extern unsigned int gcn_local_sym_hash (const char *name);
    supported for gcn.  */
 #define GOMP_SELF_SPECS ""
 
+/* Explicitly set the ABI version; in principle, we could use just the
+   default; however, when debugging symbols are turned on, mkoffload.cc
+   writes a new AMD GPU object file and the ABI version needs to be the
+   same. - LLVM <= 17 defaults to 4 while LLVM >= 18 defaults to 5.
+   GCC supports LLVM >= 13.0.1 and only LLVM >= 14 supports version 5.
+   Note that Fiji is only suppored with LLVM <= 17 as version 3 i no longer
+   supported in LLVM >= 18.  */
+#define ABI_VERSION_SPEC "march=fiji:--amdhsa-code-object-version=3;" \
+			 "!march=*|march=*:--amdhsa-code-object-version=4"
+
+/* Note that the XNACK and SRAM-ECC settings must match those in mkoffload.cc
+   as the latter creates new ELF object file when debugging is enabled and
+   the ELF flags (e_flags) of that generated file must be identical to those
+   generated by the compiler.  */
+
 #define NO_XNACK "march=fiji:;march=gfx1030:;march=gfx1100:;" \
     /* These match the defaults set in gcn.cc.  */ \
     "!mxnack*|mxnack=default:%{march=gfx900|march=gfx906|march=gfx908:-mattr=-xnack};"
@@ -88,7 +103,7 @@  extern unsigned int gcn_local_sym_hash (const char *name);
 /* Use LLVM assembler and linker options.  */
 #define ASM_SPEC  "-triple=amdgcn--amdhsa "  \
 		  "%{march=*:-mcpu=%*} " \
-		  "%{!march=*|march=fiji:--amdhsa-code-object-version=3} " \
+		  "%{" ABI_VERSION_SPEC "} " \
 		  "%{" NO_XNACK XNACKOPT "} " \
 		  "%{" NO_SRAM_ECC SRAMOPT "} " \
 		  "%{march=gfx1030|march=gfx1100:-mattr=+wavefrontsize64} " \