gcn/mkoffload.cc: Fix SRAM_ECC and XNACK handling [PR111966]

Message ID 45e2515e-b9aa-42a4-908d-80b2189c7aeb@baylibre.com
State New
Headers
Series gcn/mkoffload.cc: Fix SRAM_ECC and XNACK handling [PR111966] |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Testing passed

Commit Message

Tobias Burnus Jan. 24, 2024, 10:12 p.m. UTC
  This patch fixes "-g" debug compilation for gfx1100 and gfx1030,
which fail to link when "-g" is specified. The reason is:

When using gfx1100 and compiling with '-g' I was running into an error
because the eflags used for the debugger file has additional eflags
(elf flags) set - contrary to the compiled files; mkoffload writes files
itself, hence, it also needs to get the elf flags right.

It turned out that the ASM_SPEC handling was insufficiently replicated
in mkoffload, leading to issues with gfx1100 and gfx1030. I think in
some corner case, gfx906 also behaved differently; for gfx900 and fiji,
the eflags were different before, but got reset inside
copy_early_debug_info such that those difference did not matter.

OK for mainline?

Tobias

PS: I tried hard to look at the ASM_SPEC and played with different
options, looking at what really got passed to the assembler, but I
might have missed something as the code is somewhat confusing. Naming
wise, there is both UNSUPPORTED and UNSET for the same thing; it should
be a tad more consistent (flag = UNSUPPORTED, SET/TEST functions: UNSET),
still, one could also argue that a single name would do.

PPS: I think the PR is about other things in addition, but it also
kind of covers this "-g" issue and the one of previous commit. Even
if not directly addressing the issue, it is related and having the
commits listed there makes IMHO sense.
  

Comments

Andrew Stubbs Jan. 25, 2024, 10:17 a.m. UTC | #1
On 24/01/2024 22:12, Tobias Burnus wrote:
> This patch fixes "-g" debug compilation for gfx1100 and gfx1030,
> which fail to link when "-g" is specified. The reason is:
> 
> When using gfx1100 and compiling with '-g' I was running into an error
> because the eflags used for the debugger file has additional eflags
> (elf flags) set - contrary to the compiled files; mkoffload writes files
> itself, hence, it also needs to get the elf flags right.
> 
> It turned out that the ASM_SPEC handling was insufficiently replicated
> in mkoffload, leading to issues with gfx1100 and gfx1030. I think in
> some corner case, gfx906 also behaved differently; for gfx900 and fiji,
> the eflags were different before, but got reset inside
> copy_early_debug_info such that those difference did not matter.
> 
> OK for mainline?

I've got so confused trying to figure out this stuff and how it works 
with different LLVM, different defaults, different devices.

I think this patch is fine, but we should wait until we can test it on 
all those devices.

Andrew

> Tobias
> 
> PS: I tried hard to look at the ASM_SPEC and played with different
> options, looking at what really got passed to the assembler, but I
> might have missed something as the code is somewhat confusing. Naming
> wise, there is both UNSUPPORTED and UNSET for the same thing; it should
> be a tad more consistent (flag = UNSUPPORTED, SET/TEST functions: UNSET),
> still, one could also argue that a single name would do.

Sometimes not passing the -mattr flag gives "any", and sometimes 
"unsupported", and sometimes leaves the flag unset. I think it's changed 
over time as well, but mkoffload has to match precisely or it won't link. :(

> PPS: I think the PR is about other things in addition, but it also
> kind of covers this "-g" issue and the one of previous commit. Even
> if not directly addressing the issue, it is related and having the
> commits listed there makes IMHO sense.
  
Tobias Burnus Jan. 25, 2024, 12:06 p.m. UTC | #2
Andrew Stubbs wrote:
> I've got so confused trying to figure out this stuff and how it works 
> with different LLVM, different defaults, different devices.
> 
> I think this patch is fine, but we should wait until we can test it on 
> all those devices.

Well, we can test the compiling part without the need for any device.

I have now run the attached script and the result running yesterday's 
build with both my patch and your patch applied.


That's with Ubuntu 22's system LLVM 15 and readelf.

The "2>/dev/null" silences the warning:
"Missing knowledge of 32-bit reloc types used in DWARF sections of 
machine number 224"

It looks as if some cleanup is needed - both for consistency between the 
code generated for the '.s' file vs. the 'llvm-mc' ("as") command line 
and also "mkoffload.cc" does not seem to be fully correct, either.

* * *

RESULTS:

* Expected:

lto1: error: '-mxnack=on' is incompatible with '-march=fiji'
lto1: error: '-mxnack=on' is incompatible with '-march=gfx1030'
lto1: error: '-mxnack=on' is incompatible with '-march=gfx1100'


Compiler bugs (ASM_SPEC vs. gcn.cc), i.e. causing llvm-mc assember errors:

* gfx900 and gfx906 for
     -foffload-options=-mxnack=any
   or
     -foffload-options=-mxnack=on
as that sets
     ".amdhsa_reserve_xnack_mask    1"
but passes
     -mattr=-xnack


* For gfx908 and gfx90a, the SRAM argument is wrongly passed, e.g.,
      -mattr=-xnack-mattr=-sramecc
such that one gets the messages:
'-xnack-mattr=-sramecc' is not a recognized feature for this target
'-xnack-mattr=+sramecc' is not a recognized feature for this target
'-xnack-mattr=-sramecc' is not a recognized feature for this target
'-xnack-mattr=+sramecc' is not a recognized feature for this target
'-xnack-mattr=-sramecc' is not a recognized feature for this target
'-xnack-mattr=+sramecc' is not a recognized feature for this target
'-xnack-mattr=-sramecc' is not a recognized feature for this target
'-xnack-mattr=+sramecc' is not a recognized feature for this target
'-xnack-mattr=-sramecc' is not a recognized feature for this target
'-xnack-mattr=+sramecc' is not a recognized feature for this target
'+xnack-mattr=-sramecc' is not a recognized feature for this target
'+xnack-mattr=+sramecc' is not a recognized feature for this target


Now to my patch for mkoffload:

* For the debug-file generation, it fails with
   "incompatible sramecc"
(a) for gfx900:
    -foffload-options=-msram-ecc=off
    -foffload-options=-msram-ecc=any
    -foffload-options=-msram-ecc=on
those cause that SRAM ECC is set but it shouldn't:
   Flags:   0x22c  - a.xamdgcn-amdhsa.mkoffload.hsaco
   Flags:   0x62c  - a.xamdgcn-amdhsa.mkoffload.dbg0.o

=> FIX: unconditional use
     case EF_AMDGPU_MACH_AMDGCN_GFX900:
       SET_SRAM_ECC_UNSET (elf_flags);

(b) for gfx906, here is the reverse true:
   Flags:   0x62f  - a.xamdgcn-
   Flags:   0x22f  - a.xamdgcn-amdhsa.mkoffload.dbg0.o
Namely, it should set SRAMECC_ANY_V4 (0x400) independent of the passed 
-msram=... value.

=> FIX: unconditional use
     case EF_AMDGPU_MACH_AMDGCN_GFX906:
       SET_XNACK_ANY (elf_flags);

And for gfx908/gfx90a, I think it makes sense to fix first the compiler
issue; I think gfx90a should be fine, for gfx908 I am less sure.

Tobias
  

Patch

gcn/mkoffload.cc: Fix SRAM_ECC and XNACK handling [PR111966]

Some more '-g' fixes as the .mkoffload.dbg.o debug file's has elf flags
which did not match those generated for the compilation, leading to linker
errors.  For .mkoffload.dbg.o, the elf flags are generated by mkoffload
itself - while for the other .o files, that's done by the compiler via
setting default and mainly via the ASM_SPEC.

This is a follow up to r14-8332-g13127dac106724 which fixed an issue
caused by the default arch.  In this patch, it is mainly for gfx1100
and gfx1030 which always failed.  It also affects gfx906 and possibly
gfx900 but only when using the -mxnack/-msram-ecc flags explicitly.

What happens on the compiler side is mainly determined by gcn-hsa.h's
and otherwise by some default setting. In particular for xnack and
sram_ecc, there is:

For gfx1100 and gfx1030, neither xnack nor sram_ecc is set (only
'+wavefrontsize64'). 

For fiji, gfx900 and gfx906 there is always -mattr=-xnack and
no -msram-ecc - independent of what has been passed to the compiler.
On elf flags level, that's not quite true as for fiji those flags are
set to 0 because of HSACOv3 and for gfx900 there is a remark that even
if -msram-ecc were set (which it isn't), the elf flags would be still
zero because of a buggy assembler. See copy_early_debug_info.

For gfx90a, the -msram-ecc= and -mxnack= are passed on, or if not present,
...=any is passed on.  Note that this "any" is different from argument
nor present at elf flag level:
For XNACK: unset/unsupported is 0, any = 0x100, off = 0x200, on = 0x300.
For SRAMECC: unset/unsupported is 0, any = 0x400, off = 0x800, on = 0xc00.

The obstack_ptr_grow changes are more to avoid confusion than having an
actual effect as they would overwise be filtered out via the ASM_SPEC.

gcc/ChangeLog:

	PR other/111966
	* config/gcn/mkoffload.cc (SET_XNACK_UNSET, TEST_SRAM_ECC_UNSET): New.
	(SET_SRAM_ECC_UNSUPPORTED): Renamed to ...
	(SET_SRAM_ECC_UNSET): ... this.
	(main): Update SRAM_ECC and XNACK for the -march as done in gcn-hsa.h.

Signed-off-by: Tobias Burnus <tburnus@baylibre.com>


 gcc/config/gcn/mkoffload.cc | 38 ++++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/gcc/config/gcn/mkoffload.cc b/gcc/config/gcn/mkoffload.cc
index 0d0e7bac9b2..218ad888aea 100644
--- a/gcc/config/gcn/mkoffload.cc
+++ b/gcc/config/gcn/mkoffload.cc
@@ -80,6 +80,8 @@ 
 				  | EF_AMDGPU_FEATURE_XNACK_ANY_V4)
 #define SET_XNACK_OFF(VAR) VAR = ((VAR & ~EF_AMDGPU_FEATURE_XNACK_V4) \
 				  | EF_AMDGPU_FEATURE_XNACK_OFF_V4)
+#define SET_XNACK_UNSET(VAR) VAR = ((VAR & ~EF_AMDGPU_FEATURE_XNACK_V4) \
+				    | EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4)
 #define TEST_XNACK_ANY(VAR) ((VAR & EF_AMDGPU_FEATURE_XNACK_V4) \
 			     == EF_AMDGPU_FEATURE_XNACK_ANY_V4)
 #define TEST_XNACK_ON(VAR) ((VAR & EF_AMDGPU_FEATURE_XNACK_V4) \
@@ -94,13 +96,14 @@ 
 				     | EF_AMDGPU_FEATURE_SRAMECC_ANY_V4)
 #define SET_SRAM_ECC_OFF(VAR) VAR = ((VAR & ~EF_AMDGPU_FEATURE_SRAMECC_V4) \
 				     | EF_AMDGPU_FEATURE_SRAMECC_OFF_V4)
-#define SET_SRAM_ECC_UNSUPPORTED(VAR) \
+#define SET_SRAM_ECC_UNSET(VAR) \
   VAR = ((VAR & ~EF_AMDGPU_FEATURE_SRAMECC_V4) \
 	 | EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4)
 #define TEST_SRAM_ECC_ANY(VAR) ((VAR & EF_AMDGPU_FEATURE_SRAMECC_V4) \
 				== EF_AMDGPU_FEATURE_SRAMECC_ANY_V4)
 #define TEST_SRAM_ECC_ON(VAR) ((VAR & EF_AMDGPU_FEATURE_SRAMECC_V4) \
 			       == EF_AMDGPU_FEATURE_SRAMECC_ON_V4)
+#define TEST_SRAM_ECC_UNSET(VAR) ((VAR & EF_AMDGPU_FEATURE_SRAMECC_V4) == 0)
 
 #ifndef R_AMDGPU_NONE
 #define R_AMDGPU_NONE		0
@@ -125,7 +128,7 @@  static struct obstack files_to_cleanup;
 
 enum offload_abi offload_abi = OFFLOAD_ABI_UNSET;
 uint32_t elf_arch = EF_AMDGPU_MACH_AMDGCN_GFX900;  // Default GPU architecture.
-uint32_t elf_flags = EF_AMDGPU_FEATURE_SRAMECC_ANY_V4;
+uint32_t elf_flags = EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4;
 
 static int gcn_stack_size = 0;  /* Zero means use default.  */
 
@@ -1007,21 +1010,26 @@  main (int argc, char **argv)
       gcc_unreachable ();
     }
 
-  /* Disable XNACK mode on architectures where it doesn't work (well).
-     Set default to "any" otherwise.  */
+  /* This must match gcn-hsa.h's settings for NO_XNACK, NO_SRAM_ECC
+     and ASM_SPEC.  */
   switch (elf_arch)
     {
-    case EF_AMDGPU_MACH_AMDGCN_GFX803:
     case EF_AMDGPU_MACH_AMDGCN_GFX900:
     case EF_AMDGPU_MACH_AMDGCN_GFX906:
     case EF_AMDGPU_MACH_AMDGCN_GFX908:
+      SET_XNACK_OFF (elf_flags);
+      break;
+    case EF_AMDGPU_MACH_AMDGCN_GFX803:
     case EF_AMDGPU_MACH_AMDGCN_GFX1030:
     case EF_AMDGPU_MACH_AMDGCN_GFX1100:
-      SET_XNACK_OFF (elf_flags);
+      SET_XNACK_UNSET (elf_flags);
+      SET_SRAM_ECC_UNSET (elf_flags);
       break;
     case EF_AMDGPU_MACH_AMDGCN_GFX90a:
       if (TEST_XNACK_UNSET (elf_flags))
 	SET_XNACK_ANY (elf_flags);
+      if (TEST_SRAM_ECC_UNSET (elf_flags))
+	SET_SRAM_ECC_ANY (elf_flags);
       break;
     default:
       fatal_error (input_location, "unhandled architecture");
@@ -1145,14 +1153,16 @@  main (int argc, char **argv)
 	}
       obstack_ptr_grow (&ld_argv_obstack, gcn_s2_name);
       obstack_ptr_grow (&ld_argv_obstack, "-lgomp");
-      obstack_ptr_grow (&ld_argv_obstack,
-			(TEST_XNACK_ON (elf_flags) ? "-mxnack=on"
-			 : TEST_XNACK_ANY (elf_flags) ? "-mxnack=any"
-			 : "-mxnack=off"));
-      obstack_ptr_grow (&ld_argv_obstack,
-			(TEST_SRAM_ECC_ON (elf_flags) ? "-msram-ecc=on"
-			 : TEST_SRAM_ECC_ANY (elf_flags) ? "-msram-ecc=any"
-			 : "-msram-ecc=off"));
+      if (!TEST_XNACK_UNSET (elf_flags))
+	obstack_ptr_grow (&ld_argv_obstack,
+			  (TEST_XNACK_ON (elf_flags) ? "-mxnack=on"
+			   : TEST_XNACK_ANY (elf_flags) ? "-mxnack=any"
+			   : "-mxnack=off"));
+      if (!TEST_SRAM_ECC_UNSET (elf_flags))
+	obstack_ptr_grow (&ld_argv_obstack,
+			  (TEST_SRAM_ECC_ON (elf_flags) ? "-msram-ecc=on"
+			   : TEST_SRAM_ECC_ANY (elf_flags) ? "-msram-ecc=any"
+			   : "-msram-ecc=off"));
       if (verbose)
 	obstack_ptr_grow (&ld_argv_obstack, "-v");