[1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files

Message ID 20240314112359.50713-2-vigbalas@amd.com
State New
Headers
Series Add XSAVE layout description to Core files for debuggers to support varying XSAVE layouts |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_binutils_build--master-arm fail Patch failed to apply
linaro-tcwg-bot/tcwg_binutils_build--master-aarch64 fail Patch failed to apply

Commit Message

Balasubrmanian, Vignesh March 14, 2024, 11:23 a.m. UTC
  Add a new .note section containing type, size, offset and flags of
every xfeature that is present.

This information will be used by the debuggers to understand the XSAVE
layout of the machine where the core file is dumped, and to read XSAVE
registers, especially during cross-platform debugging.

Co-developed-by: Jini Susan George <jinisusan.george@amd.com>
Signed-off-by: Jini Susan George <jinisusan.george@amd.com>
Signed-off-by: Vignesh Balasubramanian <vigbalas@amd.com>
---
 arch/Kconfig                   |   9 +++
 arch/powerpc/Kconfig           |   1 +
 arch/powerpc/include/asm/elf.h |   2 -
 arch/x86/Kconfig               |   1 +
 arch/x86/include/asm/elf.h     |   7 +++
 arch/x86/kernel/fpu/xstate.c   | 101 +++++++++++++++++++++++++++++++++
 include/linux/elf.h            |   2 +-
 include/uapi/linux/elf.h       |   1 +
 8 files changed, 121 insertions(+), 3 deletions(-)
  

Comments

Dave Hansen March 14, 2024, 3:37 p.m. UTC | #1
On 3/14/24 04:23, Vignesh Balasubramanian wrote:
> Add a new .note section containing type, size, offset and flags of
> every xfeature that is present.

Mechanically, I'd much rather have all of that info in the cover letter
in the actual changelog instead.

I'd also love to see a practical example of what an actual example core
dump looks like on two conflicting systems:

   * Total XSAVE size
   * XCR0 value
   * XSTATE_BV from the core dump
   * XFEATURE offsets for each feature

Do you have any information about what other OSes are doing in this
area?  I thought Windows, for instance, was even less flexible about the
XSAVE format than Linux is.

Why didn't LWP cause this problem?

From the cover letter:

> But this patch series depends on heuristics based on the total XSAVE
> register set size and the XCR0 mask to infer the layouts of the
> various register blocks for core dumps, and hence, is not a foolproof
> mechanism to determine the layout of the XSAVE area.

It may not be theoretically foolproof.  But I'm struggling to think of a
case where it would matter in practice.  Is there any CPU from any
vendor where this is actually _needed_?

Sure, it's ugly as hell, but these notes aren't going to be available
universally _ever_, so it's not like the crummy heuristic code gets to
go away.

Have you seen the APX spec?

>
https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

It makes this even more fun because it adds a new XSAVE state component,
but reuses the MPX offsets.

> This information will be used by the debuggers to understand the XSAVE
> layout of the machine where the core file is dumped, and to read XSAVE
> registers, especially during cross-platform debugging.

This is pretty close to just a raw dump of the XSAVE CPUID leaves.
Rather than come up with an XSAVE-specific ABI that depends on CPUID
*ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
should just bite the bullet and dump out (some of) the raw CPUID space.
  
Borislav Petkov March 14, 2024, 4:08 p.m. UTC | #2
On Thu, Mar 14, 2024 at 08:37:09AM -0700, Dave Hansen wrote:
> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
> Rather than come up with an XSAVE-specific ABI that depends on CPUID
> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
> should just bite the bullet and dump out (some of) the raw CPUID space.

Funny you should say that. This was what they had done originally but if
you dump CPUID and you want to add another component in the future which
is *not* described by CPUID, your scheme breaks.

So the idea is to have a self-describing buffers layout, independent
from any x86-ism. You can extend this in a straight-forward way then
later.
  
Kees Cook March 14, 2024, 4:13 p.m. UTC | #3
On Thu, Mar 14, 2024 at 04:53:28PM +0530, Vignesh Balasubramanian wrote:
> Add a new .note section containing type, size, offset and flags of
> every xfeature that is present.
> 
> This information will be used by the debuggers to understand the XSAVE
> layout of the machine where the core file is dumped, and to read XSAVE
> registers, especially during cross-platform debugging.

I see binutils in CC. Can someone from gdb confirm that this solution
can be used?

> 
> Co-developed-by: Jini Susan George <jinisusan.george@amd.com>
> Signed-off-by: Jini Susan George <jinisusan.george@amd.com>
> Signed-off-by: Vignesh Balasubramanian <vigbalas@amd.com>
> ---
>  arch/Kconfig                   |   9 +++
>  arch/powerpc/Kconfig           |   1 +
>  arch/powerpc/include/asm/elf.h |   2 -
>  arch/x86/Kconfig               |   1 +
>  arch/x86/include/asm/elf.h     |   7 +++
>  arch/x86/kernel/fpu/xstate.c   | 101 +++++++++++++++++++++++++++++++++
>  include/linux/elf.h            |   2 +-
>  include/uapi/linux/elf.h       |   1 +
>  8 files changed, 121 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index fd18b7db2c77..3bd8a0b2bba1 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -502,6 +502,15 @@ config MMU_LAZY_TLB_SHOOTDOWN
>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	bool
>  
> +config ARCH_HAVE_EXTRA_ELF_NOTES
> +	bool
> +	help
> +	  An architecture should select this in order to enable adding an
> +	  arch-specific ELF note section to core files. It must provide two
> +	  functions: elf_coredump_extra_notes_size() and
> +	  elf_coredump_extra_notes_write() which are invoked by the ELF core
> +	  dumper.
> +
>  config ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
>  	bool
>  
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index a91cb070ca4a..3b31bd7490e2 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -156,6 +156,7 @@ config PPC
>  	select ARCH_HAS_UACCESS_FLUSHCACHE
>  	select ARCH_HAS_UBSAN
>  	select ARCH_HAVE_NMI_SAFE_CMPXCHG
> +	select ARCH_HAVE_EXTRA_ELF_NOTES        if SPU_BASE
>  	select ARCH_KEEP_MEMBLOCK
>  	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE	if PPC_RADIX_MMU
>  	select ARCH_MIGHT_HAVE_PC_PARPORT
> diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
> index 79f1c480b5eb..bb4b94444d3e 100644
> --- a/arch/powerpc/include/asm/elf.h
> +++ b/arch/powerpc/include/asm/elf.h
> @@ -127,8 +127,6 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
>  /* Notes used in ET_CORE. Note name is "SPU/<fd>/<filename>". */
>  #define NT_SPU		1
>  
> -#define ARCH_HAVE_EXTRA_ELF_NOTES
> -
>  #endif /* CONFIG_SPU_BASE */
>  
>  #ifdef CONFIG_PPC64
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 78050d5d7fac..35e8d1201099 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -104,6 +104,7 @@ config X86
>  	select ARCH_HAS_DEBUG_WX
>  	select ARCH_HAS_ZONE_DMA_SET if EXPERT
>  	select ARCH_HAVE_NMI_SAFE_CMPXCHG
> +	select ARCH_HAVE_EXTRA_ELF_NOTES
>  	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>  	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
>  	select ARCH_MIGHT_HAVE_PC_PARPORT
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index 1fb83d47711f..1b9f0b4bf6bc 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -13,6 +13,13 @@
>  #include <asm/auxvec.h>
>  #include <asm/fsgsbase.h>
>  
> +struct xfeat_component {
> +	u32 xfeat_type;
> +	u32 xfeat_sz;
> +	u32 xfeat_off;
> +	u32 xfeat_flags;
> +} __packed;

While it is currently true, just for robustness, can you add
a _Static_assert that sizeof(struct xfeat_component) % 4 == 0 ?
Notes must be 4-byte aligned.

> +
>  typedef unsigned long elf_greg_t;
>  
>  #define ELF_NGREG (sizeof(struct user_regs_struct) / sizeof(elf_greg_t))
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 117e74c44e75..6e5ea483ec1d 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -13,6 +13,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/proc_fs.h>
>  #include <linux/vmalloc.h>
> +#include <linux/coredump.h>
>  
>  #include <asm/fpu/api.h>
>  #include <asm/fpu/regset.h>
> @@ -1836,3 +1837,103 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
>  	return 0;
>  }
>  #endif /* CONFIG_PROC_PID_ARCH_STATUS */
> +
> +/*
> + * Dump type, size, offset and flag values for every xfeature that is present.
> + */
> +static int dump_xsave_layout_desc(struct coredump_params *cprm)
> +{
> +
> +	struct xfeat_component xc;
> +	int num_records = 0;
> +	int i;
> +
> +	/* XFEATURE_FPU and XFEATURE_SSE, both are fixed legacy states. */
> +	for (i = 0; i < FIRST_EXTENDED_XFEATURE; i++) {
> +		xc.xfeat_type = i;
> +		xc.xfeat_sz = xstate_sizes[i];
> +		xc.xfeat_off = xstate_offsets[i];
> +		xc.xfeat_flags = xstate_flags[i];
> +
> +		if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
> +			return 0;
> +		num_records++;
> +	}
> +
> +	for_each_extended_xfeature(i, fpu_user_cfg.max_features) {
> +		xc.xfeat_type = i;
> +		xc.xfeat_sz = xstate_sizes[i];
> +		xc.xfeat_off = xstate_offsets[i];
> +		xc.xfeat_flags = xstate_flags[i];
> +
> +		if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
> +			return 0;
> +		num_records++;
> +	}
> +
> +	return num_records;
> +}
> +
> +static int get_xsave_desc_size(void)
> +{
> +	/* XFEATURE_FP and XFEATURE_SSE, both are fixed legacy states */
> +	int xfeatures_count = 2;
> +	int i;
> +
> +	for_each_extended_xfeature(i, fpu_user_cfg.max_features)
> +		xfeatures_count++;
> +
> +	return xfeatures_count * (sizeof(struct xfeat_component));
> +}
> +
> +int elf_coredump_extra_notes_write(struct coredump_params *cprm)
> +{
> +	const char *owner_name = "LINUX";

If you use an array instead of a pointer, there's no need for strlen(),
and you can make it a static outside of the function to refer to it
later.

static const char owner_name[] = "LINUX";

> +	int num_records = 0;
> +	struct elf_note en;
> +
> +	en.n_namesz = strlen(owner_name) + 1;

en.n_namesz = sizeof(owner_name);

> +	en.n_descsz = get_xsave_desc_size();
> +	en.n_type = NT_X86_XSAVE_LAYOUT;
> +
> +	if (!dump_emit(cprm, &en, sizeof(en)))
> +		return 1;
> +	if (!dump_emit(cprm, owner_name, en.n_namesz))
> +		return 1;
> +	if (!dump_align(cprm, 4))
> +		return 1;
> +
> +	num_records = dump_xsave_layout_desc(cprm);
> +	if (!num_records) {
> +		pr_warn("Error adding XSTATE layout ELF note. XSTATE buffer in the core file will be unparseable.");

Can you make this pr_warn_ratelimited() (and below)?

> +		return 1;
> +	}
> +
> +	/* Total size should be equal to the number of records */
> +	if ((sizeof(struct xfeat_component) * num_records) != en.n_descsz) {
> +		pr_warn("Error adding XSTATE layout ELF note. The size of the .note section does not match with the total size of the records.");
> +		return 1;
> +	}
> +
> +	if (!dump_align(cprm, 4))
> +		return 1;

I don't think this call is needed?

> +
> +	return 0;
> +}
> +
> +/*
> + * Return the size of new note.
> + */
> +int elf_coredump_extra_notes_size(void)
> +{
> +	const char *fullname = "LINUX";

Now this can be dropped.

> +	int size = 0;
> +
> +	/* NOTE Header */
> +	size += sizeof(struct elf_note);
> +	/* name + align */
> +	size += roundup(strlen(fullname) + 1, 4);

	size += roundup(sizeof(owner_name), 4);

> +	size += get_xsave_desc_size();
> +
> +	return size;
> +}
> diff --git a/include/linux/elf.h b/include/linux/elf.h
> index c9a46c4e183b..5c402788da19 100644
> --- a/include/linux/elf.h
> +++ b/include/linux/elf.h
> @@ -65,7 +65,7 @@ extern Elf64_Dyn _DYNAMIC [];
>  struct file;
>  struct coredump_params;
>  
> -#ifndef ARCH_HAVE_EXTRA_ELF_NOTES
> +#ifndef CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES
>  static inline int elf_coredump_extra_notes_size(void) { return 0; }
>  static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) { return 0; }
>  #else
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index 9417309b7230..3325488cb39b 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -411,6 +411,7 @@ typedef struct elf64_shdr {
>  #define NT_X86_XSTATE	0x202		/* x86 extended state using xsave */
>  /* Old binutils treats 0x203 as a CET state */
>  #define NT_X86_SHSTK	0x204		/* x86 SHSTK state */
> +#define NT_X86_XSAVE_LAYOUT	0x205	/* XSAVE layout description */
>  #define NT_S390_HIGH_GPRS	0x300	/* s390 upper register halves */
>  #define NT_S390_TIMER	0x301		/* s390 timer register */
>  #define NT_S390_TODCMP	0x302		/* s390 TOD clock comparator register */
> -- 
> 2.43.0
> 

Otherwise looks reasonable, though I see Dave has feedback to address
too. :)

Thanks for working on this!

-Kees
  
Dave Hansen March 14, 2024, 4:19 p.m. UTC | #4
On 3/14/24 09:08, Borislav Petkov wrote:
> On Thu, Mar 14, 2024 at 08:37:09AM -0700, Dave Hansen wrote:
>> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
>> Rather than come up with an XSAVE-specific ABI that depends on CPUID
>> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
>> should just bite the bullet and dump out (some of) the raw CPUID space.
> 
> Funny you should say that. This was what they had done originally but if
> you dump CPUID and you want to add another component in the future which
> is *not* described by CPUID, your scheme breaks.

Are you envisioning an *XSAVE* state component that's not described by
CPUID?

Or some _other_ (non-XSAVE) component in a core dump that isn't
described by CPUID?

> So the idea is to have a self-describing buffers layout, independent
> from any x86-ism. You can extend this in a straight-forward way then
> later.

That argument breaks down a bit on the flags though:

	xc.xfeat_flags = xstate_flags[i];

Because it comes _directly_ from CPUID with zero filtering:

	cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
	...
	xstate_flags[i] = ecx;

So this layout is quite dependent on what's in x86's CPUID.
  
Borislav Petkov March 14, 2024, 4:29 p.m. UTC | #5
On Thu, Mar 14, 2024 at 09:19:15AM -0700, Dave Hansen wrote:
> Are you envisioning an *XSAVE* state component that's not described by
> CPUID?

I want to be prepared. You can imagine some of the short cuts and
corners cutting hw guys would do so I'd want to be prepared there and
not tie this to CPUID.

> Or some _other_ (non-XSAVE) component in a core dump that isn't
> described by CPUID?

Yes, that too.

Since the format of this buffer is so simple and machine-independent, it
can be extended as needed without issues.

> That argument breaks down a bit on the flags though:
> 
> 	xc.xfeat_flags = xstate_flags[i];
> 
> Because it comes _directly_ from CPUID with zero filtering:
> 
> 	cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
> 	...
> 	xstate_flags[i] = ecx;
> 
> So this layout is quite dependent on what's in x86's CPUID.

Yeah, no, this should not be copying CPUID flags - those flags should be
*translated* to independently defined flags which describe those
buffers.

This is even more important if we change our xstate_flags[] machinery.
This buffer should not use any kernel-internal definitions and
structures but be completely self-describing.

Vignesh, pls fix that.

Thx.
  
Dave Hansen March 14, 2024, 4:39 p.m. UTC | #6
On 3/14/24 09:29, Borislav Petkov wrote:
> 
>> That argument breaks down a bit on the flags though:
>>
>> 	xc.xfeat_flags = xstate_flags[i];
>>
>> Because it comes _directly_ from CPUID with zero filtering:
>>
>> 	cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
>> 	...
>> 	xstate_flags[i] = ecx;
>>
>> So this layout is quite dependent on what's in x86's CPUID.
> Yeah, no, this should not be copying CPUID flags - those flags should be
> *translated* to independently defined flags which describe those
> buffers.

Ditto for:

	xc.xfeat_type = i;

Right now, that's bound to CPUID and XSAVE.  "feat_type==10" can only
ever be PKRU and that's derived from the XSAVE architecture.

If you want this to be extensible to things outside of the XSAVE
architecture, it needs to be something actually extensible and not
entangled with XSAVE.

In other words "xc.xfeat_type" can enumerate XSAVE state components
being in the dump, but it should not be limited to XSAVE.  Just as an
example:

enum feat_type {
	FEATURE_XSAVE_PKRU,
	FEATURE_XSAVE__YMM,
	FEATURE_XSAVE_BNDREGS,
	FEATURE_XSAVE_BNDCSR,
	...
	RANDOM_STATE_NOT_XSAVE
};

See how feat_type==1 is PKRU and *NOT* feat_type==10?  That opens the
door to RANDOM_STATE_NOT_XSAVE or anything else you want.  This would be
_actually_ extensible.
  
John Baldwin March 14, 2024, 4:45 p.m. UTC | #7
On 3/14/24 8:37 AM, Dave Hansen wrote:
> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>> Add a new .note section containing type, size, offset and flags of
>> every xfeature that is present.
> 
> Mechanically, I'd much rather have all of that info in the cover letter
> in the actual changelog instead.
> 
> I'd also love to see a practical example of what an actual example core
> dump looks like on two conflicting systems:
> 
>     * Total XSAVE size
>     * XCR0 value
>     * XSTATE_BV from the core dump
>     * XFEATURE offsets for each feature

I noticed this when I bought an AMD Ryzen 9 5900X based system for my desktop
running FreeBSD and found that the XSAVE core dump notes were not recognized
by GDB (FreeBSD dumps an XSAVE register set note that matches the same
layout of NT_X86_XSTATE used by Linux).

In particular, as the cover letter notes, on this AMD processor, there is
no "gap" for MPX, so the PKRU registers it provides are at a different offset
than on Intel CPUs.  Furthermore, my reading of the SDM is that there is no
guarantee of architectural offsets of a given XSAVE feature and that software
should be querying CPUID to determine the layout.

FWIW, the relevant CPUID leaves for my AMD system:

    XSAVE features (0xd/0):
       XCR0 valid bit field mask               = 0x0000000000000207
          x87 state                            = true
          SSE state                            = true
          AVX state                            = true
          MPX BNDREGS                          = false
          MPX BNDCSR                           = false
          AVX-512 opmask                       = false
          AVX-512 ZMM_Hi256                    = false
          AVX-512 Hi16_ZMM                     = false
          PKRU state                           = true
          XTILECFG state                       = false
          XTILEDATA state                      = false
       bytes required by fields in XCR0        = 0x00000988 (2440)
       bytes required by XSAVE/XRSTOR area     = 0x00000988 (2440)
       XSAVEOPT instruction                    = true
       XSAVEC instruction                      = true
       XGETBV instruction                      = true
       XSAVES/XRSTORS instructions             = true
       XFD: extended feature disable supported = false
       SAVE area size in bytes                 = 0x00000348 (840)
       IA32_XSS valid bit field mask           = 0x0000000000001800
          PT state                             = false
          PASID state                          = false
          CET_U user state                     = true
          CET_S supervisor state               = true
          HDC state                            = false
          UINTR state                          = false
          LBR state                            = false
          HWP state                            = false

> Do you have any information about what other OSes are doing in this
> area?  I thought Windows, for instance, was even less flexible about the
> XSAVE format than Linux is.

I have an implementation of a similar note for FreeBSD already as well as
patches for GDB to make use of the note (for FreeBSD) and generate it
via 'gcore' (on FreeBSD).  However, I would very much like to reach
consensus on a shared format of the note to avoid gratuitous differences
between FreeBSD and Linux.  The AMD folks were gracious enough to work on
the Linux kernel implementation.  A bit more on that below though.

> Why didn't LWP cause this problem?
> 
>  From the cover letter:
> 
>> But this patch series depends on heuristics based on the total XSAVE
>> register set size and the XCR0 mask to infer the layouts of the
>> various register blocks for core dumps, and hence, is not a foolproof
>> mechanism to determine the layout of the XSAVE area.
> 
> It may not be theoretically foolproof.  But I'm struggling to think of a
> case where it would matter in practice.  Is there any CPU from any
> vendor where this is actually _needed_?
> 
> Sure, it's ugly as hell, but these notes aren't going to be available
> universally _ever_, so it's not like the crummy heuristic code gets to
> go away.
> 
> Have you seen the APX spec?
> 
>>
> https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html
> 
> It makes this even more fun because it adds a new XSAVE state component,
> but reuses the MPX offsets.
> 
>> This information will be used by the debuggers to understand the XSAVE
>> layout of the machine where the core file is dumped, and to read XSAVE
>> registers, especially during cross-platform debugging.
> 
> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
> Rather than come up with an XSAVE-specific ABI that depends on CPUID
> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
> should just bite the bullet and dump out (some of) the raw CPUID space.

So the current note I initially proposed and implemented for FreeBSD
(https://reviews.freebsd.org/D42136) and an initial patch set for GDB
(https://sourceware.org/pipermail/gdb-patches/2023-October/203083.html)
do indeed dump a raw set of CPUID leaves.  The version I have for FreeBSD
only dumps the raw leaf values for leaf 0x0d though the note format is
extensible should additional leaves be needed in the future.  One of the
questions if we wanted to use a CPUID leaf note is which leaves to dump
(e.g. do you dump all of them, or do you just dump the subset that is
currently needed).  Another quirky question is what to do about systems
with hetergeneous cores (E vs P for example).  Currently those systems
use the same XSAVE layout across all cores, but other CPUID leaves do
already vary across cores on those systems.  Some options considered for
that are to 1) use a separate note type for "other" core types (e.g.
a separate note type for "E" cores), or 2) make this new note a per-thread
note that matches the core the given thread was running on when the
register state stored in the process core dump was saved.

However, there are other wrinkles with the leaf approach.  Namely, one
of the use cases that I currently have an ugly hack for in GDB is if
you are using gdb against a remote host running gdbserver and then use
'gcore' to generate a core dump.  GDB needs to write out a NT_X86_XSTATE
note, but that note requires a layout.  What GDB does today is just pick
a known Intel layout based on the XCR0 mask.  However, GDB should ideally
start writing out whatever new note we adopt here, so if we dump raw
CPUID leaves it means extending the GDB remote protocol so we can query
the CPUID leaves from the remote host.  On the other hand, if we choose a
more abstract format as proposed in this patch, the local GDB (or LLDB
or whatever) can generate whatever synthetic layout it wants to write
the local NT_X86_XSTATE.  (NB: A relevant detail here is that the GDB
remote protocol does not pass the entire XSAVE state across as a block,
instead gdbserver parses individual register values for AVX, etc.
registers and those decoded register values are passed over the
protocol.)

Another question is potentially supporting compact XSAVE format in
for NT_X86_XSTATE.  Today Linux has some complicated code to re-expand
the compat XSAVE format back out to the standard layout for ptrace() and
process core dumps.  (FreeBSD doesn't yet make use of XSAVEC so we
haven't yet dealt with that problem.)  The CPUID leaf approach would
allow us to support compact formats, though GDB would have to check for
the flag in the XSAVE header to decide which format to use, etc.  On
the other hand, if we use the more abstract format in this patch, then
GDB wouldn't actually have to care at all.  The kernel would just dump
out the "compact" form of the layout note and the direct XSAVEC output
as the note.  (I will probably do this in FreeBSD eventually, but
using a policy knob (sysctl on FreeBSD) to control if it is enabled
that FreeBSD would default to on at some point in the future.)

I don't really have a strong preference for which type of note to dump
myself, I really just want to have a shared format so that there is
less work to do on the tools side (e.g. GDB).

Also, FWIW, I did try to raise this topic on LLDB's discussion forums
and got a simple "sounds ok" type response but no detailed feedback.
That was a proposal for the CPUID leaf note, but I suspect LLDB will
be fine with either approach.  Certainly I will update GDB to work
with whatever approach is adopted.
  
John Baldwin March 14, 2024, 5:05 p.m. UTC | #8
On 3/14/24 8:37 AM, Dave Hansen wrote:
> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>> But this patch series depends on heuristics based on the total XSAVE
>> register set size and the XCR0 mask to infer the layouts of the
>> various register blocks for core dumps, and hence, is not a foolproof
>> mechanism to determine the layout of the XSAVE area.
> 
> It may not be theoretically foolproof.  But I'm struggling to think of a
> case where it would matter in practice.  Is there any CPU from any
> vendor where this is actually _needed_?
> 
> Sure, it's ugly as hell, but these notes aren't going to be available
> universally _ever_, so it's not like the crummy heuristic code gets to
> go away.

I forgot to mention one other use case for this note.

Today (and before my earlier patch series to add the ugly heuristic),
when the NT_X86_XSTATE core dump note grows because a CPU vendor adds
a new xfeature and OS's which just dump the entire XSAVE state start
including that, GDB fails to parse the entire note.

Having a note describing the layout (whichever format is chosen),
allows GDB to still pull registers for features it understands from
the larger note and ignoring the parts of the XSAVE block it doesn't
understand.
  
Dave Hansen March 14, 2024, 5:10 p.m. UTC | #9
On 3/14/24 09:45, John Baldwin wrote:
> On 3/14/24 8:37 AM, Dave Hansen wrote:
>> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>>> Add a new .note section containing type, size, offset and flags of
>>> every xfeature that is present.
>>
>> Mechanically, I'd much rather have all of that info in the cover letter
>> in the actual changelog instead.
>>
>> I'd also love to see a practical example of what an actual example core
>> dump looks like on two conflicting systems:
>>
>>     * Total XSAVE size
>>     * XCR0 value
>>     * XSTATE_BV from the core dump
>>     * XFEATURE offsets for each feature
> 
> I noticed this when I bought an AMD Ryzen 9 5900X based system for
> my desktop running FreeBSD and found that the XSAVE core dump notes
> were not recognized by GDB (FreeBSD dumps an XSAVE register set note
> that matches the same layout of NT_X86_XSTATE used by Linux).

I just want to make sure that you heard what I asked.  I'd like to see a
practical example of how the real-world enumeration changes between two
real world systems.

Is that possible?

Here's the raw CPUID data from the XSAVE region on my laptop:

>    0x0000000d 0x00: eax=0x000002e7 ebx=0x00000a88 ecx=0x00000a88 edx=0x00000000
>    0x0000000d 0x01: eax=0x0000000f ebx=0x00000998 ecx=0x00003900 edx=0x00000000
>    0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
>    0x0000000d 0x05: eax=0x00000040 ebx=0x00000440 ecx=0x00000000 edx=0x00000000
>    0x0000000d 0x06: eax=0x00000200 ebx=0x00000480 ecx=0x00000000 edx=0x00000000
>    0x0000000d 0x07: eax=0x00000400 ebx=0x00000680 ecx=0x00000000 edx=0x00000000
>    0x0000000d 0x08: eax=0x00000080 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>    0x0000000d 0x09: eax=0x00000008 ebx=0x00000a80 ecx=0x00000000 edx=0x00000000
>    0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>    0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>    0x0000000d 0x0d: eax=0x00000008 ebx=0x00000000 ecx=0x00000001 edx=0x00000000

Could we get that for an impacted AMD system, please?

	cpuid -1 --raw | grep "   0x0000000d "

should do it.

> In particular, as the cover letter notes, on this AMD processor,
> there is no "gap" for MPX, so the PKRU registers it provides are at a
> different offset than on Intel CPUs.  Furthermore, my reading of the
> SDM is that there is no guarantee of architectural offsets of a given
> XSAVE feature and that software should be querying CPUID to determine
> the layout.

I'd say the SDM is an aspirational document.  Intel _aspires_ to be able
to change the layouts whenever it wants.  That doesn't mean that it
could actually pull it off in practice.

In practice, the offset are fixed and Intel can't change them.

> FWIW, the relevant CPUID leaves for my AMD system:
> 
>    XSAVE features (0xd/0):
>       XCR0 valid bit field mask               = 0x0000000000000207>          x87 state                            = true
...

So, those actually aren't the relevant ones.  We need EAX (size) and EBX
(user offset) from all of the sub-leaves.

>> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
>> Rather than come up with an XSAVE-specific ABI that depends on CPUID
>> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
>> should just bite the bullet and dump out (some of) the raw CPUID space.
> 
> So the current note I initially proposed and implemented for FreeBSD
> (https://reviews.freebsd.org/D42136) and an initial patch set for GDB
> (https://sourceware.org/pipermail/gdb-patches/2023-October/203083.html)
> do indeed dump a raw set of CPUID leaves.  The version I have for FreeBSD
> only dumps the raw leaf values for leaf 0x0d though the note format is
> extensible should additional leaves be needed in the future.  One of the
> questions if we wanted to use a CPUID leaf note is which leaves to dump
> (e.g. do you dump all of them, or do you just dump the subset that is
> currently needed).

You dump what is needed and add to the dump over time.

> Another quirky question is what to do about systems with hetergeneous
> cores (E vs P for example).
That's irrelevant for now.  The cores may be heterogeneous but the
userspace ISA and (and thus XSAVE formats) are identical.  If they're
not, then we have bigger problems on our hands.

> Currently those systems use the same XSAVE layout across all cores,
> but other CPUID leaves do already vary across cores on those systems.

There shouldn't be any CPUID leaves that differ _and_ matter to
userspace and thus core dumps.

> However, there are other wrinkles with the leaf approach.  Namely, one
> of the use cases that I currently have an ugly hack for in GDB is if
> you are using gdb against a remote host running gdbserver and then use
> 'gcore' to generate a core dump.  GDB needs to write out a NT_X86_XSTATE
> note, but that note requires a layout.  What GDB does today is just pick
> a known Intel layout based on the XCR0 mask.  However, GDB should ideally
> start writing out whatever new note we adopt here, so if we dump raw
> CPUID leaves it means extending the GDB remote protocol so we can query
> the CPUID leaves from the remote host.  On the other hand, if we choose a
> more abstract format as proposed in this patch, the local GDB (or LLDB
> or whatever) can generate whatever synthetic layout it wants to write
> the local NT_X86_XSTATE.  (NB: A relevant detail here is that the GDB
> remote protocol does not pass the entire XSAVE state across as a block,
> instead gdbserver parses individual register values for AVX, etc.
> registers and those decoded register values are passed over the
> protocol.)

So the gdb side says, "Give me PKRU" and the remote side parses the
XSAVE image, finds PKRU, and sends it over the wire?

> Another question is potentially supporting compact XSAVE format in
> for NT_X86_XSTATE.  Today Linux has some complicated code to re-expand
> the compat XSAVE format back out to the standard layout for ptrace() and
> process core dumps.

Yeah, but supporting the compacted format in NT_X86_XSTATE doesn't help
us at all.  We still intermingle user and supervisor state and that
needs to get repacked _anyway_.

In other words, no matter what we do, it's going to be complicated
because the userspace buffer can't have supervisor state and the kernel
buffer does have it.  The compacted format mismatch is the least of our
problems.

>  (FreeBSD doesn't yet make use of XSAVEC so we
> haven't yet dealt with that problem.) 

... or XSAVES, which is actually the most relevant here.

Backing up... there are two approaches here:

 1. Dump out raw x86-specific gunk, aka. CPUID contents itself.  There
    are a billion ways to do this and lots of complications, including
    the remote protocol implications
or
 2. Define an abstract format that works anywhere, not just on x86 and
    not just for XSAVE.

There's no (sane) middle ground.  The implementation here (in this
patch) is fundamentally x86-specific and pretends to be some kind of
abstracted x86-independent format.
  
John Baldwin March 14, 2024, 5:36 p.m. UTC | #10
On 3/14/24 10:10 AM, Dave Hansen wrote:
> On 3/14/24 09:45, John Baldwin wrote:
>> On 3/14/24 8:37 AM, Dave Hansen wrote:
>>> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>>>> Add a new .note section containing type, size, offset and flags of
>>>> every xfeature that is present.
>>>
>>> Mechanically, I'd much rather have all of that info in the cover letter
>>> in the actual changelog instead.
>>>
>>> I'd also love to see a practical example of what an actual example core
>>> dump looks like on two conflicting systems:
>>>
>>>      * Total XSAVE size
>>>      * XCR0 value
>>>      * XSTATE_BV from the core dump
>>>      * XFEATURE offsets for each feature
>>
>> I noticed this when I bought an AMD Ryzen 9 5900X based system for
>> my desktop running FreeBSD and found that the XSAVE core dump notes
>> were not recognized by GDB (FreeBSD dumps an XSAVE register set note
>> that matches the same layout of NT_X86_XSTATE used by Linux).
> 
> I just want to make sure that you heard what I asked.  I'd like to see a
> practical example of how the real-world enumeration changes between two
> real world systems.
> 
> Is that possible?
> 
> Here's the raw CPUID data from the XSAVE region on my laptop:
> 
>>     0x0000000d 0x00: eax=0x000002e7 ebx=0x00000a88 ecx=0x00000a88 edx=0x00000000
>>     0x0000000d 0x01: eax=0x0000000f ebx=0x00000998 ecx=0x00003900 edx=0x00000000
>>     0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
>>     0x0000000d 0x05: eax=0x00000040 ebx=0x00000440 ecx=0x00000000 edx=0x00000000
>>     0x0000000d 0x06: eax=0x00000200 ebx=0x00000480 ecx=0x00000000 edx=0x00000000
>>     0x0000000d 0x07: eax=0x00000400 ebx=0x00000680 ecx=0x00000000 edx=0x00000000
>>     0x0000000d 0x08: eax=0x00000080 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>>     0x0000000d 0x09: eax=0x00000008 ebx=0x00000a80 ecx=0x00000000 edx=0x00000000
>>     0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>>     0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>>     0x0000000d 0x0d: eax=0x00000008 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
> 
> Could we get that for an impacted AMD system, please?
> 
> 	cpuid -1 --raw | grep "   0x0000000d "
> 
> should do it.

    0x0000000d 0x00: eax=0x00000207 ebx=0x00000988 ecx=0x00000988 edx=0x00000000
    0x0000000d 0x01: eax=0x0000000f ebx=0x00000348 ecx=0x00001800 edx=0x00000000
    0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
    0x0000000d 0x09: eax=0x00000008 ebx=0x00000980 ecx=0x00000000 edx=0x00000000
    0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
    0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000

Here, I think the ebx value for the 0x09 leaf (PKRU) is the relevant difference
here, it is 0xa80 on your laptop and 0x980 on the AMD CPU.  (This is the
missing MPX gap on AMD.)

>>> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
>>> Rather than come up with an XSAVE-specific ABI that depends on CPUID
>>> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
>>> should just bite the bullet and dump out (some of) the raw CPUID space.
>>
>> So the current note I initially proposed and implemented for FreeBSD
>> (https://reviews.freebsd.org/D42136) and an initial patch set for GDB
>> (https://sourceware.org/pipermail/gdb-patches/2023-October/203083.html)
>> do indeed dump a raw set of CPUID leaves.  The version I have for FreeBSD
>> only dumps the raw leaf values for leaf 0x0d though the note format is
>> extensible should additional leaves be needed in the future.  One of the
>> questions if we wanted to use a CPUID leaf note is which leaves to dump
>> (e.g. do you dump all of them, or do you just dump the subset that is
>> currently needed).
> 
> You dump what is needed and add to the dump over time.

That is what I started with, yes, but am attempting to anticipate future
problems in my list of caveats.

>> Another quirky question is what to do about systems with hetergeneous
>> cores (E vs P for example).
> That's irrelevant for now.  The cores may be heterogeneous but the
> userspace ISA and (and thus XSAVE formats) are identical.  If they're
> not, then we have bigger problems on our hands.

Yes, I agree on the bigger problems and hope we don't have to solve
them.

>> Currently those systems use the same XSAVE layout across all cores,
>> but other CPUID leaves do already vary across cores on those systems.
> 
> There shouldn't be any CPUID leaves that differ _and_ matter to
> userspace and thus core dumps.

Today that is true, yes.  I'm fine with making that tradeoff (along
with only dumping a subset of leaves) so long as the consensus is that
is an acceptable tradeoff to make.

>> However, there are other wrinkles with the leaf approach.  Namely, one
>> of the use cases that I currently have an ugly hack for in GDB is if
>> you are using gdb against a remote host running gdbserver and then use
>> 'gcore' to generate a core dump.  GDB needs to write out a NT_X86_XSTATE
>> note, but that note requires a layout.  What GDB does today is just pick
>> a known Intel layout based on the XCR0 mask.  However, GDB should ideally
>> start writing out whatever new note we adopt here, so if we dump raw
>> CPUID leaves it means extending the GDB remote protocol so we can query
>> the CPUID leaves from the remote host.  On the other hand, if we choose a
>> more abstract format as proposed in this patch, the local GDB (or LLDB
>> or whatever) can generate whatever synthetic layout it wants to write
>> the local NT_X86_XSTATE.  (NB: A relevant detail here is that the GDB
>> remote protocol does not pass the entire XSAVE state across as a block,
>> instead gdbserver parses individual register values for AVX, etc.
>> registers and those decoded register values are passed over the
>> protocol.)
> 
> So the gdb side says, "Give me PKRU" and the remote side parses the
> XSAVE image, finds PKRU, and sends it over the wire?

Yes.

>> Another question is potentially supporting compact XSAVE format in
>> for NT_X86_XSTATE.  Today Linux has some complicated code to re-expand
>> the compat XSAVE format back out to the standard layout for ptrace() and
>> process core dumps.
> 
> Yeah, but supporting the compacted format in NT_X86_XSTATE doesn't help
> us at all.  We still intermingle user and supervisor state and that
> needs to get repacked _anyway_.

Fair enough.

> In other words, no matter what we do, it's going to be complicated
> because the userspace buffer can't have supervisor state and the kernel
> buffer does have it.  The compacted format mismatch is the least of our
> problems.
> 
>>    (FreeBSD doesn't yet make use of XSAVEC so we
>> haven't yet dealt with that problem.)
> 
> ... or XSAVES, which is actually the most relevant here.
> 
> Backing up... there are two approaches here:
> 
>   1. Dump out raw x86-specific gunk, aka. CPUID contents itself.  There
>      are a billion ways to do this and lots of complications, including
>      the remote protocol implications
> or
>   2. Define an abstract format that works anywhere, not just on x86 and
>      not just for XSAVE.
> 
> There's no (sane) middle ground.  The implementation here (in this
> patch) is fundamentally x86-specific and pretends to be some kind of
> abstracted x86-independent format.

Well, are there other register notes that could benefit from an approach
like this?  Most other register notes I'm aware of on various architectures
either have a fixed layout (like the typical general purpose register notes),
or they have a fixed set of registers but the size of individual registers
can vary (thinks like SME or RISC-V's vector extension).  XSAVE is the only
one I'm aware of that packs multiple register sets into a single note.

To step back a bit, another approach that could be taken (and I'm not sure
it is worth it at this point) would to stop dumping a single XSAVE note
and dump a separate register note for each feature.  That is, dump a note
for AVX (the upper bits of ymmX), a note for PKRU, etc.  I think if I had
to pick a strategy at the very beginning that's what I would choose now,
but this isn't the very beginning and that sort of change is likely too
disruptive.  (This approach is what happens on other arches today in effect,
e.g. on AArch64.)
  
Michael Ellerman March 14, 2024, 10:13 p.m. UTC | #11
Hi Vignesh,

Vignesh Balasubramanian <vigbalas@amd.com> writes:
> Add a new .note section containing type, size, offset and flags of
> every xfeature that is present.
>
> This information will be used by the debuggers to understand the XSAVE
> layout of the machine where the core file is dumped, and to read XSAVE
> registers, especially during cross-platform debugging.
>
> Co-developed-by: Jini Susan George <jinisusan.george@amd.com>
> Signed-off-by: Jini Susan George <jinisusan.george@amd.com>
> Signed-off-by: Vignesh Balasubramanian <vigbalas@amd.com>
> ---
>  arch/Kconfig                   |   9 +++
>  arch/powerpc/Kconfig           |   1 +
>  arch/powerpc/include/asm/elf.h |   2 -
>  arch/x86/Kconfig               |   1 +
>  arch/x86/include/asm/elf.h     |   7 +++
>  arch/x86/kernel/fpu/xstate.c   | 101 +++++++++++++++++++++++++++++++++
>  include/linux/elf.h            |   2 +-
>  include/uapi/linux/elf.h       |   1 +
>  8 files changed, 121 insertions(+), 3 deletions(-)

IMHO you should split the changes to replace ARCH_HAVE_EXTRA_ELF_NOTES
with CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES into a lead-up patch.

cheers
  
kernel test robot March 15, 2024, 9:59 a.m. UTC | #12
Hi Vignesh,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.8 next-20240315]
[cannot apply to kees/for-next/execve tip/x86/core powerpc/next powerpc/fixes]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vignesh-Balasubramanian/x86-elf-Add-a-new-note-section-containing-Xfeatures-information-to-x86-core-files/20240314-192650
base:   linus/master
patch link:    https://lore.kernel.org/r/20240314112359.50713-2-vigbalas%40amd.com
patch subject: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files
config: i386-buildonly-randconfig-001-20240315 (https://download.01.org/0day-ci/archive/20240315/202403151742.VEz04MqR-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240315/202403151742.VEz04MqR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202403151742.VEz04MqR-lkp@intel.com/

All errors (new ones prefixed by >>):

>> arch/x86/kernel/fpu/xstate.c:1858:8: error: call to undeclared function 'dump_emit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1858 |                 if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
         |                      ^
   arch/x86/kernel/fpu/xstate.c:1869:8: error: call to undeclared function 'dump_emit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1869 |                 if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
         |                      ^
   arch/x86/kernel/fpu/xstate.c:1899:7: error: call to undeclared function 'dump_emit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1899 |         if (!dump_emit(cprm, &en, sizeof(en)))
         |              ^
>> arch/x86/kernel/fpu/xstate.c:1903:7: error: call to undeclared function 'dump_align'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1903 |         if (!dump_align(cprm, 4))
         |              ^
   4 errors generated.


vim +/dump_emit +1858 arch/x86/kernel/fpu/xstate.c

  1846	
  1847		struct xfeat_component xc;
  1848		int num_records = 0;
  1849		int i;
  1850	
  1851		/* XFEATURE_FPU and XFEATURE_SSE, both are fixed legacy states. */
  1852		for (i = 0; i < FIRST_EXTENDED_XFEATURE; i++) {
  1853			xc.xfeat_type = i;
  1854			xc.xfeat_sz = xstate_sizes[i];
  1855			xc.xfeat_off = xstate_offsets[i];
  1856			xc.xfeat_flags = xstate_flags[i];
  1857	
> 1858			if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
  1859				return 0;
  1860			num_records++;
  1861		}
  1862	
  1863		for_each_extended_xfeature(i, fpu_user_cfg.max_features) {
  1864			xc.xfeat_type = i;
  1865			xc.xfeat_sz = xstate_sizes[i];
  1866			xc.xfeat_off = xstate_offsets[i];
  1867			xc.xfeat_flags = xstate_flags[i];
  1868	
  1869			if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
  1870				return 0;
  1871			num_records++;
  1872		}
  1873	
  1874		return num_records;
  1875	}
  1876	
  1877	static int get_xsave_desc_size(void)
  1878	{
  1879		/* XFEATURE_FP and XFEATURE_SSE, both are fixed legacy states */
  1880		int xfeatures_count = 2;
  1881		int i;
  1882	
  1883		for_each_extended_xfeature(i, fpu_user_cfg.max_features)
  1884			xfeatures_count++;
  1885	
  1886		return xfeatures_count * (sizeof(struct xfeat_component));
  1887	}
  1888	
  1889	int elf_coredump_extra_notes_write(struct coredump_params *cprm)
  1890	{
  1891		const char *owner_name = "LINUX";
  1892		int num_records = 0;
  1893		struct elf_note en;
  1894	
  1895		en.n_namesz = strlen(owner_name) + 1;
  1896		en.n_descsz = get_xsave_desc_size();
  1897		en.n_type = NT_X86_XSAVE_LAYOUT;
  1898	
  1899		if (!dump_emit(cprm, &en, sizeof(en)))
  1900			return 1;
  1901		if (!dump_emit(cprm, owner_name, en.n_namesz))
  1902			return 1;
> 1903		if (!dump_align(cprm, 4))
  1904			return 1;
  1905	
  1906		num_records = dump_xsave_layout_desc(cprm);
  1907		if (!num_records) {
  1908			pr_warn("Error adding XSTATE layout ELF note. XSTATE buffer in the core file will be unparseable.");
  1909			return 1;
  1910		}
  1911	
  1912		/* Total size should be equal to the number of records */
  1913		if ((sizeof(struct xfeat_component) * num_records) != en.n_descsz) {
  1914			pr_warn("Error adding XSTATE layout ELF note. The size of the .note section does not match with the total size of the records.");
  1915			return 1;
  1916		}
  1917	
  1918		if (!dump_align(cprm, 4))
  1919			return 1;
  1920	
  1921		return 0;
  1922	}
  1923
  
kernel test robot March 15, 2024, 12:59 p.m. UTC | #13
Hi Vignesh,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.8 next-20240315]
[cannot apply to kees/for-next/execve tip/x86/core powerpc/next powerpc/fixes]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vignesh-Balasubramanian/x86-elf-Add-a-new-note-section-containing-Xfeatures-information-to-x86-core-files/20240314-192650
base:   linus/master
patch link:    https://lore.kernel.org/r/20240314112359.50713-2-vigbalas%40amd.com
patch subject: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files
config: x86_64-randconfig-122-20240315 (https://download.01.org/0day-ci/archive/20240315/202403152037.J6Fn7uiP-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240315/202403152037.J6Fn7uiP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202403152037.J6Fn7uiP-lkp@intel.com/

All errors (new ones prefixed by >>):

   arch/x86/kernel/fpu/xstate.c: In function 'dump_xsave_layout_desc':
>> arch/x86/kernel/fpu/xstate.c:1858:22: error: implicit declaration of function 'dump_emit'; did you mean 'dir_emit'? [-Werror=implicit-function-declaration]
    1858 |                 if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
         |                      ^~~~~~~~~
         |                      dir_emit
   arch/x86/kernel/fpu/xstate.c: In function 'elf_coredump_extra_notes_write':
>> arch/x86/kernel/fpu/xstate.c:1903:14: error: implicit declaration of function 'dump_align'; did you mean 'dump_mapping'? [-Werror=implicit-function-declaration]
    1903 |         if (!dump_align(cprm, 4))
         |              ^~~~~~~~~~
         |              dump_mapping
   cc1: some warnings being treated as errors


vim +1858 arch/x86/kernel/fpu/xstate.c

  1846	
  1847		struct xfeat_component xc;
  1848		int num_records = 0;
  1849		int i;
  1850	
  1851		/* XFEATURE_FPU and XFEATURE_SSE, both are fixed legacy states. */
  1852		for (i = 0; i < FIRST_EXTENDED_XFEATURE; i++) {
  1853			xc.xfeat_type = i;
  1854			xc.xfeat_sz = xstate_sizes[i];
  1855			xc.xfeat_off = xstate_offsets[i];
  1856			xc.xfeat_flags = xstate_flags[i];
  1857	
> 1858			if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
  1859				return 0;
  1860			num_records++;
  1861		}
  1862	
  1863		for_each_extended_xfeature(i, fpu_user_cfg.max_features) {
  1864			xc.xfeat_type = i;
  1865			xc.xfeat_sz = xstate_sizes[i];
  1866			xc.xfeat_off = xstate_offsets[i];
  1867			xc.xfeat_flags = xstate_flags[i];
  1868	
  1869			if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
  1870				return 0;
  1871			num_records++;
  1872		}
  1873	
  1874		return num_records;
  1875	}
  1876	
  1877	static int get_xsave_desc_size(void)
  1878	{
  1879		/* XFEATURE_FP and XFEATURE_SSE, both are fixed legacy states */
  1880		int xfeatures_count = 2;
  1881		int i;
  1882	
  1883		for_each_extended_xfeature(i, fpu_user_cfg.max_features)
  1884			xfeatures_count++;
  1885	
  1886		return xfeatures_count * (sizeof(struct xfeat_component));
  1887	}
  1888	
  1889	int elf_coredump_extra_notes_write(struct coredump_params *cprm)
  1890	{
  1891		const char *owner_name = "LINUX";
  1892		int num_records = 0;
  1893		struct elf_note en;
  1894	
  1895		en.n_namesz = strlen(owner_name) + 1;
  1896		en.n_descsz = get_xsave_desc_size();
  1897		en.n_type = NT_X86_XSAVE_LAYOUT;
  1898	
  1899		if (!dump_emit(cprm, &en, sizeof(en)))
  1900			return 1;
  1901		if (!dump_emit(cprm, owner_name, en.n_namesz))
  1902			return 1;
> 1903		if (!dump_align(cprm, 4))
  1904			return 1;
  1905	
  1906		num_records = dump_xsave_layout_desc(cprm);
  1907		if (!num_records) {
  1908			pr_warn("Error adding XSTATE layout ELF note. XSTATE buffer in the core file will be unparseable.");
  1909			return 1;
  1910		}
  1911	
  1912		/* Total size should be equal to the number of records */
  1913		if ((sizeof(struct xfeat_component) * num_records) != en.n_descsz) {
  1914			pr_warn("Error adding XSTATE layout ELF note. The size of the .note section does not match with the total size of the records.");
  1915			return 1;
  1916		}
  1917	
  1918		if (!dump_align(cprm, 4))
  1919			return 1;
  1920	
  1921		return 0;
  1922	}
  1923
  
Thomas Gleixner March 15, 2024, 11:51 p.m. UTC | #14
On Thu, Mar 14 2024 at 17:29, Borislav Petkov wrote:
> On Thu, Mar 14, 2024 at 09:19:15AM -0700, Dave Hansen wrote:
>> Are you envisioning an *XSAVE* state component that's not described by
>> CPUID?
>
> I want to be prepared. You can imagine some of the short cuts and
> corners cutting hw guys would do so I'd want to be prepared there and
> not tie this to CPUID.

Anything which is not enumerated in CPUID does not exist in
XSTATE. Period and end of story.

Stop paving the way for hardware people to have an excuse for being
stupid.

Aside of that XSTATE is a dead end as we all know.

Thanks,

        tglx
  
Borislav Petkov March 16, 2024, 10:29 a.m. UTC | #15
On Sat, Mar 16, 2024 at 12:51:28AM +0100, Thomas Gleixner wrote:
> Anything which is not enumerated in CPUID does not exist in
> XSTATE. Period and end of story.

But why not have a simple buffer definition which doesn't need CPUID?

Also, doing the CPUID thing would need extending the gdb remote protocol
as explained here:

https://lore.kernel.org/r/971d21b7-0309-439e-91b6-234f84da959d@FreeBSD.org

The simple buffer layout won't.

So regardless of where hw is going, I think a simple buffer definition
is always better.

Thx.
  
Balasubrmanian, Vignesh March 26, 2024, 9:59 a.m. UTC | #16
On 3/14/2024 10:09 PM, Dave Hansen wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On 3/14/24 09:29, Borislav Petkov wrote:
>>> That argument breaks down a bit on the flags though:
>>>
>>>       xc.xfeat_flags = xstate_flags[i];
>>>
>>> Because it comes _directly_ from CPUID with zero filtering:
>>>
>>>       cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
>>>       ...
>>>       xstate_flags[i] = ecx;
>>>
>>> So this layout is quite dependent on what's in x86's CPUID.
>> Yeah, no, this should not be copying CPUID flags - those flags should be
>> *translated* to independently defined flags which describe those
>> buffers.
> Ditto for:
>
>          xc.xfeat_type = i;
>
> Right now, that's bound to CPUID and XSAVE.  "feat_type==10" can only
> ever be PKRU and that's derived from the XSAVE architecture.
>
> If you want this to be extensible to things outside of the XSAVE
> architecture, it needs to be something actually extensible and not
> entangled with XSAVE.
>
> In other words "xc.xfeat_type" can enumerate XSAVE state components
> being in the dump, but it should not be limited to XSAVE.  Just as an
> example:
>
> enum feat_type {
>          FEATURE_XSAVE_PKRU,
>          FEATURE_XSAVE__YMM,
>          FEATURE_XSAVE_BNDREGS,
>          FEATURE_XSAVE_BNDCSR,
>          ...
>          RANDOM_STATE_NOT_XSAVE
> };
>
> See how feat_type==1 is PKRU and *NOT* feat_type==10?  That opens the
> door to RANDOM_STATE_NOT_XSAVE or anything else you want.  This would be
> _actually_ extensible.


Thanks for the review.
I will add new enum, instead of using "enum xfeature".
Currently we are retaining the flags field. The value will be set to 
zero at this point, and the field will be reserved for future use.
GDB / LLDB would not require this field at this point. Do let us know if 
this is not OK.

-thanks,
vigneshbalu.
  
Balasubrmanian, Vignesh March 26, 2024, 10:06 a.m. UTC | #17
> Otherwise looks reasonable, though I see Dave has feedback to address
> too. :)
>
> Thanks for working on this!
>
> -Kees
Thank you for the review.
I will address all this on next version.

thanks,
vigneshbalu.

> --
> Kees Cook
  
Balasubrmanian, Vignesh March 26, 2024, 10:09 a.m. UTC | #18
> IMHO you should split the changes to replace ARCH_HAVE_EXTRA_ELF_NOTES
> with CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES into a lead-up patch.
>
> cheers
Thanks for the input and i will take care in next version.

regards,
vigneshbalu.
  

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index fd18b7db2c77..3bd8a0b2bba1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -502,6 +502,15 @@  config MMU_LAZY_TLB_SHOOTDOWN
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
+config ARCH_HAVE_EXTRA_ELF_NOTES
+	bool
+	help
+	  An architecture should select this in order to enable adding an
+	  arch-specific ELF note section to core files. It must provide two
+	  functions: elf_coredump_extra_notes_size() and
+	  elf_coredump_extra_notes_write() which are invoked by the ELF core
+	  dumper.
+
 config ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a91cb070ca4a..3b31bd7490e2 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -156,6 +156,7 @@  config PPC
 	select ARCH_HAS_UACCESS_FLUSHCACHE
 	select ARCH_HAS_UBSAN
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_EXTRA_ELF_NOTES        if SPU_BASE
 	select ARCH_KEEP_MEMBLOCK
 	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE	if PPC_RADIX_MMU
 	select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 79f1c480b5eb..bb4b94444d3e 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -127,8 +127,6 @@  extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 /* Notes used in ET_CORE. Note name is "SPU/<fd>/<filename>". */
 #define NT_SPU		1
 
-#define ARCH_HAVE_EXTRA_ELF_NOTES
-
 #endif /* CONFIG_SPU_BASE */
 
 #ifdef CONFIG_PPC64
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 78050d5d7fac..35e8d1201099 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -104,6 +104,7 @@  config X86
 	select ARCH_HAS_DEBUG_WX
 	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_EXTRA_ELF_NOTES
 	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
 	select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 1fb83d47711f..1b9f0b4bf6bc 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -13,6 +13,13 @@ 
 #include <asm/auxvec.h>
 #include <asm/fsgsbase.h>
 
+struct xfeat_component {
+	u32 xfeat_type;
+	u32 xfeat_sz;
+	u32 xfeat_off;
+	u32 xfeat_flags;
+} __packed;
+
 typedef unsigned long elf_greg_t;
 
 #define ELF_NGREG (sizeof(struct user_regs_struct) / sizeof(elf_greg_t))
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 117e74c44e75..6e5ea483ec1d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -13,6 +13,7 @@ 
 #include <linux/seq_file.h>
 #include <linux/proc_fs.h>
 #include <linux/vmalloc.h>
+#include <linux/coredump.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/regset.h>
@@ -1836,3 +1837,103 @@  int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 #endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
+/*
+ * Dump type, size, offset and flag values for every xfeature that is present.
+ */
+static int dump_xsave_layout_desc(struct coredump_params *cprm)
+{
+
+	struct xfeat_component xc;
+	int num_records = 0;
+	int i;
+
+	/* XFEATURE_FPU and XFEATURE_SSE, both are fixed legacy states. */
+	for (i = 0; i < FIRST_EXTENDED_XFEATURE; i++) {
+		xc.xfeat_type = i;
+		xc.xfeat_sz = xstate_sizes[i];
+		xc.xfeat_off = xstate_offsets[i];
+		xc.xfeat_flags = xstate_flags[i];
+
+		if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
+			return 0;
+		num_records++;
+	}
+
+	for_each_extended_xfeature(i, fpu_user_cfg.max_features) {
+		xc.xfeat_type = i;
+		xc.xfeat_sz = xstate_sizes[i];
+		xc.xfeat_off = xstate_offsets[i];
+		xc.xfeat_flags = xstate_flags[i];
+
+		if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
+			return 0;
+		num_records++;
+	}
+
+	return num_records;
+}
+
+static int get_xsave_desc_size(void)
+{
+	/* XFEATURE_FP and XFEATURE_SSE, both are fixed legacy states */
+	int xfeatures_count = 2;
+	int i;
+
+	for_each_extended_xfeature(i, fpu_user_cfg.max_features)
+		xfeatures_count++;
+
+	return xfeatures_count * (sizeof(struct xfeat_component));
+}
+
+int elf_coredump_extra_notes_write(struct coredump_params *cprm)
+{
+	const char *owner_name = "LINUX";
+	int num_records = 0;
+	struct elf_note en;
+
+	en.n_namesz = strlen(owner_name) + 1;
+	en.n_descsz = get_xsave_desc_size();
+	en.n_type = NT_X86_XSAVE_LAYOUT;
+
+	if (!dump_emit(cprm, &en, sizeof(en)))
+		return 1;
+	if (!dump_emit(cprm, owner_name, en.n_namesz))
+		return 1;
+	if (!dump_align(cprm, 4))
+		return 1;
+
+	num_records = dump_xsave_layout_desc(cprm);
+	if (!num_records) {
+		pr_warn("Error adding XSTATE layout ELF note. XSTATE buffer in the core file will be unparseable.");
+		return 1;
+	}
+
+	/* Total size should be equal to the number of records */
+	if ((sizeof(struct xfeat_component) * num_records) != en.n_descsz) {
+		pr_warn("Error adding XSTATE layout ELF note. The size of the .note section does not match with the total size of the records.");
+		return 1;
+	}
+
+	if (!dump_align(cprm, 4))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Return the size of new note.
+ */
+int elf_coredump_extra_notes_size(void)
+{
+	const char *fullname = "LINUX";
+	int size = 0;
+
+	/* NOTE Header */
+	size += sizeof(struct elf_note);
+	/* name + align */
+	size += roundup(strlen(fullname) + 1, 4);
+	size += get_xsave_desc_size();
+
+	return size;
+}
diff --git a/include/linux/elf.h b/include/linux/elf.h
index c9a46c4e183b..5c402788da19 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -65,7 +65,7 @@  extern Elf64_Dyn _DYNAMIC [];
 struct file;
 struct coredump_params;
 
-#ifndef ARCH_HAVE_EXTRA_ELF_NOTES
+#ifndef CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES
 static inline int elf_coredump_extra_notes_size(void) { return 0; }
 static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) { return 0; }
 #else
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 9417309b7230..3325488cb39b 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -411,6 +411,7 @@  typedef struct elf64_shdr {
 #define NT_X86_XSTATE	0x202		/* x86 extended state using xsave */
 /* Old binutils treats 0x203 as a CET state */
 #define NT_X86_SHSTK	0x204		/* x86 SHSTK state */
+#define NT_X86_XSAVE_LAYOUT	0x205	/* XSAVE layout description */
 #define NT_S390_HIGH_GPRS	0x300	/* s390 upper register halves */
 #define NT_S390_TIMER	0x301		/* s390 timer register */
 #define NT_S390_TODCMP	0x302		/* s390 TOD clock comparator register */