mbox

[ping,0/8,RFC] Support BTF decl_tag and type_tag annotations

Message ID b29e8319-3a63-5821-4475-9243e71cb11f@oracle.com
Headers

Message

David Faust April 18, 2022, 7:36 p.m. UTC
  Gentle ping :)

Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-April/592685.html

The series adds support for new attribues btf_type_tag and btf_decl_tag, 
for recording arbitrary string tags in DWARF and BTF debug info. The 
feature is to support kernel use cases.

Thanks,
David

On 4/1/22 12:42, David Faust via Gcc-patches wrote:
> Hello,
> 
> This patch series is a first attempt at adding support for:
> 
> - Two new C-language-level attributes that allow to associate (to "tag")
>    particular declarations and types with arbitrary strings. As explained below,
>    this is intended to be used to, for example, characterize certain pointer
>    types.
> 
> - The conveyance of that information in the DWARF output in the form of a new
>    DIE: DW_TAG_GNU_annotation.
> 
> - The conveyance of that information in the BTF output in the form of two new
>    kinds of BTF objects: BTF_KIND_DECL_TAG and BTF_KIND_TYPE_TAG.
> 
> All of these facilities are being added to the eBPF ecosystem, and support for
> them exists in some form in LLVM. However, as we shall see, we have found some
> problems implementing them so some discussion is in order.
> 
> Purpose
> =======
> 
> 1)  Addition of C-family language constructs (attributes) to specify free-text
>      tags on certain language elements, such as struct fields.
> 
>      The purpose of these annotations is to provide additional information about
>      types, variables, and function paratemeters of interest to the kernel. A
>      driving use case is to tag pointer types within the linux kernel and eBPF
>      programs with additional semantic information, such as '__user' or '__rcu'.
> 
>      For example, consider the linux kernel function do_execve with the
>      following declaration:
> 
>        static int do_execve(struct filename *filename,
>           const char __user *const __user *__argv,
>           const char __user *const __user *__envp);
> 
>      Here, __user could be defined with these annotations to record semantic
>      information about the pointer parameters (e.g., they are user-provided) in
>      DWARF and BTF information. Other kernel facilites such as the eBPF verifier
>      can read the tags and make use of the information.
> 
> 2)  Conveying the tags in the generated DWARF debug info.
> 
>      The main motivation for emitting the tags in DWARF is that the Linux kernel
>      generates its BTF information via pahole, using DWARF as a source:
> 
>          +--------+  BTF                  BTF   +----------+
>          | pahole |-------> vmlinux.btf ------->| verifier |
>          +--------+                             +----------+
>              ^                                        ^
>              |                                        |
>        DWARF |                                    BTF |
>              |                                        |
>           vmlinux                              +-------------+
>           module1.ko                           | BPF program |
>           module2.ko                           +-------------+
>             ...
> 
>      This is because:
> 
>      a)  Unlike GCC, LLVM will only generate BTF for BPF programs.
> 
>      b)  GCC can generate BTF for whatever target with -gbtf, but there is no
>          support for linking/deduplicating BTF in the linker.
> 
>      In the scenario above, the verifier needs access to the pointer tags of
>      both the kernel types/declarations (conveyed in the DWARF and translated
>      to BTF by pahole) and those of the BPF program (available directly in BTF).
> 
>      Another motivation for having the tag information in DWARF, unrelated to
>      BPF and BTF, is that the drgn project (another DWARF consumer) also wants
>      to benefit from these tags in order to differentiate between different
>      kinds of pointers in the kernel.
> 
> 3)  Conveying the tags in the generated BTF debug info.
> 
>      This is easy: the main purpose of having this info in BTF is for the
>      compiled eBPF programs. The kernel verifier can then access the tags
>      of pointers used by the eBPF programs.
> 
> 
> For more information about these tags and the motivation behind them, please
> refer to the following linux kernel discussions:
> 
>    https://lore.kernel.org/bpf/20210914223004.244411-1-yhs@fb.com/
>    https://lore.kernel.org/bpf/20211012164838.3345699-1-yhs@fb.com/
>    https://lore.kernel.org/bpf/20211112012604.1504583-1-yhs@fb.com/
> 
> 
> What is in this patch series
> ============================
> 
> This patch series adds support for these annotations in GCC. The implementation
> is largely complete. However, in some cases the produced debug info (both DWARF
> and BTF) differs significantly from that produced by LLVM. This issue is
> discussed in detail below, along with a few specific questions for both GCC and
> LLVM. Any input would be much appreciated.
> 
> 
> Implementation Overview
> =======================
> 
> To enable these annotations, two new C language attributes are added:
> __attribute__((btf_decl_tag("foo")) and __attribute__((btf_type_tag("bar"))).
> Both attributes accept a single arbitrary string constant argument, which will
> be recorded in the generated DWARF and/or BTF debugging information. They have
> no effect on code generation.
> 
> Note that we are using the same attribute names as LLVM, which include "btf"
> in the name. This may be controversial, as these tags are not really
> BTF-specific. A different name may be more appropriate. There was much
> discussion about naming in the proposal for the functionality in LLVM, the
> full thread can be found here:
> 
>    https://lists.llvm.org/pipermail/llvm-dev/2021-June/151023.html
> 
> The name debug_info_annotate, suggested here, might better suit the attribute:
> 
>    https://lists.llvm.org/pipermail/llvm-dev/2021-June/151042.html
> 
> DWARF support is enabled via a new DW_TAG_GNU_annotation. When generating DWARF,
> declarations and types will be checked for the corresponding attributes. If
> present, a DW_TAG_GNU_annotation DIE will be created as a child of the DIE for
> the annotated type or declaration, one for each tag. These DIEs link the
> arbitrary tag value to the item they annotate.
> 
> For example, the following variable declaration:
> 
>      #define __typetag1 __attribute__((btf_type_tag("type-tag-1")))
>      #define __decltag1 __attribute__((btf_decl_tag("decl-tag-1")))
>      #define __decltag2 __attribute__((btf_decl_tag("decl-tag-2")))
> 
>      int __typetag1 * x __decltag1 __decltag2;
> 
> Produces the following DIEs:
> 
> <1><1e>: Abbrev Number: 3 (DW_TAG_variable)
>      <1f>   DW_AT_name        : x
>      <21>   DW_AT_decl_file   : 1
>      <22>   DW_AT_decl_line   : 6
>      <23>   DW_AT_decl_column : 18
>      <24>   DW_AT_type        : <0x49>
>      <28>   DW_AT_external    : 1
>      <28>   DW_AT_location    : 9 byte block: 3 0 0 0 0 0 0 0 0 	(DW_OP_addr: 0)
>      <32>   DW_AT_sibling     : <0x49>
>   <2><36>: Abbrev Number: 1 (User TAG value: 0x6000)
>      <37>   DW_AT_name        : (indirect string, offset: 0x10): btf_decl_tag
>      <3b>   DW_AT_const_value : (indirect string, offset: 0x0): decl-tag-2
>   <2><3f>: Abbrev Number: 1 (User TAG value: 0x6000)
>      <40>   DW_AT_name        : (indirect string, offset: 0x10): btf_decl_tag
>      <44>   DW_AT_const_value : (indirect string, offset: 0x1d): decl-tag-1
>   <2><48>: Abbrev Number: 0
>   <1><49>: Abbrev Number: 4 (DW_TAG_pointer_type)
>      <4a>   DW_AT_byte_size   : 8
>      <4b>   DW_AT_type        : <0x5d>
>      <4f>   DW_AT_sibling     : <0x5d>
>   <2><53>: Abbrev Number: 1 (User TAG value: 0x6000)
>      <54>   DW_AT_name        : (indirect string, offset: 0x28): btf_type_tag
>      <58>   DW_AT_const_value : (indirect string, offset: 0xd7): type-tag-1
>   <2><5c>: Abbrev Number: 0
>   <1><5d>: Abbrev Number: 5 (DW_TAG_base_type)
>      <5e>   DW_AT_byte_size   : 4
>      <5f>   DW_AT_encoding    : 5	(signed)
>      <60>   DW_AT_name        : int
>   <1><64>: Abbrev Number: 0
> 
> Please note that currently, the annotation DWARF DIEs will be generated only if
> BTF debug information requested (via -gbtf). Therefore, the annotation DIEs
> will only be output if both BTF and DWARF are requested (e.g. -gbtf -gdwarf).
> This will change, since these tags are needed even when not generating BTF,
> for example in a GCC-built Linux kernel.
> 
> In the case of BTF, the annotations are recorded in two type kinds recently
> added to the BTF specification: BTF_KIND_DECL_TAG and BTF_KIND_TYPE_TAG.
> The above example declaration prodcues the following BTF information:
> 
>      [1] int 'int'(1U#B) size=4U#B offset=0UB#b bits=32UB#b SIGNED
>      [2] ptr <anonymous> type=3
>      [3] type_tag 'type-tag-1'(5U#B) type=1
>      [4] decl_tag 'decl-tag-1'(18U#B) type=6 component_idx=-1
>      [5] decl_tag 'decl-tag-2'(29U#B) type=6 component_idx=-1
>      [6] var 'x'(16U#B) type=2 linkage=1 (global)
> 
> 
> Current issues in the implementation
> ====================================
> 
> The __attribute__((btf_type_tag ("foo"))) syntax does not work correctly for
> types involving multiple pointers.
> 
> Consider the following example:
> 
>    #define __typetag1 __attribute__((btf_type_tag("type-tag-1")))
>    #define __typetag2 __attribute__((btf_type_tag("type-tag-2")))
>    #define __typetag3 __attribute__((btf_type_tag("type-tag-3")))
> 
>    int __typetag1 * __typetag2 __typetag3 * g;
> 
> The current implementation produces the following DWARF:
> 
>   <1><1e>: Abbrev Number: 4 (DW_TAG_variable)
>      <1f>   DW_AT_name        : g
>      <21>   DW_AT_decl_file   : 1
>      <22>   DW_AT_decl_line   : 6
>      <23>   DW_AT_decl_column : 42
>      <24>   DW_AT_type        : <0x32>
>      <28>   DW_AT_external    : 1
>      <28>   DW_AT_location    : 9 byte block: 3 0 0 0 0 0 0 0 0 	(DW_OP_addr: 0)
>   <1><32>: Abbrev Number: 2 (DW_TAG_pointer_type)
>      <33>   DW_AT_byte_size   : 8
>      <33>   DW_AT_type        : <0x45>
>      <37>   DW_AT_sibling     : <0x45>
>   <2><3b>: Abbrev Number: 1 (User TAG value: 0x6000)
>      <3c>   DW_AT_name        : (indirect string, offset: 0x18): btf_type_tag
>      <40>   DW_AT_const_value : (indirect string, offset: 0xc7): type-tag-1
>   <2><44>: Abbrev Number: 0
>   <1><45>: Abbrev Number: 2 (DW_TAG_pointer_type)
>      <46>   DW_AT_byte_size   : 8
>      <46>   DW_AT_type        : <0x61>
>      <4a>   DW_AT_sibling     : <0x61>
>   <2><4e>: Abbrev Number: 1 (User TAG value: 0x6000)
>      <4f>   DW_AT_name        : (indirect string, offset: 0x18): btf_type_tag
>      <53>   DW_AT_const_value : (indirect string, offset: 0xd): type-tag-3
>   <2><57>: Abbrev Number: 1 (User TAG value: 0x6000)
>      <58>   DW_AT_name        : (indirect string, offset: 0x18): btf_type_tag
>      <5c>   DW_AT_const_value : (indirect string, offset: 0xd2): type-tag-2
>   <2><60>: Abbrev Number: 0
>   <1><61>: Abbrev Number: 5 (DW_TAG_base_type)
>      <62>   DW_AT_byte_size   : 4
>      <63>   DW_AT_encoding    : 5	(signed)
>      <64>   DW_AT_name        : int
>   <1><68>: Abbrev Number: 0
> 
> This does not agree with the DWARF produced by LLVM/clang for the same case:
> (clang 15.0.0 git 142501117a78080d2615074d3986fa42aa6a0734)
> 
> <1><1e>: Abbrev Number: 2 (DW_TAG_variable)
>      <1f>   DW_AT_name        : (indexed string: 0x3): g
>      <20>   DW_AT_type        : <0x29>
>      <24>   DW_AT_external    : 1
>      <24>   DW_AT_decl_file   : 0
>      <25>   DW_AT_decl_line   : 6
>      <26>   DW_AT_location    : 2 byte block: a1 0 	((Unknown location op 0xa1))
>   <1><29>: Abbrev Number: 3 (DW_TAG_pointer_type)
>      <2a>   DW_AT_type        : <0x35>
>   <2><2e>: Abbrev Number: 4 (User TAG value: 0x6000)
>      <2f>   DW_AT_name        : (indexed string: 0x5): btf_type_tag
>      <30>   DW_AT_const_value : (indexed string: 0x7): type-tag-2
>   <2><31>: Abbrev Number: 4 (User TAG value: 0x6000)
>      <32>   DW_AT_name        : (indexed string: 0x5): btf_type_tag
>      <33>   DW_AT_const_value : (indexed string: 0x8): type-tag-3
>   <2><34>: Abbrev Number: 0
>   <1><35>: Abbrev Number: 3 (DW_TAG_pointer_type)
>      <36>   DW_AT_type        : <0x3e>
>   <2><3a>: Abbrev Number: 4 (User TAG value: 0x6000)
>      <3b>   DW_AT_name        : (indexed string: 0x5): btf_type_tag
>      <3c>   DW_AT_const_value : (indexed string: 0x6): type-tag-1
>   <2><3d>: Abbrev Number: 0
>   <1><3e>: Abbrev Number: 5 (DW_TAG_base_type)
>      <3f>   DW_AT_name        : (indexed string: 0x4): int
>      <40>   DW_AT_encoding    : 5	(signed)
>      <41>   DW_AT_byte_size   : 4
>   <1><42>: Abbrev Number: 0
> 
> Notice the structural difference. From the DWARF produced by GCC (i.e. this
> patch series), variable 'g' is a pointer with tag 'type-tag-1' to a pointer
> with tags 'type-tag-2' and 'type-tag3' to an int. But from the LLVM DWARF,
> variable 'g' is a pointer with tags 'type-tag-2' and 'type-tag3' to a pointer
> to an int.
> 
> Because GCC produces BTF from the internal DWARF DIE tree, the BTF also differs.
> This can be seen most obviously in the BTF type reference chains:
> 
>    GCC
>      VAR (g) -> ptr -> tag1 -> ptr -> tag3 -> tag2 -> int
> 
>    LLVM
>      VAR (g) -> ptr -> tag3 -> tag2 -> ptr -> tag1 -> int
> 
> It seems that the ultimate cause for this is the structure of the TREE
> produced by the C frontend parsing and attribute handling. I believe this may
> be due to differences in __attribute__ syntax parsing between GCC and LLVM.
> 
> This is the TREE for variable 'g':
>    int __typetag1 * __typetag2 __typetag3 * g;
> 
>   <var_decl 0x7ffff7547090 g
>      type <pointer_type 0x7ffff7548000
>          type <pointer_type 0x7ffff75097e0 type <integer_type 0x7ffff74495e8 int>
>              asm_written unsigned DI
>              size <integer_cst 0x7ffff743c450 constant 64>
>              unit-size <integer_cst 0x7ffff743c468 constant 8>
>              align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7ffff7450888
>              attributes <tree_list 0x7ffff75275c8
>                  purpose <identifier_node 0x7ffff753a1e0 btf_type_tag>
>                  value <tree_list 0x7ffff7527550
>                      value <string_cst 0x7ffff75292e0 type <array_type 0x7ffff7509738>
>                          readonly constant static "type-tag-3\000">>
>                  chain <tree_list 0x7ffff75275a0 purpose <identifier_node 0x7ffff753a1e0 btf_type_tag>
>                      value <tree_list 0x7ffff75274d8
>                          value <string_cst 0x7ffff75292c0 type <array_type 0x7ffff7509738>
>                              readonly constant static "type-tag-2\000">>>>
>              pointer_to_this <pointer_type 0x7ffff7509888>>
>          asm_written unsigned DI size <integer_cst 0x7ffff743c450 64> unit-size <integer_cst 0x7ffff743c468 8>
>          align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7ffff7509930
>          attributes <tree_list 0x7ffff75275f0 purpose <identifier_node 0x7ffff753a1e0 btf_type_tag>
>              value <tree_list 0x7ffff7527438
>                  value <string_cst 0x7ffff75292a0 type <array_type 0x7ffff7509738>
>                      readonly constant static "type-tag-1\000">>>>
>      public static unsigned DI defer-output /home/dfaust/playpen/btf/annotate.c:29:42 size <integer_cst 0x7ffff743c450 64> unit-size <integer_cst 0x7ffff743c468 8>
>      align:64 warn_if_not_align:0>
> 
> To me this is surprising. I would have expected the int** type of "g" to have
> the tags 'type-tag-2' and 'type-tag-3', and the inner (int*) pointer type to
> have the 'type-tag-1' tag. So far my attempts at resolving this difference in
> the new attribute handlers for the tag attributes has not been successful.
> 
> I do not understand why exacly the attributes are attached in this way. I think
> that it may be related to the pointer cases discussed in the "All other
> attributes" section here:
> 
>    https://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html
> 
> In particular it seems similar to this example:
> 
>      char *__attribute__((aligned(8))) *f;
> 
>    specifies the type “pointer to 8-byte-aligned pointer to char”. Note again
>    that this does not work with most attributes; for example, the usage of
>    ‘aligned’ and ‘noreturn’ attributes given above is not yet supported.
> 
> I am not sure if this section of the documentation is outdated, if scenarios
> like this one have not been an issue before now, or if there is a way to
> resolve this within the attribute handler. I am by no means an expert in the C
> frontend nor attribute handling, if someone with more knowledge could help me
> understand this case I would be very grateful. :)
> 
> Questions for GCC
> =================
> 
> 1)  How can this issue with the type tags be resolved? Is this a bug or
>      limitation in the attribute parsing that hasn't been an issue until now?
>      Oris it that the above case is somehow a "weird" usage of attribtes?
> 
> 2)  Are attributes the right tool for this? Is there some other mechanism that
>      would better fit the design of these tags? In some ways the type tags seem
>      more similar to const/volatile/restrict qualifiers than to most other
>      attributes.
> 
> 
> Questions for LLVM / kernel BPF
> ===============================
> 
> 1)  What special handling does the LLVM frontend/clang do for these attributes?
>      Is there anything specific? Or does it simply follow whatever is default?
> 
> 2)  What is the correct BTF representation for type tags? The documentation for
>      BTF_KIND_TYPE_TAG in linux/Documentation/bpf/btf.rst seems to conflict with
>      the output of clang, and the format change that was discussed here:
>        https://reviews.llvm.org/D113496
>      I assume the kernel btf.rst might simply be outdated, but I want to be sure.
> 
> 3)  Is the ordering of multiple type tags on the same type important?
>      e.g. for this variable:
>          int __tag1 __tag2 __tag3 * b;
> 
>      would it be "correct" (or at least, acceptable) to produce:
>          VAR(b) -> ptr -> tag2 -> tag3 -> tag1 -> int
> 
>      or _must_ it be:
>          VAR(b) -> ptr -> tag3 -> tag2 -> tag1 -> int
> 
>      In the DWARF representation, all tags are equal sibling children of the type
>      they annotate, so this 'ordering' problem seems like it only arises because of
>      the BTF format for type tags.
> 
> 4)  Are types with the same tags in different orders considered distinct types?
>      I think the answer is "no", but given the format of the tags in BTF we get
>      distinct chains for the types I am curious.
>      e.g.
>          int __tag1 __tag2 * x;
>          int __tag2 __tag1 * y;
> 
>      produces
>          VAR(x) -> ptr -> tag2 -> tag1 -> int
>          VAR(y) -> ptr -> tag1 -> tag2 -> int
> 
>      but would
>          VAR(y) -> ptr -> tag2 -> tag1 -> int
> 
>      be just as correct?
> 
> 5)  According to the clang docs, type tags are currently ignored for non-pointer
>      types. Is pointer tagging e.g. '__user' the only use case so far?
> 
>      This GCC implementation allows type tags on non-pointer types. Such tags
>      can be represented in the DWARF but don't make much sense in BTF output,
>      e.g.
> 
>          struct __typetag1 S {
>              int a;
>              int b;
>          } __decltag1;
> 
>          struct S my_s;
> 
>      This will produce a type tag child DIE of S. In the current implementation,
>      it will also produce a BTF type tag type, which refers to the __decltag1 BTF
>      decl tag, which in turn refers to the struct type.  But nothing refers to
>      the type tag type, currently variable my_s in BTF refers to the struct type
>      directly.
> 
>      In my opinion, the DWARF here is useful but the BTF seems odd. What would be
>      "correct" BTF in such a case?
> 
> 6)  Would LLVM be open to changing the name of the attribute, for example to
>      'debug_info_annotate' (or any other suggestion)? The use cases for these
>      tags have grown (e.g. drgn) since they were originally proposed, and the
>      scope is no longer limited to BTF.
> 
>      The kernel eBPF developers have said they can accomodate whatever name we
>      would like to use. So although we in GCC are not tied to the name LLVM
>      uses, it would be ideal for everyone to use the same attribute name.
> 
> Thanks!
> 
> David
> 
> David Faust (8):
>    dwarf: Add dw_get_die_parent function
>    include: Add BTF tag defines to dwarf2 and btf
>    c-family: Add BTF tag attribute handlers
>    dwarf: create BTF decl and type tag DIEs
>    ctfc: Add support to pass through BTF annotations
>    dwarf2ctf: convert tag DIEs to CTF types
>    Output BTF DECL_TAG and TYPE_TAG types
>    testsuite: Add tests for BTF tags
> 
>   gcc/btfout.cc                                 |  28 +++++
>   gcc/c-family/c-attribs.cc                     |  45 +++++++
>   gcc/ctf-int.h                                 |  29 +++++
>   gcc/ctfc.cc                                   |  11 +-
>   gcc/ctfc.h                                    |  17 ++-
>   gcc/dwarf2ctf.cc                              | 115 +++++++++++++++++-
>   gcc/dwarf2out.cc                              | 110 +++++++++++++++++
>   gcc/dwarf2out.h                               |   1 +
>   .../gcc.dg/debug/btf/btf-decltag-func.c       |  18 +++
>   .../gcc.dg/debug/btf/btf-decltag-sou.c        |  34 ++++++
>   .../gcc.dg/debug/btf/btf-decltag-typedef.c    |  15 +++
>   .../gcc.dg/debug/btf/btf-typetag-1.c          |  20 +++
>   .../gcc.dg/debug/dwarf2/annotation-1.c        |  29 +++++
>   include/btf.h                                 |  17 ++-
>   include/dwarf2.def                            |   4 +
>   15 files changed, 482 insertions(+), 11 deletions(-)
>   create mode 100644 gcc/ctf-int.h
>   create mode 100644 gcc/testsuite/gcc.dg/debug/btf/btf-decltag-func.c
>   create mode 100644 gcc/testsuite/gcc.dg/debug/btf/btf-decltag-sou.c
>   create mode 100644 gcc/testsuite/gcc.dg/debug/btf/btf-decltag-typedef.c
>   create mode 100644 gcc/testsuite/gcc.dg/debug/btf/btf-typetag-1.c
>   create mode 100644 gcc/testsuite/gcc.dg/debug/dwarf2/annotation-1.c
>