[v2] Create script to convert old tests into Dwarf::assemble calls.
Checks
Context |
Check |
Description |
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_gdb_build--master-arm |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_gdb_check--master-arm |
success
|
Test passed
|
Commit Message
PR testsuite/32261 requests a script that could convert old .S-based
tests (that were made before dwarf.exp existed) into the new
Dwarf::assemble calls in Tcl. This commit is an initial implementation
of such a script. Python was chosen for convenience, and only relies on
a single external library.
Usage: the script operates not on .S files, but on ELF files with DWARF
information. To convert an old test, one must run said test via
`make check-gdb TESTS=testname` in their build directory. This will
produce, as a side effect, an ELF file the test used as an inferior, at
gdb/testsuite/outputs/testname/testname. This ELF file is this script's
input.
Reliability: not counting the limitations listed below, the script seems
functional enough to be worthy of discussion in the mailing list. I have
been testing it with different tests that already use Dwarf::assemble,
to see if the script can produce a similar call to it. Indeed, in most
cases that I've tested (including some more complex ones, marked with an
asterisk below) it is able to produce comparable output to the original
exp file. Of course, it can't reproduce the complex code *before* the
Dwarf::assemble call. Values calculated then are simply inlined.
The following .exp files have been tried in this way and their outputs
highly resemble the original:
- gdb.dwarf2/dynarr-ptr
- gdb.dwarf2/void-type
- gdb.dwarf2/ada-thick-pointer
- gdb.dwarf2/atomic-type
- gdb.dwarf2/dw2-entry-points (*)
- gdb.dwarf2/main-subprogram
The following .exp files work, with caveats addressed in the limitations
section below.
- gdb.dwarf2/cpp-linkage-name
- Works correctly except for one attribute of the form SPECIAL_expr.
- gdb.dwarf2/dw2-unusual-field-names
- Same as above, with two instances of SPECIAL_expr.
- gdb.dwarf2/implptr-optimized-out
- Same as above, with two instances of SPECIAL_expr.
- gdb.dwarf2/negative-data-member-location
- Same as above, with one instance of SPECIAL_expr.
The following .exp files FAIL, but that is expected:
- gdb.dwarf2/staticvirtual.exp
- high_pc and low_pc of subprogram "~S" are hardcoded in the original
.exp file, but they get replaced with a get_func_info call. Since
the function S::~S is not present in the original, get_func_info
will fail.
The following .exp files DO NOT WORK with this script:
- gdb.dwarf2/cu-no-addrs
- `aranges` not supported.
- Compile unit high_pc and low_pc hardcoded, prone to user error
due to forgetting to replace with variables.
- gdb.dwarf2/dw2-inline-stepping
- Same as above.
- gdb.dwarf2/fission-relative-dwo
- `fission` not supported.
- gdb.dwarf2/dw2-prologue-end and gdb.dwarf2/dw2-prologue-end-2
- Line tables not supported.
Currently the script has the following known limitations:
- It does not support line tables.
- It does not use $srcfile and other variables in the call to
Dwarf::assemble (since it can't know where it is safe to substitute).
- It does not support "fission" type DWARFs (in fact I still have no
clue what those are).
- It does not support cu {label LABEL} {} CUs, mostly because I couldn't
find the information using pyelftools.
- It sometimes outputs empty CUs at the start and end of the call. This
might be a problem with my machine, but I've checked with DWARF dumps
and they are indeed in the input ELF files generated by
`make check-gdb`.
- It does not support attributes with the forms DW_FORM_block* and
DW_FORM_exprloc. This is mostly not a concern of the difficulty of
the implementation, but of it being an incomplete feature and, thus,
more susceptible to users forgetting to correct its mistakes or
unfinished values (please see discussion started by Guinevere at
comment 23 https://sourceware.org/bugzilla/show_bug.cgi?id=32261#c23).
The incompleteness of this feature is easy to demonstrate: any call to
gdb_target_symbol, a common tool used in these attributes, needs a
symbol name that is erased after compilation. There is no way to guess
where that address being referenced in a DW_OP_addr comes from, and it
can't be hard coded since it can change depending on the machine
compiling it.
Also of note is that there is a large chunk of code left out in
comments. This is a partial implementation of SPECIAL_expr attribute
decoding. Since it is not complete nor functional, I've opted to leave
it in the commit in hopes that it will be useful for a future
implementation.
Please bring up any further shortcomings this script may have with your
expectations.
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=32261
---
gdb/contrib/dwarf-to-dwarf-assembler.py | 841 ++++++++++++++++++++++++
1 file changed, 841 insertions(+)
create mode 100755 gdb/contrib/dwarf-to-dwarf-assembler.py
Comments
On 1/2/25 5:23 PM, William Ferreira wrote:
> PR testsuite/32261 requests a script that could convert old .S-based
> tests (that were made before dwarf.exp existed) into the new
> Dwarf::assemble calls in Tcl. This commit is an initial implementation
> of such a script. Python was chosen for convenience, and only relies on
> a single external library.
>
> Usage: the script operates not on .S files, but on ELF files with DWARF
> information. To convert an old test, one must run said test via
> `make check-gdb TESTS=testname` in their build directory. This will
> produce, as a side effect, an ELF file the test used as an inferior, at
> gdb/testsuite/outputs/testname/testname. This ELF file is this script's
> input.
>
> Reliability: not counting the limitations listed below, the script seems
> functional enough to be worthy of discussion in the mailing list. I have
> been testing it with different tests that already use Dwarf::assemble,
> to see if the script can produce a similar call to it. Indeed, in most
> cases that I've tested (including some more complex ones, marked with an
> asterisk below) it is able to produce comparable output to the original
> exp file. Of course, it can't reproduce the complex code *before* the
> Dwarf::assemble call. Values calculated then are simply inlined.
I just thought of something. Have you tried this with any test that
actually uses one of those .S tests, and seeing if it still works?
Having them work on existing dwarf assembler is good, but the idea is
converting old stuff and maybe making new tests, so I think it'd be nice
to have one or two tests where you know it works. You don't need to
necessarily submit the updated tests, but just as a smoke test, you know?
Apart from this, I have a comment that is mostly personal preference,
inlined.
>
> The following .exp files have been tried in this way and their outputs
> highly resemble the original:
> - gdb.dwarf2/dynarr-ptr
> - gdb.dwarf2/void-type
> - gdb.dwarf2/ada-thick-pointer
> - gdb.dwarf2/atomic-type
> - gdb.dwarf2/dw2-entry-points (*)
> - gdb.dwarf2/main-subprogram
>
> The following .exp files work, with caveats addressed in the limitations
> section below.
> - gdb.dwarf2/cpp-linkage-name
> - Works correctly except for one attribute of the form SPECIAL_expr.
> - gdb.dwarf2/dw2-unusual-field-names
> - Same as above, with two instances of SPECIAL_expr.
> - gdb.dwarf2/implptr-optimized-out
> - Same as above, with two instances of SPECIAL_expr.
> - gdb.dwarf2/negative-data-member-location
> - Same as above, with one instance of SPECIAL_expr.
>
> The following .exp files FAIL, but that is expected:
> - gdb.dwarf2/staticvirtual.exp
> - high_pc and low_pc of subprogram "~S" are hardcoded in the original
> .exp file, but they get replaced with a get_func_info call. Since
> the function S::~S is not present in the original, get_func_info
> will fail.
>
> The following .exp files DO NOT WORK with this script:
> - gdb.dwarf2/cu-no-addrs
> - `aranges` not supported.
> - Compile unit high_pc and low_pc hardcoded, prone to user error
> due to forgetting to replace with variables.
> - gdb.dwarf2/dw2-inline-stepping
> - Same as above.
> - gdb.dwarf2/fission-relative-dwo
> - `fission` not supported.
> - gdb.dwarf2/dw2-prologue-end and gdb.dwarf2/dw2-prologue-end-2
> - Line tables not supported.
>
> Currently the script has the following known limitations:
> - It does not support line tables.
> - It does not use $srcfile and other variables in the call to
> Dwarf::assemble (since it can't know where it is safe to substitute).
> - It does not support "fission" type DWARFs (in fact I still have no
> clue what those are).
> - It does not support cu {label LABEL} {} CUs, mostly because I couldn't
> find the information using pyelftools.
> - It sometimes outputs empty CUs at the start and end of the call. This
> might be a problem with my machine, but I've checked with DWARF dumps
> and they are indeed in the input ELF files generated by
> `make check-gdb`.
> - It does not support attributes with the forms DW_FORM_block* and
> DW_FORM_exprloc. This is mostly not a concern of the difficulty of
> the implementation, but of it being an incomplete feature and, thus,
> more susceptible to users forgetting to correct its mistakes or
> unfinished values (please see discussion started by Guinevere at
> comment 23 https://sourceware.org/bugzilla/show_bug.cgi?id=32261#c23).
> The incompleteness of this feature is easy to demonstrate: any call to
> gdb_target_symbol, a common tool used in these attributes, needs a
> symbol name that is erased after compilation. There is no way to guess
> where that address being referenced in a DW_OP_addr comes from, and it
> can't be hard coded since it can change depending on the machine
> compiling it.
>
> Also of note is that there is a large chunk of code left out in
> comments. This is a partial implementation of SPECIAL_expr attribute
> decoding. Since it is not complete nor functional, I've opted to leave
> it in the commit in hopes that it will be useful for a future
> implementation.
>
> Please bring up any further shortcomings this script may have with your
> expectations.
>
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=32261
> ---
> gdb/contrib/dwarf-to-dwarf-assembler.py | 841 ++++++++++++++++++++++++
> 1 file changed, 841 insertions(+)
> create mode 100755 gdb/contrib/dwarf-to-dwarf-assembler.py
>
> diff --git a/gdb/contrib/dwarf-to-dwarf-assembler.py b/gdb/contrib/dwarf-to-dwarf-assembler.py
> new file mode 100755
> index 00000000000..817d6581ec8
> --- /dev/null
> +++ b/gdb/contrib/dwarf-to-dwarf-assembler.py
> @@ -0,0 +1,841 @@
> +#!/usr/bin/env python3
> +# Copyright 2024 Free Software Foundation, Inc.
> +
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program. If not, see <http://www.gnu.org/licenses/>.
> +
> +# Due to the pyelftools dependency, this script requires Python version
> +# 3.10 or greater to run.
> +
> +"""A utility to convert ELF files with DWARF info to Dwarf::assemble code.
> +
> +Usage:
> + python ./asm_to_dwarf_assembler.py <path/to/elf/file>
> +
> +Dependencies:
> + Python >= 3.10
> + pyelftools >= 0.31
> +
> +Notes:
> +- Line tables are not currently supported.
> +- Non-contiguous subprograms are not currently supported.
> +- If you want to use $srcfile or similar, you must edit the references to the
> + file name manually, including DW_AT_name attributes on compile units.
> +- If run with binaries generated by make check-gdb, it may include an
> + additional compile_unit before and after the actual compile units. This is
> + an artifact of the normal compilation process, as these CUs are indeed in
> + the generated DWARF in some cases.
> +"""
> +
> +import sys
> +import re
> +import errno
> +
> +from io import BytesIO, IOBase
> +from typing import Optional, Annotated
> +from functools import cache
> +from dataclasses import dataclass
> +from copy import copy
> +from datetime import datetime
> +from logging import getLogger
> +
> +from elftools.elf.elffile import ELFFile
> +from elftools.dwarf.compileunit import CompileUnit as RawCompileUnit
> +from elftools.dwarf.die import DIE as RawDIE, AttributeValue
> +
> +
> +logger = getLogger(__file__)
> +
> +
> +# While these aren't supported, their detection is important for replacing them
> +# with SPECIAL_expr and for writing the placeholder {MANUAL} expr list.
> +EXPR_ATTRIBUTE_FORMS = [
> + "DW_FORM_exprloc",
> + "DW_FORM_block",
> + "DW_FORM_block1",
> + "DW_FORM_block2",
> + "DW_FORM_block4",
> +]
> +
> +
> +# Again, these aren't supported, but I already transcribed the relevant table
> +# from the DWARF5 spec so it felt like a waste to erase it. It will be very
> +# useful for the future if someone wants to spend the time to implement it.
> +# NOTE: this dataclass is incomplete, because it doesn't take into consideration
> +# the length of each of its operands.
> +
> +# @dataclass
> +# class DWARFOperation:
> +# """Values of the form DW_OP_<name>."""
> +# suffix: Annotated[str, 'Part of the name that succeeds "DW_OP_".']
> +# code: Annotated[int, "Numeric representation of the OP."]
> +# num_operands: Annotated[int, "Number of operands this operator expects."]
> +
> +
> +# DWARF_OPERATIONS: Annotated[
> +# dict[int, DWARFOperation],
> +# "Table of all DWARF operations as per table 7.9 of the DWARF5 spec."
> +# ] = {}
> +
> +
> +# def _register_op(suffix: str, code: int, num_operands: int) -> None:
> +# assert code not in DWARF_OPERATIONS, "Duplicate operation code found."
> +# DWARF_OPERATIONS[code] = DWARFOperation(suffix, code, num_operands)
> +
> +
> +# def _register_all_ops() -> None:
> +# """Register all operations as per table 7.9 of the DWARF5 spec.
> +#
> +# Currently these operations are not actively used by the script, as
> +# SPECIAL_expr form attributes are not implemented.
> +# """
> +#
> +# # Op codes 0x01 and 0x02 are reserved.
> +#
> +# _register_op("addr", 0x03, 1)
> +#
> +# # Op codes 0x04 and 0x05 are reserved.
> +#
> +# _register_op("deref", 0x06, 0)
> +#
> +# # Op code 0x07 is reserved.
> +#
> +# _register_op("const1u", 0x08, 1)
> +# _register_op("const1s", 0x09, 1)
> +# _register_op("const2u", 0x0a, 1)
> +# _register_op("const2s", 0x0b, 1)
> +# _register_op("const4u", 0x0c, 1)
> +# _register_op("const4s", 0x0d, 1)
> +# _register_op("const8u", 0x0e, 1)
> +# _register_op("const8s", 0x0f, 1)
> +# _register_op("constu", 0x10, 1)
> +# _register_op("consts", 0x11, 1)
> +#
> +# _register_op("dup", 0x12, 0)
> +# _register_op("drop", 0x13, 0)
> +# _register_op("over", 0x14, 0)
> +# _register_op("pick", 0x15, 1)
> +# _register_op("swap", 0x16, 0)
> +# _register_op("rot", 0x17, 0)
> +# _register_op("xderef", 0x18, 0)
> +#
> +# _register_op("abs", 0x19, 0)
> +# _register_op("and", 0x1a, 0)
> +# _register_op("div", 0x1b, 0)
> +# _register_op("minus", 0x1c, 0)
> +# _register_op("mod", 0x1d, 0)
> +# _register_op("mul", 0x1e, 0)
> +# _register_op("neg", 0x1f, 0)
> +# _register_op("not", 0x20, 0)
> +# _register_op("or", 0x21, 0)
> +# _register_op("plus", 0x22, 0)
> +# _register_op("plus_uconst", 0x23, 1)
> +# _register_op("shl", 0x24, 0)
> +# _register_op("shr", 0x25, 0)
> +# _register_op("shra", 0x26, 0)
> +# _register_op("xor", 0x27, 0)
> +# _register_op("bra", 0x28, 1)
> +# _register_op("eq", 0x29, 0)
> +# _register_op("ge", 0x2a, 0)
> +# _register_op("gt", 0x2b, 0)
> +# _register_op("le", 0x2c, 0)
> +# _register_op("lt", 0x2d, 0)
> +# _register_op("ne", 0x2e, 0)
> +# _register_op("skip", 0x2f, 0)
> +#
> +# for lit_nr in range(32):
> +# _register_op(f"lit{lit_nr}", 0x30 + lit_nr, 0)
> +#
> +# for reg_nr in range(32):
> +# _register_op(f"reg{reg_nr}", 0x50 + reg_nr, 0)
> +#
> +# for breg_nr in range(32):
> +# _register_op(f"breg{breg_nr}", 0x70 + breg_nr, 0)
> +#
> +# _register_op("regx", 0x90, 1)
> +# _register_op("fbregx", 0x91, 1)
> +# _register_op("bregx", 0x92, 2)
> +# _register_op("piece", 0x93, 1)
> +# _register_op("deref_size", 0x94, 1)
> +# _register_op("xderef_size", 0x95, 1)
> +# _register_op("nop", 0x96, 0)
> +# _register_op("push_object_address", 0x97, 0)
> +#
> +# _register_op("call2", 0x98, 1)
> +# _register_op("call4", 0x99, 1)
> +# _register_op("call_ref", 0x9a, 1)
> +# _register_op("form_tls_address", 0x9b, 0)
> +# _register_op("call_frame_cfa", 0x9c, 0)
> +#
> +# _register_op("bit_piece", 0x9d, 2)
> +# _register_op("implicit_value", 0x9e, 2)
> +# _register_op("stack_value", 0x9f, 0)
> +# _register_op("implicit_pointer", 0xa0, 2)
> +# _register_op("addrx", 0xa1, 1)
> +# _register_op("constx", 0xa2, 1)
> +#
> +# _register_op("entry_value", 0xa3, 2)
> +# _register_op("const_type", 0xa4, 3)
> +# _register_op("regval_type", 0xa5, 2)
> +# _register_op("deref_type", 0xa6, 2)
> +# _register_op("xderef_type", 0xa7, 2)
> +#
> +# _register_op("convert", 0xa8, 1)
> +# _register_op("reinterpret", 0xa9, 1)
> +
> +
> +# _register_all_ops()
> +
> +
> +# Workaround for my editor not to freak out over unclosed braces.
> +lbrace, rbrace = "{", "}"
> +
> +
> +@cache
> +def get_indent_str(indent_count: int) -> str:
> + """Get whitespace string to prepend to another for indenting."""
> + indent = (indent_count // 2) * "\t"
> + if indent_count % 2 == 1:
> + indent += " "
> + return indent
> +
> +
> +def indent(line: str, indent_count: int) -> str:
> + """Indent line by indent_count levels."""
> + return get_indent_str(indent_count) + line
> +
> +
> +def labelify_str(s: str) -> str:
> + """Make s appropriate for a label name."""
> + # Replace "*" with the literal word "ptr".
> + s = s.replace("*", "ptr")
> +
> + # Replace any non-"word" characters by "_".
> + s = re.sub(r"\W", "_", s)
> +
> + # Remove consecutive "_"s.
> + s = re.sub(r"__+", "_", s)
> +
> + return s
> +
> +
> +class DWARFAttribute:
> + """Storage unit for a single DWARF attribute.
> +
> + All its values are strings that are usually passed on
> + directly to format. The exceptions to this are attributes
> + with int values with DW_FORM_ref4 or DW_FORM_ref_addr form.
> + Their values are interpreted as the global offset of the DIE
> + being referenced, which are looked up dynamically to fetch
> + their labels.
> + """
> + def __init__(
> + self,
> + die_offset: int,
> + name: str,
> + value: str | bytes | int | bool,
> + form=None,
> + ):
> + self.die_offset = die_offset
> + self.name = name
> + self.value = value
> + self.form = form
> +
> + def _format_expr_value(self) -> str:
> + # This code is left here in case this is feature is desired in the
> + # future. It is partly functional, being able to decode operations
> + # but not operands due to their variable length in bytes.
> +
> + # self.form = "SPECIAL_expr"
> + # lines = [lbrace]
> + # curr_idx = 0
> + # while curr_idx < len(self.value):
> + # curr_op_code = self.value[curr_idx]
> + #
> + # try:
> + # curr_op: DWARFOperation = dwarf_operations[curr_op_code]
> + # except KeyError as err:
> + # raise KeyError(
> + # f"Unknown op: {curr_op_code}, item {curr_idx} of"
> + # f" attribute {self.name}, DIE at offset"
> + # f" {self.die_offset}."
> + # ) from err
> + # # Empty arg here so it inserts a space before the first argument,
> + # # only if there is indeed a first argument.
> + # args = [""]
> + # for arg_nr in range(curr_op.num_operands):
> + # curr_idx += 1
> + # # TODO decode bytes according to specific operation.
> + # # The following line only reads the next BYTE of the sequence,
> + # # and needs to be adjusted to read the correct number of
> + # # bytes.
> + # # args.append(str(self.value[curr_idx]))
> + # op_line = "DW_OP_" + curr_op.suffix + " ".join(args)
> + # lines.append(op_line)
> + # curr_idx += 1
> + #
> + # lines.append(rbrace)
> + # return "\n".join(lines)
> + self.form = "SPECIAL_expr"
> + return "{ MANUAL: Fill expr list }"
> +
> + def _needs_escaping(self, str_value: str) -> bool:
> + charset = set(str_value)
> + return bool(charset.intersection({"{", "}", " ", "\t"}))
> +
> + def _format_str(self, str_value: str) -> str:
> + if self._needs_escaping(str_value):
> + escaped_str = str(str_value)
> + # Replace single escape (which is itself escaped because of regex)
> + # with a double escape (which doesn't mean anything to regex so
> + # it doesn't need escaping).
> + escaped_str = re.sub(r"\\", r"\\", escaped_str)
> + escaped_str = re.sub("([{}])", r"\\\1", escaped_str)
> + return "{" + escaped_str + "}"
> + else:
> + return str_value
> +
> + def _format_value(
> + self,
> + offset_die_lookup: dict[int, "DWARFDIE"],
> + indent_count: int = 0
> + ) -> str:
> + if self.form in EXPR_ATTRIBUTE_FORMS:
> + return self._format_expr_value()
> + elif isinstance(self.value, bool):
> + return str(int(self.value))
> + elif isinstance(self.value, int):
> + if self.form == "DW_FORM_ref4":
> + # ref4-style referencing label.
> + die: "DWARFDIE" = offset_die_lookup[self.value]
> + return ":$" + die.tcl_label
> + elif self.form == "DW_FORM_ref_addr":
> + # ref_addr-style referencing label.
> + die: "DWARFDIE" = offset_die_lookup[self.value]
> + return "%$" + die.tcl_label
> + else:
> + return str(self.value)
> + elif isinstance(self.value, bytes):
> + return self._format_str(self.value.decode("ascii"))
> + elif isinstance(self.value, str):
> + return self._format_str(self.value)
> + else:
> + raise NotImplementedError(
> + "Unknown data type: " + str(type(self.value))
> + )
> +
> + def format(
> + self,
> + offset_die_lookup: dict[int, "DWARFDIE"],
> + indent_count: int = 0
> + ) -> str:
> + """Format the attribute in the form {name value form}.
> +
> + If form is DW_FORM_exprloc or DW_FORM_block, see next section on
> + DWARFOperations.
> +
> + If it isn't, value is formatted as follows:
> + If bool, use "1" if True, "0" if False.
> + If int:
> + If form is DW_FORM_ref4, use ":$label" where label is the
> + tcl_label of the DWARFDIE at offset "value".
> + If form is DW_FORM_ref_addr, use "%$label" where label is
> + the tcl_label of the DWARFDIE at offset "value".
> + Else, use value directly.
> + If bytes, use value.decode("ascii")
> + If str, use value directly.
> + Any other type results in a NotImplementedError being raised.
> +
> + Regarding DW_FORM_exprloc and DW_FORM_block:
> + The form is replaced with SPECIAL_expr.
> + The entries in the value are interpreted and decoded using the
> + dwarf_operations dictionary, and replaced with their names where
> + applicable.
> + """
> + s = lbrace
> + s += self.name + " "
> + s += self._format_value(offset_die_lookup)
> +
> + # Only explicitly state form if it's not a reference.
> + if self.form not in [None, "DW_FORM_ref4", "DW_FORM_ref_addr"]:
> + s += " " + self.form
> +
> + s += rbrace
> + return indent(s, indent_count)
> +
> +
> +class DWARFDIE:
> + """This script's parsed version of a RawDIE."""
> + def __init__(
> + self,
> + offset: int,
> + tag: str,
> + attrs: dict[str, DWARFAttribute],
> + tcl_label: Optional[str] = None
> + ):
> + self.offset: Annotated[
> + int,
> + "Global offset of the DIE."
> + ] = offset
> + self.tag: Annotated[
> + str,
> + "DWARF tag for this DIE."
> + ] = tag
> + self.attrs: Annotated[
> + dict[str, DWARFAttribute],
> + "Dict of attributes for this DIE."
> + ] = copy(attrs)
> + self.children: Annotated[
> + list[DWARFDIE],
> + "List of child DIEs of this DIE."
> + ] = []
> + self.tcl_label: Annotated[
> + str,
> + "Label used by the Tcl code to reference this DIE, if any. These "
> + 'take the form of "label: " before the actual DIE definition.'
> + ] = tcl_label
> +
> + def format_lines(
> + self,
> + offset_die_lookup: dict[int, "DWARFDIE"],
> + indent_count: int = 0
> + ) -> list[str]:
> + """Get the list of lines that represent this DIE in Dwarf assembler."""
> + die_lines = []
> +
> + # Prepend label to first line, if it's set.
> + if self.tcl_label:
> + first_line_start = self.tcl_label + ": "
> + else:
> + first_line_start = ""
> +
> + # First line, including label.
> + first_line = indent(
> + first_line_start + self.tag + " " + lbrace,
> + indent_count
> + )
> + die_lines.append(first_line)
> +
> + # Format attributes, if any.
> + if self.attrs:
> + for attr_name, attr in self.attrs.items():
> + attr_line = attr.format(
> + offset_die_lookup,
> + indent_count=indent_count + 1
> + )
> + die_lines.append(attr_line)
> + die_lines.append(indent(rbrace, indent_count))
> + else:
> + # Don't create a new line, just append and immediately close the
> + # brace on the last line.
> + die_lines[-1] += rbrace
> +
> + # Format children, if any.
> + if self.children:
> + # Only open a new brace if there are any children for the
> + # current DIE.
> + die_lines[-1] += " " + lbrace
> + for child in self.children:
> + child_lines = child.format_lines(
> + offset_die_lookup,
> + indent_count=indent_count + 1
> + )
> + die_lines.extend(child_lines)
> + die_lines.append(indent(rbrace, indent_count))
> +
> + return die_lines
> +
> + def format(
> + self,
> + offset_die_lookup: dict[int, "DWARFDIE"],
> + indent_count: int = 0
> + ) -> str:
> + """Join result from format_lines into a single str."""
> + return "\n".join(self.format_lines(offset_die_lookup, indent_count))
> +
> + def name(self) -> Optional[str]:
> + """Get DW_AT_name (if present) decoded as ASCII."""
> + raw_value = self.attrs.get("DW_AT_name")
> + if raw_value is None:
> + return None
> + else:
> + return raw_value.value.decode("ascii")
> +
> + def type_name(self) -> str:
> + """Name of Dwarf tag, with the "DW_TAG_" prefix removed."""
> + return re.sub("DW_TAG_", "", self.tag)
> +
> +
> +class DWARFCompileUnit(DWARFDIE):
> + """Wrapper subclass for CU DIEs.
> +
> + This is necessary due to the special format CUs take in Dwarf::assemble.
> +
> + Instead of simply:
> + DW_TAG_compile_unit {
> + <attributes>
> + } {
> + <children>
> + }
> +
> + CUs are formatted as:
> + cu { <cu_special_vars> } {
> + DW_TAG_compile_unit {
> + <attributes>
> + } {
> + <children>
> + }
> + }
> + """
> +
> + # Default value for parameter is_64 defined in dwarf.exp line 1553.
> + # This value is converted to 0/1 automatically when emitting
> + # Dwarf::assemble code.
> + default_is_64 = False
> +
> + # Default value for parameter dwarf_version defined in dwarf.exp line 1552.
> + default_dwarf_version = 4
> +
> + # Default value for parameter is_fission defined in dwarf.exp line 1556.
> + # Currently not implemented, see comment below.
> + # default_is_fission = False
> +
> + # Tag that signifies a DIE is a compile unit.
> + compile_unit_tag = "DW_TAG_compile_unit"
> +
> + def __init__(
> + self,
> + raw_die: RawDIE,
> + raw_cu: RawCompileUnit,
> + attrs: dict[str, DWARFAttribute],
> + ):
> + """Initialize additional instance variables for CU encoding.
> +
> + The additional instance variables are:
> + - is_64_bit: bool
> + Whether this CU is 64 bit or not.
> + - dwarf_version: int
> + default DWARFCompileUnit.default_dwarf_version
> + Version of DWARF this CU is using.
> + - addr_size: Optional[int]
> + default None
> + Size of an address in bytes.
> +
> + These variables are used to configure the first parameter of the cu
> + proc (which contains calls to the compile_unit proc in the body of
> + Dwarf::assemble).
> + """
> + super().__init__(
> + raw_die.offset,
> + DWARFCompileUnit.compile_unit_tag,
> + attrs
> + )
> + self.raw_cu = raw_cu
> + self.dwarf_version: int = raw_cu.header.get(
> + "version",
> + DWARFCompileUnit.default_dwarf_version
> + )
> + self.addr_size: Optional[int] = raw_cu.header.get("address_size")
> + self.is_64_bit: bool = raw_cu.dwarf_format() == 64
> +
> + # Fission is not currently implemented because I don't know where to
> + # fetch this information from.
> + # self.is_fission: bool = self.default_is_fission
> +
> + # CU labels are not currently implemented because I haven't found where
> + # pyelftools exposes this information.
> + # self.cu_label: Optional[str] = None
> +
> + def format_lines(
> + self,
> + offset_die_lookup: dict[int, DWARFDIE],
> + indent_count: int = 0,
> + ) -> list[str]:
> + lines = []
> + lines.append(self._get_header(indent_count))
> + inner_lines = super().format_lines(
> + offset_die_lookup,
> + indent_count + 1
> + )
> + lines += inner_lines
> + lines.append(indent(rbrace, indent_count))
> + return lines
> +
> + def _get_header(self, indent_count: int = 0) -> str:
> + """Assemble the first line of the surrounding 'cu {} {}' proc call."""
> + header = indent("cu " + lbrace, indent_count)
> + cu_params = []
> +
> + if self.is_64_bit != DWARFCompileUnit.default_is_64:
> + # Convert from True/False to 1/0.
> + param_value = int(self.is_64_bit)
> + cu_params += ["is_64", str(param_value)]
> +
> + if self.dwarf_version != DWARFCompileUnit.default_dwarf_version:
> + cu_params += ["version", str(self.dwarf_version)]
> +
> + if self.addr_size is not None:
> + cu_params += ["addr_size", str(self.addr_size)]
> +
> + # Fission is not currently implemented, see comment above.
> + # if self.is_fission != DWARFCompileUnit.default_is_fission:
> + # # Same as is_64_bit conversion, True/False -> 1/0.
> + # param_value = int(self.is_fission)
> + # cu_params += ["fission", str(param_value)]
> +
> + # CU labels are not currently implemented, see commend above.
> + # if self.cu_label is not None:
> + # cu_params += ["label", self.cu_label]
> +
> + if cu_params:
> + header += " ".join(cu_params)
> +
> + header += rbrace + " " + lbrace
> + return header
> +
> +
> +class DWARFParser:
> + """Converter from pyelftools's DWARF representation to this script's."""
> +
> + def __init__(self, elf_file: IOBase):
> + """Init parser with file opened in binary mode.
> +
> + File can be closed after this function is called.
> + """
> + self.raw_data = BytesIO(elf_file.read())
> + self.elf_data = ELFFile(self.raw_data)/
> + self.dwarf_info = self.elf_data.get_dwarf_info()
> + self.offset_to_die: dict[int, DWARFDIE] = {}
> + self.label_to_die: dict[str, DWARFDIE] = {}
> + self.referenced_offsets: Annotated[
> + set[int],
> + "The set of all offsets that were referenced by some DIE."
> + ] = set()
> + self.raw_cu_list: list[RawCompileUnit] = []
> + self.top_level_dies: list[DWARFDIE] = []
> + self.subprograms: list[DWARFDIE] = []
> + self.taken_labels: set[str] = set()
> +
> + self._read_all_cus()
> + self._create_necessary_labels()
> +
> + def _read_all_cus(self):
> + """Populate self.raw_cu_list with all CUs in self.dwarf_info."""
> + for cu in self.dwarf_info.iter_CUs():
> + self._read_cu(cu)
> +
> + def _read_cu(self, raw_cu: RawCompileUnit):
> + """Read a compile_unit into self.cu_list."""
> + self.raw_cu_list.append(raw_cu)
> + for raw_die in raw_cu.iter_DIEs():
> + if not raw_die.is_null():
> + self._parse_die(raw_cu, raw_die)
> +
> + def _parse_die(self, die_cu: RawCompileUnit, raw_die: RawDIE) -> DWARFDIE:
> + """Process a single DIE and add it to offset_to_die.
> +
> + Look for DW_FORM_ref4 and DWD_FORM_ref_addr form attributes and replace
> + them with the global offset of the referenced DIE, and adding the
> + referenced DIE to a set. This will be used later to assign and use
> + labels only to DIEs that need it.
> +
> + In case the DIE is a top-level DIE, add it to self.top_level_dies.
> +
> + In case the DIE is a subprogram, add it to self.subprograms and call
> + self._use_vars_for_low_and_high_pc_attr with it.
> + """
> + processed_attrs = {}
> + attr_value: AttributeValue
> + for attr_name, attr_value in raw_die.attributes.items():
> + actual_value = attr_value.value
> + if attr_value.form in ("DW_FORM_ref4", "DW_FORM_ref_addr"):
> + referenced_die = raw_die.get_DIE_from_attribute(attr_name)
> + actual_value = referenced_die.offset
> + self.referenced_offsets.add(referenced_die.offset)
> +
> + processed_attrs[attr_name] = DWARFAttribute(
> + raw_die.offset,
> + attr_name,
> + actual_value,
> + attr_value.form
> + )
> +
> + if raw_die.tag == DWARFCompileUnit.compile_unit_tag:
> + processed_die = DWARFCompileUnit(raw_die, die_cu, processed_attrs)
> + else:
> + processed_die = DWARFDIE(
> + raw_die.offset,
> + raw_die.tag,
> + processed_attrs,
> + None
> + )
> +
> + if raw_die.get_parent() is None:
> + # Top level DIE
> + self.top_level_dies.append(processed_die)
> + else:
> + # Setting the parent here assumes the parent was already processed
> + # prior to this DIE being found.
> + # As far as I'm aware, this is always true in DWARF.
> + processed_parent = self.offset_to_die[raw_die.get_parent().offset]
> + processed_parent.children.append(processed_die)
> +
> + if processed_die.tag == "DW_TAG_subprogram":
> + self.subprograms.append(processed_die)
> + self._use_vars_for_low_and_high_pc_attr(processed_die)
> +
> + self.offset_to_die[processed_die.offset] = processed_die
> + return processed_die
> +
> + def _create_necessary_labels(self):
> + """Create labels to DIEs that were referenced by others."""
> + for offset in self.referenced_offsets:
> + die = self.offset_to_die[offset]
> + self._create_label_for_die(die)
> +
> + def _use_vars_for_low_and_high_pc_attr(self, subprogram: DWARFDIE) -> None:
> + """Replace existing PC attributes with Tcl variables.
> +
> + If DW_AT_low_pc exists for this DIE, replace it with accessing the
> + variable whose name is given by self.subprogram_start_var(subprogram).
> +
> + If DW_AT_high_pc exists for this DIE, replace it with accessing the
> + variable whose name is given by self.subprogram_end_var(subprogram).
> + """
> + low_pc_attr_name = "DW_AT_low_pc"
> + if low_pc_attr_name in subprogram.attrs:
> + start = self.subprogram_start_var(subprogram)
> + subprogram.attrs[low_pc_attr_name].value = start
> +
> + high_pc_attr_name = "DW_AT_high_pc"
> + if high_pc_attr_name in subprogram.attrs:
> + end = self.subprogram_end_var(subprogram)
> + subprogram.attrs[high_pc_attr_name].value = end
> +
> + def _create_label_for_die(self, die: DWARFDIE) -> None:
> + """Set tcl_label to a unique string among other DIEs for this parser.
> +
> + As a first attempt, use labelify(die.name()). If the DIE does not have
> + a name, use labelify(die.type_name()).
> +
> + If the chosen initial label is already taken, try again appending "_2".
> + While the attempt is still taken, try again replacing it with "_3", then
> + "_4", and so on.
> +
> + This function also creates an entry on self.label_to_die.
> + """
> + if die.tcl_label is not None:
> + return
> +
> + label = labelify_str(die.name() or die.type_name())
> +
> + # Deduplicate label in case of collision
> + if label in self.taken_labels:
> + suffix_nr = 2
> +
> + # Walrus operator to prevent writing the assembled label_suffix
> + # string literal twice. This could be rewritten by copying the
> + # string literal to the line after the end of the while loop,
> + # but I deemed it would be too frail in case one of them needs
> + # to be changed and the other is forgotten.
> + while (new_label := f"{label}_{suffix_nr}") in self.taken_labels:
> + suffix_nr += 1
> + label = new_label
> +
> + die.tcl_label = label
> + self.label_to_die[label] = die
> + self.taken_labels.add(label)
> +
> + def subprogram_start_var(self, subprogram: DWARFDIE) -> str:
> + """Name of the Tcl variable that holds the low PC for a subprogram."""
> + return f"${subprogram.name()}_start"
> +
> + def subprogram_end_var(self, subprogram: DWARFDIE) -> str:
> + """Name of the Tcl variable that holds the high PC for a subprogram."""
> + return f"${subprogram.name()}_end"
> +
> + def all_labels(self) -> set[str]:
> + """Get a copy of the set of all labels known to the parser so far."""
> + return copy(self.taken_labels)
> +
> +
> +class DWARFAssemblerGenerator:
> + """Class that generates Dwarf::assemble code out of a DWARFParser."""
> +
> + def __init__(self, dwarf_parser: DWARFParser, output=sys.stdout):
I just noticed that you have the optional output stream here, but you
don't use it. I suppose this is where argparse was meant to be used,
that you included in version 1 but not here, right?
Personally I think we either should have this be useful, or directly set
self.output to sys.stdout and future improvements to the script would
make use of this.
I don't think this alone warrants sending a v3 yet though, let's give
the other people in the list some time to look at v2 :)
new file mode 100755
@@ -0,0 +1,841 @@
+#!/usr/bin/env python3
+# Copyright 2024 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+# Due to the pyelftools dependency, this script requires Python version
+# 3.10 or greater to run.
+
+"""A utility to convert ELF files with DWARF info to Dwarf::assemble code.
+
+Usage:
+ python ./asm_to_dwarf_assembler.py <path/to/elf/file>
+
+Dependencies:
+ Python >= 3.10
+ pyelftools >= 0.31
+
+Notes:
+- Line tables are not currently supported.
+- Non-contiguous subprograms are not currently supported.
+- If you want to use $srcfile or similar, you must edit the references to the
+ file name manually, including DW_AT_name attributes on compile units.
+- If run with binaries generated by make check-gdb, it may include an
+ additional compile_unit before and after the actual compile units. This is
+ an artifact of the normal compilation process, as these CUs are indeed in
+ the generated DWARF in some cases.
+"""
+
+import sys
+import re
+import errno
+
+from io import BytesIO, IOBase
+from typing import Optional, Annotated
+from functools import cache
+from dataclasses import dataclass
+from copy import copy
+from datetime import datetime
+from logging import getLogger
+
+from elftools.elf.elffile import ELFFile
+from elftools.dwarf.compileunit import CompileUnit as RawCompileUnit
+from elftools.dwarf.die import DIE as RawDIE, AttributeValue
+
+
+logger = getLogger(__file__)
+
+
+# While these aren't supported, their detection is important for replacing them
+# with SPECIAL_expr and for writing the placeholder {MANUAL} expr list.
+EXPR_ATTRIBUTE_FORMS = [
+ "DW_FORM_exprloc",
+ "DW_FORM_block",
+ "DW_FORM_block1",
+ "DW_FORM_block2",
+ "DW_FORM_block4",
+]
+
+
+# Again, these aren't supported, but I already transcribed the relevant table
+# from the DWARF5 spec so it felt like a waste to erase it. It will be very
+# useful for the future if someone wants to spend the time to implement it.
+# NOTE: this dataclass is incomplete, because it doesn't take into consideration
+# the length of each of its operands.
+
+# @dataclass
+# class DWARFOperation:
+# """Values of the form DW_OP_<name>."""
+# suffix: Annotated[str, 'Part of the name that succeeds "DW_OP_".']
+# code: Annotated[int, "Numeric representation of the OP."]
+# num_operands: Annotated[int, "Number of operands this operator expects."]
+
+
+# DWARF_OPERATIONS: Annotated[
+# dict[int, DWARFOperation],
+# "Table of all DWARF operations as per table 7.9 of the DWARF5 spec."
+# ] = {}
+
+
+# def _register_op(suffix: str, code: int, num_operands: int) -> None:
+# assert code not in DWARF_OPERATIONS, "Duplicate operation code found."
+# DWARF_OPERATIONS[code] = DWARFOperation(suffix, code, num_operands)
+
+
+# def _register_all_ops() -> None:
+# """Register all operations as per table 7.9 of the DWARF5 spec.
+#
+# Currently these operations are not actively used by the script, as
+# SPECIAL_expr form attributes are not implemented.
+# """
+#
+# # Op codes 0x01 and 0x02 are reserved.
+#
+# _register_op("addr", 0x03, 1)
+#
+# # Op codes 0x04 and 0x05 are reserved.
+#
+# _register_op("deref", 0x06, 0)
+#
+# # Op code 0x07 is reserved.
+#
+# _register_op("const1u", 0x08, 1)
+# _register_op("const1s", 0x09, 1)
+# _register_op("const2u", 0x0a, 1)
+# _register_op("const2s", 0x0b, 1)
+# _register_op("const4u", 0x0c, 1)
+# _register_op("const4s", 0x0d, 1)
+# _register_op("const8u", 0x0e, 1)
+# _register_op("const8s", 0x0f, 1)
+# _register_op("constu", 0x10, 1)
+# _register_op("consts", 0x11, 1)
+#
+# _register_op("dup", 0x12, 0)
+# _register_op("drop", 0x13, 0)
+# _register_op("over", 0x14, 0)
+# _register_op("pick", 0x15, 1)
+# _register_op("swap", 0x16, 0)
+# _register_op("rot", 0x17, 0)
+# _register_op("xderef", 0x18, 0)
+#
+# _register_op("abs", 0x19, 0)
+# _register_op("and", 0x1a, 0)
+# _register_op("div", 0x1b, 0)
+# _register_op("minus", 0x1c, 0)
+# _register_op("mod", 0x1d, 0)
+# _register_op("mul", 0x1e, 0)
+# _register_op("neg", 0x1f, 0)
+# _register_op("not", 0x20, 0)
+# _register_op("or", 0x21, 0)
+# _register_op("plus", 0x22, 0)
+# _register_op("plus_uconst", 0x23, 1)
+# _register_op("shl", 0x24, 0)
+# _register_op("shr", 0x25, 0)
+# _register_op("shra", 0x26, 0)
+# _register_op("xor", 0x27, 0)
+# _register_op("bra", 0x28, 1)
+# _register_op("eq", 0x29, 0)
+# _register_op("ge", 0x2a, 0)
+# _register_op("gt", 0x2b, 0)
+# _register_op("le", 0x2c, 0)
+# _register_op("lt", 0x2d, 0)
+# _register_op("ne", 0x2e, 0)
+# _register_op("skip", 0x2f, 0)
+#
+# for lit_nr in range(32):
+# _register_op(f"lit{lit_nr}", 0x30 + lit_nr, 0)
+#
+# for reg_nr in range(32):
+# _register_op(f"reg{reg_nr}", 0x50 + reg_nr, 0)
+#
+# for breg_nr in range(32):
+# _register_op(f"breg{breg_nr}", 0x70 + breg_nr, 0)
+#
+# _register_op("regx", 0x90, 1)
+# _register_op("fbregx", 0x91, 1)
+# _register_op("bregx", 0x92, 2)
+# _register_op("piece", 0x93, 1)
+# _register_op("deref_size", 0x94, 1)
+# _register_op("xderef_size", 0x95, 1)
+# _register_op("nop", 0x96, 0)
+# _register_op("push_object_address", 0x97, 0)
+#
+# _register_op("call2", 0x98, 1)
+# _register_op("call4", 0x99, 1)
+# _register_op("call_ref", 0x9a, 1)
+# _register_op("form_tls_address", 0x9b, 0)
+# _register_op("call_frame_cfa", 0x9c, 0)
+#
+# _register_op("bit_piece", 0x9d, 2)
+# _register_op("implicit_value", 0x9e, 2)
+# _register_op("stack_value", 0x9f, 0)
+# _register_op("implicit_pointer", 0xa0, 2)
+# _register_op("addrx", 0xa1, 1)
+# _register_op("constx", 0xa2, 1)
+#
+# _register_op("entry_value", 0xa3, 2)
+# _register_op("const_type", 0xa4, 3)
+# _register_op("regval_type", 0xa5, 2)
+# _register_op("deref_type", 0xa6, 2)
+# _register_op("xderef_type", 0xa7, 2)
+#
+# _register_op("convert", 0xa8, 1)
+# _register_op("reinterpret", 0xa9, 1)
+
+
+# _register_all_ops()
+
+
+# Workaround for my editor not to freak out over unclosed braces.
+lbrace, rbrace = "{", "}"
+
+
+@cache
+def get_indent_str(indent_count: int) -> str:
+ """Get whitespace string to prepend to another for indenting."""
+ indent = (indent_count // 2) * "\t"
+ if indent_count % 2 == 1:
+ indent += " "
+ return indent
+
+
+def indent(line: str, indent_count: int) -> str:
+ """Indent line by indent_count levels."""
+ return get_indent_str(indent_count) + line
+
+
+def labelify_str(s: str) -> str:
+ """Make s appropriate for a label name."""
+ # Replace "*" with the literal word "ptr".
+ s = s.replace("*", "ptr")
+
+ # Replace any non-"word" characters by "_".
+ s = re.sub(r"\W", "_", s)
+
+ # Remove consecutive "_"s.
+ s = re.sub(r"__+", "_", s)
+
+ return s
+
+
+class DWARFAttribute:
+ """Storage unit for a single DWARF attribute.
+
+ All its values are strings that are usually passed on
+ directly to format. The exceptions to this are attributes
+ with int values with DW_FORM_ref4 or DW_FORM_ref_addr form.
+ Their values are interpreted as the global offset of the DIE
+ being referenced, which are looked up dynamically to fetch
+ their labels.
+ """
+ def __init__(
+ self,
+ die_offset: int,
+ name: str,
+ value: str | bytes | int | bool,
+ form=None,
+ ):
+ self.die_offset = die_offset
+ self.name = name
+ self.value = value
+ self.form = form
+
+ def _format_expr_value(self) -> str:
+ # This code is left here in case this is feature is desired in the
+ # future. It is partly functional, being able to decode operations
+ # but not operands due to their variable length in bytes.
+
+ # self.form = "SPECIAL_expr"
+ # lines = [lbrace]
+ # curr_idx = 0
+ # while curr_idx < len(self.value):
+ # curr_op_code = self.value[curr_idx]
+ #
+ # try:
+ # curr_op: DWARFOperation = dwarf_operations[curr_op_code]
+ # except KeyError as err:
+ # raise KeyError(
+ # f"Unknown op: {curr_op_code}, item {curr_idx} of"
+ # f" attribute {self.name}, DIE at offset"
+ # f" {self.die_offset}."
+ # ) from err
+ # # Empty arg here so it inserts a space before the first argument,
+ # # only if there is indeed a first argument.
+ # args = [""]
+ # for arg_nr in range(curr_op.num_operands):
+ # curr_idx += 1
+ # # TODO decode bytes according to specific operation.
+ # # The following line only reads the next BYTE of the sequence,
+ # # and needs to be adjusted to read the correct number of
+ # # bytes.
+ # # args.append(str(self.value[curr_idx]))
+ # op_line = "DW_OP_" + curr_op.suffix + " ".join(args)
+ # lines.append(op_line)
+ # curr_idx += 1
+ #
+ # lines.append(rbrace)
+ # return "\n".join(lines)
+ self.form = "SPECIAL_expr"
+ return "{ MANUAL: Fill expr list }"
+
+ def _needs_escaping(self, str_value: str) -> bool:
+ charset = set(str_value)
+ return bool(charset.intersection({"{", "}", " ", "\t"}))
+
+ def _format_str(self, str_value: str) -> str:
+ if self._needs_escaping(str_value):
+ escaped_str = str(str_value)
+ # Replace single escape (which is itself escaped because of regex)
+ # with a double escape (which doesn't mean anything to regex so
+ # it doesn't need escaping).
+ escaped_str = re.sub(r"\\", r"\\", escaped_str)
+ escaped_str = re.sub("([{}])", r"\\\1", escaped_str)
+ return "{" + escaped_str + "}"
+ else:
+ return str_value
+
+ def _format_value(
+ self,
+ offset_die_lookup: dict[int, "DWARFDIE"],
+ indent_count: int = 0
+ ) -> str:
+ if self.form in EXPR_ATTRIBUTE_FORMS:
+ return self._format_expr_value()
+ elif isinstance(self.value, bool):
+ return str(int(self.value))
+ elif isinstance(self.value, int):
+ if self.form == "DW_FORM_ref4":
+ # ref4-style referencing label.
+ die: "DWARFDIE" = offset_die_lookup[self.value]
+ return ":$" + die.tcl_label
+ elif self.form == "DW_FORM_ref_addr":
+ # ref_addr-style referencing label.
+ die: "DWARFDIE" = offset_die_lookup[self.value]
+ return "%$" + die.tcl_label
+ else:
+ return str(self.value)
+ elif isinstance(self.value, bytes):
+ return self._format_str(self.value.decode("ascii"))
+ elif isinstance(self.value, str):
+ return self._format_str(self.value)
+ else:
+ raise NotImplementedError(
+ "Unknown data type: " + str(type(self.value))
+ )
+
+ def format(
+ self,
+ offset_die_lookup: dict[int, "DWARFDIE"],
+ indent_count: int = 0
+ ) -> str:
+ """Format the attribute in the form {name value form}.
+
+ If form is DW_FORM_exprloc or DW_FORM_block, see next section on
+ DWARFOperations.
+
+ If it isn't, value is formatted as follows:
+ If bool, use "1" if True, "0" if False.
+ If int:
+ If form is DW_FORM_ref4, use ":$label" where label is the
+ tcl_label of the DWARFDIE at offset "value".
+ If form is DW_FORM_ref_addr, use "%$label" where label is
+ the tcl_label of the DWARFDIE at offset "value".
+ Else, use value directly.
+ If bytes, use value.decode("ascii")
+ If str, use value directly.
+ Any other type results in a NotImplementedError being raised.
+
+ Regarding DW_FORM_exprloc and DW_FORM_block:
+ The form is replaced with SPECIAL_expr.
+ The entries in the value are interpreted and decoded using the
+ dwarf_operations dictionary, and replaced with their names where
+ applicable.
+ """
+ s = lbrace
+ s += self.name + " "
+ s += self._format_value(offset_die_lookup)
+
+ # Only explicitly state form if it's not a reference.
+ if self.form not in [None, "DW_FORM_ref4", "DW_FORM_ref_addr"]:
+ s += " " + self.form
+
+ s += rbrace
+ return indent(s, indent_count)
+
+
+class DWARFDIE:
+ """This script's parsed version of a RawDIE."""
+ def __init__(
+ self,
+ offset: int,
+ tag: str,
+ attrs: dict[str, DWARFAttribute],
+ tcl_label: Optional[str] = None
+ ):
+ self.offset: Annotated[
+ int,
+ "Global offset of the DIE."
+ ] = offset
+ self.tag: Annotated[
+ str,
+ "DWARF tag for this DIE."
+ ] = tag
+ self.attrs: Annotated[
+ dict[str, DWARFAttribute],
+ "Dict of attributes for this DIE."
+ ] = copy(attrs)
+ self.children: Annotated[
+ list[DWARFDIE],
+ "List of child DIEs of this DIE."
+ ] = []
+ self.tcl_label: Annotated[
+ str,
+ "Label used by the Tcl code to reference this DIE, if any. These "
+ 'take the form of "label: " before the actual DIE definition.'
+ ] = tcl_label
+
+ def format_lines(
+ self,
+ offset_die_lookup: dict[int, "DWARFDIE"],
+ indent_count: int = 0
+ ) -> list[str]:
+ """Get the list of lines that represent this DIE in Dwarf assembler."""
+ die_lines = []
+
+ # Prepend label to first line, if it's set.
+ if self.tcl_label:
+ first_line_start = self.tcl_label + ": "
+ else:
+ first_line_start = ""
+
+ # First line, including label.
+ first_line = indent(
+ first_line_start + self.tag + " " + lbrace,
+ indent_count
+ )
+ die_lines.append(first_line)
+
+ # Format attributes, if any.
+ if self.attrs:
+ for attr_name, attr in self.attrs.items():
+ attr_line = attr.format(
+ offset_die_lookup,
+ indent_count=indent_count + 1
+ )
+ die_lines.append(attr_line)
+ die_lines.append(indent(rbrace, indent_count))
+ else:
+ # Don't create a new line, just append and immediately close the
+ # brace on the last line.
+ die_lines[-1] += rbrace
+
+ # Format children, if any.
+ if self.children:
+ # Only open a new brace if there are any children for the
+ # current DIE.
+ die_lines[-1] += " " + lbrace
+ for child in self.children:
+ child_lines = child.format_lines(
+ offset_die_lookup,
+ indent_count=indent_count + 1
+ )
+ die_lines.extend(child_lines)
+ die_lines.append(indent(rbrace, indent_count))
+
+ return die_lines
+
+ def format(
+ self,
+ offset_die_lookup: dict[int, "DWARFDIE"],
+ indent_count: int = 0
+ ) -> str:
+ """Join result from format_lines into a single str."""
+ return "\n".join(self.format_lines(offset_die_lookup, indent_count))
+
+ def name(self) -> Optional[str]:
+ """Get DW_AT_name (if present) decoded as ASCII."""
+ raw_value = self.attrs.get("DW_AT_name")
+ if raw_value is None:
+ return None
+ else:
+ return raw_value.value.decode("ascii")
+
+ def type_name(self) -> str:
+ """Name of Dwarf tag, with the "DW_TAG_" prefix removed."""
+ return re.sub("DW_TAG_", "", self.tag)
+
+
+class DWARFCompileUnit(DWARFDIE):
+ """Wrapper subclass for CU DIEs.
+
+ This is necessary due to the special format CUs take in Dwarf::assemble.
+
+ Instead of simply:
+ DW_TAG_compile_unit {
+ <attributes>
+ } {
+ <children>
+ }
+
+ CUs are formatted as:
+ cu { <cu_special_vars> } {
+ DW_TAG_compile_unit {
+ <attributes>
+ } {
+ <children>
+ }
+ }
+ """
+
+ # Default value for parameter is_64 defined in dwarf.exp line 1553.
+ # This value is converted to 0/1 automatically when emitting
+ # Dwarf::assemble code.
+ default_is_64 = False
+
+ # Default value for parameter dwarf_version defined in dwarf.exp line 1552.
+ default_dwarf_version = 4
+
+ # Default value for parameter is_fission defined in dwarf.exp line 1556.
+ # Currently not implemented, see comment below.
+ # default_is_fission = False
+
+ # Tag that signifies a DIE is a compile unit.
+ compile_unit_tag = "DW_TAG_compile_unit"
+
+ def __init__(
+ self,
+ raw_die: RawDIE,
+ raw_cu: RawCompileUnit,
+ attrs: dict[str, DWARFAttribute],
+ ):
+ """Initialize additional instance variables for CU encoding.
+
+ The additional instance variables are:
+ - is_64_bit: bool
+ Whether this CU is 64 bit or not.
+ - dwarf_version: int
+ default DWARFCompileUnit.default_dwarf_version
+ Version of DWARF this CU is using.
+ - addr_size: Optional[int]
+ default None
+ Size of an address in bytes.
+
+ These variables are used to configure the first parameter of the cu
+ proc (which contains calls to the compile_unit proc in the body of
+ Dwarf::assemble).
+ """
+ super().__init__(
+ raw_die.offset,
+ DWARFCompileUnit.compile_unit_tag,
+ attrs
+ )
+ self.raw_cu = raw_cu
+ self.dwarf_version: int = raw_cu.header.get(
+ "version",
+ DWARFCompileUnit.default_dwarf_version
+ )
+ self.addr_size: Optional[int] = raw_cu.header.get("address_size")
+ self.is_64_bit: bool = raw_cu.dwarf_format() == 64
+
+ # Fission is not currently implemented because I don't know where to
+ # fetch this information from.
+ # self.is_fission: bool = self.default_is_fission
+
+ # CU labels are not currently implemented because I haven't found where
+ # pyelftools exposes this information.
+ # self.cu_label: Optional[str] = None
+
+ def format_lines(
+ self,
+ offset_die_lookup: dict[int, DWARFDIE],
+ indent_count: int = 0,
+ ) -> list[str]:
+ lines = []
+ lines.append(self._get_header(indent_count))
+ inner_lines = super().format_lines(
+ offset_die_lookup,
+ indent_count + 1
+ )
+ lines += inner_lines
+ lines.append(indent(rbrace, indent_count))
+ return lines
+
+ def _get_header(self, indent_count: int = 0) -> str:
+ """Assemble the first line of the surrounding 'cu {} {}' proc call."""
+ header = indent("cu " + lbrace, indent_count)
+ cu_params = []
+
+ if self.is_64_bit != DWARFCompileUnit.default_is_64:
+ # Convert from True/False to 1/0.
+ param_value = int(self.is_64_bit)
+ cu_params += ["is_64", str(param_value)]
+
+ if self.dwarf_version != DWARFCompileUnit.default_dwarf_version:
+ cu_params += ["version", str(self.dwarf_version)]
+
+ if self.addr_size is not None:
+ cu_params += ["addr_size", str(self.addr_size)]
+
+ # Fission is not currently implemented, see comment above.
+ # if self.is_fission != DWARFCompileUnit.default_is_fission:
+ # # Same as is_64_bit conversion, True/False -> 1/0.
+ # param_value = int(self.is_fission)
+ # cu_params += ["fission", str(param_value)]
+
+ # CU labels are not currently implemented, see commend above.
+ # if self.cu_label is not None:
+ # cu_params += ["label", self.cu_label]
+
+ if cu_params:
+ header += " ".join(cu_params)
+
+ header += rbrace + " " + lbrace
+ return header
+
+
+class DWARFParser:
+ """Converter from pyelftools's DWARF representation to this script's."""
+
+ def __init__(self, elf_file: IOBase):
+ """Init parser with file opened in binary mode.
+
+ File can be closed after this function is called.
+ """
+ self.raw_data = BytesIO(elf_file.read())
+ self.elf_data = ELFFile(self.raw_data)
+ self.dwarf_info = self.elf_data.get_dwarf_info()
+ self.offset_to_die: dict[int, DWARFDIE] = {}
+ self.label_to_die: dict[str, DWARFDIE] = {}
+ self.referenced_offsets: Annotated[
+ set[int],
+ "The set of all offsets that were referenced by some DIE."
+ ] = set()
+ self.raw_cu_list: list[RawCompileUnit] = []
+ self.top_level_dies: list[DWARFDIE] = []
+ self.subprograms: list[DWARFDIE] = []
+ self.taken_labels: set[str] = set()
+
+ self._read_all_cus()
+ self._create_necessary_labels()
+
+ def _read_all_cus(self):
+ """Populate self.raw_cu_list with all CUs in self.dwarf_info."""
+ for cu in self.dwarf_info.iter_CUs():
+ self._read_cu(cu)
+
+ def _read_cu(self, raw_cu: RawCompileUnit):
+ """Read a compile_unit into self.cu_list."""
+ self.raw_cu_list.append(raw_cu)
+ for raw_die in raw_cu.iter_DIEs():
+ if not raw_die.is_null():
+ self._parse_die(raw_cu, raw_die)
+
+ def _parse_die(self, die_cu: RawCompileUnit, raw_die: RawDIE) -> DWARFDIE:
+ """Process a single DIE and add it to offset_to_die.
+
+ Look for DW_FORM_ref4 and DWD_FORM_ref_addr form attributes and replace
+ them with the global offset of the referenced DIE, and adding the
+ referenced DIE to a set. This will be used later to assign and use
+ labels only to DIEs that need it.
+
+ In case the DIE is a top-level DIE, add it to self.top_level_dies.
+
+ In case the DIE is a subprogram, add it to self.subprograms and call
+ self._use_vars_for_low_and_high_pc_attr with it.
+ """
+ processed_attrs = {}
+ attr_value: AttributeValue
+ for attr_name, attr_value in raw_die.attributes.items():
+ actual_value = attr_value.value
+ if attr_value.form in ("DW_FORM_ref4", "DW_FORM_ref_addr"):
+ referenced_die = raw_die.get_DIE_from_attribute(attr_name)
+ actual_value = referenced_die.offset
+ self.referenced_offsets.add(referenced_die.offset)
+
+ processed_attrs[attr_name] = DWARFAttribute(
+ raw_die.offset,
+ attr_name,
+ actual_value,
+ attr_value.form
+ )
+
+ if raw_die.tag == DWARFCompileUnit.compile_unit_tag:
+ processed_die = DWARFCompileUnit(raw_die, die_cu, processed_attrs)
+ else:
+ processed_die = DWARFDIE(
+ raw_die.offset,
+ raw_die.tag,
+ processed_attrs,
+ None
+ )
+
+ if raw_die.get_parent() is None:
+ # Top level DIE
+ self.top_level_dies.append(processed_die)
+ else:
+ # Setting the parent here assumes the parent was already processed
+ # prior to this DIE being found.
+ # As far as I'm aware, this is always true in DWARF.
+ processed_parent = self.offset_to_die[raw_die.get_parent().offset]
+ processed_parent.children.append(processed_die)
+
+ if processed_die.tag == "DW_TAG_subprogram":
+ self.subprograms.append(processed_die)
+ self._use_vars_for_low_and_high_pc_attr(processed_die)
+
+ self.offset_to_die[processed_die.offset] = processed_die
+ return processed_die
+
+ def _create_necessary_labels(self):
+ """Create labels to DIEs that were referenced by others."""
+ for offset in self.referenced_offsets:
+ die = self.offset_to_die[offset]
+ self._create_label_for_die(die)
+
+ def _use_vars_for_low_and_high_pc_attr(self, subprogram: DWARFDIE) -> None:
+ """Replace existing PC attributes with Tcl variables.
+
+ If DW_AT_low_pc exists for this DIE, replace it with accessing the
+ variable whose name is given by self.subprogram_start_var(subprogram).
+
+ If DW_AT_high_pc exists for this DIE, replace it with accessing the
+ variable whose name is given by self.subprogram_end_var(subprogram).
+ """
+ low_pc_attr_name = "DW_AT_low_pc"
+ if low_pc_attr_name in subprogram.attrs:
+ start = self.subprogram_start_var(subprogram)
+ subprogram.attrs[low_pc_attr_name].value = start
+
+ high_pc_attr_name = "DW_AT_high_pc"
+ if high_pc_attr_name in subprogram.attrs:
+ end = self.subprogram_end_var(subprogram)
+ subprogram.attrs[high_pc_attr_name].value = end
+
+ def _create_label_for_die(self, die: DWARFDIE) -> None:
+ """Set tcl_label to a unique string among other DIEs for this parser.
+
+ As a first attempt, use labelify(die.name()). If the DIE does not have
+ a name, use labelify(die.type_name()).
+
+ If the chosen initial label is already taken, try again appending "_2".
+ While the attempt is still taken, try again replacing it with "_3", then
+ "_4", and so on.
+
+ This function also creates an entry on self.label_to_die.
+ """
+ if die.tcl_label is not None:
+ return
+
+ label = labelify_str(die.name() or die.type_name())
+
+ # Deduplicate label in case of collision
+ if label in self.taken_labels:
+ suffix_nr = 2
+
+ # Walrus operator to prevent writing the assembled label_suffix
+ # string literal twice. This could be rewritten by copying the
+ # string literal to the line after the end of the while loop,
+ # but I deemed it would be too frail in case one of them needs
+ # to be changed and the other is forgotten.
+ while (new_label := f"{label}_{suffix_nr}") in self.taken_labels:
+ suffix_nr += 1
+ label = new_label
+
+ die.tcl_label = label
+ self.label_to_die[label] = die
+ self.taken_labels.add(label)
+
+ def subprogram_start_var(self, subprogram: DWARFDIE) -> str:
+ """Name of the Tcl variable that holds the low PC for a subprogram."""
+ return f"${subprogram.name()}_start"
+
+ def subprogram_end_var(self, subprogram: DWARFDIE) -> str:
+ """Name of the Tcl variable that holds the high PC for a subprogram."""
+ return f"${subprogram.name()}_end"
+
+ def all_labels(self) -> set[str]:
+ """Get a copy of the set of all labels known to the parser so far."""
+ return copy(self.taken_labels)
+
+
+class DWARFAssemblerGenerator:
+ """Class that generates Dwarf::assemble code out of a DWARFParser."""
+
+ def __init__(self, dwarf_parser: DWARFParser, output=sys.stdout):
+ self.dwarf_parser = dwarf_parser
+ self.output = output
+
+ def emit(self, line: str, indent_count: int) -> None:
+ """Print a single line indented indent_count times to self.output.
+
+ If line is empty, it will always print an empty line, even with nonzero
+ indent_count.
+ """
+ if line:
+ line = get_indent_str(indent_count) + line
+ print(line, file=self.output)
+
+ def generate_die(self, die: DWARFDIE, indent_count: int):
+ """Generate the lines that represent a DIE."""
+ die_lines = die.format(self.dwarf_parser.offset_to_die, indent_count)
+ self.emit(die_lines, 0)
+
+ def generate(self):
+ indent_count = 0
+
+ self.emit("Dwarf::assemble $asm_file {", indent_count)
+
+ # Begin Dwarf::assemble body.
+ indent_count += 1
+ self.emit("global srcdir subdir srcfile", indent_count)
+
+ all_labels = self.dwarf_parser.all_labels()
+ if all_labels:
+ self.emit("declare_labels " + " ".join(all_labels), indent_count)
+
+ self.emit("", 0)
+ for subprogram in self.dwarf_parser.subprograms:
+ self.emit(f"get_func_info {subprogram.name()}", indent_count)
+
+ for die in self.dwarf_parser.top_level_dies:
+ self.generate_die(die, indent_count)
+
+ # TODO: line table, if it's within scope (it probably isn't).
+
+ # End Dwarf::assemble body.
+ indent_count -= 1
+ self.emit(rbrace, indent_count)
+
+
+def main(argv):
+ try:
+ filename = argv[1]
+ except IndexError:
+ print("Usage:", file=sys.stderr)
+ print("python ./asm_to_dwarf_assembler.py <path/to/elf/file>", file=sys.stderr)
+ sys.exit(errno.EOPNOTSUP)
+
+ try:
+ with open(filename, "rb") as elf_file:
+ parser = DWARFParser(elf_file)
+ except Exception as e:
+ print("Error parsing ELF file. Does it contain DWARF information?", file=sys.stderr)
+ print(str(e), file=sys.stderr)
+ sys.exit(errno.ENODATA)
+ generator = DWARFAssemblerGenerator(parser)
+ generator.generate()
+
+
+if __name__ == "__main__":
+ main(sys.argv)