RFC: AArch64 Disassembler: Annotate undefined instructions

Message ID 87h5owxv94.fsf@redhat.com
State New
Headers
Series RFC: AArch64 Disassembler: Annotate undefined instructions |

Commit Message

Nick Clifton April 27, 2026, 3:45 p.m. UTC
  Hi Guys,

  Attached is proposed patch to enhance the AArch64 disassembler with a
  "-M annotate" command line option.  When enabled the option tell the
  disassembler that undefined instructions might actually be the
  addresses of objects and that it should look them up in the symbol
  table.

  So for example without the option being enabled, disassembling
  non-code sections might produce results like this:

    000000000041fdd8 <__frame_dummy_init_array_entry>:
      41fdd8:	00400764 	.inst	0x00400764 ; undefined
      41fddc:	00000000 	udf	#0

  Whereas with the option enabled this changes to:
  
    000000000041fdd8 <__frame_dummy_init_array_entry>:
      41fdd8:	00400764 	.inst	0x00400764 ; [frame_dummy]
      41fddc:	00000000 	udf	#0

  There are limitations to the patch.  If a piece of data just happens
  to decode to a valid instruction, then the annotation will not take
  place.  Plus the patch assumes that there is no point in annotating
  static object files as their symbols will not have been resolved.
  (Plus you can always disassembler with relocation display enabled to
  see where symbols are being referenced).

  Comments, thoughts, suggestions ?

Cheers
  Nick
  

Comments

Richard Earnshaw (foss) April 28, 2026, 11:54 a.m. UTC | #1
[Apologies if you get two copies of this reply; I thought I sent one earlier, but I can't see it anywhere in my mail system, so perhaps I just managed to delete it without sending.]

On 27/04/2026 16:45, Nick Clifton wrote:
> Hi Guys,
> 
>   Attached is proposed patch to enhance the AArch64 disassembler with a
>   "-M annotate" command line option.  When enabled the option tell the
>   disassembler that undefined instructions might actually be the
>   addresses of objects and that it should look them up in the symbol
>   table.
> 
>   So for example without the option being enabled, disassembling
>   non-code sections might produce results like this:
> 
>     000000000041fdd8 <__frame_dummy_init_array_entry>:
>       41fdd8:	00400764 	.inst	0x00400764 ; undefined
>       41fddc:	00000000 	udf	#0
> 
>   Whereas with the option enabled this changes to:
>   
>     000000000041fdd8 <__frame_dummy_init_array_entry>:
>       41fdd8:	00400764 	.inst	0x00400764 ; [frame_dummy]
>       41fddc:	00000000 	udf	#0
> 
>   There are limitations to the patch.  If a piece of data just happens
>   to decode to a valid instruction, then the annotation will not take
>   place.  Plus the patch assumes that there is no point in annotating
>   static object files as their symbols will not have been resolved.
>   (Plus you can always disassembler with relocation display enabled to
>   see where symbols are being referenced).
> 

This is what mapping symbols are supposed to address.  If the user writes '.word' in a code section then the value will be tagged with $d and the output should then disassemble as data.  '.inst' means 'this is an instruction'.

R.

>   Comments, thoughts, suggestions ?
> 
> Cheers
>   Nick
>
  
Richard Earnshaw (foss) April 28, 2026, 12:13 p.m. UTC | #2
On 28/04/2026 12:54, Richard Earnshaw (foss) wrote:
> [Apologies if you get two copies of this reply; I thought I sent one earlier, but I can't see it anywhere in my mail system, so perhaps I just managed to delete it without sending.]
> 
> On 27/04/2026 16:45, Nick Clifton wrote:
>> Hi Guys,
>>
>>   Attached is proposed patch to enhance the AArch64 disassembler with a
>>   "-M annotate" command line option.  When enabled the option tell the
>>   disassembler that undefined instructions might actually be the
>>   addresses of objects and that it should look them up in the symbol
>>   table.
>>
>>   So for example without the option being enabled, disassembling
>>   non-code sections might produce results like this:
>>
>>     000000000041fdd8 <__frame_dummy_init_array_entry>:
>>       41fdd8:	00400764 	.inst	0x00400764 ; undefined
>>       41fddc:	00000000 	udf	#0
>>
>>   Whereas with the option enabled this changes to:
>>   
>>     000000000041fdd8 <__frame_dummy_init_array_entry>:
>>       41fdd8:	00400764 	.inst	0x00400764 ; [frame_dummy]
>>       41fddc:	00000000 	udf	#0
>>
>>   There are limitations to the patch.  If a piece of data just happens
>>   to decode to a valid instruction, then the annotation will not take
>>   place.  Plus the patch assumes that there is no point in annotating
>>   static object files as their symbols will not have been resolved.
>>   (Plus you can always disassembler with relocation display enabled to
>>   see where symbols are being referenced).
>>
> 
> This is what mapping symbols are supposed to address.  If the user writes '.word' in a code section then the value will be tagged with $d and the output should then disassemble as data.  '.inst' means 'this is an instruction'.
> 
> R.
> 
>>   Comments, thoughts, suggestions ?
>>
>> Cheers
>>   Nick
>>
> 

For example:

        .text
        .inst 0x12345678
        .word 0x12345678

gives:

 objdump -d test.o

test.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <.text>:
   0:   12345678        and     w24, w19, #0xfffff003
   4:   12345678        .word   0x12345678

Maybe .word should be annotated with a suitable label, but I don't think .inst should.

R.
  
Nick Clifton April 28, 2026, 1:51 p.m. UTC | #3
Hi Richard,

>>>    So for example without the option being enabled, disassembling
>>>    non-code sections might produce results like this:
>>>
>>>      000000000041fdd8 <__frame_dummy_init_array_entry>:
>>>        41fdd8:	00400764 	.inst	0x00400764 ; undefined
>>>        41fddc:	00000000 	udf	#0
>>>
>>>    Whereas with the option enabled this changes to:
>>>    
>>>      000000000041fdd8 <__frame_dummy_init_array_entry>:
>>>        41fdd8:	00400764 	.inst	0x00400764 ; [frame_dummy]
>>>        41fddc:	00000000 	udf	#0

>> This is what mapping symbols are supposed to address.  If the user writes '.word' in a code section then the value will be tagged with $d and the output should then disassemble as data.  '.inst' means 'this is an instruction'.

True.

> For example:
> 
>          .text
>          .inst 0x12345678
>          .word 0x12345678
> 
> gives:
> 
>   objdump -d test.o
> 
> test.o:     file format elf64-littleaarch64
> 
> 
> Disassembly of section .text:
> 
> 0000000000000000 <.text>:
>     0:   12345678        and     w24, w19, #0xfffff003
>     4:   12345678        .word   0x12345678
> 
> Maybe .word should be annotated with a suitable label, but I don't think .inst should.

OK, that makes sense.  So the disassembler should check to see if closest
previous mapping symbol is $d and only proceed with the annotation in this
case, yes ?

Cheers
   Nick
  
Michael Matz April 28, 2026, 2:14 p.m. UTC | #4
Hey,

On Tue, 28 Apr 2026, Nick Clifton wrote:

> >>>        41fdd8:	00400764 	.inst	0x00400764 ; undefined
> >>>        41fddc:	00000000 	udf	#0
> >>>
> >>>    Whereas with the option enabled this changes to:
> >>>    
> >>>      000000000041fdd8 <__frame_dummy_init_array_entry>:
> >>>        41fdd8:	00400764 	.inst	0x00400764 ; [frame_dummy]
> >>>        41fddc:	00000000 	udf	#0
> 
> >> This is what mapping symbols are supposed to address.  If the user writes
> >> '.word' in a code section then the value will be tagged with $d and the
> >> output should then disassemble as data.  '.inst' means 'this is an
> >> instruction'.
...
> >   objdump -d test.o
> > 
> > Maybe .word should be annotated with a suitable label, but I don't think
> > .inst should.
> 
> OK, that makes sense.  So the disassembler should check to see if closest
> previous mapping symbol is $d and only proceed with the annotation in this
> case, yes ?

But we're talking about doing the annotations for a final-linked file, not 
for relocatable files.  Are the $d symbols even part of them?

(And it's an annotation only, I think it'd be acceptable to say "[foobar]" 
even when an undefined instruction really was meant as undefined 
instruction, not as .data, but just so happens to match the address of 
foobar.  It certainly would be a very interesting coincidence, and also 
certainly much more interesting than "; undefined". )


Ciao,
Michael.
  
Richard Earnshaw (foss) April 28, 2026, 3:15 p.m. UTC | #5
On 28/04/2026 15:14, Michael Matz wrote:
> Hey,
> 
> On Tue, 28 Apr 2026, Nick Clifton wrote:
> 
>>>>>        41fdd8:	00400764 	.inst	0x00400764 ; undefined
>>>>>        41fddc:	00000000 	udf	#0
>>>>>
>>>>>    Whereas with the option enabled this changes to:
>>>>>    
>>>>>      000000000041fdd8 <__frame_dummy_init_array_entry>:
>>>>>        41fdd8:	00400764 	.inst	0x00400764 ; [frame_dummy]
>>>>>        41fddc:	00000000 	udf	#0
>>
>>>> This is what mapping symbols are supposed to address.  If the user writes
>>>> '.word' in a code section then the value will be tagged with $d and the
>>>> output should then disassemble as data.  '.inst' means 'this is an
>>>> instruction'.
> ...
>>>   objdump -d test.o
>>>
>>> Maybe .word should be annotated with a suitable label, but I don't think
>>> .inst should.
>>
>> OK, that makes sense.  So the disassembler should check to see if closest
>> previous mapping symbol is $d and only proceed with the annotation in this
>> case, yes ?
> 
> But we're talking about doing the annotations for a final-linked file, not 
> for relocatable files.  Are the $d symbols even part of them?
> 
> (And it's an annotation only, I think it'd be acceptable to say "[foobar]" 
> even when an undefined instruction really was meant as undefined 
> instruction, not as .data, but just so happens to match the address of 
> foobar.  It certainly would be a very interesting coincidence, and also 
> certainly much more interesting than "; undefined". )
> 
> 
> Ciao,
> Michael.

They're in the symbol table.  If you strip the symbols you'll lose the information.  But then you'd lose the symbols that Nick is proposing to annotate as well.

I guess you could selectively strip some symbols, but then it's a matter of caveat emptor.
  
Richard Earnshaw (foss) April 28, 2026, 3:18 p.m. UTC | #6
On 28/04/2026 14:51, Nick Clifton wrote:
> Hi Richard,
> 
>>>>    So for example without the option being enabled, disassembling
>>>>    non-code sections might produce results like this:
>>>>
>>>>      000000000041fdd8 <__frame_dummy_init_array_entry>:
>>>>        41fdd8:    00400764     .inst    0x00400764 ; undefined
>>>>        41fddc:    00000000     udf    #0
>>>>
>>>>    Whereas with the option enabled this changes to:
>>>>         000000000041fdd8 <__frame_dummy_init_array_entry>:
>>>>        41fdd8:    00400764     .inst    0x00400764 ; [frame_dummy]
>>>>        41fddc:    00000000     udf    #0
> 
>>> This is what mapping symbols are supposed to address.  If the user writes '.word' in a code section then the value will be tagged with $d and the output should then disassemble as data.  '.inst' means 'this is an instruction'.
> 
> True.
> 
>> For example:
>>
>>          .text
>>          .inst 0x12345678
>>          .word 0x12345678
>>
>> gives:
>>
>>   objdump -d test.o
>>
>> test.o:     file format elf64-littleaarch64
>>
>>
>> Disassembly of section .text:
>>
>> 0000000000000000 <.text>:
>>     0:   12345678        and     w24, w19, #0xfffff003
>>     4:   12345678        .word   0x12345678
>>
>> Maybe .word should be annotated with a suitable label, but I don't think .inst should.
> 
> OK, that makes sense.  So the disassembler should check to see if closest
> previous mapping symbol is $d and only proceed with the annotation in this
> case, yes ?
> 

Do you even need to do that?  Just annotate the disassembly of .word values.  Though I suspect you'll need to look at 64-bit quantities on a 64-bit machine.  32-bit values aren't addresses.

R.

> Cheers
>   Nick
> 
>
  
Nick Clifton April 29, 2026, 11:03 a.m. UTC | #7
Hi Richard,

>> OK, that makes sense.  So the disassembler should check to see if closest
>> previous mapping symbol is $d and only proceed with the annotation in this
>> case, yes ?

> Do you even need to do that?  Just annotate the disassembly of .word values.

But the disassembler only knows if an "undefined" instruction is actually
meant to be data by looking at the mapping symbols.

This is the test that I am now using in the v2 version of the patch (not
posted yet as I am waiting to see if there are more comments):

      /* See if this "undefined instruction" is actually the address of something.  */
       if (annotate_undefined_insns
	  /* Skip values that have been explicitly tagged as code.  */
	  && last_type == MAP_DATA
	  /* Skip static object files as symbol values have not be resolved yet.  */
	  && info->section != NULL
	  && info->section->owner != NULL
	  && (info->section->owner->flags & (EXEC_P | DYNAMIC)))
        { ... do the annotation ... }

> Though I suspect you'll need to look at 64-bit quantities on a 64-bit machine.  
 > 32-bit values aren't addresses.

Can you get 32-bit undefined AArch64 instructions ?  The patch only annotates
the disassembler's output for undefined instructions, so if it displaying something
else then nothing new will happen.

Cheers
   Nick
  
Richard Earnshaw (foss) April 30, 2026, 1:08 p.m. UTC | #8
On 29/04/2026 12:03, Nick Clifton wrote:
> Hi Richard,
> 
>>> OK, that makes sense.  So the disassembler should check to see if closest
>>> previous mapping symbol is $d and only proceed with the annotation in this
>>> case, yes ?
> 
>> Do you even need to do that?  Just annotate the disassembly of .word values.
> 
> But the disassembler only knows if an "undefined" instruction is actually
> meant to be data by looking at the mapping symbols.
> 
> This is the test that I am now using in the v2 version of the patch (not
> posted yet as I am waiting to see if there are more comments):
> 
>      /* See if this "undefined instruction" is actually the address of something.  */
>       if (annotate_undefined_insns
>       /* Skip values that have been explicitly tagged as code.  */
>       && last_type == MAP_DATA
>       /* Skip static object files as symbol values have not be resolved yet.  */
>       && info->section != NULL
>       && info->section->owner != NULL
>       && (info->section->owner->flags & (EXEC_P | DYNAMIC)))
>        { ... do the annotation ... }
> 
>> Though I suspect you'll need to look at 64-bit quantities on a 64-bit machine.  
>> 32-bit values aren't addresses.
> 
> Can you get 32-bit undefined AArch64 instructions ?  
I'm not sure I follow.  All aarch64 instructions are 32 bits in size.

> The patch only annotates
> the disassembler's output for undefined instructions, so if it displaying something
> else then nothing new will happen.

Also, .inst does not permit the use of a label - it has no meaning in that context.  So for

        .text
        .inst 0x12345678
        .word 0x12345678
        .word bar  // truncated to 32 bits, but OK
        .inst bar  // Error, not a constant expression
bar:

you get

as test.s
test.s: Assembler messages:
test.s:5: Error: constant expression required

So to even trigger this case you have to contort your sources in ways I really can't see users trying to do.

Is there some user reported case driving this?

R.

> 
> Cheers
>   Nick
> 
>
  
Nick Clifton May 4, 2026, 9:41 a.m. UTC | #9
Hi Richard,

>> Can you get 32-bit undefined AArch64 instructions ?
> I'm not sure I follow.  All aarch64 instructions are 32 bits in size.

Doh!  Please excuse my brain fart.  Yes, I knew that.  I was getting
confused with the "64" in the "AArch64" name...

> Also, .inst does not permit the use of a label - it has no meaning in that context.  So for
> 
>          .text
>          .inst 0x12345678
>          .word 0x12345678
>          .word bar  // truncated to 32 bits, but OK
>          .inst bar  // Error, not a constant expression
> bar:
> 
> you get
> 
> as test.s
> test.s: Assembler messages:
> test.s:5: Error: constant expression required
> 
> So to even trigger this case you have to contort your sources in ways I really can't see users trying to do.

Agreed.  I tried to create a test case to make sure that .inst values
were not annotated (but could of been because their value matched a
legitimate symbol value in the linked executable) but I could not find
any way to do this.


> Is there some user reported case driving this?

No, it is just me, trying to make the disassembler output slightly
more informative.

Cheers
   Nick
  

Patch

diff --git a/opcodes/aarch64-dis.c b/opcodes/aarch64-dis.c
index 8544ce4b6d4..2613ea9aeed 100644
--- a/opcodes/aarch64-dis.c
+++ b/opcodes/aarch64-dis.c
@@ -49,6 +49,7 @@  static enum map_type last_type;
 static int last_mapping_sym = -1;
 static bfd_vma last_stop_offset = 0;
 static bfd_vma last_mapping_addr = 0;
+static bool annotate_undefined_insns = false;
 
 /* Other options */
 static int no_aliases = 0;	/* If set disassemble as most general inst.  */
@@ -91,6 +92,18 @@  parse_aarch64_dis_option (const char *option, unsigned int len ATTRIBUTE_UNUSED)
       return;
     }
 
+  if (startswith (option, "annotate"))
+    {
+      annotate_undefined_insns = true;
+      return;
+    }
+
+  if (startswith (option, "no-annotate"))
+    {
+      annotate_undefined_insns = false;
+      return;
+    }
+
 #ifdef DEBUG_AARCH64
   if (startswith (option, "debug_dump"))
     {
@@ -4260,8 +4273,22 @@  print_insn_aarch64_word (bfd_vma pc,
 				    ".inst\t");
       (*info->fprintf_styled_func) (info->stream, dis_style_immediate,
 				    "0x%08x", word);
-      (*info->fprintf_styled_func) (info->stream, dis_style_comment_start,
-				    " ; %s", err_msg[ret]);
+      asymbol * sym = NULL;
+      /* See if this "instruction" is actually the address of something.  */
+      if (annotate_undefined_insns
+	  /* Skip static object files as symbol values have not be resolved yet.  */
+	  && info->section != NULL
+	  && info->section->owner != NULL
+	  && (info->section->owner->flags & (EXEC_P | DYNAMIC)))
+	{
+	  sym = info->symbol_at_address_func (word, info);
+	  if (sym != NULL)
+	    info->fprintf_styled_func (info->stream, dis_style_symbol,
+				       " ; [%s]", sym->name);
+	}
+      if (sym == NULL)
+	info->fprintf_styled_func (info->stream, dis_style_comment_start,
+				   " ; %s", err_msg[ret]);
       break;
     case ERR_OK:
       user_friendly_fixup (&inst);
@@ -4595,6 +4622,12 @@  with the -M switch (multiple options should be separated by commas):\n"));
   fprintf (stream, _("\n\
   notes            Do print instruction notes.\n"));
 
+  fprintf (stream, _("\n\
+  annotate         Display symbol names for undefined instructions.\n"));
+
+  fprintf (stream, _("\n\
+  no-annotate       Do not display symbol names for undefined instructions.\n"));
+
 #ifdef DEBUG_AARCH64
   fprintf (stream, _("\n\
   debug_dump         Temp switch for debug trace.\n"));
diff --git a/binutils/NEWS b/binutils/NEWS
index 29821c96d0f..5532670b4a2 100644
--- a/binutils/NEWS
+++ b/binutils/NEWS
@@ -1,5 +1,9 @@ 
 -*- text -*-
 
+* The AArch64 disassembler now accepts a command line option of "-M annotate"
+  which displays the symbol associated with undefined instructions, should
+  there be one.
+  
 * The x86 and x86_64 disassemblers now accept a command line option of
   "-M annotate-immediates" which displays the symbol associated with immediate
   values, should there be one.
diff --git a/binutils/doc/binutils.texi b/binutils/doc/binutils.texi
index 45df92394b6..198ae8244ef 100644
--- a/binutils/doc/binutils.texi
+++ b/binutils/doc/binutils.texi
@@ -2691,7 +2691,15 @@  compilers.
 For AArch64 targets this switch can be used to set whether instructions are
 disassembled as the most general instruction using the @option{-M no-aliases}
 option or whether instruction notes should be generated as comments in the
-disasssembly using @option{-M notes}.
+disasssembly using @option{-M notes}.  In addition the @option{-M
+annotate} option can be used to customise the handling of undefined
+instructions when disassembling executables and shared libraries.
+Normally the disassembler will just show the hexadecimal value of tge
+undefined instruction, but if annotation is enabled it will first try
+to find a symbol whose value matches the instruction's encoding.  If
+there is a match then the symbol name will be displayed instead.  This
+can be useful when disassembling non-code sections which may contain
+function and data addresses.
 
 For the x86, some of the options duplicate functions of the @option{-m}
 switch, but allow finer grained control.
diff --git a/binutils/testsuite/binutils-all/aarch64/aarch64.exp b/binutils/testsuite/binutils-all/aarch64/aarch64.exp
index 05edcf53203..6a100a32954 100644
--- a/binutils/testsuite/binutils-all/aarch64/aarch64.exp
+++ b/binutils/testsuite/binutils-all/aarch64/aarch64.exp
@@ -28,3 +28,34 @@  foreach t $test_list {
     verbose [file rootname $t]
     run_dump_test [file rootname $t]
 }
+
+# Test objdump -M annotate
+
+proc test_objdump_M_annotate { } {
+    global srcdir
+    global subdir
+    global OBJDUMP
+    global OBJDUMPFLAGS
+
+    set test "objdump -M annotate"
+
+    set result [target_compile "$srcdir/$subdir/objdumpM1.c $srcdir/$subdir/objdumpM2.c" tmpdir/testprog executable debug]
+    if { $result != "" } {
+	unsupported "$test (build): compile result: $result"
+	return
+    }
+
+    set got [binutils_run $OBJDUMP "$OBJDUMPFLAGS -D -M annotate tmpdir/testprog"]
+
+    # Look for something like this in the disassembly:
+    # 420020:	00400780 	.inst	0x00400780 ; [func1]
+    set want "; \[func1\]"
+
+    if [regexp $want $got] then {
+	pass $test
+    } else {
+	fail $test
+    }
+}
+
+test_objdump_M_annotate
--- /dev/null	2026-04-27 09:01:42.369431460 +0100
+++ current/binutils/testsuite/binutils-all/aarch64/objdumpM1.c	2026-04-27 15:53:03.311967477 +0100
@@ -0,0 +1,26 @@ 
+int datum = 22;
+
+extern int func1 (int);
+extern int func2 (int);
+extern int func3 (int);
+extern int func4 (int);
+extern int func5 (int);
+
+struct ptrs
+{
+  int (* fptr)(int);
+  int field;
+}
+fred [5] =
+{
+  { func1, 1 },
+  { func2, 2 },
+  { func3, 3 },
+  { func4, 4 },
+  { func5, 5 }
+};
+
+int main (int arg)
+{
+  return fred[arg].fptr (fred[arg].field * datum);
+}
--- /dev/null	2026-04-27 09:01:42.369431460 +0100
+++ current/binutils/testsuite/binutils-all/aarch64/objdumpM2.c	2026-04-27 15:53:13.969016278 +0100
@@ -0,0 +1,5 @@ 
+int func1 (int arg) { return arg * 2; }
+int func2 (int arg) { return arg * 3; }
+int func3 (int arg) { return arg * 4; }
+int func4 (int arg) { return arg * 5; }
+int func5 (int arg) { return arg * 6; }