[v3] Fix missing watchpoint hits/endless loop

Message ID 20240105150734.5287-1-tdevries@suse.de
State Dropped
Headers
Series [v3] Fix missing watchpoint hits/endless loop |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gdb_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-aarch64 success Testing passed

Commit Message

Tom de Vries Jan. 5, 2024, 3:07 p.m. UTC
  From: Luis Machado <luis.machado@linaro.org>

Updates on v3:

- Rebased to current trunk (v2 applied cleanly on gdb-12-branch).
- Should also work on fbsd, though I haven't tested it.
- Dropped gdbserver part (should be possible, I'm just not sure yet where to
  move aarch64_stopped_data_address such that it's accessible from gdbserver).
- No attempt made to address any review remarks on v2.
  https://inbox.sourceware.org/gdb-patches/D2AC3C31-5EEE-4BCA-8D75-7CEDFA75F540@arm.com/

Updates on v2:

- Handle more types of load/store instructions, including a catch all
  for SVE load/store instructions.

--

I ran into a situation where a hardware watchpoint hit is not detected
correctly, misleading GDB into thinking it was a delayed breakpoint hit.

The problem is that hardware watchpoints are not skippable on AArch64, so
that makes GDB loop endlessly trying to run past the instruction.

The most obvious case where this happens is when the load/store pair
instructions access 16 bytes of memory.

Suppose we have a stp instruction that will write a couple 64-bit registers
to address 0x10 (stp x3,x4 [x2]). It will write data from 0x10 up to 0x20.

Now suppose a write watchpoint is created to monitor memory address 0x18,
which is the start of the second register write. It can have whatever length,
but let's assume it has length 8.

When we execute that stp instruction, it will trap and the reported stopped
data address from the kernel will be 0x10 (the beginning of the memory range
accessed by the instruction).

The current code won't be able to detect a valid trigger because it assumes an
alignment of 8 bytes for the watchpoint address. Forcing that kind of alignment
won't be enough to detect a 16-byte access if the trap address falls outside of
the 8-byte alignment window. We need to know how many bytes the instruction
will access, but we won't have that data unless we go parsing instructions.

Another issue with the current code seems to be that it assumes the accesses
will always be 8 bytes in size, since it wants to align the watchpoint address
to that particular boundary. This leads to problems when we have unaligned
addresses and unaligned watchpoints.

For example, suppose we have a str instruction storing 8 bytes to memory
address 0xf. Now suppose we have a write watchpoint at address 0x16,
monitoring 8 bytes.

The trap address will be 0xf, but forcing 0x16 to 8-byte alignment yields
0x10, and so GDB doesn't think this is a watchpoint hit.

I believe you can trigger the same problem with smaller memory accesses,
except one that accesses a single byte.

In order to cover most of the cases correctly, we parse the load/store
instructions and detect how many bytes they're accessing. That will give us
enough information to tell if a hardware watchpoint triggered or not.

The SVE load/store support is only a placeholder for now, as we need to
parse the instructions and use the variable length to determine the memory
access size (planned for the future).

The patch also fixes these two failures in the testsuite:

FAIL: gdb.base/watchpoint-unaligned.exp: continue (timeout)
FAIL: gdb.base/watchpoint-unaligned.exp: size8twice write

Regression tested (v2) on aarch64-linux Ubuntu/20.04 and Ubuntu/18.04.
Regression tested (v3) on aarch64-linux fedora 39.

I also used a very slow aarch64 hardware watchpoint stress test to validate
the v2 patch, but I don't think that particular test should be included in the
testsuite. It takes quite a while to run (20+ minutes), and shouldn't be
required unless we know there are problems with hardware watchpoint handling.

PR tdep/29423
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29423
---
 gdb/aarch64-fbsd-nat.c  |   7 +-
 gdb/aarch64-linux-nat.c |   7 +-
 gdb/aarch64-nat.c       | 268 ++++++++++++++++++++++++++++++++++++++--
 gdb/aarch64-nat.h       |   3 +-
 4 files changed, 273 insertions(+), 12 deletions(-)


base-commit: e78337d5780c9d837e186c22c11eb8f9ed5bac62
  

Comments

Luis Machado Jan. 8, 2024, 12:27 p.m. UTC | #1
Hi Tom,

Thanks for refreshing this patch.

As I hinted on IRC, back then I ended up deciding not to pursue this approach further. Though it does
address some of the potential failures, it is a bit convoluted and poses a fairly significant
maintenance burden (we need to explicitly identify instructions, so we will always be behind), including
the risk of producing false positives.

For SVE/SME, there are also predicated accesses, which means an intruction may read/write a discontiguous range
of bytes depending on the predicate mask. That is trickier to handle.

The nature of hardware watchpoint detection on aarch64 means the triggering behavior is dependent on a particular
CPU spec, in a way that is hard for userspace to predict reliably. Hopefully in the future this situation will
improve for userspace. The kernel folks are aware of it.

Meanwhile, given the above restrictions (and potentially restrictions of other architectures), we could teach gdb
about imprecise hardware watchpoint triggers. Something that would allow gdb to tell userspace that we know
a hardware watchpoint has triggered, but we don't know exactly for what range and what happened.

At least this would prevent certain situations where gdb is simply stuck trying to skip over unskippable
hardware watchpoints, because it can't tell what happened and ended up with a generic SIGTRAP.

Thoughts?

On 1/5/24 15:07, Tom de Vries wrote:
> From: Luis Machado <luis.machado@linaro.org>
> 
> Updates on v3:
> 
> - Rebased to current trunk (v2 applied cleanly on gdb-12-branch).
> - Should also work on fbsd, though I haven't tested it.
> - Dropped gdbserver part (should be possible, I'm just not sure yet where to
>   move aarch64_stopped_data_address such that it's accessible from gdbserver).
> - No attempt made to address any review remarks on v2.
>   https://inbox.sourceware.org/gdb-patches/D2AC3C31-5EEE-4BCA-8D75-7CEDFA75F540@arm.com/
> 
> Updates on v2:
> 
> - Handle more types of load/store instructions, including a catch all
>   for SVE load/store instructions.
> 
> --
> 
> I ran into a situation where a hardware watchpoint hit is not detected
> correctly, misleading GDB into thinking it was a delayed breakpoint hit.
> 
> The problem is that hardware watchpoints are not skippable on AArch64, so
> that makes GDB loop endlessly trying to run past the instruction.
> 
> The most obvious case where this happens is when the load/store pair
> instructions access 16 bytes of memory.
> 
> Suppose we have a stp instruction that will write a couple 64-bit registers
> to address 0x10 (stp x3,x4 [x2]). It will write data from 0x10 up to 0x20.
> 
> Now suppose a write watchpoint is created to monitor memory address 0x18,
> which is the start of the second register write. It can have whatever length,
> but let's assume it has length 8.
> 
> When we execute that stp instruction, it will trap and the reported stopped
> data address from the kernel will be 0x10 (the beginning of the memory range
> accessed by the instruction).
> 
> The current code won't be able to detect a valid trigger because it assumes an
> alignment of 8 bytes for the watchpoint address. Forcing that kind of alignment
> won't be enough to detect a 16-byte access if the trap address falls outside of
> the 8-byte alignment window. We need to know how many bytes the instruction
> will access, but we won't have that data unless we go parsing instructions.
> 
> Another issue with the current code seems to be that it assumes the accesses
> will always be 8 bytes in size, since it wants to align the watchpoint address
> to that particular boundary. This leads to problems when we have unaligned
> addresses and unaligned watchpoints.
> 
> For example, suppose we have a str instruction storing 8 bytes to memory
> address 0xf. Now suppose we have a write watchpoint at address 0x16,
> monitoring 8 bytes.
> 
> The trap address will be 0xf, but forcing 0x16 to 8-byte alignment yields
> 0x10, and so GDB doesn't think this is a watchpoint hit.
> 
> I believe you can trigger the same problem with smaller memory accesses,
> except one that accesses a single byte.
> 
> In order to cover most of the cases correctly, we parse the load/store
> instructions and detect how many bytes they're accessing. That will give us
> enough information to tell if a hardware watchpoint triggered or not.
> 
> The SVE load/store support is only a placeholder for now, as we need to
> parse the instructions and use the variable length to determine the memory
> access size (planned for the future).
> 
> The patch also fixes these two failures in the testsuite:
> 
> FAIL: gdb.base/watchpoint-unaligned.exp: continue (timeout)
> FAIL: gdb.base/watchpoint-unaligned.exp: size8twice write
> 
> Regression tested (v2) on aarch64-linux Ubuntu/20.04 and Ubuntu/18.04.
> Regression tested (v3) on aarch64-linux fedora 39.
> 
> I also used a very slow aarch64 hardware watchpoint stress test to validate
> the v2 patch, but I don't think that particular test should be included in the
> testsuite. It takes quite a while to run (20+ minutes), and shouldn't be
> required unless we know there are problems with hardware watchpoint handling.
> 
> PR tdep/29423
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29423
> ---
>  gdb/aarch64-fbsd-nat.c  |   7 +-
>  gdb/aarch64-linux-nat.c |   7 +-
>  gdb/aarch64-nat.c       | 268 ++++++++++++++++++++++++++++++++++++++--
>  gdb/aarch64-nat.h       |   3 +-
>  4 files changed, 273 insertions(+), 12 deletions(-)
> 
> diff --git a/gdb/aarch64-fbsd-nat.c b/gdb/aarch64-fbsd-nat.c
> index 38fb093f139..39162fe3bb0 100644
> --- a/gdb/aarch64-fbsd-nat.c
> +++ b/gdb/aarch64-fbsd-nat.c
> @@ -154,9 +154,14 @@ aarch64_fbsd_nat_target::stopped_data_address (CORE_ADDR *addr_p)
>  
>    const CORE_ADDR addr_trap = (CORE_ADDR) siginfo.si_addr;
>  
> +  struct regcache *regs = get_thread_regcache (this, inferior_ptid);
> +  CORE_ADDR trigger_pc = regcache_read_pc (regs);
> +  uint32_t insn;
> +  read_memory (trigger_pc, (gdb_byte *) &insn, 4);
> +
>    /* Check if the address matches any watched address.  */
>    state = aarch64_get_debug_reg_state (inferior_ptid.pid ());
> -  return aarch64_stopped_data_address (state, addr_trap, addr_p);
> +  return aarch64_stopped_data_address (state, insn, addr_trap, addr_p);
>  }
>  
>  /* Implement the "stopped_by_watchpoint" target_ops method.  */
> diff --git a/gdb/aarch64-linux-nat.c b/gdb/aarch64-linux-nat.c
> index 5b4e3c2bde1..5494bb07517 100644
> --- a/gdb/aarch64-linux-nat.c
> +++ b/gdb/aarch64-linux-nat.c
> @@ -962,9 +962,14 @@ aarch64_linux_nat_target::stopped_data_address (CORE_ADDR *addr_p)
>    const CORE_ADDR addr_trap
>      = gdbarch_remove_non_address_bits (gdbarch, (CORE_ADDR) siginfo.si_addr);
>  
> +  struct regcache *regs = get_thread_regcache (this, inferior_ptid);
> +  CORE_ADDR trigger_pc = regcache_read_pc (regs);
> +  uint32_t insn;
> +  read_memory (trigger_pc, (gdb_byte *) &insn, 4);
> +
>    /* Check if the address matches any watched address.  */
>    state = aarch64_get_debug_reg_state (inferior_ptid.pid ());
> -  return aarch64_stopped_data_address (state, addr_trap, addr_p);
> +  return aarch64_stopped_data_address (state, insn, addr_trap, addr_p);
>  }
>  
>  /* Implement the "stopped_by_watchpoint" target_ops method.  */
> diff --git a/gdb/aarch64-nat.c b/gdb/aarch64-nat.c
> index ee8c5a1e21d..928bb70d644 100644
> --- a/gdb/aarch64-nat.c
> +++ b/gdb/aarch64-nat.c
> @@ -22,6 +22,7 @@
>  #include "inferior.h"
>  #include "cli/cli-cmds.h"
>  #include "aarch64-nat.h"
> +#include "arch/aarch64-insn.h"
>  
>  #include <unordered_map>
>  
> @@ -225,27 +226,275 @@ aarch64_remove_watchpoint (CORE_ADDR addr, int len, enum target_hw_bp_type type,
>    return ret;
>  }
>  
> -/* See aarch64-nat.h.  */
> +/* Macros to distinguish between various types of load/store instructions.  */
> +
> +/* Regular and Advanced SIMD load/store instructions.  */
> +#define GENERAL_LOAD_STORE_P(insn) (bit (insn, 25) == 0 && bit (insn, 27) == 1)
> +
> +/* SVE load/store instruction.  */
> +#define SVE_LOAD_STORE_P(insn) (bits (insn, 25, 28) == 0x2		      \
> +				&& (bits (insn, 29, 31) == 0x4		      \
> +				    || bits (insn, 29, 31) == 0x5	      \
> +				    || bits (insn, 29, 31) == 0x6	      \
> +				    || (bits (insn, 29, 31) == 0x7	      \
> +					&& ((bit (insn, 15) == 0x0	      \
> +					     && (bit (insn, 13) == 0x0	      \
> +						 || bit (insn, 13) == 0x1))   \
> +					     || (bit (insn, 15) == 0x1	      \
> +						 && bit (insn, 13) == 0x0)    \
> +					     || bits (insn, 13, 15) == 0x6    \
> +					     || bits (insn, 13, 15) == 0x7))))
> +
> +/* Any load/store instruction (regular, Advanced SIMD or SVE).  */
> +#define LOAD_STORE_P(insn) (GENERAL_LOAD_STORE_P (insn)			      \
> +			    || SVE_LOAD_STORE_P (insn))
> +
> +/* Load/Store pair (regular or vector).  */
> +#define LOAD_STORE_PAIR_P(insn) (bit (insn, 28) == 0 && bit (insn, 29) == 1)
> +
> +#define COMPARE_SWAP_PAIR_P(insn) (bits (insn, 30, 31) == 0		      \
> +				   && bits (insn, 28, 29) == 0		      \
> +				   && bit (insn, 26) == 0		      \
> +				   && bits (insn, 23, 24) == 0		      \
> +				   && bit (insn, 11) == 1)
> +#define LOAD_STORE_EXCLUSIVE_PAIR_P(insn) (bit (insn, 31) == 1		      \
> +					   && bits (insn, 28, 29) == 0	      \
> +					   && bit (insn, 26) == 0	      \
> +					   && bits (insn, 23, 24) == 0	      \
> +					   && bit (insn, 11) == 1)
> +
> +#define LOAD_STORE_LITERAL_P(insn) (bit (insn, 28) == 1			      \
> +				    && bit (insn, 29) == 0		      \
> +				    && bit (insn, 24) == 0)
> +
> +/* Vector Load/Store multiple structures.  */
> +#define LOAD_STORE_MS(insn) (bits (insn, 28, 29) == 0x0			      \
> +			     && bit (insn, 31) == 0x0			      \
> +			     && bit (insn, 26) == 0x1			      \
> +			     && ((bits (insn, 23, 24) == 0x0		      \
> +				  && bits (insn, 16, 21) == 0x0)	      \
> +				 || (bits (insn, 23, 24) == 0x1		      \
> +				     && bit (insn, 21) == 0x0)))
> +
> +/* Vector Load/Store single structure.  */
> +#define LOAD_STORE_SS(insn) (bits (insn, 28, 29) == 0x0			      \
> +			     && bit (insn, 31) == 0x0			      \
> +			     && bit (insn, 26) == 0x1			      \
> +			     && ((bits (insn, 23, 24) == 0x2		      \
> +				  && bits (insn, 16, 20) == 0x0)	      \
> +				 || (bits (insn, 23, 24) == 0x3)))
> +
> +/* Assuming INSN is a load/store instruction, return the size of the
> +   memory access.  The patterns are documented in the ARM Architecture
> +   Reference Manual - Index by encoding.  */
> +
> +static unsigned int
> +get_load_store_access_size (CORE_ADDR insn)
> +{
> +  if (SVE_LOAD_STORE_P (insn))
> +    {
> +      /* SVE load/store instructions.  */
> +
> +      /* This is not always correct for SVE instructions, but it is a reasonable
> +	 default for now.  Calculating the exact memory access size for SVE
> +	 load/store instructions requires parsing instructions and evaluating
> +	 the vector length.  */
> +      return 16;
> +    }
> +
> +  /* Non-SVE instructions.  */
> +
> +  bool vector = (bit (insn, 26) == 1);
> +  bool ldst_pair = LOAD_STORE_PAIR_P (insn);
> +
> +  /* Is this an Advanced SIMD load/store instruction?  */
> +  if (vector)
> +    {
> +      unsigned int size = bits (insn, 30, 31);
> +
> +      if (LOAD_STORE_LITERAL_P (insn))
> +	{
> +	  /* Advanced SIMD load/store literal */
> +	  /* Sizes 4, 8 and 16 bytes.  */
> +	  return 4 << size;
> +	}
> +      else if (LOAD_STORE_MS (insn))
> +	{
> +	  size = 8 << bit (insn, 30);
> +
> +	  unsigned char opcode = bits (insn, 12, 15);
> +
> +	  if (opcode == 0x0 || opcode == 0x2)
> +	    size *= 4;
> +	  else if (opcode == 0x4 || opcode == 0x6)
> +	    size *= 3;
> +	  else if (opcode == 0x8 || opcode == 0xa)
> +	    size *= 2;
> +
> +	  return size;
> +	}
> +      else if (LOAD_STORE_SS (insn))
> +	{
> +	  size = 8 << bit (insn, 30);
> +	  return size;
> +	}
> +      /* Advanced SIMD load/store instructions.  */
> +      else if (ldst_pair)
> +	{
> +	  /* Advanced SIMD load/store pair.  */
> +	  /* Sizes 8, 16 and 32 bytes.  */
> +	  return (8 << size);
> +	}
> +      else
> +	{
> +	  /* Regular Advanced SIMD load/store instructions.  */
> +	  size = size | (bit (insn, 23) << 2);
> +	  return 1 << size;
> +	}
> +    }
> +
> +  /* This must be a regular GPR load/store instruction.  */
> +
> +  unsigned int size = bits (insn, 30, 31);
> +  bool cs_pair = COMPARE_SWAP_PAIR_P (insn);
> +  bool ldstx_pair = LOAD_STORE_EXCLUSIVE_PAIR_P (insn);
> +
> +  if (ldst_pair)
> +    {
> +      /* Load/Store pair.  */
> +      if (size == 0x1)
> +	{
> +	  if (bit (insn, 22) == 0)
> +	    /* STGP - 16 bytes.  */
> +	    return 16;
> +	  else
> +	    /* LDPSW - 8 bytes.  */
> +	    return 8;
> +	}
> +      /* Other Load/Store pair.  */
> +      return (size == 0)? 8 : 16;
> +    }
> +  else if (cs_pair || ldstx_pair)
> +    {
> +      /* Compare Swap Pair or Load Store Exclusive Pair.  */
> +      /* Sizes 8 and 16 bytes.  */
> +      size = bit (insn, 30);
> +      return (8 << size);
> +    }
> +  else if (LOAD_STORE_LITERAL_P (insn))
> +    {
> +      /* Load/Store literal.  */
> +      /* Sizes between 4 and 8 bytes.  */
> +      if (size == 0x2)
> +	return 4;
> +
> +      return 4 << size;
> +    }
> +  else
> +    {
> +      /* General load/store instructions, with memory access sizes between
> +	 1 and 8 bytes.  */
> +      return (1 << size);
> +    }
> +}
> +
> +/* Return true if the regions [mem_addr, mem_addr + mem_len) and
> +   [watch_addr, watch_addr + watch_len) overlap.  False otherwise.  */
> +
> +static bool
> +hw_watch_regions_overlap (CORE_ADDR mem_addr, size_t mem_len,
> +			  CORE_ADDR watch_addr, size_t watch_len)
> +{
> +  /* Quick check for non-overlapping case.  */
> +  if (watch_addr >= (mem_addr + mem_len)
> +      || mem_addr >= (watch_addr + watch_len))
> +    return false;
> +
> +  CORE_ADDR start = std::max (mem_addr, watch_addr);
> +  CORE_ADDR end = std::min (mem_addr + mem_len, watch_addr + watch_len);
> +
> +  return ((end - start) > 0);
> +}
> +
> +/* Check if a hardware watchpoint has triggered.  If a trigger is detected,
> +   return true and update ADDR_P with the stopped data address.
> +   Otherwise return false.
> +
> +   STATE is the debug register's state, INSN is the instruction the inferior
> +   stopped at and ADDR_TRAP is the reported stopped data address.  */
>  
>  bool
>  aarch64_stopped_data_address (const struct aarch64_debug_reg_state *state,
> -			      CORE_ADDR addr_trap, CORE_ADDR *addr_p)
> +			      CORE_ADDR insn, CORE_ADDR addr_trap,
> +			      CORE_ADDR *addr_p)
>  {
> -  int i;
> +  /* There are 6 variations of watchpoint range and memory access
> +     range positioning:
> +
> +     - W is the byte in the watchpoint range only.
> +
> +     - B is the byte in the memory access range ony.
> +
> +     - O is the byte in the overlapping region of the watchpoint range and
> +       the memory access range.
> +
> +     1 - Non-overlapping, no triggers.
> +
> +     [WWWWWWWW]...[BBBBBBBB]
> +
> +     2 - Non-overlapping, no triggers.
> +
> +     [BBBBBBBB]...[WWWWWWWW]
> +
> +     3 - Overlapping partially, triggers.
> +
> +     [BBBBOOOOWWWW]
> +
> +     4 - Overlapping partially, triggers.
> +
> +     [WWWWOOOOBBBB]
> +
> +     5 - Memory access contained in watchpoint range, triggers.
> +
> +     [WWWWOOOOOOOOWWWW]
> +
> +     6 - Memory access containing watchpoint range, triggers.
> +
> +     [BBBBOOOOOOOOBBBB]
> +  */
> +
> +  /* If there are no load/store instructions at PC, this must be a hardware
> +     breakpoint hit.  Return early.
> +
> +     If a hardware breakpoint is placed at a PC containing a load/store
> +     instruction, we will go through the memory access size check anyway, but
> +     will eventually figure out there are no watchpoints matching such an
> +     address.
> +
> +     There is one corner case though, which is having a hardware breakpoint and
> +     a hardware watchpoint at PC, when PC contains a load/store
> +     instruction.  That is an ambiguous case that is hard to differentiate.
> +
> +     There's not much we can do about that unless the kernel sends us enough
> +     information to tell them apart.  */
> +  if (!LOAD_STORE_P (insn))
> +    return false;
> +
> +  /* Get the memory access size for the instruction at PC.  */
> +  unsigned int memory_access_size = get_load_store_access_size (insn);
>  
> -  for (i = aarch64_num_wp_regs - 1; i >= 0; --i)
> +  for (int i = aarch64_num_wp_regs - 1; i >= 0; --i)
>      {
>        const unsigned int offset
>  	= aarch64_watchpoint_offset (state->dr_ctrl_wp[i]);
>        const unsigned int len = aarch64_watchpoint_length (state->dr_ctrl_wp[i]);
>        const CORE_ADDR addr_watch = state->dr_addr_wp[i] + offset;
> -      const CORE_ADDR addr_watch_aligned = align_down (state->dr_addr_wp[i], 8);
>        const CORE_ADDR addr_orig = state->dr_addr_orig_wp[i];
>  
> -      if (state->dr_ref_count_wp[i]
> -	  && DR_CONTROL_ENABLED (state->dr_ctrl_wp[i])
> -	  && addr_trap >= addr_watch_aligned
> -	  && addr_trap < addr_watch + len)
> +      if ((state->dr_ref_count_wp[i]
> +	  && DR_CONTROL_ENABLED (state->dr_ctrl_wp[i]))
> +	  && hw_watch_regions_overlap (addr_trap, memory_access_size,
> +				       addr_watch, len))
>  	{
>  	  /* ADDR_TRAP reports the first address of the memory range
>  	     accessed by the CPU, regardless of what was the memory
> @@ -270,6 +519,7 @@ aarch64_stopped_data_address (const struct aarch64_debug_reg_state *state,
>  	}
>      }
>  
> +  /* No hardware watchpoint hits detected.  */
>    return false;
>  }
>  
> diff --git a/gdb/aarch64-nat.h b/gdb/aarch64-nat.h
> index fee6bda2577..bc0e6297f40 100644
> --- a/gdb/aarch64-nat.h
> +++ b/gdb/aarch64-nat.h
> @@ -51,7 +51,8 @@ void aarch64_remove_debug_reg_state (pid_t pid);
>     *ADDR_P.  */
>  
>  bool aarch64_stopped_data_address (const struct aarch64_debug_reg_state *state,
> -				   CORE_ADDR addr_trap, CORE_ADDR *addr_p);
> +				   CORE_ADDR insn, CORE_ADDR addr_trap,
> +				   CORE_ADDR *addr_p);
>  
>  /* Helper functions used by aarch64_nat_target below.  See their
>     definitions.  */
> 
> base-commit: e78337d5780c9d837e186c22c11eb8f9ed5bac62
  
Tom de Vries Feb. 20, 2024, 9:07 p.m. UTC | #2
On 1/8/24 13:27, Luis Machado wrote:
> Hi Tom,
> 
> Thanks for refreshing this patch.
> 
> As I hinted on IRC, back then I ended up deciding not to pursue this approach further. Though it does
> address some of the potential failures, it is a bit convoluted and poses a fairly significant
> maintenance burden (we need to explicitly identify instructions, so we will always be behind), including
> the risk of producing false positives.
> 
> For SVE/SME, there are also predicated accesses, which means an intruction may read/write a discontiguous range
> of bytes depending on the predicate mask. That is trickier to handle.
> 
> The nature of hardware watchpoint detection on aarch64 means the triggering behavior is dependent on a particular
> CPU spec, in a way that is hard for userspace to predict reliably. Hopefully in the future this situation will
> improve for userspace. The kernel folks are aware of it.
> 

Hi Luis,

ack, that makes sense.

> Meanwhile, given the above restrictions (and potentially restrictions of other architectures), we could teach gdb
> about imprecise hardware watchpoint triggers. Something that would allow gdb to tell userspace that we know
> a hardware watchpoint has triggered, but we don't know exactly for what range and what happened.
> 
> At least this would prevent certain situations where gdb is simply stuck trying to skip over unskippable
> hardware watchpoints, because it can't tell what happened and ended up with a generic SIGTRAP.
> 
> Thoughts?
> 

I've submitted a two-patch series here ( 
https://sourceware.org/pipermail/gdb-patches/2024-February/206706.html 
), where the second patch does something a bit like this: the aarch64 
tdep tells gdb that we know that a hardware watchpoint has triggered, 
but doesn't know exactly for what range, and since it's a regular write 
watchpoint, gdb handles this fine.

With this patch series, I'm getting rid of the last consistent FAILs I 
see on m1 (not counting libcc1 stuff).

Thanks,
- Tom

> On 1/5/24 15:07, Tom de Vries wrote:
>> From: Luis Machado <luis.machado@linaro.org>
>>
>> Updates on v3:
>>
>> - Rebased to current trunk (v2 applied cleanly on gdb-12-branch).
>> - Should also work on fbsd, though I haven't tested it.
>> - Dropped gdbserver part (should be possible, I'm just not sure yet where to
>>    move aarch64_stopped_data_address such that it's accessible from gdbserver).
>> - No attempt made to address any review remarks on v2.
>>    https://inbox.sourceware.org/gdb-patches/D2AC3C31-5EEE-4BCA-8D75-7CEDFA75F540@arm.com/
>>
>> Updates on v2:
>>
>> - Handle more types of load/store instructions, including a catch all
>>    for SVE load/store instructions.
>>
>> --
>>
>> I ran into a situation where a hardware watchpoint hit is not detected
>> correctly, misleading GDB into thinking it was a delayed breakpoint hit.
>>
>> The problem is that hardware watchpoints are not skippable on AArch64, so
>> that makes GDB loop endlessly trying to run past the instruction.
>>
>> The most obvious case where this happens is when the load/store pair
>> instructions access 16 bytes of memory.
>>
>> Suppose we have a stp instruction that will write a couple 64-bit registers
>> to address 0x10 (stp x3,x4 [x2]). It will write data from 0x10 up to 0x20.
>>
>> Now suppose a write watchpoint is created to monitor memory address 0x18,
>> which is the start of the second register write. It can have whatever length,
>> but let's assume it has length 8.
>>
>> When we execute that stp instruction, it will trap and the reported stopped
>> data address from the kernel will be 0x10 (the beginning of the memory range
>> accessed by the instruction).
>>
>> The current code won't be able to detect a valid trigger because it assumes an
>> alignment of 8 bytes for the watchpoint address. Forcing that kind of alignment
>> won't be enough to detect a 16-byte access if the trap address falls outside of
>> the 8-byte alignment window. We need to know how many bytes the instruction
>> will access, but we won't have that data unless we go parsing instructions.
>>
>> Another issue with the current code seems to be that it assumes the accesses
>> will always be 8 bytes in size, since it wants to align the watchpoint address
>> to that particular boundary. This leads to problems when we have unaligned
>> addresses and unaligned watchpoints.
>>
>> For example, suppose we have a str instruction storing 8 bytes to memory
>> address 0xf. Now suppose we have a write watchpoint at address 0x16,
>> monitoring 8 bytes.
>>
>> The trap address will be 0xf, but forcing 0x16 to 8-byte alignment yields
>> 0x10, and so GDB doesn't think this is a watchpoint hit.
>>
>> I believe you can trigger the same problem with smaller memory accesses,
>> except one that accesses a single byte.
>>
>> In order to cover most of the cases correctly, we parse the load/store
>> instructions and detect how many bytes they're accessing. That will give us
>> enough information to tell if a hardware watchpoint triggered or not.
>>
>> The SVE load/store support is only a placeholder for now, as we need to
>> parse the instructions and use the variable length to determine the memory
>> access size (planned for the future).
>>
>> The patch also fixes these two failures in the testsuite:
>>
>> FAIL: gdb.base/watchpoint-unaligned.exp: continue (timeout)
>> FAIL: gdb.base/watchpoint-unaligned.exp: size8twice write
>>
>> Regression tested (v2) on aarch64-linux Ubuntu/20.04 and Ubuntu/18.04.
>> Regression tested (v3) on aarch64-linux fedora 39.
>>
>> I also used a very slow aarch64 hardware watchpoint stress test to validate
>> the v2 patch, but I don't think that particular test should be included in the
>> testsuite. It takes quite a while to run (20+ minutes), and shouldn't be
>> required unless we know there are problems with hardware watchpoint handling.
>>
>> PR tdep/29423
>> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29423
>> ---
>>   gdb/aarch64-fbsd-nat.c  |   7 +-
>>   gdb/aarch64-linux-nat.c |   7 +-
>>   gdb/aarch64-nat.c       | 268 ++++++++++++++++++++++++++++++++++++++--
>>   gdb/aarch64-nat.h       |   3 +-
>>   4 files changed, 273 insertions(+), 12 deletions(-)
>>
>> diff --git a/gdb/aarch64-fbsd-nat.c b/gdb/aarch64-fbsd-nat.c
>> index 38fb093f139..39162fe3bb0 100644
>> --- a/gdb/aarch64-fbsd-nat.c
>> +++ b/gdb/aarch64-fbsd-nat.c
>> @@ -154,9 +154,14 @@ aarch64_fbsd_nat_target::stopped_data_address (CORE_ADDR *addr_p)
>>   
>>     const CORE_ADDR addr_trap = (CORE_ADDR) siginfo.si_addr;
>>   
>> +  struct regcache *regs = get_thread_regcache (this, inferior_ptid);
>> +  CORE_ADDR trigger_pc = regcache_read_pc (regs);
>> +  uint32_t insn;
>> +  read_memory (trigger_pc, (gdb_byte *) &insn, 4);
>> +
>>     /* Check if the address matches any watched address.  */
>>     state = aarch64_get_debug_reg_state (inferior_ptid.pid ());
>> -  return aarch64_stopped_data_address (state, addr_trap, addr_p);
>> +  return aarch64_stopped_data_address (state, insn, addr_trap, addr_p);
>>   }
>>   
>>   /* Implement the "stopped_by_watchpoint" target_ops method.  */
>> diff --git a/gdb/aarch64-linux-nat.c b/gdb/aarch64-linux-nat.c
>> index 5b4e3c2bde1..5494bb07517 100644
>> --- a/gdb/aarch64-linux-nat.c
>> +++ b/gdb/aarch64-linux-nat.c
>> @@ -962,9 +962,14 @@ aarch64_linux_nat_target::stopped_data_address (CORE_ADDR *addr_p)
>>     const CORE_ADDR addr_trap
>>       = gdbarch_remove_non_address_bits (gdbarch, (CORE_ADDR) siginfo.si_addr);
>>   
>> +  struct regcache *regs = get_thread_regcache (this, inferior_ptid);
>> +  CORE_ADDR trigger_pc = regcache_read_pc (regs);
>> +  uint32_t insn;
>> +  read_memory (trigger_pc, (gdb_byte *) &insn, 4);
>> +
>>     /* Check if the address matches any watched address.  */
>>     state = aarch64_get_debug_reg_state (inferior_ptid.pid ());
>> -  return aarch64_stopped_data_address (state, addr_trap, addr_p);
>> +  return aarch64_stopped_data_address (state, insn, addr_trap, addr_p);
>>   }
>>   
>>   /* Implement the "stopped_by_watchpoint" target_ops method.  */
>> diff --git a/gdb/aarch64-nat.c b/gdb/aarch64-nat.c
>> index ee8c5a1e21d..928bb70d644 100644
>> --- a/gdb/aarch64-nat.c
>> +++ b/gdb/aarch64-nat.c
>> @@ -22,6 +22,7 @@
>>   #include "inferior.h"
>>   #include "cli/cli-cmds.h"
>>   #include "aarch64-nat.h"
>> +#include "arch/aarch64-insn.h"
>>   
>>   #include <unordered_map>
>>   
>> @@ -225,27 +226,275 @@ aarch64_remove_watchpoint (CORE_ADDR addr, int len, enum target_hw_bp_type type,
>>     return ret;
>>   }
>>   
>> -/* See aarch64-nat.h.  */
>> +/* Macros to distinguish between various types of load/store instructions.  */
>> +
>> +/* Regular and Advanced SIMD load/store instructions.  */
>> +#define GENERAL_LOAD_STORE_P(insn) (bit (insn, 25) == 0 && bit (insn, 27) == 1)
>> +
>> +/* SVE load/store instruction.  */
>> +#define SVE_LOAD_STORE_P(insn) (bits (insn, 25, 28) == 0x2		      \
>> +				&& (bits (insn, 29, 31) == 0x4		      \
>> +				    || bits (insn, 29, 31) == 0x5	      \
>> +				    || bits (insn, 29, 31) == 0x6	      \
>> +				    || (bits (insn, 29, 31) == 0x7	      \
>> +					&& ((bit (insn, 15) == 0x0	      \
>> +					     && (bit (insn, 13) == 0x0	      \
>> +						 || bit (insn, 13) == 0x1))   \
>> +					     || (bit (insn, 15) == 0x1	      \
>> +						 && bit (insn, 13) == 0x0)    \
>> +					     || bits (insn, 13, 15) == 0x6    \
>> +					     || bits (insn, 13, 15) == 0x7))))
>> +
>> +/* Any load/store instruction (regular, Advanced SIMD or SVE).  */
>> +#define LOAD_STORE_P(insn) (GENERAL_LOAD_STORE_P (insn)			      \
>> +			    || SVE_LOAD_STORE_P (insn))
>> +
>> +/* Load/Store pair (regular or vector).  */
>> +#define LOAD_STORE_PAIR_P(insn) (bit (insn, 28) == 0 && bit (insn, 29) == 1)
>> +
>> +#define COMPARE_SWAP_PAIR_P(insn) (bits (insn, 30, 31) == 0		      \
>> +				   && bits (insn, 28, 29) == 0		      \
>> +				   && bit (insn, 26) == 0		      \
>> +				   && bits (insn, 23, 24) == 0		      \
>> +				   && bit (insn, 11) == 1)
>> +#define LOAD_STORE_EXCLUSIVE_PAIR_P(insn) (bit (insn, 31) == 1		      \
>> +					   && bits (insn, 28, 29) == 0	      \
>> +					   && bit (insn, 26) == 0	      \
>> +					   && bits (insn, 23, 24) == 0	      \
>> +					   && bit (insn, 11) == 1)
>> +
>> +#define LOAD_STORE_LITERAL_P(insn) (bit (insn, 28) == 1			      \
>> +				    && bit (insn, 29) == 0		      \
>> +				    && bit (insn, 24) == 0)
>> +
>> +/* Vector Load/Store multiple structures.  */
>> +#define LOAD_STORE_MS(insn) (bits (insn, 28, 29) == 0x0			      \
>> +			     && bit (insn, 31) == 0x0			      \
>> +			     && bit (insn, 26) == 0x1			      \
>> +			     && ((bits (insn, 23, 24) == 0x0		      \
>> +				  && bits (insn, 16, 21) == 0x0)	      \
>> +				 || (bits (insn, 23, 24) == 0x1		      \
>> +				     && bit (insn, 21) == 0x0)))
>> +
>> +/* Vector Load/Store single structure.  */
>> +#define LOAD_STORE_SS(insn) (bits (insn, 28, 29) == 0x0			      \
>> +			     && bit (insn, 31) == 0x0			      \
>> +			     && bit (insn, 26) == 0x1			      \
>> +			     && ((bits (insn, 23, 24) == 0x2		      \
>> +				  && bits (insn, 16, 20) == 0x0)	      \
>> +				 || (bits (insn, 23, 24) == 0x3)))
>> +
>> +/* Assuming INSN is a load/store instruction, return the size of the
>> +   memory access.  The patterns are documented in the ARM Architecture
>> +   Reference Manual - Index by encoding.  */
>> +
>> +static unsigned int
>> +get_load_store_access_size (CORE_ADDR insn)
>> +{
>> +  if (SVE_LOAD_STORE_P (insn))
>> +    {
>> +      /* SVE load/store instructions.  */
>> +
>> +      /* This is not always correct for SVE instructions, but it is a reasonable
>> +	 default for now.  Calculating the exact memory access size for SVE
>> +	 load/store instructions requires parsing instructions and evaluating
>> +	 the vector length.  */
>> +      return 16;
>> +    }
>> +
>> +  /* Non-SVE instructions.  */
>> +
>> +  bool vector = (bit (insn, 26) == 1);
>> +  bool ldst_pair = LOAD_STORE_PAIR_P (insn);
>> +
>> +  /* Is this an Advanced SIMD load/store instruction?  */
>> +  if (vector)
>> +    {
>> +      unsigned int size = bits (insn, 30, 31);
>> +
>> +      if (LOAD_STORE_LITERAL_P (insn))
>> +	{
>> +	  /* Advanced SIMD load/store literal */
>> +	  /* Sizes 4, 8 and 16 bytes.  */
>> +	  return 4 << size;
>> +	}
>> +      else if (LOAD_STORE_MS (insn))
>> +	{
>> +	  size = 8 << bit (insn, 30);
>> +
>> +	  unsigned char opcode = bits (insn, 12, 15);
>> +
>> +	  if (opcode == 0x0 || opcode == 0x2)
>> +	    size *= 4;
>> +	  else if (opcode == 0x4 || opcode == 0x6)
>> +	    size *= 3;
>> +	  else if (opcode == 0x8 || opcode == 0xa)
>> +	    size *= 2;
>> +
>> +	  return size;
>> +	}
>> +      else if (LOAD_STORE_SS (insn))
>> +	{
>> +	  size = 8 << bit (insn, 30);
>> +	  return size;
>> +	}
>> +      /* Advanced SIMD load/store instructions.  */
>> +      else if (ldst_pair)
>> +	{
>> +	  /* Advanced SIMD load/store pair.  */
>> +	  /* Sizes 8, 16 and 32 bytes.  */
>> +	  return (8 << size);
>> +	}
>> +      else
>> +	{
>> +	  /* Regular Advanced SIMD load/store instructions.  */
>> +	  size = size | (bit (insn, 23) << 2);
>> +	  return 1 << size;
>> +	}
>> +    }
>> +
>> +  /* This must be a regular GPR load/store instruction.  */
>> +
>> +  unsigned int size = bits (insn, 30, 31);
>> +  bool cs_pair = COMPARE_SWAP_PAIR_P (insn);
>> +  bool ldstx_pair = LOAD_STORE_EXCLUSIVE_PAIR_P (insn);
>> +
>> +  if (ldst_pair)
>> +    {
>> +      /* Load/Store pair.  */
>> +      if (size == 0x1)
>> +	{
>> +	  if (bit (insn, 22) == 0)
>> +	    /* STGP - 16 bytes.  */
>> +	    return 16;
>> +	  else
>> +	    /* LDPSW - 8 bytes.  */
>> +	    return 8;
>> +	}
>> +      /* Other Load/Store pair.  */
>> +      return (size == 0)? 8 : 16;
>> +    }
>> +  else if (cs_pair || ldstx_pair)
>> +    {
>> +      /* Compare Swap Pair or Load Store Exclusive Pair.  */
>> +      /* Sizes 8 and 16 bytes.  */
>> +      size = bit (insn, 30);
>> +      return (8 << size);
>> +    }
>> +  else if (LOAD_STORE_LITERAL_P (insn))
>> +    {
>> +      /* Load/Store literal.  */
>> +      /* Sizes between 4 and 8 bytes.  */
>> +      if (size == 0x2)
>> +	return 4;
>> +
>> +      return 4 << size;
>> +    }
>> +  else
>> +    {
>> +      /* General load/store instructions, with memory access sizes between
>> +	 1 and 8 bytes.  */
>> +      return (1 << size);
>> +    }
>> +}
>> +
>> +/* Return true if the regions [mem_addr, mem_addr + mem_len) and
>> +   [watch_addr, watch_addr + watch_len) overlap.  False otherwise.  */
>> +
>> +static bool
>> +hw_watch_regions_overlap (CORE_ADDR mem_addr, size_t mem_len,
>> +			  CORE_ADDR watch_addr, size_t watch_len)
>> +{
>> +  /* Quick check for non-overlapping case.  */
>> +  if (watch_addr >= (mem_addr + mem_len)
>> +      || mem_addr >= (watch_addr + watch_len))
>> +    return false;
>> +
>> +  CORE_ADDR start = std::max (mem_addr, watch_addr);
>> +  CORE_ADDR end = std::min (mem_addr + mem_len, watch_addr + watch_len);
>> +
>> +  return ((end - start) > 0);
>> +}
>> +
>> +/* Check if a hardware watchpoint has triggered.  If a trigger is detected,
>> +   return true and update ADDR_P with the stopped data address.
>> +   Otherwise return false.
>> +
>> +   STATE is the debug register's state, INSN is the instruction the inferior
>> +   stopped at and ADDR_TRAP is the reported stopped data address.  */
>>   
>>   bool
>>   aarch64_stopped_data_address (const struct aarch64_debug_reg_state *state,
>> -			      CORE_ADDR addr_trap, CORE_ADDR *addr_p)
>> +			      CORE_ADDR insn, CORE_ADDR addr_trap,
>> +			      CORE_ADDR *addr_p)
>>   {
>> -  int i;
>> +  /* There are 6 variations of watchpoint range and memory access
>> +     range positioning:
>> +
>> +     - W is the byte in the watchpoint range only.
>> +
>> +     - B is the byte in the memory access range ony.
>> +
>> +     - O is the byte in the overlapping region of the watchpoint range and
>> +       the memory access range.
>> +
>> +     1 - Non-overlapping, no triggers.
>> +
>> +     [WWWWWWWW]...[BBBBBBBB]
>> +
>> +     2 - Non-overlapping, no triggers.
>> +
>> +     [BBBBBBBB]...[WWWWWWWW]
>> +
>> +     3 - Overlapping partially, triggers.
>> +
>> +     [BBBBOOOOWWWW]
>> +
>> +     4 - Overlapping partially, triggers.
>> +
>> +     [WWWWOOOOBBBB]
>> +
>> +     5 - Memory access contained in watchpoint range, triggers.
>> +
>> +     [WWWWOOOOOOOOWWWW]
>> +
>> +     6 - Memory access containing watchpoint range, triggers.
>> +
>> +     [BBBBOOOOOOOOBBBB]
>> +  */
>> +
>> +  /* If there are no load/store instructions at PC, this must be a hardware
>> +     breakpoint hit.  Return early.
>> +
>> +     If a hardware breakpoint is placed at a PC containing a load/store
>> +     instruction, we will go through the memory access size check anyway, but
>> +     will eventually figure out there are no watchpoints matching such an
>> +     address.
>> +
>> +     There is one corner case though, which is having a hardware breakpoint and
>> +     a hardware watchpoint at PC, when PC contains a load/store
>> +     instruction.  That is an ambiguous case that is hard to differentiate.
>> +
>> +     There's not much we can do about that unless the kernel sends us enough
>> +     information to tell them apart.  */
>> +  if (!LOAD_STORE_P (insn))
>> +    return false;
>> +
>> +  /* Get the memory access size for the instruction at PC.  */
>> +  unsigned int memory_access_size = get_load_store_access_size (insn);
>>   
>> -  for (i = aarch64_num_wp_regs - 1; i >= 0; --i)
>> +  for (int i = aarch64_num_wp_regs - 1; i >= 0; --i)
>>       {
>>         const unsigned int offset
>>   	= aarch64_watchpoint_offset (state->dr_ctrl_wp[i]);
>>         const unsigned int len = aarch64_watchpoint_length (state->dr_ctrl_wp[i]);
>>         const CORE_ADDR addr_watch = state->dr_addr_wp[i] + offset;
>> -      const CORE_ADDR addr_watch_aligned = align_down (state->dr_addr_wp[i], 8);
>>         const CORE_ADDR addr_orig = state->dr_addr_orig_wp[i];
>>   
>> -      if (state->dr_ref_count_wp[i]
>> -	  && DR_CONTROL_ENABLED (state->dr_ctrl_wp[i])
>> -	  && addr_trap >= addr_watch_aligned
>> -	  && addr_trap < addr_watch + len)
>> +      if ((state->dr_ref_count_wp[i]
>> +	  && DR_CONTROL_ENABLED (state->dr_ctrl_wp[i]))
>> +	  && hw_watch_regions_overlap (addr_trap, memory_access_size,
>> +				       addr_watch, len))
>>   	{
>>   	  /* ADDR_TRAP reports the first address of the memory range
>>   	     accessed by the CPU, regardless of what was the memory
>> @@ -270,6 +519,7 @@ aarch64_stopped_data_address (const struct aarch64_debug_reg_state *state,
>>   	}
>>       }
>>   
>> +  /* No hardware watchpoint hits detected.  */
>>     return false;
>>   }
>>   
>> diff --git a/gdb/aarch64-nat.h b/gdb/aarch64-nat.h
>> index fee6bda2577..bc0e6297f40 100644
>> --- a/gdb/aarch64-nat.h
>> +++ b/gdb/aarch64-nat.h
>> @@ -51,7 +51,8 @@ void aarch64_remove_debug_reg_state (pid_t pid);
>>      *ADDR_P.  */
>>   
>>   bool aarch64_stopped_data_address (const struct aarch64_debug_reg_state *state,
>> -				   CORE_ADDR addr_trap, CORE_ADDR *addr_p);
>> +				   CORE_ADDR insn, CORE_ADDR addr_trap,
>> +				   CORE_ADDR *addr_p);
>>   
>>   /* Helper functions used by aarch64_nat_target below.  See their
>>      definitions.  */
>>
>> base-commit: e78337d5780c9d837e186c22c11eb8f9ed5bac62
>
  
Luis Machado March 6, 2024, 1:36 p.m. UTC | #3
Hi Tom,

On 2/20/24 21:07, Tom de Vries wrote:
> On 1/8/24 13:27, Luis Machado wrote:
>> Hi Tom,
>>
>> Thanks for refreshing this patch.
>>
>> As I hinted on IRC, back then I ended up deciding not to pursue this approach further. Though it does
>> address some of the potential failures, it is a bit convoluted and poses a fairly significant
>> maintenance burden (we need to explicitly identify instructions, so we will always be behind), including
>> the risk of producing false positives.
>>
>> For SVE/SME, there are also predicated accesses, which means an intruction may read/write a discontiguous range
>> of bytes depending on the predicate mask. That is trickier to handle.
>>
>> The nature of hardware watchpoint detection on aarch64 means the triggering behavior is dependent on a particular
>> CPU spec, in a way that is hard for userspace to predict reliably. Hopefully in the future this situation will
>> improve for userspace. The kernel folks are aware of it.
>>
> 
> Hi Luis,
> 
> ack, that makes sense.
> 
>> Meanwhile, given the above restrictions (and potentially restrictions of other architectures), we could teach gdb
>> about imprecise hardware watchpoint triggers. Something that would allow gdb to tell userspace that we know
>> a hardware watchpoint has triggered, but we don't know exactly for what range and what happened.
>>
>> At least this would prevent certain situations where gdb is simply stuck trying to skip over unskippable
>> hardware watchpoints, because it can't tell what happened and ended up with a generic SIGTRAP.
>>
>> Thoughts?
>>
> 
> I've submitted a two-patch series here ( https://sourceware.org/pipermail/gdb-patches/2024-February/206706.html ), where the second patch does something a bit like this: the aarch64 tdep tells gdb that we know that a hardware watchpoint has triggered, but doesn't know exactly for what range, and since it's a regular write watchpoint, gdb handles this fine.
> 
> With this patch series, I'm getting rid of the last consistent FAILs I see on m1 (not counting libcc1 stuff).

Sorry it took me a bit to get to this. Let me take a look at those two patches.
  

Patch

diff --git a/gdb/aarch64-fbsd-nat.c b/gdb/aarch64-fbsd-nat.c
index 38fb093f139..39162fe3bb0 100644
--- a/gdb/aarch64-fbsd-nat.c
+++ b/gdb/aarch64-fbsd-nat.c
@@ -154,9 +154,14 @@  aarch64_fbsd_nat_target::stopped_data_address (CORE_ADDR *addr_p)
 
   const CORE_ADDR addr_trap = (CORE_ADDR) siginfo.si_addr;
 
+  struct regcache *regs = get_thread_regcache (this, inferior_ptid);
+  CORE_ADDR trigger_pc = regcache_read_pc (regs);
+  uint32_t insn;
+  read_memory (trigger_pc, (gdb_byte *) &insn, 4);
+
   /* Check if the address matches any watched address.  */
   state = aarch64_get_debug_reg_state (inferior_ptid.pid ());
-  return aarch64_stopped_data_address (state, addr_trap, addr_p);
+  return aarch64_stopped_data_address (state, insn, addr_trap, addr_p);
 }
 
 /* Implement the "stopped_by_watchpoint" target_ops method.  */
diff --git a/gdb/aarch64-linux-nat.c b/gdb/aarch64-linux-nat.c
index 5b4e3c2bde1..5494bb07517 100644
--- a/gdb/aarch64-linux-nat.c
+++ b/gdb/aarch64-linux-nat.c
@@ -962,9 +962,14 @@  aarch64_linux_nat_target::stopped_data_address (CORE_ADDR *addr_p)
   const CORE_ADDR addr_trap
     = gdbarch_remove_non_address_bits (gdbarch, (CORE_ADDR) siginfo.si_addr);
 
+  struct regcache *regs = get_thread_regcache (this, inferior_ptid);
+  CORE_ADDR trigger_pc = regcache_read_pc (regs);
+  uint32_t insn;
+  read_memory (trigger_pc, (gdb_byte *) &insn, 4);
+
   /* Check if the address matches any watched address.  */
   state = aarch64_get_debug_reg_state (inferior_ptid.pid ());
-  return aarch64_stopped_data_address (state, addr_trap, addr_p);
+  return aarch64_stopped_data_address (state, insn, addr_trap, addr_p);
 }
 
 /* Implement the "stopped_by_watchpoint" target_ops method.  */
diff --git a/gdb/aarch64-nat.c b/gdb/aarch64-nat.c
index ee8c5a1e21d..928bb70d644 100644
--- a/gdb/aarch64-nat.c
+++ b/gdb/aarch64-nat.c
@@ -22,6 +22,7 @@ 
 #include "inferior.h"
 #include "cli/cli-cmds.h"
 #include "aarch64-nat.h"
+#include "arch/aarch64-insn.h"
 
 #include <unordered_map>
 
@@ -225,27 +226,275 @@  aarch64_remove_watchpoint (CORE_ADDR addr, int len, enum target_hw_bp_type type,
   return ret;
 }
 
-/* See aarch64-nat.h.  */
+/* Macros to distinguish between various types of load/store instructions.  */
+
+/* Regular and Advanced SIMD load/store instructions.  */
+#define GENERAL_LOAD_STORE_P(insn) (bit (insn, 25) == 0 && bit (insn, 27) == 1)
+
+/* SVE load/store instruction.  */
+#define SVE_LOAD_STORE_P(insn) (bits (insn, 25, 28) == 0x2		      \
+				&& (bits (insn, 29, 31) == 0x4		      \
+				    || bits (insn, 29, 31) == 0x5	      \
+				    || bits (insn, 29, 31) == 0x6	      \
+				    || (bits (insn, 29, 31) == 0x7	      \
+					&& ((bit (insn, 15) == 0x0	      \
+					     && (bit (insn, 13) == 0x0	      \
+						 || bit (insn, 13) == 0x1))   \
+					     || (bit (insn, 15) == 0x1	      \
+						 && bit (insn, 13) == 0x0)    \
+					     || bits (insn, 13, 15) == 0x6    \
+					     || bits (insn, 13, 15) == 0x7))))
+
+/* Any load/store instruction (regular, Advanced SIMD or SVE).  */
+#define LOAD_STORE_P(insn) (GENERAL_LOAD_STORE_P (insn)			      \
+			    || SVE_LOAD_STORE_P (insn))
+
+/* Load/Store pair (regular or vector).  */
+#define LOAD_STORE_PAIR_P(insn) (bit (insn, 28) == 0 && bit (insn, 29) == 1)
+
+#define COMPARE_SWAP_PAIR_P(insn) (bits (insn, 30, 31) == 0		      \
+				   && bits (insn, 28, 29) == 0		      \
+				   && bit (insn, 26) == 0		      \
+				   && bits (insn, 23, 24) == 0		      \
+				   && bit (insn, 11) == 1)
+#define LOAD_STORE_EXCLUSIVE_PAIR_P(insn) (bit (insn, 31) == 1		      \
+					   && bits (insn, 28, 29) == 0	      \
+					   && bit (insn, 26) == 0	      \
+					   && bits (insn, 23, 24) == 0	      \
+					   && bit (insn, 11) == 1)
+
+#define LOAD_STORE_LITERAL_P(insn) (bit (insn, 28) == 1			      \
+				    && bit (insn, 29) == 0		      \
+				    && bit (insn, 24) == 0)
+
+/* Vector Load/Store multiple structures.  */
+#define LOAD_STORE_MS(insn) (bits (insn, 28, 29) == 0x0			      \
+			     && bit (insn, 31) == 0x0			      \
+			     && bit (insn, 26) == 0x1			      \
+			     && ((bits (insn, 23, 24) == 0x0		      \
+				  && bits (insn, 16, 21) == 0x0)	      \
+				 || (bits (insn, 23, 24) == 0x1		      \
+				     && bit (insn, 21) == 0x0)))
+
+/* Vector Load/Store single structure.  */
+#define LOAD_STORE_SS(insn) (bits (insn, 28, 29) == 0x0			      \
+			     && bit (insn, 31) == 0x0			      \
+			     && bit (insn, 26) == 0x1			      \
+			     && ((bits (insn, 23, 24) == 0x2		      \
+				  && bits (insn, 16, 20) == 0x0)	      \
+				 || (bits (insn, 23, 24) == 0x3)))
+
+/* Assuming INSN is a load/store instruction, return the size of the
+   memory access.  The patterns are documented in the ARM Architecture
+   Reference Manual - Index by encoding.  */
+
+static unsigned int
+get_load_store_access_size (CORE_ADDR insn)
+{
+  if (SVE_LOAD_STORE_P (insn))
+    {
+      /* SVE load/store instructions.  */
+
+      /* This is not always correct for SVE instructions, but it is a reasonable
+	 default for now.  Calculating the exact memory access size for SVE
+	 load/store instructions requires parsing instructions and evaluating
+	 the vector length.  */
+      return 16;
+    }
+
+  /* Non-SVE instructions.  */
+
+  bool vector = (bit (insn, 26) == 1);
+  bool ldst_pair = LOAD_STORE_PAIR_P (insn);
+
+  /* Is this an Advanced SIMD load/store instruction?  */
+  if (vector)
+    {
+      unsigned int size = bits (insn, 30, 31);
+
+      if (LOAD_STORE_LITERAL_P (insn))
+	{
+	  /* Advanced SIMD load/store literal */
+	  /* Sizes 4, 8 and 16 bytes.  */
+	  return 4 << size;
+	}
+      else if (LOAD_STORE_MS (insn))
+	{
+	  size = 8 << bit (insn, 30);
+
+	  unsigned char opcode = bits (insn, 12, 15);
+
+	  if (opcode == 0x0 || opcode == 0x2)
+	    size *= 4;
+	  else if (opcode == 0x4 || opcode == 0x6)
+	    size *= 3;
+	  else if (opcode == 0x8 || opcode == 0xa)
+	    size *= 2;
+
+	  return size;
+	}
+      else if (LOAD_STORE_SS (insn))
+	{
+	  size = 8 << bit (insn, 30);
+	  return size;
+	}
+      /* Advanced SIMD load/store instructions.  */
+      else if (ldst_pair)
+	{
+	  /* Advanced SIMD load/store pair.  */
+	  /* Sizes 8, 16 and 32 bytes.  */
+	  return (8 << size);
+	}
+      else
+	{
+	  /* Regular Advanced SIMD load/store instructions.  */
+	  size = size | (bit (insn, 23) << 2);
+	  return 1 << size;
+	}
+    }
+
+  /* This must be a regular GPR load/store instruction.  */
+
+  unsigned int size = bits (insn, 30, 31);
+  bool cs_pair = COMPARE_SWAP_PAIR_P (insn);
+  bool ldstx_pair = LOAD_STORE_EXCLUSIVE_PAIR_P (insn);
+
+  if (ldst_pair)
+    {
+      /* Load/Store pair.  */
+      if (size == 0x1)
+	{
+	  if (bit (insn, 22) == 0)
+	    /* STGP - 16 bytes.  */
+	    return 16;
+	  else
+	    /* LDPSW - 8 bytes.  */
+	    return 8;
+	}
+      /* Other Load/Store pair.  */
+      return (size == 0)? 8 : 16;
+    }
+  else if (cs_pair || ldstx_pair)
+    {
+      /* Compare Swap Pair or Load Store Exclusive Pair.  */
+      /* Sizes 8 and 16 bytes.  */
+      size = bit (insn, 30);
+      return (8 << size);
+    }
+  else if (LOAD_STORE_LITERAL_P (insn))
+    {
+      /* Load/Store literal.  */
+      /* Sizes between 4 and 8 bytes.  */
+      if (size == 0x2)
+	return 4;
+
+      return 4 << size;
+    }
+  else
+    {
+      /* General load/store instructions, with memory access sizes between
+	 1 and 8 bytes.  */
+      return (1 << size);
+    }
+}
+
+/* Return true if the regions [mem_addr, mem_addr + mem_len) and
+   [watch_addr, watch_addr + watch_len) overlap.  False otherwise.  */
+
+static bool
+hw_watch_regions_overlap (CORE_ADDR mem_addr, size_t mem_len,
+			  CORE_ADDR watch_addr, size_t watch_len)
+{
+  /* Quick check for non-overlapping case.  */
+  if (watch_addr >= (mem_addr + mem_len)
+      || mem_addr >= (watch_addr + watch_len))
+    return false;
+
+  CORE_ADDR start = std::max (mem_addr, watch_addr);
+  CORE_ADDR end = std::min (mem_addr + mem_len, watch_addr + watch_len);
+
+  return ((end - start) > 0);
+}
+
+/* Check if a hardware watchpoint has triggered.  If a trigger is detected,
+   return true and update ADDR_P with the stopped data address.
+   Otherwise return false.
+
+   STATE is the debug register's state, INSN is the instruction the inferior
+   stopped at and ADDR_TRAP is the reported stopped data address.  */
 
 bool
 aarch64_stopped_data_address (const struct aarch64_debug_reg_state *state,
-			      CORE_ADDR addr_trap, CORE_ADDR *addr_p)
+			      CORE_ADDR insn, CORE_ADDR addr_trap,
+			      CORE_ADDR *addr_p)
 {
-  int i;
+  /* There are 6 variations of watchpoint range and memory access
+     range positioning:
+
+     - W is the byte in the watchpoint range only.
+
+     - B is the byte in the memory access range ony.
+
+     - O is the byte in the overlapping region of the watchpoint range and
+       the memory access range.
+
+     1 - Non-overlapping, no triggers.
+
+     [WWWWWWWW]...[BBBBBBBB]
+
+     2 - Non-overlapping, no triggers.
+
+     [BBBBBBBB]...[WWWWWWWW]
+
+     3 - Overlapping partially, triggers.
+
+     [BBBBOOOOWWWW]
+
+     4 - Overlapping partially, triggers.
+
+     [WWWWOOOOBBBB]
+
+     5 - Memory access contained in watchpoint range, triggers.
+
+     [WWWWOOOOOOOOWWWW]
+
+     6 - Memory access containing watchpoint range, triggers.
+
+     [BBBBOOOOOOOOBBBB]
+  */
+
+  /* If there are no load/store instructions at PC, this must be a hardware
+     breakpoint hit.  Return early.
+
+     If a hardware breakpoint is placed at a PC containing a load/store
+     instruction, we will go through the memory access size check anyway, but
+     will eventually figure out there are no watchpoints matching such an
+     address.
+
+     There is one corner case though, which is having a hardware breakpoint and
+     a hardware watchpoint at PC, when PC contains a load/store
+     instruction.  That is an ambiguous case that is hard to differentiate.
+
+     There's not much we can do about that unless the kernel sends us enough
+     information to tell them apart.  */
+  if (!LOAD_STORE_P (insn))
+    return false;
+
+  /* Get the memory access size for the instruction at PC.  */
+  unsigned int memory_access_size = get_load_store_access_size (insn);
 
-  for (i = aarch64_num_wp_regs - 1; i >= 0; --i)
+  for (int i = aarch64_num_wp_regs - 1; i >= 0; --i)
     {
       const unsigned int offset
 	= aarch64_watchpoint_offset (state->dr_ctrl_wp[i]);
       const unsigned int len = aarch64_watchpoint_length (state->dr_ctrl_wp[i]);
       const CORE_ADDR addr_watch = state->dr_addr_wp[i] + offset;
-      const CORE_ADDR addr_watch_aligned = align_down (state->dr_addr_wp[i], 8);
       const CORE_ADDR addr_orig = state->dr_addr_orig_wp[i];
 
-      if (state->dr_ref_count_wp[i]
-	  && DR_CONTROL_ENABLED (state->dr_ctrl_wp[i])
-	  && addr_trap >= addr_watch_aligned
-	  && addr_trap < addr_watch + len)
+      if ((state->dr_ref_count_wp[i]
+	  && DR_CONTROL_ENABLED (state->dr_ctrl_wp[i]))
+	  && hw_watch_regions_overlap (addr_trap, memory_access_size,
+				       addr_watch, len))
 	{
 	  /* ADDR_TRAP reports the first address of the memory range
 	     accessed by the CPU, regardless of what was the memory
@@ -270,6 +519,7 @@  aarch64_stopped_data_address (const struct aarch64_debug_reg_state *state,
 	}
     }
 
+  /* No hardware watchpoint hits detected.  */
   return false;
 }
 
diff --git a/gdb/aarch64-nat.h b/gdb/aarch64-nat.h
index fee6bda2577..bc0e6297f40 100644
--- a/gdb/aarch64-nat.h
+++ b/gdb/aarch64-nat.h
@@ -51,7 +51,8 @@  void aarch64_remove_debug_reg_state (pid_t pid);
    *ADDR_P.  */
 
 bool aarch64_stopped_data_address (const struct aarch64_debug_reg_state *state,
-				   CORE_ADDR addr_trap, CORE_ADDR *addr_p);
+				   CORE_ADDR insn, CORE_ADDR addr_trap,
+				   CORE_ADDR *addr_p);
 
 /* Helper functions used by aarch64_nat_target below.  See their
    definitions.  */