[25/31] Ignore failure to read PC when resuming

Message ID 20221212203101.1034916-26-pedro@palves.net
State New
Headers
Series Step over thread clone and thread exit |

Commit Message

Pedro Alves Dec. 12, 2022, 8:30 p.m. UTC
  If GDB sets a GDB_THREAD_OPTION_EXIT option on a thread, and the
thread exits, the server reports the corresponding thread exit event,
and forgets about the thread, i.e., removes the exited thread from its
thread list.

On the GDB side, GDB set the GDB_THREAD_OPTION_EXIT option on a
thread, GDB delays deleting the thread from its thread list until it
sees the corresponding thread exit event, as that event needs special
handling in infrun.

When a thread disappears from the target, but it still exists on GDB's
thread list, in all-stop RSP mode, it can happen that GDB ends up
trying to resume such an already-exited-thread that GDB doesn't yet
know is gone.  When that happens, against GDBserver, typically the
ongoing execution command fails with this error:

 ...
 PC register is not available
 (gdb)

At the remote protocol level, we may see e.g., this:

      [remote] Packet received: w0;p97479.978d2
    [remote] wait: exit
    [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
    [infrun] print_target_wait_results:   619641.620754.0 [Thread 619641.620754],
    [infrun] print_target_wait_results:   status->kind = THREAD_EXITED, exit_status = 0
    [infrun] handle_inferior_event: status->kind = THREAD_EXITED, exit_status = 0
    [infrun] context_switch: Switching context from 0.0.0 to 619641.620754.0
    [infrun] clear_proceed_status_thread: 619641.620754.0

GDB saw an exit event for thread 619641.620754.  After processing it,
infrun decides to re-resume the target again.  To do that, infrun
picks some other thread that isn't exited yet from GDB's perspective,
switches to it, and calls keep_going.  Below, infrun happens to pick
thread p97479.97479, the leader, which also exited, but GDB doesn't
know yet:

...
    [remote] Sending packet: $Hgp97479.97479#75
    [remote] Packet received: OK
    [remote] Sending packet: $g#67
    [remote] Packet received: xxxxxxxxxxxxxxxxx (...snip...) [1120 bytes omitted]
    [infrun] reset: reason=handling event
    [infrun] maybe_set_commit_resumed_all_targets: not requesting commit-resumed for target remote, no resumed threads
  [infrun] fetch_inferior_event: exit
  PC register is not available
  (gdb)

The Linux backends, both in GDB and in GDBserver, already silently
ignore failures to resume, with the understanding that we'll see an
exit event soon.  Core of GDB doesn't do that yet, though.

This patch is a small step in that direction.  It swallows the error
when thrown from within resume_1.  There are likely are spots where we
will need similar treatment, but we can tackle them as we find them.

After this patch, we'll see something like this instead:

    [infrun] resume_1: step=0, signal=GDB_SIGNAL_0, trap_expected=0, current thread [640478.640478.0] at 0x0
    [infrun] do_target_resume: resume_ptid=640478.0.0, step=0, sig=GDB_SIGNAL_0
    [remote] Sending packet: $vCont;c:p9c5de.-1#78
    [infrun] prepare_to_wait: prepare_to_wait
    [infrun] reset: reason=handling event
    [infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target remote
    [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target remote
  [infrun] fetch_inferior_event: exit
  [infrun] fetch_inferior_event: enter
    [infrun] scoped_disable_commit_resumed: reason=handling event
    [infrun] random_pending_event_thread: None found.
    [remote] wait: enter
      [remote] Packet received: W0;process:9c5de
    [remote] wait: exit
    [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
    [infrun] print_target_wait_results:   640478.0.0 [process 640478],
    [infrun] print_target_wait_results:   status->kind = EXITED, exit_status = 0
    [infrun] handle_inferior_event: status->kind = EXITED, exit_status = 0
  [Inferior 1 (process 640478) exited normally]
    [infrun] stop_waiting: stop_waiting
    [infrun] reset: reason=handling event
  (gdb) [infrun] fetch_inferior_event: exit

Change-Id: I7f1c7610923435c4e98e70acc5ebe5ebbac581e2
---
 gdb/infrun.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)
  

Comments

Andrew Burgess June 10, 2023, 10:33 a.m. UTC | #1
Pedro Alves <pedro@palves.net> writes:

> If GDB sets a GDB_THREAD_OPTION_EXIT option on a thread, and the
> thread exits, the server reports the corresponding thread exit event,
> and forgets about the thread, i.e., removes the exited thread from its
> thread list.
>
> On the GDB side, GDB set the GDB_THREAD_OPTION_EXIT option on a
> thread, GDB delays deleting the thread from its thread list until it
> sees the corresponding thread exit event, as that event needs special
> handling in infrun.
>
> When a thread disappears from the target, but it still exists on GDB's
> thread list, in all-stop RSP mode, it can happen that GDB ends up
> trying to resume such an already-exited-thread that GDB doesn't yet
> know is gone.  When that happens, against GDBserver, typically the
> ongoing execution command fails with this error:

I'm slightly confused here.  If GDB doesn't know the thread has exited
doesn't that mean the server hasn't yet reported the exit, and so should
be holding onto the thread?

I wanted to investigate this a bit more to try and understand more about
what's going on, but I couldn't find a test that was triggering the code
added in this patch.  Do you know if there's a test I can run to see
this issue?

>
>  ...
>  PC register is not available
>  (gdb)
>
> At the remote protocol level, we may see e.g., this:
>
>       [remote] Packet received: w0;p97479.978d2
>     [remote] wait: exit
>     [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
>     [infrun] print_target_wait_results:   619641.620754.0 [Thread 619641.620754],
>     [infrun] print_target_wait_results:   status->kind = THREAD_EXITED, exit_status = 0
>     [infrun] handle_inferior_event: status->kind = THREAD_EXITED, exit_status = 0
>     [infrun] context_switch: Switching context from 0.0.0 to 619641.620754.0
>     [infrun] clear_proceed_status_thread: 619641.620754.0
>
> GDB saw an exit event for thread 619641.620754.  After processing it,
> infrun decides to re-resume the target again.  To do that, infrun
> picks some other thread that isn't exited yet from GDB's perspective,
> switches to it, and calls keep_going.  Below, infrun happens to pick
> thread p97479.97479, the leader, which also exited, but GDB doesn't
> know yet:
>
> ...
>     [remote] Sending packet: $Hgp97479.97479#75
>     [remote] Packet received: OK
>     [remote] Sending packet: $g#67
>     [remote] Packet received: xxxxxxxxxxxxxxxxx (...snip...) [1120 bytes omitted]
>     [infrun] reset: reason=handling event
>     [infrun] maybe_set_commit_resumed_all_targets: not requesting commit-resumed for target remote, no resumed threads
>   [infrun] fetch_inferior_event: exit
>   PC register is not available
>   (gdb)
>
> The Linux backends, both in GDB and in GDBserver, already silently
> ignore failures to resume, with the understanding that we'll see an
> exit event soon.  Core of GDB doesn't do that yet, though.
>
> This patch is a small step in that direction.  It swallows the error
> when thrown from within resume_1.  There are likely are spots where we
> will need similar treatment, but we can tackle them as we find them.
>
> After this patch, we'll see something like this instead:
>
>     [infrun] resume_1: step=0, signal=GDB_SIGNAL_0, trap_expected=0, current thread [640478.640478.0] at 0x0
>     [infrun] do_target_resume: resume_ptid=640478.0.0, step=0, sig=GDB_SIGNAL_0
>     [remote] Sending packet: $vCont;c:p9c5de.-1#78

I'm confuse by this example.  I would have expected it to start off with
the same intro as the above, that is, send the '$g#67' packet, get back
the xxxx...etc... but then do things differently.

>     [infrun] prepare_to_wait: prepare_to_wait
>     [infrun] reset: reason=handling event
>     [infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target remote
>     [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target remote
>   [infrun] fetch_inferior_event: exit
>   [infrun] fetch_inferior_event: enter
>     [infrun] scoped_disable_commit_resumed: reason=handling event
>     [infrun] random_pending_event_thread: None found.
>     [remote] wait: enter
>       [remote] Packet received: W0;process:9c5de
>     [remote] wait: exit
>     [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
>     [infrun] print_target_wait_results:   640478.0.0 [process 640478],
>     [infrun] print_target_wait_results:   status->kind = EXITED, exit_status = 0
>     [infrun] handle_inferior_event: status->kind = EXITED, exit_status = 0
>   [Inferior 1 (process 640478) exited normally]
>     [infrun] stop_waiting: stop_waiting
>     [infrun] reset: reason=handling event
>   (gdb) [infrun] fetch_inferior_event: exit
>
> Change-Id: I7f1c7610923435c4e98e70acc5ebe5ebbac581e2
> ---
>  gdb/infrun.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/gdb/infrun.c b/gdb/infrun.c
> index 09391d85256..21e5aa0f50e 100644
> --- a/gdb/infrun.c
> +++ b/gdb/infrun.c
> @@ -2595,7 +2595,28 @@ resume_1 (enum gdb_signal sig)
>        step = false;
>      }
>  
> -  CORE_ADDR pc = regcache_read_pc (regcache);
> +  CORE_ADDR pc = 0;

I don't think we should be picking some arbitrary $pc value (0 in this
case) and just using that as a default, instead, I think it would be
better to change the type of pc to gdb::optional<CORE_ADDR>, and then
update the rest of this function to only do the $pc relevant parts if we
have a $pc value.

> +  try
> +    {
> +      pc = regcache_read_pc (regcache);
> +    }
> +  catch (const gdb_exception_error &err)
> +    {
> +      /* Swallow errors as it may be that the current thread exited
> +	 and we've haven't seen its exit status yet.  Let the
> +	 resumption continue and we'll collect the exit event
> +	 shortly.  */
> +      if (err.error == TARGET_CLOSE_ERROR)
> +	throw;
> +
> +      if (debug_infrun)
> +	{
> +	  string_file buf;
> +	  exception_print (&buf, err);
> +	  infrun_debug_printf ("resume: swallowing error: %s",
> +			       buf.string ().c_str ());
> +	}

I guess this is the best we can probably do without changing the remote
protocol. My worry would be that there could be other reasons that the
read of $pc fails, which we are now just ignoring.  It looks like you
already ran into one such case with TARGET_CLOSE_ERROR, but maybe
there's others?

It almost feels like the ideal solution would invert the logic, so we
could write:

  catch (const gdb_exception_error &err)
    {
      /* I just invent a new error type here...  */
      if (err.err != INFERIOR_EXITED_ERROR)
        throw;

      // ... etc ...
    }

To use something like this we could have the H packet send back
something other then "OK" when GDB asks to switch to a thread that has
already exited, maybe send back the stop reply could be made to work?

I say all that really just to check if you agree or not.  I think for
now I'd be happy to go with what you present here, I think the gains
this series brings to GDB are worth some rough edges that we might want
to address in the future.

Would love to hear your thoughts,

Thanks,
Andrew

> +    }
>  
>    infrun_debug_printf ("step=%d, signal=%s, trap_expected=%d, "
>  		       "current thread [%s] at %s",
> -- 
> 2.36.0
  
Pedro Alves Nov. 13, 2023, 2:13 p.m. UTC | #2
On 2023-06-10 11:33, Andrew Burgess wrote:
> Pedro Alves <pedro@palves.net> writes:
> 
>> If GDB sets a GDB_THREAD_OPTION_EXIT option on a thread, and the
>> thread exits, the server reports the corresponding thread exit event,
>> and forgets about the thread, i.e., removes the exited thread from its
>> thread list.
>>
>> On the GDB side, GDB set the GDB_THREAD_OPTION_EXIT option on a
>> thread, GDB delays deleting the thread from its thread list until it
>> sees the corresponding thread exit event, as that event needs special
>> handling in infrun.
>>
>> When a thread disappears from the target, but it still exists on GDB's
>> thread list, in all-stop RSP mode, it can happen that GDB ends up
>> trying to resume such an already-exited-thread that GDB doesn't yet
>> know is gone.  When that happens, against GDBserver, typically the
>> ongoing execution command fails with this error:
> 
> I'm slightly confused here.  If GDB doesn't know the thread has exited
> doesn't that mean the server hasn't yet reported the exit, and so should
> be holding onto the thread?
> 
> I wanted to investigate this a bit more to try and understand more about
> what's going on, but I couldn't find a test that was triggering the code
> added in this patch.  Do you know if there's a test I can run to see
> this issue?

I think there was an existing testcase that would sometimes fail for this
problem, but looks like I didn't write that anywhere, and now I don't
remember... :-/  Sorry about this.  I wasn't able to reproduce the
problem in a few test runs, so I will drop this patch from the series
for now until I find a better rationale, and we can discuss how to 
fix it then.

Sorry again, and many thanks for the review and ideas.

Pedro Alves
  

Patch

diff --git a/gdb/infrun.c b/gdb/infrun.c
index 09391d85256..21e5aa0f50e 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -2595,7 +2595,28 @@  resume_1 (enum gdb_signal sig)
       step = false;
     }
 
-  CORE_ADDR pc = regcache_read_pc (regcache);
+  CORE_ADDR pc = 0;
+  try
+    {
+      pc = regcache_read_pc (regcache);
+    }
+  catch (const gdb_exception_error &err)
+    {
+      /* Swallow errors as it may be that the current thread exited
+	 and we've haven't seen its exit status yet.  Let the
+	 resumption continue and we'll collect the exit event
+	 shortly.  */
+      if (err.error == TARGET_CLOSE_ERROR)
+	throw;
+
+      if (debug_infrun)
+	{
+	  string_file buf;
+	  exception_print (&buf, err);
+	  infrun_debug_printf ("resume: swallowing error: %s",
+			       buf.string ().c_str ());
+	}
+    }
 
   infrun_debug_printf ("step=%d, signal=%s, trap_expected=%d, "
 		       "current thread [%s] at %s",