[6/8] Fix handling of vanishing threads that were stepping/stopping

Message ID 20231214202238.1065676-7-pedro@palves.net
State New
Headers
Series Step over thread exit improvements/fixes + AMD GPU |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gdb_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-aarch64 success Testing passed

Commit Message

Pedro Alves Dec. 14, 2023, 8:22 p.m. UTC
  Downstream, AMD is carrying a testcase
(gdb.rocm/continue-over-kernel-exit.exp) that exposes a couple issues
with the amd-dbgapi target's handling of exited threads.  The test
can't be added upstream yet, unfortunately, due to dependency on DWARF
extensions that can't be upstreamed yet.  However, it can be found on
the mailing list on the same series as this patch.

The test spawns a kernel with a number of waves.  The waves do nothing
but exit.  There is a breakpoint on the s_endpgm instruction.  Once
that breakpoint is hit, the test issues a "continue" command.  We
should see one breakpoint hit per wave, and then the whole program
exiting.  We do see that, however we also see this:

 [New AMDGPU Wave ?:?:?:1 (?,?,?)/?]
 [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]
 *repeat for other waves*
 ...
 [Thread 0x7ffff626f640 (LWP 3048491) exited]
 [Thread 0x7fffeb7ff640 (LWP 3048488) exited]
 [Inferior 1 (process 3048475) exited normally]

That "New AMDGPU Wave" output comes from infrun.c itself adding the
thread to the GDB thread list, because it got an event for a thread
not on the thread list yet.  The output shows "?"s instead of proper
coordinates, because the event was a TARGET_WAITKIND_THREAD_EXITED,
i.e., the wave was already gone when infrun.c added the thread to the
thread list.

That shouldn't ever happen for the amd-dbgapi target, threads should
only ever be added by the backend.

Note "New AMDGPU Wave ?:?:?:1" is for wave 1.  What happened was that
wave 1 terminated previously, and a previous call to
amd_dbgapi_target::update_thread_list() noticed the wave had vanished
and removed it from the GDB thread list.  However, because the wave
was stepping when it terminated (due to the displaced step over the
s_endpgm) instruction, it is guaranteed that the amd-dbgapi library
queues a WAVE_COMMAND_TERMINATED event for the exit.

When we process that WAVE_COMMAND_TERMINATED event, in
amd-dbgapi-target.c:process_one_event, we return it to the core as a
TARGET_WAITKIND_THREAD_EXITED event:

 static void
 process_one_event (amd_dbgapi_event_id_t event_id,
		    amd_dbgapi_event_kind_t event_kind)
 {
 ...
	 if (status == AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID
	     && event_kind == AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED)
	   ws.set_thread_exited (0);
 ...
 }

Recall the wave is already gone from the GDB thread list.  So when GDB
sees that TARGET_WAITKIND_THREAD_EXITED event for a thread it doesn't
know about, it adds the thread to the thread list, resulting in that:

 [New AMDGPU Wave ?:?:?:1 (?,?,?)/?]

and then, because it was a TARGET_WAITKIND_THREAD_EXITED event, GDB
marks the thread exited right afterwards:

 [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]

The fix is to make amd_dbgapi_target::update_thread_list() _not_
delete vanishing waves iff they were stepping or in progress of being
stopped.  These two cases are the ones dbgapi guarantees will result
in a WAVE_COMMAND_TERMINATED event if the wave terminates:

  /**
   * A command for a wave was not able to complete because the wave has
   * terminated.
   *
   * Commands that can result in this event are ::amd_dbgapi_wave_stop and
   * ::amd_dbgapi_wave_resume in single step mode.  Since the wave terminated
   * before stopping, this event will be reported instead of
   * ::AMD_DBGAPI_EVENT_KIND_WAVE_STOP.
   *
   * The wave that terminated is available by the ::AMD_DBGAPI_EVENT_INFO_WAVE
   * query.  However, the wave will be invalid since it has already terminated.
   * It is the client's responsibility to know what command was being performed
   * and was unable to complete due to the wave terminating.
   */
  AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED = 2,

As the comment says, it's GDB's responsability to know whether the
wave was stepping or being stopped.  Since we now have a wave_info map
with one entry for each wave, that seems like the place to store that
information.  However, I still decided to put all the coordinate
information in its own structure.  I.e., basically renamed the
existing wave_info to wave_coordinates, and then added a new wave_info
structure that holds the new state, plus a wave_coordinates object.
This seemer cleaner as there are places where we only need to
instantiate a wave_coordinates object.

There's an extra twist.  The testcase also exercises stopping at a new
kernel right after the first kernel fully exits.  In that scenario, we
were hitting this assertion after the first kernel fully exits and the
hit of the breakpoint at the second kernel is handled:

 [amd-dbgapi] process_event_queue: Pulled event from dbgapi: event_id.handle = 26, event_kind = WAVE_STOP
 [amd-dbgapi-lib] suspending queue_3, queue_2, queue_1 (refresh wave list)
 ../../src/gdb/amd-dbgapi-target.c:1625: internal-error: amd_dbgapi_thread_deleted: Assertion `it != info->wave_info_map.end ()' failed.
 A problem internal to GDB has been detected,
 further debugging may prove unreliable.

This is the exact same problem as above, just a different
manifestation.  In this scenario, we end up in update_thread_list
successfully deleting the exited thread (because it was no longer the
current thread) that was incorrectly added by infrun.c.  Because it
was added by infrun.c and not by amd-dbgapi-target.c:add_gpu_thread,
it doesn't have an entry in the wave_info map, so
amd_dbgapi_thread_deleted trips on this assertion:

      gdb_assert (it != info->wave_info_map.end ());

here:

  ...
  -> stop_all_threads
   -> update_thread_list
    -> target_update_thread_list
     -> amd_dbgapi_target::update_thread_list
      -> thread_db_target::update_thread_list
       -> linux_nat_target::update_thread_list
	-> delete_exited_threads
	 -> delete_thread
	  -> delete_thread_1
	   -> gdb::observers::observable<thread_info*>::notify
	    -> amd_dbgapi_thread_deleted
	     -> internal_error_loc

The testcase thus tries both running to exit after the first kernel
exits, and running to a breakpoint in a second kernel after the first
kernel exits.

Change-Id: I43a66f060c35aad1fe0d9ff022ce2afd0537f028
---
 gdb/amd-dbgapi-target.c | 197 ++++++++++++++++++++++++++++++----------
 1 file changed, 149 insertions(+), 48 deletions(-)
  

Comments

Lancelot SIX Dec. 15, 2023, 10:51 a.m. UTC | #1
Hi Pedro,

I have a minor comment below.

On Thu, Dec 14, 2023 at 08:22:36PM +0000, Pedro Alves wrote:
> Downstream, AMD is carrying a testcase
> (gdb.rocm/continue-over-kernel-exit.exp) that exposes a couple issues
> with the amd-dbgapi target's handling of exited threads.  The test
> can't be added upstream yet, unfortunately, due to dependency on DWARF
> extensions that can't be upstreamed yet.  However, it can be found on
> the mailing list on the same series as this patch.
> 
> The test spawns a kernel with a number of waves.  The waves do nothing
> but exit.  There is a breakpoint on the s_endpgm instruction.  Once
> that breakpoint is hit, the test issues a "continue" command.  We
> should see one breakpoint hit per wave, and then the whole program
> exiting.  We do see that, however we also see this:
> 
>  [New AMDGPU Wave ?:?:?:1 (?,?,?)/?]
>  [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]
>  *repeat for other waves*
>  ...
>  [Thread 0x7ffff626f640 (LWP 3048491) exited]
>  [Thread 0x7fffeb7ff640 (LWP 3048488) exited]
>  [Inferior 1 (process 3048475) exited normally]
> 
> That "New AMDGPU Wave" output comes from infrun.c itself adding the
> thread to the GDB thread list, because it got an event for a thread
> not on the thread list yet.  The output shows "?"s instead of proper
> coordinates, because the event was a TARGET_WAITKIND_THREAD_EXITED,
> i.e., the wave was already gone when infrun.c added the thread to the
> thread list.
> 
> That shouldn't ever happen for the amd-dbgapi target, threads should
> only ever be added by the backend.
> 
> Note "New AMDGPU Wave ?:?:?:1" is for wave 1.  What happened was that
> wave 1 terminated previously, and a previous call to
> amd_dbgapi_target::update_thread_list() noticed the wave had vanished
> and removed it from the GDB thread list.  However, because the wave
> was stepping when it terminated (due to the displaced step over the
> s_endpgm) instruction, it is guaranteed that the amd-dbgapi library
> queues a WAVE_COMMAND_TERMINATED event for the exit.
> 
> When we process that WAVE_COMMAND_TERMINATED event, in
> amd-dbgapi-target.c:process_one_event, we return it to the core as a
> TARGET_WAITKIND_THREAD_EXITED event:
> 
>  static void
>  process_one_event (amd_dbgapi_event_id_t event_id,
> 		    amd_dbgapi_event_kind_t event_kind)
>  {
>  ...
> 	 if (status == AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID
> 	     && event_kind == AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED)
> 	   ws.set_thread_exited (0);
>  ...
>  }
> 
> Recall the wave is already gone from the GDB thread list.  So when GDB
> sees that TARGET_WAITKIND_THREAD_EXITED event for a thread it doesn't
> know about, it adds the thread to the thread list, resulting in that:
> 
>  [New AMDGPU Wave ?:?:?:1 (?,?,?)/?]
> 
> and then, because it was a TARGET_WAITKIND_THREAD_EXITED event, GDB
> marks the thread exited right afterwards:
> 
>  [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]
> 
> The fix is to make amd_dbgapi_target::update_thread_list() _not_
> delete vanishing waves iff they were stepping or in progress of being
> stopped.  These two cases are the ones dbgapi guarantees will result
> in a WAVE_COMMAND_TERMINATED event if the wave terminates:
> 
>   /**
>    * A command for a wave was not able to complete because the wave has
>    * terminated.
>    *
>    * Commands that can result in this event are ::amd_dbgapi_wave_stop and
>    * ::amd_dbgapi_wave_resume in single step mode.  Since the wave terminated
>    * before stopping, this event will be reported instead of
>    * ::AMD_DBGAPI_EVENT_KIND_WAVE_STOP.
>    *
>    * The wave that terminated is available by the ::AMD_DBGAPI_EVENT_INFO_WAVE
>    * query.  However, the wave will be invalid since it has already terminated.
>    * It is the client's responsibility to know what command was being performed
>    * and was unable to complete due to the wave terminating.
>    */
>   AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED = 2,
> 
> As the comment says, it's GDB's responsability to know whether the
> wave was stepping or being stopped.  Since we now have a wave_info map
> with one entry for each wave, that seems like the place to store that
> information.  However, I still decided to put all the coordinate
> information in its own structure.  I.e., basically renamed the
> existing wave_info to wave_coordinates, and then added a new wave_info
> structure that holds the new state, plus a wave_coordinates object.
> This seemer cleaner as there are places where we only need to

s/seemer/seemed/ maybe?

> instantiate a wave_coordinates object.
> 
> There's an extra twist.  The testcase also exercises stopping at a new
> kernel right after the first kernel fully exits.  In that scenario, we
> were hitting this assertion after the first kernel fully exits and the
> hit of the breakpoint at the second kernel is handled:
> 
>  [amd-dbgapi] process_event_queue: Pulled event from dbgapi: event_id.handle = 26, event_kind = WAVE_STOP
>  [amd-dbgapi-lib] suspending queue_3, queue_2, queue_1 (refresh wave list)
>  ../../src/gdb/amd-dbgapi-target.c:1625: internal-error: amd_dbgapi_thread_deleted: Assertion `it != info->wave_info_map.end ()' failed.
>  A problem internal to GDB has been detected,
>  further debugging may prove unreliable.
> 
> This is the exact same problem as above, just a different
> manifestation.  In this scenario, we end up in update_thread_list
> successfully deleting the exited thread (because it was no longer the
> current thread) that was incorrectly added by infrun.c.  Because it
> was added by infrun.c and not by amd-dbgapi-target.c:add_gpu_thread,
> it doesn't have an entry in the wave_info map, so
> amd_dbgapi_thread_deleted trips on this assertion:
> 
>       gdb_assert (it != info->wave_info_map.end ());
> 
> here:
> 
>   ...
>   -> stop_all_threads
>    -> update_thread_list
>     -> target_update_thread_list
>      -> amd_dbgapi_target::update_thread_list
>       -> thread_db_target::update_thread_list
>        -> linux_nat_target::update_thread_list
> 	-> delete_exited_threads
> 	 -> delete_thread
> 	  -> delete_thread_1
> 	   -> gdb::observers::observable<thread_info*>::notify
> 	    -> amd_dbgapi_thread_deleted
> 	     -> internal_error_loc
> 
> The testcase thus tries both running to exit after the first kernel
> exits, and running to a breakpoint in a second kernel after the first
> kernel exits.
> 
> Change-Id: I43a66f060c35aad1fe0d9ff022ce2afd0537f028
> ---
>  gdb/amd-dbgapi-target.c | 197 ++++++++++++++++++++++++++++++----------
>  1 file changed, 149 insertions(+), 48 deletions(-)
> 
> diff --git a/gdb/amd-dbgapi-target.c b/gdb/amd-dbgapi-target.c
> index 86102b7fb03..06f9e8c5f9c 100644
> --- a/gdb/amd-dbgapi-target.c
> +++ b/gdb/amd-dbgapi-target.c
> @@ -109,10 +109,9 @@ get_amd_dbgapi_target_inferior_created_observer_token ()
>    return amd_dbgapi_target_inferior_created_observer_token;
>  }
>  
> -/* A type holding coordinate, etc. info for a given wave.  We cache
> -   this because we need this information after a wave exits.  */
> +/* A type holding coordinates, etc. info for a given wave.  */
>  
> -struct wave_info
> +struct wave_coordinates
>  {
>    /* The wave.  Set by the ctor.  */
>    amd_dbgapi_wave_id_t wave_id;
> @@ -125,11 +124,44 @@ struct wave_info
>    uint32_t group_ids[3] {UINT32_MAX, UINT32_MAX, UINT32_MAX};
>    uint32_t wave_in_group = UINT32_MAX;
>  
> -  explicit wave_info (amd_dbgapi_wave_id_t wave_id);
> +  explicit wave_coordinates (amd_dbgapi_wave_id_t wave_id)
> +    : wave_id (wave_id)
> +  {}
>  
> -  /* Return the target ID string for the wave this wave_info is
> +  /* Return the target ID string for the wave this wave_coordinates is
>       for.  */
>    std::string to_string () const;
> +
> +  /* Pull out coordinates info from the amd-dbgapi library.  */
> +  void fetch ();
> +};
> +
> +/* A type holding info about a given wave.  */
> +
> +struct wave_info
> +{
> +  /* We cache the coordinates info because we need it after a wave
> +     exits.  The wave's ID is here.  */
> +  wave_coordinates coords;
> +
> +  /* The last resume_mode passed to amd_dbgapi_wave_resume for this
> +     wave.  We track this because we are guaranteed to see a
> +     WAVE_COMMAND_TERMINATED event if a stepping wave terminates, and
> +     we need to know to not delete such a wave until we process that
> +     event.  */
> +  amd_dbgapi_resume_mode_t last_resume_mode = AMD_DBGAPI_RESUME_MODE_NORMAL;
> +
> +  /* Whether we've called amd_dbgapi_wave_stop for this wave and are
> +     waiting for its stop event.  Similarly, we track this because
> +     we're guaranteed to get a WAVE_COMMAND_TERMINATED event if the
> +     wave terminates while being stopped.  */
> +  bool stopping = false;
> +
> +  explicit wave_info (amd_dbgapi_wave_id_t wave_id)
> +    : coords (wave_id)
> +  {
> +    coords.fetch ();
> +  }
>  };
>  
>  /* Big enough to hold the size of the largest register in bytes.  */
> @@ -275,6 +307,19 @@ static struct amd_dbgapi_target the_amd_dbgapi_target;
>  static const registry<inferior>::key<amd_dbgapi_inferior_info>
>    amd_dbgapi_inferior_data;
>  
> +/* Fetch the amd_dbgapi_inferior_info data for the given inferior.  */
> +
> +static struct amd_dbgapi_inferior_info *
> +get_amd_dbgapi_inferior_info (struct inferior *inferior)
> +{
> +  amd_dbgapi_inferior_info *info = amd_dbgapi_inferior_data.get (inferior);
> +
> +  if (info == nullptr)
> +    info = amd_dbgapi_inferior_data.emplace (inferior, inferior);
> +
> +  return info;
> +}
> +
>  /* The async event handler registered with the event loop, indicating that we
>     might have events to report to the core and that we'd like our wait method
>     to be called.
> @@ -289,7 +334,7 @@ static const registry<inferior>::key<amd_dbgapi_inferior_info>
>  static async_event_handler *amd_dbgapi_async_event_handler = nullptr;
>  
>  std::string
> -wave_info::to_string () const
> +wave_coordinates::to_string () const
>  {
>    std::string str = "AMDGPU Wave";
>  
> @@ -319,37 +364,41 @@ wave_info::to_string () const
>    return str;
>  }
>  
> -wave_info::wave_info (amd_dbgapi_wave_id_t wave_id)
> -  : wave_id (wave_id)
> -{
> -}
> -
> -/* Read in wave_info for WAVE_ID.  */
> -
> -static wave_info
> -get_wave_info (amd_dbgapi_wave_id_t wave_id)
> +void
> +wave_coordinates::fetch ()
>  {
> -  wave_info res (wave_id);
> -
>    /* Any field that fails to be read is left with its in-class
>       initialized value, which is printed as "?".  */
>  
>    amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_AGENT,
> -			    sizeof (res.agent_id), &res.agent_id);
> +			    sizeof (agent_id), &agent_id);
>    amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_QUEUE,
> -			    sizeof (res.queue_id), &res.queue_id);
> +			    sizeof (queue_id), &queue_id);
>    amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_DISPATCH,
> -			    sizeof (res.dispatch_id), &res.dispatch_id);
> +			    sizeof (dispatch_id), &dispatch_id);
>  
>    amd_dbgapi_wave_get_info (wave_id,
>  			    AMD_DBGAPI_WAVE_INFO_WORKGROUP_COORD,
> -			    sizeof (res.group_ids), &res.group_ids);
> +			    sizeof (group_ids), &group_ids);
>  
>    amd_dbgapi_wave_get_info (wave_id,
>  			    AMD_DBGAPI_WAVE_INFO_WAVE_NUMBER_IN_WORKGROUP,
> -			    sizeof (res.wave_in_group), &res.wave_in_group);
> +			    sizeof (wave_in_group), &wave_in_group);
> +}
> +
> +/* Get the wave_info object for TP, from the wave_info map.  It is
> +   assumed that the wave is in the map.  */
> +
> +static wave_info &
> +get_thread_wave_info (thread_info *tp)
> +{
> +  amd_dbgapi_inferior_info *info = get_amd_dbgapi_inferior_info (tp->inf);
> +  amd_dbgapi_wave_id_t wave_id = get_amd_dbgapi_wave_id (tp->ptid);
> +
> +  auto it = info->wave_info_map.find (wave_id.handle);
> +  gdb_assert (it != info->wave_info_map.end ());
>  
> -  return res;
> +  return it->second;
>  }
>  
>  /* Clear our async event handler.  */
> @@ -370,19 +419,6 @@ async_event_handler_mark ()
>    mark_async_event_handler (amd_dbgapi_async_event_handler);
>  }
>  
> -/* Fetch the amd_dbgapi_inferior_info data for the given inferior.  */
> -
> -static struct amd_dbgapi_inferior_info *
> -get_amd_dbgapi_inferior_info (struct inferior *inferior)
> -{
> -  amd_dbgapi_inferior_info *info = amd_dbgapi_inferior_data.get (inferior);
> -
> -  if (info == nullptr)
> -    info = amd_dbgapi_inferior_data.emplace (inferior, inferior);
> -
> -  return info;
> -}
> -
>  /* Set forward progress requirement to REQUIRE for all processes of PROC_TARGET
>     matching PTID.  */
>  
> @@ -565,12 +601,12 @@ amd_dbgapi_target::pid_to_str (ptid_t ptid)
>  
>    auto it = info->wave_info_map.find (wave_id.handle);
>    if (it != info->wave_info_map.end ())
> -    return it->second.to_string ();
> +    return it->second.coords.to_string ();
>  
>    /* A wave we don't know about.  Shouldn't usually happen, but
>       asserting and bringing down the session is a bit too harsh.  Just
>       print all unknown info as "?"s.  */
> -  return wave_info (wave_id).to_string ();
> +  return wave_coordinates (wave_id).to_string ();
>  }
>  
>  const char *
> @@ -694,16 +730,24 @@ amd_dbgapi_target::resume (ptid_t scope_ptid, int step, enum gdb_signal signo)
>  
>        amd_dbgapi_wave_id_t wave_id = get_amd_dbgapi_wave_id (thread->ptid);
>        amd_dbgapi_status_t status;
> +
> +      wave_info &wi = get_thread_wave_info (thread);
> +      amd_dbgapi_resume_mode_t &resume_mode = wi.last_resume_mode;
> +      amd_dbgapi_exceptions_t wave_exception;
>        if (thread->ptid == inferior_ptid)
> -	status = amd_dbgapi_wave_resume (wave_id,
> -					 (step
> -					  ? AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
> -					  : AMD_DBGAPI_RESUME_MODE_NORMAL),
> -					 exception);
> +	{
> +	  resume_mode = (step
> +			 ? AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
> +			 : AMD_DBGAPI_RESUME_MODE_NORMAL);
> +	  wave_exception = exception;
> +	}
>        else
> -	status = amd_dbgapi_wave_resume (wave_id, AMD_DBGAPI_RESUME_MODE_NORMAL,
> -					 AMD_DBGAPI_EXCEPTION_NONE);
> +	{
> +	  resume_mode = AMD_DBGAPI_RESUME_MODE_NORMAL;
> +	  wave_exception = AMD_DBGAPI_EXCEPTION_NONE;
> +	}
>  
> +      status = amd_dbgapi_wave_resume (wave_id, resume_mode, wave_exception);
>        if (status != AMD_DBGAPI_STATUS_SUCCESS
>  	  /* Ignore the error that wave is no longer valid as that could
>  	     indicate that the process has exited.  GDB treats resuming a
> @@ -711,6 +755,8 @@ amd_dbgapi_target::resume (ptid_t scope_ptid, int step, enum gdb_signal signo)
>  	  && status != AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID)
>  	error (_("wave_resume for wave_%ld failed (%s)"), wave_id.handle,
>  	       get_status_string (status));
> +
> +      wi.stopping = false;
>      }
>  }
>  
> @@ -725,6 +771,15 @@ amd_dbgapi_target::commit_resumed ()
>    require_forward_progress (minus_one_ptid, proc_target, true);
>  }
>  
> +/* Return a string version of RESUME_MODE, for debug log purposes.  */
> +static const char *
> +resume_mode_to_string (amd_dbgapi_resume_mode_t resume_mode)
> +{
> +  return (resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
> +	  ? "step"
> +	  : "normal");

Pedantically, amd_dbgapi_resume_mode_t is an enum and could in theory
get more enumerators.  I don't expect any, but the following construct
would make the compiler complain if that ever happens:

   static const char *
   resume_mode_to_string (amd_dbgapi_resume_mode_t resume_mode)
   {
     switch (resume_mode)
       {
       case AMD_DBGAPI_RESUME_MODE_NORMAL:
         return "normal";
       case AMD_DBGAPI_RESUME_MODE_SINGLE_STEP:
         return "step";
       }
     gdb_assert_not_reached ("invalid amd_dbgapi_resume_mode_t");
   }

> +}
> +
>  void
>  amd_dbgapi_target::stop (ptid_t ptid)
>  {
> @@ -758,7 +813,11 @@ amd_dbgapi_target::stop (ptid_t ptid)
>  
>  	  status = amd_dbgapi_wave_stop (wave_id);
>  	  if (status == AMD_DBGAPI_STATUS_SUCCESS)
> -	    return;
> +	    {
> +	      wave_info &wi = get_thread_wave_info (thread);
> +	      wi.stopping = true;
> +	      return;
> +	    }
>  
>  	  if (status != AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID)
>  	    error (_("wave_stop for wave_%ld failed (%s)"), wave_id.handle,
> @@ -772,6 +831,23 @@ amd_dbgapi_target::stop (ptid_t ptid)
>  	 could have terminated since the last time the wave list was
>  	 refreshed.  */
>  
> +      wave_info &wi = get_thread_wave_info (thread);
> +      wi.stopping = true;
> +
> +      amd_dbgapi_debug_printf ("got AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID "
> +			       "for wave_%ld, last_resume_mode=%s, "
> +			       "report_thread_events=%d",
> +			       wave_id.handle,
> +			       resume_mode_to_string (wi.last_resume_mode),
> +			       m_report_thread_events);
> +
> +      /* If the wave was stepping when it terminated, then it is
> +	 guaranteed that we will see a WAVE_COMMAND_TERMINATED event
> +	 for it.  Don't report a thread exit event or delete the
> +	 thread yet, until we see such event.  */
> +      if (wi.last_resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP)
> +	return;
> +
>        if (m_report_thread_events)
>  	{
>  	  get_amd_dbgapi_inferior_info (thread->inf)->wave_events.emplace_back
> @@ -1018,7 +1094,7 @@ add_gpu_thread (inferior *inf, ptid_t wave_ptid)
>    auto wave_id = get_amd_dbgapi_wave_id (wave_ptid);
>  
>    if (!info->wave_info_map.try_emplace (wave_id.handle,
> -					get_wave_info (wave_id)).second)
> +					wave_info (wave_id)).second)
>      internal_error ("wave ID %ld already in map", wave_id.handle);
>  
>    /* Create new GPU threads silently to avoid spamming the terminal
> @@ -1770,7 +1846,32 @@ amd_dbgapi_target::update_thread_list ()
>  	    auto it = threads.find (tp->ptid.tid ());
>  
>  	    if (it == threads.end ())
> -	      delete_thread_silent (tp);
> +	      {
> +		auto wave_id = get_amd_dbgapi_wave_id (tp->ptid);
> +		wave_info &wi = get_thread_wave_info (tp);
> +
> +		/* Waves that were stepping or in progress of being
> +		   stopped are guaranteed to report a
> +		   WAVE_COMMAND_TERMINATED event if they terminate.
> +		   Don't delete such threads until we see the
> +		   event.  */
> +		if (wi.last_resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
> +		    || wi.stopping)
> +		  {
> +		    amd_dbgapi_debug_printf
> +		      ("wave_%ld disappeared, keeping it"
> +		       " (last_resume_mode=%s, stopping=%d)",
> +		       wave_id.handle,
> +		       resume_mode_to_string (wi.last_resume_mode),
> +		       wi.stopping);
> +		  }
> +		else
> +		  {
> +		    amd_dbgapi_debug_printf ("wave_%ld disappeared, deleting it",
> +					     wave_id.handle);
> +		    delete_thread_silent (tp);
> +		  }
> +	      }
>  	    else
>  	      threads.erase (it);
>  	  }
> 
> -- 
> 2.43.0
> 

I happy with the patch as it is if you prefer to not change
resume_mode_to_string.  Either way

Approved-By: Lancelot Six <laneclot.six@amd.com> (amdgpu)

Best,
Lancelot.
  

Patch

diff --git a/gdb/amd-dbgapi-target.c b/gdb/amd-dbgapi-target.c
index 86102b7fb03..06f9e8c5f9c 100644
--- a/gdb/amd-dbgapi-target.c
+++ b/gdb/amd-dbgapi-target.c
@@ -109,10 +109,9 @@  get_amd_dbgapi_target_inferior_created_observer_token ()
   return amd_dbgapi_target_inferior_created_observer_token;
 }
 
-/* A type holding coordinate, etc. info for a given wave.  We cache
-   this because we need this information after a wave exits.  */
+/* A type holding coordinates, etc. info for a given wave.  */
 
-struct wave_info
+struct wave_coordinates
 {
   /* The wave.  Set by the ctor.  */
   amd_dbgapi_wave_id_t wave_id;
@@ -125,11 +124,44 @@  struct wave_info
   uint32_t group_ids[3] {UINT32_MAX, UINT32_MAX, UINT32_MAX};
   uint32_t wave_in_group = UINT32_MAX;
 
-  explicit wave_info (amd_dbgapi_wave_id_t wave_id);
+  explicit wave_coordinates (amd_dbgapi_wave_id_t wave_id)
+    : wave_id (wave_id)
+  {}
 
-  /* Return the target ID string for the wave this wave_info is
+  /* Return the target ID string for the wave this wave_coordinates is
      for.  */
   std::string to_string () const;
+
+  /* Pull out coordinates info from the amd-dbgapi library.  */
+  void fetch ();
+};
+
+/* A type holding info about a given wave.  */
+
+struct wave_info
+{
+  /* We cache the coordinates info because we need it after a wave
+     exits.  The wave's ID is here.  */
+  wave_coordinates coords;
+
+  /* The last resume_mode passed to amd_dbgapi_wave_resume for this
+     wave.  We track this because we are guaranteed to see a
+     WAVE_COMMAND_TERMINATED event if a stepping wave terminates, and
+     we need to know to not delete such a wave until we process that
+     event.  */
+  amd_dbgapi_resume_mode_t last_resume_mode = AMD_DBGAPI_RESUME_MODE_NORMAL;
+
+  /* Whether we've called amd_dbgapi_wave_stop for this wave and are
+     waiting for its stop event.  Similarly, we track this because
+     we're guaranteed to get a WAVE_COMMAND_TERMINATED event if the
+     wave terminates while being stopped.  */
+  bool stopping = false;
+
+  explicit wave_info (amd_dbgapi_wave_id_t wave_id)
+    : coords (wave_id)
+  {
+    coords.fetch ();
+  }
 };
 
 /* Big enough to hold the size of the largest register in bytes.  */
@@ -275,6 +307,19 @@  static struct amd_dbgapi_target the_amd_dbgapi_target;
 static const registry<inferior>::key<amd_dbgapi_inferior_info>
   amd_dbgapi_inferior_data;
 
+/* Fetch the amd_dbgapi_inferior_info data for the given inferior.  */
+
+static struct amd_dbgapi_inferior_info *
+get_amd_dbgapi_inferior_info (struct inferior *inferior)
+{
+  amd_dbgapi_inferior_info *info = amd_dbgapi_inferior_data.get (inferior);
+
+  if (info == nullptr)
+    info = amd_dbgapi_inferior_data.emplace (inferior, inferior);
+
+  return info;
+}
+
 /* The async event handler registered with the event loop, indicating that we
    might have events to report to the core and that we'd like our wait method
    to be called.
@@ -289,7 +334,7 @@  static const registry<inferior>::key<amd_dbgapi_inferior_info>
 static async_event_handler *amd_dbgapi_async_event_handler = nullptr;
 
 std::string
-wave_info::to_string () const
+wave_coordinates::to_string () const
 {
   std::string str = "AMDGPU Wave";
 
@@ -319,37 +364,41 @@  wave_info::to_string () const
   return str;
 }
 
-wave_info::wave_info (amd_dbgapi_wave_id_t wave_id)
-  : wave_id (wave_id)
-{
-}
-
-/* Read in wave_info for WAVE_ID.  */
-
-static wave_info
-get_wave_info (amd_dbgapi_wave_id_t wave_id)
+void
+wave_coordinates::fetch ()
 {
-  wave_info res (wave_id);
-
   /* Any field that fails to be read is left with its in-class
      initialized value, which is printed as "?".  */
 
   amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_AGENT,
-			    sizeof (res.agent_id), &res.agent_id);
+			    sizeof (agent_id), &agent_id);
   amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_QUEUE,
-			    sizeof (res.queue_id), &res.queue_id);
+			    sizeof (queue_id), &queue_id);
   amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_DISPATCH,
-			    sizeof (res.dispatch_id), &res.dispatch_id);
+			    sizeof (dispatch_id), &dispatch_id);
 
   amd_dbgapi_wave_get_info (wave_id,
 			    AMD_DBGAPI_WAVE_INFO_WORKGROUP_COORD,
-			    sizeof (res.group_ids), &res.group_ids);
+			    sizeof (group_ids), &group_ids);
 
   amd_dbgapi_wave_get_info (wave_id,
 			    AMD_DBGAPI_WAVE_INFO_WAVE_NUMBER_IN_WORKGROUP,
-			    sizeof (res.wave_in_group), &res.wave_in_group);
+			    sizeof (wave_in_group), &wave_in_group);
+}
+
+/* Get the wave_info object for TP, from the wave_info map.  It is
+   assumed that the wave is in the map.  */
+
+static wave_info &
+get_thread_wave_info (thread_info *tp)
+{
+  amd_dbgapi_inferior_info *info = get_amd_dbgapi_inferior_info (tp->inf);
+  amd_dbgapi_wave_id_t wave_id = get_amd_dbgapi_wave_id (tp->ptid);
+
+  auto it = info->wave_info_map.find (wave_id.handle);
+  gdb_assert (it != info->wave_info_map.end ());
 
-  return res;
+  return it->second;
 }
 
 /* Clear our async event handler.  */
@@ -370,19 +419,6 @@  async_event_handler_mark ()
   mark_async_event_handler (amd_dbgapi_async_event_handler);
 }
 
-/* Fetch the amd_dbgapi_inferior_info data for the given inferior.  */
-
-static struct amd_dbgapi_inferior_info *
-get_amd_dbgapi_inferior_info (struct inferior *inferior)
-{
-  amd_dbgapi_inferior_info *info = amd_dbgapi_inferior_data.get (inferior);
-
-  if (info == nullptr)
-    info = amd_dbgapi_inferior_data.emplace (inferior, inferior);
-
-  return info;
-}
-
 /* Set forward progress requirement to REQUIRE for all processes of PROC_TARGET
    matching PTID.  */
 
@@ -565,12 +601,12 @@  amd_dbgapi_target::pid_to_str (ptid_t ptid)
 
   auto it = info->wave_info_map.find (wave_id.handle);
   if (it != info->wave_info_map.end ())
-    return it->second.to_string ();
+    return it->second.coords.to_string ();
 
   /* A wave we don't know about.  Shouldn't usually happen, but
      asserting and bringing down the session is a bit too harsh.  Just
      print all unknown info as "?"s.  */
-  return wave_info (wave_id).to_string ();
+  return wave_coordinates (wave_id).to_string ();
 }
 
 const char *
@@ -694,16 +730,24 @@  amd_dbgapi_target::resume (ptid_t scope_ptid, int step, enum gdb_signal signo)
 
       amd_dbgapi_wave_id_t wave_id = get_amd_dbgapi_wave_id (thread->ptid);
       amd_dbgapi_status_t status;
+
+      wave_info &wi = get_thread_wave_info (thread);
+      amd_dbgapi_resume_mode_t &resume_mode = wi.last_resume_mode;
+      amd_dbgapi_exceptions_t wave_exception;
       if (thread->ptid == inferior_ptid)
-	status = amd_dbgapi_wave_resume (wave_id,
-					 (step
-					  ? AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
-					  : AMD_DBGAPI_RESUME_MODE_NORMAL),
-					 exception);
+	{
+	  resume_mode = (step
+			 ? AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
+			 : AMD_DBGAPI_RESUME_MODE_NORMAL);
+	  wave_exception = exception;
+	}
       else
-	status = amd_dbgapi_wave_resume (wave_id, AMD_DBGAPI_RESUME_MODE_NORMAL,
-					 AMD_DBGAPI_EXCEPTION_NONE);
+	{
+	  resume_mode = AMD_DBGAPI_RESUME_MODE_NORMAL;
+	  wave_exception = AMD_DBGAPI_EXCEPTION_NONE;
+	}
 
+      status = amd_dbgapi_wave_resume (wave_id, resume_mode, wave_exception);
       if (status != AMD_DBGAPI_STATUS_SUCCESS
 	  /* Ignore the error that wave is no longer valid as that could
 	     indicate that the process has exited.  GDB treats resuming a
@@ -711,6 +755,8 @@  amd_dbgapi_target::resume (ptid_t scope_ptid, int step, enum gdb_signal signo)
 	  && status != AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID)
 	error (_("wave_resume for wave_%ld failed (%s)"), wave_id.handle,
 	       get_status_string (status));
+
+      wi.stopping = false;
     }
 }
 
@@ -725,6 +771,15 @@  amd_dbgapi_target::commit_resumed ()
   require_forward_progress (minus_one_ptid, proc_target, true);
 }
 
+/* Return a string version of RESUME_MODE, for debug log purposes.  */
+static const char *
+resume_mode_to_string (amd_dbgapi_resume_mode_t resume_mode)
+{
+  return (resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
+	  ? "step"
+	  : "normal");
+}
+
 void
 amd_dbgapi_target::stop (ptid_t ptid)
 {
@@ -758,7 +813,11 @@  amd_dbgapi_target::stop (ptid_t ptid)
 
 	  status = amd_dbgapi_wave_stop (wave_id);
 	  if (status == AMD_DBGAPI_STATUS_SUCCESS)
-	    return;
+	    {
+	      wave_info &wi = get_thread_wave_info (thread);
+	      wi.stopping = true;
+	      return;
+	    }
 
 	  if (status != AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID)
 	    error (_("wave_stop for wave_%ld failed (%s)"), wave_id.handle,
@@ -772,6 +831,23 @@  amd_dbgapi_target::stop (ptid_t ptid)
 	 could have terminated since the last time the wave list was
 	 refreshed.  */
 
+      wave_info &wi = get_thread_wave_info (thread);
+      wi.stopping = true;
+
+      amd_dbgapi_debug_printf ("got AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID "
+			       "for wave_%ld, last_resume_mode=%s, "
+			       "report_thread_events=%d",
+			       wave_id.handle,
+			       resume_mode_to_string (wi.last_resume_mode),
+			       m_report_thread_events);
+
+      /* If the wave was stepping when it terminated, then it is
+	 guaranteed that we will see a WAVE_COMMAND_TERMINATED event
+	 for it.  Don't report a thread exit event or delete the
+	 thread yet, until we see such event.  */
+      if (wi.last_resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP)
+	return;
+
       if (m_report_thread_events)
 	{
 	  get_amd_dbgapi_inferior_info (thread->inf)->wave_events.emplace_back
@@ -1018,7 +1094,7 @@  add_gpu_thread (inferior *inf, ptid_t wave_ptid)
   auto wave_id = get_amd_dbgapi_wave_id (wave_ptid);
 
   if (!info->wave_info_map.try_emplace (wave_id.handle,
-					get_wave_info (wave_id)).second)
+					wave_info (wave_id)).second)
     internal_error ("wave ID %ld already in map", wave_id.handle);
 
   /* Create new GPU threads silently to avoid spamming the terminal
@@ -1770,7 +1846,32 @@  amd_dbgapi_target::update_thread_list ()
 	    auto it = threads.find (tp->ptid.tid ());
 
 	    if (it == threads.end ())
-	      delete_thread_silent (tp);
+	      {
+		auto wave_id = get_amd_dbgapi_wave_id (tp->ptid);
+		wave_info &wi = get_thread_wave_info (tp);
+
+		/* Waves that were stepping or in progress of being
+		   stopped are guaranteed to report a
+		   WAVE_COMMAND_TERMINATED event if they terminate.
+		   Don't delete such threads until we see the
+		   event.  */
+		if (wi.last_resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
+		    || wi.stopping)
+		  {
+		    amd_dbgapi_debug_printf
+		      ("wave_%ld disappeared, keeping it"
+		       " (last_resume_mode=%s, stopping=%d)",
+		       wave_id.handle,
+		       resume_mode_to_string (wi.last_resume_mode),
+		       wi.stopping);
+		  }
+		else
+		  {
+		    amd_dbgapi_debug_printf ("wave_%ld disappeared, deleting it",
+					     wave_id.handle);
+		    delete_thread_silent (tp);
+		  }
+	      }
 	    else
 	      threads.erase (it);
 	  }