[03/31] gdb/linux: Delete all other LWPs immediately on ptrace exec event

Message ID 20221212203101.1034916-4-pedro@palves.net
State New
Headers
Series Step over thread clone and thread exit |

Commit Message

Pedro Alves Dec. 12, 2022, 8:30 p.m. UTC
  I noticed that after a following patch ("Step over clone syscall w/
breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the
gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd
end up with four new unexpected GDB core dumps:

		 === gdb Summary ===

 # of unexpected core files      4
 # of expected passes            48

That said patch is making the pre-existing
gdb.threads/step-over-exec.exp testcase (almost silently) expose a
latent problem in gdb/linux-nat.c, resulting in a GDB crash when:

 #1 - a non-leader thread execs
 #2 - the post-exec program stops somewhere
 #3 - you kill the inferior

Instead of #3 directly, the testcase just returns, which ends up in
gdb_exit, tearing down GDB, which kills the inferior, and is thus
equivalent to #3 above.

Vis:

 $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
 ...
 (top-gdb) r
 ...
 (gdb) b main
 ...
 (gdb) r
 ...
 Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
 69        argv0 = argv[0];
 (gdb) c
 Continuing.
 [New Thread 0x7ffff7d89700 (LWP 2506975)]
 Other going in exec.
 Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
 process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd

 Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
 28        foo ();
 (gdb) k
 ...
 Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
 393         return m_suspend.waitstatus_pending_p;
 (top-gdb) bt
 #0  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
 #1  0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
 #2  0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
 #3  0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
 #4  0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
 #5  0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
 #6  0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
 #7  0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
 #8  0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
 ...

The root of the problem is that when a non-leader LWP execs, it just
changes its tid to the tgid, replacing the pre-exec leader thread,
becoming the new leader.  There's no thread exit event for the execing
thread.  It's as if the old pre-exec LWP vanishes without trace.  The
ptrace man page says:

"PTRACE_O_TRACEEXEC (since Linux 2.5.46)
	Stop the tracee at the next execve(2).  A waitpid(2) by the
	tracer will return a status value such that

	  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))

	If the execing thread is not a thread group leader, the thread
	ID is reset to thread group leader's ID before this stop.
	Since Linux 3.0, the former thread ID can be retrieved with
	PTRACE_GETEVENTMSG."

When the core of GDB processes an exec events, it deletes all the
threads of the inferior.  But, that is too late -- deleting the thread
does not delete the corresponding LWP, so we end leaving the pre-exec
non-leader LWP stale in the LWP list.  That's what leads to the crash
above -- linux_nat_target::kill iterates over all LWPs, and after the
patch in question, that code will look for the corresponding
thread_info for each LWP.  For the pre-exec non-leader LWP still
listed, won't find one.

This patch fixes it, by deleting the pre-exec non-leader LWP (and
thread) from the LWP/thread lists as soon as we get an exec event out
of ptrace.

GDBserver does not need an equivalent fix, because it is already doing
this, as side effect of mourning the pre-exec process, in
gdbserver/linux-low.cc:

  else if (event == PTRACE_EVENT_EXEC && cs.report_exec_events)
    {
...
      /* Delete the execing process and all its threads.  */
      mourn (proc);
      switch_to_thread (nullptr);

Change-Id: I21ec18072c7750f3a972160ae6b9e46590376643
---
 gdb/linux-nat.c                              | 15 +++++++++++++++
 gdb/testsuite/gdb.threads/step-over-exec.exp |  6 ++++++
 2 files changed, 21 insertions(+)
  

Comments

Andrew Burgess March 21, 2023, 2:50 p.m. UTC | #1
Pedro Alves <pedro@palves.net> writes:

> I noticed that after a following patch ("Step over clone syscall w/
> breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the
> gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd
> end up with four new unexpected GDB core dumps:
>
> 		 === gdb Summary ===
>
>  # of unexpected core files      4
>  # of expected passes            48
>
> That said patch is making the pre-existing
> gdb.threads/step-over-exec.exp testcase (almost silently) expose a
> latent problem in gdb/linux-nat.c, resulting in a GDB crash when:
>
>  #1 - a non-leader thread execs
>  #2 - the post-exec program stops somewhere
>  #3 - you kill the inferior
>
> Instead of #3 directly, the testcase just returns, which ends up in
> gdb_exit, tearing down GDB, which kills the inferior, and is thus
> equivalent to #3 above.
>
> Vis:
>
>  $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
>  ...
>  (top-gdb) r
>  ...
>  (gdb) b main
>  ...
>  (gdb) r
>  ...
>  Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
>  69        argv0 = argv[0];
>  (gdb) c
>  Continuing.
>  [New Thread 0x7ffff7d89700 (LWP 2506975)]
>  Other going in exec.
>  Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>  process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>
>  Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
>  28        foo ();
>  (gdb) k
>  ...
>  Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
>  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>  393         return m_suspend.waitstatus_pending_p;
>  (top-gdb) bt
>  #0  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>  #1  0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
>  #2  0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
>  #3  0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
>  #4  0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
>  #5  0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
>  #6  0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
>  #7  0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
>  #8  0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
>  ...

It wasn't 100% clear to me if the above session was supposed to show a
failure with GDB prior to *this* commit, or was a demonstration of what
would happen if this commit is skipped, and the later commits applied.

I thought it was the second case, but I was so unsure that I tried the
reproducer anyway.   Just in case I'm wrong, the above example doesn't
seem to fail prior to this commit.

>
> The root of the problem is that when a non-leader LWP execs, it just
> changes its tid to the tgid, replacing the pre-exec leader thread,
> becoming the new leader.  There's no thread exit event for the execing
> thread.  It's as if the old pre-exec LWP vanishes without trace.  The
> ptrace man page says:
>
> "PTRACE_O_TRACEEXEC (since Linux 2.5.46)
> 	Stop the tracee at the next execve(2).  A waitpid(2) by the
> 	tracer will return a status value such that
>
> 	  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
>
> 	If the execing thread is not a thread group leader, the thread
> 	ID is reset to thread group leader's ID before this stop.
> 	Since Linux 3.0, the former thread ID can be retrieved with
> 	PTRACE_GETEVENTMSG."
>
> When the core of GDB processes an exec events, it deletes all the
> threads of the inferior.  But, that is too late -- deleting the thread
> does not delete the corresponding LWP, so we end leaving the pre-exec
> non-leader LWP stale in the LWP list.  That's what leads to the crash
> above -- linux_nat_target::kill iterates over all LWPs, and after the
> patch in question, that code will look for the corresponding
> thread_info for each LWP.  For the pre-exec non-leader LWP still
> listed, won't find one.
>
> This patch fixes it, by deleting the pre-exec non-leader LWP (and
> thread) from the LWP/thread lists as soon as we get an exec event out
> of ptrace.

Given that we don't have a test *right now* for this issue, and instead
rely on a future patch not failing.  I wondered if there was any way
that we could trigger a failure.

So I was poking around looking for places where we iterate over the
all_lwps() list wondering which we could trigger that might cause a
failure...

... and then I thought: why not just have GDB tell us that the
all_lwps() list is broken.

So I hacked up a new 'maint info linux-lwps' command.  It's not very
interesting right now, here's the output in a multi-threaded inferior
prior to the exec:

  (gdb) maintenance info linux-lwps 
  LWP Ptid          Thread ID
  1707218.1707239.0 2           
  1707218.1707218.0 1           

And in your failure case (after the exec):

  (gdb) maintenance info linux-lwps 
  LWP Ptid          Thread ID 
  1708883.1708895.0 None      
  1708883.1708883.0 1         

And then we can check this from the testscript, and now we have a test
that fails before this commit, and passes afterwards.

And in the future we might find other information we want to add in the
new maintenance command.

What are your thoughts on including this, or something like this with
this commit?  My patch, which applies on top of this commit, is included
at the end of this email.  Please feel free to take any changes that you
feel add value.

>
> GDBserver does not need an equivalent fix, because it is already doing
> this, as side effect of mourning the pre-exec process, in
> gdbserver/linux-low.cc:
>
>   else if (event == PTRACE_EVENT_EXEC && cs.report_exec_events)
>     {
> ...
>       /* Delete the execing process and all its threads.  */
>       mourn (proc);
>       switch_to_thread (nullptr);
>
> Change-Id: I21ec18072c7750f3a972160ae6b9e46590376643
> ---
>  gdb/linux-nat.c                              | 15 +++++++++++++++
>  gdb/testsuite/gdb.threads/step-over-exec.exp |  6 ++++++
>  2 files changed, 21 insertions(+)
>
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 9b78fd1f8e8..5ee3227f1b9 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -1986,6 +1986,21 @@ linux_handle_extended_wait (struct lwp_info *lp, int status)
>  	 thread execs, it changes its tid to the tgid, and the old
>  	 tgid thread might have not been resumed.  */
>        lp->resumed = 1;
> +
> +      /* All other LWPs are gone now.  We'll have received a thread
> +	 exit notification for all threads other the execing one.
> +	 That one, if it wasn't the leader, just silently changes its
> +	 tid to the tgid, and the previous leader vanishes.  Since
> +	 Linux 3.0, the former thread ID can be retrieved with
> +	 PTRACE_GETEVENTMSG, but since we support older kernels, don't
> +	 bother with it, and just walk the LWP list.  Even with
> +	 PTRACE_GETEVENTMSG, we'd still need to lookup the
> +	 corresponding LWP object, and it would be an extra ptrace
> +	 syscall, so this way may even be more efficient.  */
> +      for (lwp_info *other_lp : all_lwps_safe ())
> +	if (other_lp != lp && other_lp->ptid.pid () == lp->ptid.pid ())
> +	  exit_lwp (other_lp);
> +
>        return 0;
>      }
>  
> diff --git a/gdb/testsuite/gdb.threads/step-over-exec.exp b/gdb/testsuite/gdb.threads/step-over-exec.exp
> index 783f865585c..a8b01f8aeda 100644
> --- a/gdb/testsuite/gdb.threads/step-over-exec.exp
> +++ b/gdb/testsuite/gdb.threads/step-over-exec.exp
> @@ -102,6 +102,12 @@ proc do_test { execr_thread different_text_segments displaced_stepping } {
>      gdb_breakpoint foo
>      gdb_test "continue" "Breakpoint $decimal, foo .*" \
>  	"continue to foo"
> +
> +    # Test that GDB is able to kill the inferior.  This may fail if
> +    # e.g., GDB does not dispose of the pre-exec threads properly.
> +    gdb_test "with confirm off -- kill" \
> +	"\\\[Inferior 1 (.*) killed\\\]" \
> +	"kill inferior"
>  }
>

These changes all look good.

Reviewed-By: Andrew Burgess <aburgess@redhat.com>

Thanks,
Andrew

>  foreach_with_prefix displaced_stepping {auto off} {
> -- 
> 2.36.0


---

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 5f67bcbcb4f..9b1e071b5f6 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -4482,6 +4485,49 @@ current_lwp_ptid (void)
   return inferior_ptid;
 }
 
+/* Implement 'maintenance info linux-lwps'.  Displays some basic
+   information about all the current lwp_info objects.  */
+
+static void
+maintenance_info_lwps (const char *arg, int from_tty)
+{
+  if (all_lwps ().size () == 0)
+    {
+      gdb_printf ("No Linux LWPs\n");
+      return;
+    }
+
+  /* Start the width at 8 to match the column heading below, then figure
+     out the widest ptid string.  We'll use this to build our output table
+     below.  */
+  size_t ptid_width = 8;
+  for (lwp_info *lp : all_lwps ())
+    ptid_width = std::max (ptid_width, lp->ptid.to_string ().size ());
+
+  /* Setup the table headers.  */
+  struct ui_out *uiout = current_uiout;
+  ui_out_emit_table table_emitter (uiout, 2, -1, "linux-lwps");
+  uiout->table_header (ptid_width, ui_left, "lwp-ptid", _("LWP Ptid"));
+  uiout->table_header (9, ui_left, "thread-info", _("Thread ID"));
+  uiout->table_body ();
+
+  /* Display one table row for each lwp_info.  */
+  for (lwp_info *lp : all_lwps ())
+    {
+      ui_out_emit_tuple tuple_emitter (uiout, "lwp-entry");
+
+      struct thread_info *th = find_thread_ptid (linux_target, lp->ptid);
+
+      uiout->field_string ("lwp-ptid", lp->ptid.to_string ().c_str ());
+      if (th == nullptr)
+	uiout->field_string ("thread-info", "None");
+      else
+	uiout->field_string ("thread-info", print_thread_id (th));
+
+      uiout->message ("\n");
+    }
+}
+
 void _initialize_linux_nat ();
 void
 _initialize_linux_nat ()
@@ -4519,6 +4565,9 @@ Enables printf debugging output."),
   sigemptyset (&blocked_mask);
 
   lwp_lwpid_htab_create ();
+
+  add_cmd ("linux-lwps", class_maintenance, maintenance_info_lwps,
+	 _("List the Linux LWPS."), &maintenanceinfolist);
 }
 
 
diff --git a/gdb/testsuite/gdb.threads/step-over-exec.exp b/gdb/testsuite/gdb.threads/step-over-exec.exp
index c9a067b23aa..8ab027f6f08 100644
--- a/gdb/testsuite/gdb.threads/step-over-exec.exp
+++ b/gdb/testsuite/gdb.threads/step-over-exec.exp
@@ -103,6 +103,49 @@ proc do_test { execr_thread different_text_segments displaced_stepping } {
     gdb_test "continue" "Breakpoint $decimal, foo .*" \
 	"continue to foo"
 
+    # If we have a linux target then there used to be a bug that in
+    # some situations we'd leave an orphaned lwp object around.  Check
+    # the 'maint info linux-lwp' output to spot any orphans.
+    #
+    # If linux native support is not built in then we'll get an
+    # undefined maintenance command error, which is fine.  The bug
+    # we're checking for was in linux native code, so we know we're
+    # fine.
+    #
+    # Alternatively, linux native support might be built in, but we
+    # might be using an alternative target (e.g. a remote target), in
+    # this case we'll get a message about 'No Linux LWPs'.  Again
+    # there's nothing that needs testing in this case.
+    gdb_test_multiple "maint info linux-lwp" "" {
+	-re "^maint info linux-lwp\r\n" {
+	    exp_continue
+	}
+
+	-re "^Undefined maintenance info command: \"linux-lwp\"\\.  Try \"help maintenance info\"\\.\r\n$::gdb_prompt $" {
+	    unsupported $gdb_test_name
+	}
+
+	-re "^LWP Ptid\\s+Thread Info\\s*\r\n" {
+	    exp_continue
+	}
+
+	-re "^\\d+\\.\\d+\\.\\d+\\s+\\d+(?:\\.\\d+)?\\s*\r\n" {
+	    exp_continue
+	}
+
+	-re "^\\d+\\.\\d+\\.\\d+\\s+None\\s*\r\n" {
+	    fail $gdb_test_name
+	}
+
+	-re "^No Linux LWPs\r\n$::gdb_prompt" {
+	    unsupported $gdb_test_name
+	}
+
+	-re "^$::gdb_prompt $" {
+	    pass $gdb_test_name
+	}
+    }
+
     # Test that GDB is able to kill the inferior.  This may fail if
     # e.g., GDB does not dispose of the pre-exec threads properly.
     gdb_test "with confirm off -- kill" \
  
Pedro Alves April 4, 2023, 1:57 p.m. UTC | #2
Hi Andrew,

Took me a bit to find time to investigate this.  See below.

On 2023-03-21 2:50 p.m., Andrew Burgess wrote:
> Pedro Alves <pedro@palves.net> writes:
> 
>> I noticed that after a following patch ("Step over clone syscall w/
>> breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the
>> gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd
>> end up with four new unexpected GDB core dumps:
>>
>> 		 === gdb Summary ===
>>
>>  # of unexpected core files      4
>>  # of expected passes            48
>>
>> That said patch is making the pre-existing
>> gdb.threads/step-over-exec.exp testcase (almost silently) expose a
>> latent problem in gdb/linux-nat.c, resulting in a GDB crash when:
>>
>>  #1 - a non-leader thread execs
>>  #2 - the post-exec program stops somewhere
>>  #3 - you kill the inferior
>>
>> Instead of #3 directly, the testcase just returns, which ends up in
>> gdb_exit, tearing down GDB, which kills the inferior, and is thus
>> equivalent to #3 above.
>>
>> Vis:
>>
>>  $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
>>  ...
>>  (top-gdb) r
>>  ...
>>  (gdb) b main
>>  ...
>>  (gdb) r
>>  ...
>>  Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
>>  69        argv0 = argv[0];
>>  (gdb) c
>>  Continuing.
>>  [New Thread 0x7ffff7d89700 (LWP 2506975)]
>>  Other going in exec.
>>  Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>>  process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>>
>>  Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
>>  28        foo ();
>>  (gdb) k
>>  ...
>>  Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
>>  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>>  393         return m_suspend.waitstatus_pending_p;
>>  (top-gdb) bt
>>  #0  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>>  #1  0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
>>  #2  0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
>>  #3  0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
>>  #4  0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
>>  #5  0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
>>  #6  0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
>>  #7  0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
>>  #8  0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
>>  ...
> 
> It wasn't 100% clear to me if the above session was supposed to show a
> failure with GDB prior to *this* commit, or was a demonstration of what
> would happen if this commit is skipped, and the later commits applied.

Yes, prior to this commit.

> 
> I thought it was the second case, but I was so unsure that I tried the
> reproducer anyway.   Just in case I'm wrong, the above example doesn't
> seem to fail prior to this commit.

This surprised me, and when I tried it myself, I was even more surprised,
for I couldn't reproduce it either!

But I figured it out.

I'm usually using Ubuntu 22.04 for development nowadays, and in that system, indeed I can't
reproduce it.  Right after the exec, GDB traps a load event for "libc.so.6", which leads to
gdb trying to open libthread_db for the post-exec inferior, and, it succeeds.  When we load
libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps,
and then waits to see their stops.  While doing this, GDB detects that the pre-exec stale
LWP is gone, and deletes it.

The logs show:

[linux-nat] linux_nat_wait_1: waitpid 1725529 received SIGTRAP - Trace/breakpoint trap (stopped)
[linux-nat] save_stop_reason: 1725529.1725529.0 stopped by software breakpoint
[linux-nat] linux_nat_wait_1: waitpid(-1, ...) returned 0, ERRNO-OK
[linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725658.0, not stopped
[linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725529.0, has pending status
[linux-nat] linux_nat_wait_1: trap ptid is 1725529.1725529.0.
[linux-nat] linux_nat_wait_1: exit
[linux-nat] stop_callback: kill 1725529.1725658.0 **<SIGSTOP>**
[linux-nat] stop_callback: lwp kill -1 No such process
[linux-nat] wait_lwp: 1725529.1725658.0 vanished.

And the backtrace is:

(top-gdb) bt
#0  wait_lwp (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2069
#1  0x0000555555aa8fbf in stop_wait_callback (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2375
#2  0x0000555555ab12b3 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (__closure=0x0, ecall=..., args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:326
#3  0x0000555555ab12e2 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:320
#4  0x0000555555ab0610 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffca90, args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:289
#5  0x0000555555aa4c2d in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:867
#6  0x0000555555aa8a03 in linux_stop_and_wait_all_lwps () at ../../src/gdb/linux-nat.c:2229
#7  0x0000555555ac8525 in try_thread_db_load_1 (info=0x555556a66dd0) at ../../src/gdb/linux-thread-db.c:923
#8  0x0000555555ac89d5 in try_thread_db_load (library=0x5555560eca27 "libthread_db.so.1", check_auto_load_safe=false) at ../../src/gdb/linux-thread-db.c:1024
#9  0x0000555555ac8eda in try_thread_db_load_from_sdir () at ../../src/gdb/linux-thread-db.c:1108
#10 0x0000555555ac9278 in thread_db_load_search () at ../../src/gdb/linux-thread-db.c:1163
#11 0x0000555555ac9518 in thread_db_load () at ../../src/gdb/linux-thread-db.c:1225
#12 0x0000555555ac95e1 in check_for_thread_db () at ../../src/gdb/linux-thread-db.c:1268
#13 0x0000555555ac9657 in thread_db_new_objfile (objfile=0x555556943ed0) at ../../src/gdb/linux-thread-db.c:1297
#14 0x000055555569e2d2 in std::__invoke_impl<void, void (*&)(objfile*), objfile*> (__f=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:61
#15 0x000055555569c44a in std::__invoke_r<void, void (*&)(objfile*), objfile*> (__fn=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:111
#16 0x0000555555699d69 in std::_Function_handler<void (objfile*), void (*)(objfile*)>::_M_invoke(std::_Any_data const&, objfile*&&) (__functor=..., __args#0=@0x7fffffffce50: 0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:290
#17 0x0000555555b5f48b in std::function<void (objfile*)>::operator()(objfile*) const (this=0x5555567925d8, __args#0=0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:590
#18 0x0000555555b5eba4 in gdb::observers::observable<objfile*>::notify (this=0x5555565b5680 <gdb::observers::new_objfile>, args#0=0x555556943ed0) at ../../src/gdb/../gdbsupport/observable.h:166
#19 0x0000555555cdd85b in symbol_file_add_with_addrs (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1131
#20 0x0000555555cdd9c5 in symbol_file_add_from_bfd (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1167
#21 0x0000555555c9dd69 in solib_read_symbols (so=0x5555569792d0, flags=...) at ../../src/gdb/solib.c:730
#22 0x0000555555c9e7b7 in solib_add (pattern=0x0, from_tty=0, readsyms=1) at ../../src/gdb/solib.c:1041
#23 0x0000555555c9f61d in handle_solib_event () at ../../src/gdb/solib.c:1315
#24 0x0000555555729c26 in bpstat_stop_status (aspace=0x555556606800, bp_addr=0x7ffff7fe7278, thread=0x555556816bd0, ws=..., stop_chain=0x0) at ../../src/gdb/breakpoint.c:5702
#25 0x0000555555a62e41 in handle_signal_stop (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6517
#26 0x0000555555a61479 in handle_inferior_event (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6000
#27 0x0000555555a5c7b5 in fetch_inferior_event () at ../../src/gdb/infrun.c:4403
#28 0x0000555555a35b65 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:41
#29 0x0000555555aae0c9 in handle_target_event (error=0, client_data=0x0) at ../../src/gdb/linux-nat.c:4231


Now, when I try the same on a Fedora 32 machine, I see the GDB crash due to the stale
LWP still in the LWP list with no corresponding thread_info.  On this
machine, glibc predates the changes that make it possible to use libthread_db with
non-threaded processes, so try_thread_db_load doesn't manage to open a connection
to libthread_db, and thus we don't end up in linux_stop_and_wait_all_lwps, and thus
the stale lwp is not deleted.  And so a subsequent "kill" command crashes.

I wrote that patch originally on an Ubuntu 20.04 machine (vs the Ubuntu 22.04 I have now),
and it must be that that version also predates the glibc change, and thus behaves like
this Fedora 32 box.  You are very likely using a newer Fedora which has the glibc change.

> 
>>
>> The root of the problem is that when a non-leader LWP execs, it just
>> changes its tid to the tgid, replacing the pre-exec leader thread,
>> becoming the new leader.  There's no thread exit event for the execing
>> thread.  It's as if the old pre-exec LWP vanishes without trace.  The
>> ptrace man page says:
>>
>> "PTRACE_O_TRACEEXEC (since Linux 2.5.46)
>> 	Stop the tracee at the next execve(2).  A waitpid(2) by the
>> 	tracer will return a status value such that
>>
>> 	  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
>>
>> 	If the execing thread is not a thread group leader, the thread
>> 	ID is reset to thread group leader's ID before this stop.
>> 	Since Linux 3.0, the former thread ID can be retrieved with
>> 	PTRACE_GETEVENTMSG."
>>
>> When the core of GDB processes an exec events, it deletes all the
>> threads of the inferior.  But, that is too late -- deleting the thread
>> does not delete the corresponding LWP, so we end leaving the pre-exec
>> non-leader LWP stale in the LWP list.  That's what leads to the crash
>> above -- linux_nat_target::kill iterates over all LWPs, and after the
>> patch in question, that code will look for the corresponding
>> thread_info for each LWP.  For the pre-exec non-leader LWP still
>> listed, won't find one.
>>
>> This patch fixes it, by deleting the pre-exec non-leader LWP (and
>> thread) from the LWP/thread lists as soon as we get an exec event out
>> of ptrace.
> 
> Given that we don't have a test *right now* for this issue, and instead
> rely on a future patch not failing.  I wondered if there was any way
> that we could trigger a failure.
> 
> So I was poking around looking for places where we iterate over the
> all_lwps() list wondering which we could trigger that might cause a
> failure...
> 
> ... and then I thought: why not just have GDB tell us that the
> all_lwps() list is broken.
> 
> So I hacked up a new 'maint info linux-lwps' command.  It's not very
> interesting right now, here's the output in a multi-threaded inferior
> prior to the exec:
> 
>   (gdb) maintenance info linux-lwps 
>   LWP Ptid          Thread ID
>   1707218.1707239.0 2           
>   1707218.1707218.0 1           
> 
> And in your failure case (after the exec):
> 
>   (gdb) maintenance info linux-lwps 
>   LWP Ptid          Thread ID 
>   1708883.1708895.0 None      
>   1708883.1708883.0 1         
> 
> And then we can check this from the testscript, and now we have a test
> that fails before this commit, and passes afterwards.
> 
> And in the future we might find other information we want to add in the
> new maintenance command.
> 
> What are your thoughts on including this, or something like this with
> this commit?  My patch, which applies on top of this commit, is included
> at the end of this email.  Please feel free to take any changes that you
> feel add value.

I'm totally fine with such a command, though the test I had added covers
as much as it would, as the "kill" command fails when the maint command
would fail, and passes when the maint command passes.  But I'll incorporate
it.
  
Pedro Alves April 14, 2023, 7:29 p.m. UTC | #3
Hi!

On 2023-04-04 2:57 p.m., Pedro Alves wrote:
> On 2023-03-21 2:50 p.m., Andrew Burgess wrote:
>>
>> I thought it was the second case, but I was so unsure that I tried the
>> reproducer anyway.   Just in case I'm wrong, the above example doesn't
>> seem to fail prior to this commit.
> 
> This surprised me, and when I tried it myself, I was even more surprised,
> for I couldn't reproduce it either!
> 
> But I figured it out.
> 
> I'm usually using Ubuntu 22.04 for development nowadays, and in that system, indeed I can't
> reproduce it.  Right after the exec, GDB traps a load event for "libc.so.6", which leads to
> gdb trying to open libthread_db for the post-exec inferior, and, it succeeds.  When we load
> libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps,
> and then waits to see their stops.  While doing this, GDB detects that the pre-exec stale
> LWP is gone, and deletes it.
> 
> The logs show:
> 
> [linux-nat] linux_nat_wait_1: waitpid 1725529 received SIGTRAP - Trace/breakpoint trap (stopped)
> [linux-nat] save_stop_reason: 1725529.1725529.0 stopped by software breakpoint
> [linux-nat] linux_nat_wait_1: waitpid(-1, ...) returned 0, ERRNO-OK
> [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725658.0, not stopped
> [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725529.0, has pending status
> [linux-nat] linux_nat_wait_1: trap ptid is 1725529.1725529.0.
> [linux-nat] linux_nat_wait_1: exit
> [linux-nat] stop_callback: kill 1725529.1725658.0 **<SIGSTOP>**
> [linux-nat] stop_callback: lwp kill -1 No such process
> [linux-nat] wait_lwp: 1725529.1725658.0 vanished.
> 
> And the backtrace is:
> 
> (top-gdb) bt
> #0  wait_lwp (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2069
> #1  0x0000555555aa8fbf in stop_wait_callback (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2375
> #2  0x0000555555ab12b3 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (__closure=0x0, ecall=..., args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:326
> #3  0x0000555555ab12e2 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:320
> #4  0x0000555555ab0610 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffca90, args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:289
> #5  0x0000555555aa4c2d in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:867
> #6  0x0000555555aa8a03 in linux_stop_and_wait_all_lwps () at ../../src/gdb/linux-nat.c:2229
> #7  0x0000555555ac8525 in try_thread_db_load_1 (info=0x555556a66dd0) at ../../src/gdb/linux-thread-db.c:923
> #8  0x0000555555ac89d5 in try_thread_db_load (library=0x5555560eca27 "libthread_db.so.1", check_auto_load_safe=false) at ../../src/gdb/linux-thread-db.c:1024
> #9  0x0000555555ac8eda in try_thread_db_load_from_sdir () at ../../src/gdb/linux-thread-db.c:1108
> #10 0x0000555555ac9278 in thread_db_load_search () at ../../src/gdb/linux-thread-db.c:1163
> #11 0x0000555555ac9518 in thread_db_load () at ../../src/gdb/linux-thread-db.c:1225
> #12 0x0000555555ac95e1 in check_for_thread_db () at ../../src/gdb/linux-thread-db.c:1268
> #13 0x0000555555ac9657 in thread_db_new_objfile (objfile=0x555556943ed0) at ../../src/gdb/linux-thread-db.c:1297
> #14 0x000055555569e2d2 in std::__invoke_impl<void, void (*&)(objfile*), objfile*> (__f=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:61
> #15 0x000055555569c44a in std::__invoke_r<void, void (*&)(objfile*), objfile*> (__fn=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:111
> #16 0x0000555555699d69 in std::_Function_handler<void (objfile*), void (*)(objfile*)>::_M_invoke(std::_Any_data const&, objfile*&&) (__functor=..., __args#0=@0x7fffffffce50: 0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:290
> #17 0x0000555555b5f48b in std::function<void (objfile*)>::operator()(objfile*) const (this=0x5555567925d8, __args#0=0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:590
> #18 0x0000555555b5eba4 in gdb::observers::observable<objfile*>::notify (this=0x5555565b5680 <gdb::observers::new_objfile>, args#0=0x555556943ed0) at ../../src/gdb/../gdbsupport/observable.h:166
> #19 0x0000555555cdd85b in symbol_file_add_with_addrs (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1131
> #20 0x0000555555cdd9c5 in symbol_file_add_from_bfd (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1167
> #21 0x0000555555c9dd69 in solib_read_symbols (so=0x5555569792d0, flags=...) at ../../src/gdb/solib.c:730
> #22 0x0000555555c9e7b7 in solib_add (pattern=0x0, from_tty=0, readsyms=1) at ../../src/gdb/solib.c:1041
> #23 0x0000555555c9f61d in handle_solib_event () at ../../src/gdb/solib.c:1315
> #24 0x0000555555729c26 in bpstat_stop_status (aspace=0x555556606800, bp_addr=0x7ffff7fe7278, thread=0x555556816bd0, ws=..., stop_chain=0x0) at ../../src/gdb/breakpoint.c:5702
> #25 0x0000555555a62e41 in handle_signal_stop (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6517
> #26 0x0000555555a61479 in handle_inferior_event (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6000
> #27 0x0000555555a5c7b5 in fetch_inferior_event () at ../../src/gdb/infrun.c:4403
> #28 0x0000555555a35b65 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:41
> #29 0x0000555555aae0c9 in handle_target_event (error=0, client_data=0x0) at ../../src/gdb/linux-nat.c:4231
> 
> 
> Now, when I try the same on a Fedora 32 machine, I see the GDB crash due to the stale
> LWP still in the LWP list with no corresponding thread_info.  On this
> machine, glibc predates the changes that make it possible to use libthread_db with
> non-threaded processes, so try_thread_db_load doesn't manage to open a connection
> to libthread_db, and thus we don't end up in linux_stop_and_wait_all_lwps, and thus
> the stale lwp is not deleted.  And so a subsequent "kill" command crashes.
> 
> I wrote that patch originally on an Ubuntu 20.04 machine (vs the Ubuntu 22.04 I have now),
> and it must be that that version also predates the glibc change, and thus behaves like
> this Fedora 32 box.  You are very likely using a newer Fedora which has the glibc change.

...

>> What are your thoughts on including this, or something like this with
>> this commit?  My patch, which applies on top of this commit, is included
>> at the end of this email.  Please feel free to take any changes that you
>> feel add value.
> 
> I'm totally fine with such a command, though the test I had added covers
> as much as it would, as the "kill" command fails when the maint command
> would fail, and passes when the maint command passes.  But I'll incorporate
> it.
> 

I realized that my description of the problem above practically
suggests a way to expose the crash everywhere -- just catch the exec
event with "catch exec", so that the post-exec program doesn't even
get to the libc.so.6 load event, and issue "kill" there, or use "maint info linux-lwps".
So I've adjusted the patch to add a new testcase doing that.  I've attached two
patches, one adding your "maint info linux-lwps", now with NEWS/docs, and
the updated version of the crash fix and testcase.

WDYT?

Pedro Alves
  
Andrew Burgess May 26, 2023, 2:45 p.m. UTC | #4
Hi Pedro,

Sorry for the delay in looking at this again.  I had to find some time
to investigate this a little more as my result were still not aligning
with what you reported, but I think I understand what's going on now...

Pedro Alves <pedro@palves.net> writes:

> Hi Andrew,
>
> Took me a bit to find time to investigate this.  See below.
>
> On 2023-03-21 2:50 p.m., Andrew Burgess wrote:
>> Pedro Alves <pedro@palves.net> writes:
>> 
>>> I noticed that after a following patch ("Step over clone syscall w/
>>> breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the
>>> gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd
>>> end up with four new unexpected GDB core dumps:
>>>
>>> 		 === gdb Summary ===
>>>
>>>  # of unexpected core files      4
>>>  # of expected passes            48
>>>
>>> That said patch is making the pre-existing
>>> gdb.threads/step-over-exec.exp testcase (almost silently) expose a
>>> latent problem in gdb/linux-nat.c, resulting in a GDB crash when:
>>>
>>>  #1 - a non-leader thread execs
>>>  #2 - the post-exec program stops somewhere
>>>  #3 - you kill the inferior
>>>
>>> Instead of #3 directly, the testcase just returns, which ends up in
>>> gdb_exit, tearing down GDB, which kills the inferior, and is thus
>>> equivalent to #3 above.
>>>
>>> Vis:
>>>
>>>  $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
>>>  ...
>>>  (top-gdb) r
>>>  ...
>>>  (gdb) b main
>>>  ...
>>>  (gdb) r
>>>  ...
>>>  Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
>>>  69        argv0 = argv[0];
>>>  (gdb) c
>>>  Continuing.
>>>  [New Thread 0x7ffff7d89700 (LWP 2506975)]
>>>  Other going in exec.
>>>  Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>>>  process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>>>
>>>  Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
>>>  28        foo ();
>>>  (gdb) k
>>>  ...
>>>  Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
>>>  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>>>  393         return m_suspend.waitstatus_pending_p;
>>>  (top-gdb) bt
>>>  #0  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>>>  #1  0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
>>>  #2  0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
>>>  #3  0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
>>>  #4  0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
>>>  #5  0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
>>>  #6  0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
>>>  #7  0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
>>>  #8  0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
>>>  ...
>> 
>> It wasn't 100% clear to me if the above session was supposed to show a
>> failure with GDB prior to *this* commit, or was a demonstration of what
>> would happen if this commit is skipped, and the later commits applied.
>
> Yes, prior to this commit.

OK, but check your backtrace, notice frame #2 is in
kill_unfollowed_child_callback.  This is only added in a later patch in
this series.  If we roll back to just this patch then the crash doesn't
reproduce!  But I can explain that...

>
>> 
>> I thought it was the second case, but I was so unsure that I tried the
>> reproducer anyway.   Just in case I'm wrong, the above example doesn't
>> seem to fail prior to this commit.
>
> This surprised me, and when I tried it myself, I was even more surprised,
> for I couldn't reproduce it either!
>
> But I figured it out.
>
> I'm usually using Ubuntu 22.04 for development nowadays, and in that system, indeed I can't
> reproduce it.  Right after the exec, GDB traps a load event for "libc.so.6", which leads to
> gdb trying to open libthread_db for the post-exec inferior, and, it succeeds.  When we load
> libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps,
> and then waits to see their stops.  While doing this, GDB detects that the pre-exec stale
> LWP is gone, and deletes it.
>
> The logs show:
>
> [linux-nat] linux_nat_wait_1: waitpid 1725529 received SIGTRAP - Trace/breakpoint trap (stopped)
> [linux-nat] save_stop_reason: 1725529.1725529.0 stopped by software breakpoint
> [linux-nat] linux_nat_wait_1: waitpid(-1, ...) returned 0, ERRNO-OK
> [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725658.0, not stopped
> [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725529.0, has pending status
> [linux-nat] linux_nat_wait_1: trap ptid is 1725529.1725529.0.
> [linux-nat] linux_nat_wait_1: exit
> [linux-nat] stop_callback: kill 1725529.1725658.0 **<SIGSTOP>**
> [linux-nat] stop_callback: lwp kill -1 No such process
> [linux-nat] wait_lwp: 1725529.1725658.0 vanished.

When we look at just this patch, the linux_nat_target::kill function
calls kill_unfollowed_fork_children, which, unlike your later updated
kill_unfollowed_child_callback, doesn't require us to lookup a
thread_info object.  It is the lookup (and dereference) of the
thread_info object that causes the segfault you are seeing.

Once we skip that code, the current linux_nat_target::kill function
actually starts with exactly the same function calls as
linux_stop_and_wait_all_lwps (maybe we should be calling that
function?)

And so, for _me_ when I 'kill' I end up calling the stop_callback
followed by the stop_wait_callback, which cleans up the rogue thread
just like you see.

>
> And the backtrace is:
>
> (top-gdb) bt
> #0  wait_lwp (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2069
> #1  0x0000555555aa8fbf in stop_wait_callback (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2375
> #2  0x0000555555ab12b3 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (__closure=0x0, ecall=..., args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:326
> #3  0x0000555555ab12e2 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:320
> #4  0x0000555555ab0610 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffca90, args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:289
> #5  0x0000555555aa4c2d in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:867
> #6  0x0000555555aa8a03 in linux_stop_and_wait_all_lwps () at ../../src/gdb/linux-nat.c:2229
> #7  0x0000555555ac8525 in try_thread_db_load_1 (info=0x555556a66dd0) at ../../src/gdb/linux-thread-db.c:923
> #8  0x0000555555ac89d5 in try_thread_db_load (library=0x5555560eca27 "libthread_db.so.1", check_auto_load_safe=false) at ../../src/gdb/linux-thread-db.c:1024
> #9  0x0000555555ac8eda in try_thread_db_load_from_sdir () at ../../src/gdb/linux-thread-db.c:1108
> #10 0x0000555555ac9278 in thread_db_load_search () at ../../src/gdb/linux-thread-db.c:1163
> #11 0x0000555555ac9518 in thread_db_load () at ../../src/gdb/linux-thread-db.c:1225
> #12 0x0000555555ac95e1 in check_for_thread_db () at ../../src/gdb/linux-thread-db.c:1268
> #13 0x0000555555ac9657 in thread_db_new_objfile (objfile=0x555556943ed0) at ../../src/gdb/linux-thread-db.c:1297
> #14 0x000055555569e2d2 in std::__invoke_impl<void, void (*&)(objfile*), objfile*> (__f=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:61
> #15 0x000055555569c44a in std::__invoke_r<void, void (*&)(objfile*), objfile*> (__fn=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:111
> #16 0x0000555555699d69 in std::_Function_handler<void (objfile*), void (*)(objfile*)>::_M_invoke(std::_Any_data const&, objfile*&&) (__functor=..., __args#0=@0x7fffffffce50: 0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:290
> #17 0x0000555555b5f48b in std::function<void (objfile*)>::operator()(objfile*) const (this=0x5555567925d8, __args#0=0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:590
> #18 0x0000555555b5eba4 in gdb::observers::observable<objfile*>::notify (this=0x5555565b5680 <gdb::observers::new_objfile>, args#0=0x555556943ed0) at ../../src/gdb/../gdbsupport/observable.h:166
> #19 0x0000555555cdd85b in symbol_file_add_with_addrs (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1131
> #20 0x0000555555cdd9c5 in symbol_file_add_from_bfd (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1167
> #21 0x0000555555c9dd69 in solib_read_symbols (so=0x5555569792d0, flags=...) at ../../src/gdb/solib.c:730
> #22 0x0000555555c9e7b7 in solib_add (pattern=0x0, from_tty=0, readsyms=1) at ../../src/gdb/solib.c:1041
> #23 0x0000555555c9f61d in handle_solib_event () at ../../src/gdb/solib.c:1315
> #24 0x0000555555729c26 in bpstat_stop_status (aspace=0x555556606800, bp_addr=0x7ffff7fe7278, thread=0x555556816bd0, ws=..., stop_chain=0x0) at ../../src/gdb/breakpoint.c:5702
> #25 0x0000555555a62e41 in handle_signal_stop (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6517
> #26 0x0000555555a61479 in handle_inferior_event (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6000
> #27 0x0000555555a5c7b5 in fetch_inferior_event () at ../../src/gdb/infrun.c:4403
> #28 0x0000555555a35b65 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:41
> #29 0x0000555555aae0c9 in handle_target_event (error=0, client_data=0x0) at ../../src/gdb/linux-nat.c:4231
>
>
> Now, when I try the same on a Fedora 32 machine, I see the GDB crash due to the stale
> LWP still in the LWP list with no corresponding thread_info.  On this
> machine, glibc predates the changes that make it possible to use libthread_db with
> non-threaded processes, so try_thread_db_load doesn't manage to open a connection
> to libthread_db, and thus we don't end up in linux_stop_and_wait_all_lwps, and thus
> the stale lwp is not deleted.  And so a subsequent "kill" command crashes.
>
> I wrote that patch originally on an Ubuntu 20.04 machine (vs the Ubuntu 22.04 I have now),
> and it must be that that version also predates the glibc change, and thus behaves like
> this Fedora 32 box.  You are very likely using a newer Fedora which has the glibc change.

To my shame I actually running an even older Fedora install.  One day
GDB will be finished, then I'll find time to reinstall :)  So I'm not
cleaning up the thread in the libthread_db code.

>
>> 
>>>
>>> The root of the problem is that when a non-leader LWP execs, it just
>>> changes its tid to the tgid, replacing the pre-exec leader thread,
>>> becoming the new leader.  There's no thread exit event for the execing
>>> thread.  It's as if the old pre-exec LWP vanishes without trace.  The
>>> ptrace man page says:
>>>
>>> "PTRACE_O_TRACEEXEC (since Linux 2.5.46)
>>> 	Stop the tracee at the next execve(2).  A waitpid(2) by the
>>> 	tracer will return a status value such that
>>>
>>> 	  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
>>>
>>> 	If the execing thread is not a thread group leader, the thread
>>> 	ID is reset to thread group leader's ID before this stop.
>>> 	Since Linux 3.0, the former thread ID can be retrieved with
>>> 	PTRACE_GETEVENTMSG."
>>>
>>> When the core of GDB processes an exec events, it deletes all the
>>> threads of the inferior.  But, that is too late -- deleting the thread
>>> does not delete the corresponding LWP, so we end leaving the pre-exec
>>> non-leader LWP stale in the LWP list.  That's what leads to the crash
>>> above -- linux_nat_target::kill iterates over all LWPs, and after the
>>> patch in question, that code will look for the corresponding
>>> thread_info for each LWP.  For the pre-exec non-leader LWP still
>>> listed, won't find one.
>>>
>>> This patch fixes it, by deleting the pre-exec non-leader LWP (and
>>> thread) from the LWP/thread lists as soon as we get an exec event out
>>> of ptrace.
>> 
>> Given that we don't have a test *right now* for this issue, and instead
>> rely on a future patch not failing.  I wondered if there was any way
>> that we could trigger a failure.
>> 
>> So I was poking around looking for places where we iterate over the
>> all_lwps() list wondering which we could trigger that might cause a
>> failure...
>> 
>> ... and then I thought: why not just have GDB tell us that the
>> all_lwps() list is broken.
>> 
>> So I hacked up a new 'maint info linux-lwps' command.  It's not very
>> interesting right now, here's the output in a multi-threaded inferior
>> prior to the exec:
>> 
>>   (gdb) maintenance info linux-lwps 
>>   LWP Ptid          Thread ID
>>   1707218.1707239.0 2           
>>   1707218.1707218.0 1           
>> 
>> And in your failure case (after the exec):
>> 
>>   (gdb) maintenance info linux-lwps 
>>   LWP Ptid          Thread ID 
>>   1708883.1708895.0 None      
>>   1708883.1708883.0 1         
>> 
>> And then we can check this from the testscript, and now we have a test
>> that fails before this commit, and passes afterwards.
>> 
>> And in the future we might find other information we want to add in the
>> new maintenance command.
>> 
>> What are your thoughts on including this, or something like this with
>> this commit?  My patch, which applies on top of this commit, is included
>> at the end of this email.  Please feel free to take any changes that you
>> feel add value.
>
> I'm totally fine with such a command, though the test I had added covers
> as much as it would, as the "kill" command fails when the maint command
> would fail, and passes when the maint command passes.  But I'll incorporate
> it.

Given the above this isn't quite right.  It would appear that someone
using the libthread_db code would, as you point out, see both tests
fail.  But without the libthread_db code the 'maint' command will show
the problem, while the 'kill' command isn't going to, even if you
force the 'kill' earlier like you propose in your new patch[1].

This doesn't actually change anything -- you did add the new 'maint'
command in your next patch, it was just the behaviour I was seeing
didn't align with what you saw, and I think we now know why.

I'll reply separately to [1] with my feedback on that patch.

[1] https://inbox.sourceware.org/gdb-patches/5b80a2c3-3679-fb86-27f3-0dcc9c019562@palves.net/#t


Thanks,
Andrew
  
Andrew Burgess May 26, 2023, 3:04 p.m. UTC | #5
Pedro Alves <pedro@palves.net> writes:

> Hi!
>
> On 2023-04-04 2:57 p.m., Pedro Alves wrote:
>> On 2023-03-21 2:50 p.m., Andrew Burgess wrote:
>>>
>>> I thought it was the second case, but I was so unsure that I tried the
>>> reproducer anyway.   Just in case I'm wrong, the above example doesn't
>>> seem to fail prior to this commit.
>> 
>> This surprised me, and when I tried it myself, I was even more surprised,
>> for I couldn't reproduce it either!
>> 
>> But I figured it out.
>> 
>> I'm usually using Ubuntu 22.04 for development nowadays, and in that system, indeed I can't
>> reproduce it.  Right after the exec, GDB traps a load event for "libc.so.6", which leads to
>> gdb trying to open libthread_db for the post-exec inferior, and, it succeeds.  When we load
>> libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps,
>> and then waits to see their stops.  While doing this, GDB detects that the pre-exec stale
>> LWP is gone, and deletes it.
>> 
>> The logs show:
>> 
>> [linux-nat] linux_nat_wait_1: waitpid 1725529 received SIGTRAP - Trace/breakpoint trap (stopped)
>> [linux-nat] save_stop_reason: 1725529.1725529.0 stopped by software breakpoint
>> [linux-nat] linux_nat_wait_1: waitpid(-1, ...) returned 0, ERRNO-OK
>> [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725658.0, not stopped
>> [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725529.0, has pending status
>> [linux-nat] linux_nat_wait_1: trap ptid is 1725529.1725529.0.
>> [linux-nat] linux_nat_wait_1: exit
>> [linux-nat] stop_callback: kill 1725529.1725658.0 **<SIGSTOP>**
>> [linux-nat] stop_callback: lwp kill -1 No such process
>> [linux-nat] wait_lwp: 1725529.1725658.0 vanished.
>> 
>> And the backtrace is:
>> 
>> (top-gdb) bt
>> #0  wait_lwp (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2069
>> #1  0x0000555555aa8fbf in stop_wait_callback (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2375
>> #2  0x0000555555ab12b3 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (__closure=0x0, ecall=..., args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:326
>> #3  0x0000555555ab12e2 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:320
>> #4  0x0000555555ab0610 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffca90, args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:289
>> #5  0x0000555555aa4c2d in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:867
>> #6  0x0000555555aa8a03 in linux_stop_and_wait_all_lwps () at ../../src/gdb/linux-nat.c:2229
>> #7  0x0000555555ac8525 in try_thread_db_load_1 (info=0x555556a66dd0) at ../../src/gdb/linux-thread-db.c:923
>> #8  0x0000555555ac89d5 in try_thread_db_load (library=0x5555560eca27 "libthread_db.so.1", check_auto_load_safe=false) at ../../src/gdb/linux-thread-db.c:1024
>> #9  0x0000555555ac8eda in try_thread_db_load_from_sdir () at ../../src/gdb/linux-thread-db.c:1108
>> #10 0x0000555555ac9278 in thread_db_load_search () at ../../src/gdb/linux-thread-db.c:1163
>> #11 0x0000555555ac9518 in thread_db_load () at ../../src/gdb/linux-thread-db.c:1225
>> #12 0x0000555555ac95e1 in check_for_thread_db () at ../../src/gdb/linux-thread-db.c:1268
>> #13 0x0000555555ac9657 in thread_db_new_objfile (objfile=0x555556943ed0) at ../../src/gdb/linux-thread-db.c:1297
>> #14 0x000055555569e2d2 in std::__invoke_impl<void, void (*&)(objfile*), objfile*> (__f=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:61
>> #15 0x000055555569c44a in std::__invoke_r<void, void (*&)(objfile*), objfile*> (__fn=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:111
>> #16 0x0000555555699d69 in std::_Function_handler<void (objfile*), void (*)(objfile*)>::_M_invoke(std::_Any_data const&, objfile*&&) (__functor=..., __args#0=@0x7fffffffce50: 0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:290
>> #17 0x0000555555b5f48b in std::function<void (objfile*)>::operator()(objfile*) const (this=0x5555567925d8, __args#0=0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:590
>> #18 0x0000555555b5eba4 in gdb::observers::observable<objfile*>::notify (this=0x5555565b5680 <gdb::observers::new_objfile>, args#0=0x555556943ed0) at ../../src/gdb/../gdbsupport/observable.h:166
>> #19 0x0000555555cdd85b in symbol_file_add_with_addrs (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1131
>> #20 0x0000555555cdd9c5 in symbol_file_add_from_bfd (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1167
>> #21 0x0000555555c9dd69 in solib_read_symbols (so=0x5555569792d0, flags=...) at ../../src/gdb/solib.c:730
>> #22 0x0000555555c9e7b7 in solib_add (pattern=0x0, from_tty=0, readsyms=1) at ../../src/gdb/solib.c:1041
>> #23 0x0000555555c9f61d in handle_solib_event () at ../../src/gdb/solib.c:1315
>> #24 0x0000555555729c26 in bpstat_stop_status (aspace=0x555556606800, bp_addr=0x7ffff7fe7278, thread=0x555556816bd0, ws=..., stop_chain=0x0) at ../../src/gdb/breakpoint.c:5702
>> #25 0x0000555555a62e41 in handle_signal_stop (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6517
>> #26 0x0000555555a61479 in handle_inferior_event (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6000
>> #27 0x0000555555a5c7b5 in fetch_inferior_event () at ../../src/gdb/infrun.c:4403
>> #28 0x0000555555a35b65 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:41
>> #29 0x0000555555aae0c9 in handle_target_event (error=0, client_data=0x0) at ../../src/gdb/linux-nat.c:4231
>> 
>> 
>> Now, when I try the same on a Fedora 32 machine, I see the GDB crash due to the stale
>> LWP still in the LWP list with no corresponding thread_info.  On this
>> machine, glibc predates the changes that make it possible to use libthread_db with
>> non-threaded processes, so try_thread_db_load doesn't manage to open a connection
>> to libthread_db, and thus we don't end up in linux_stop_and_wait_all_lwps, and thus
>> the stale lwp is not deleted.  And so a subsequent "kill" command crashes.
>> 
>> I wrote that patch originally on an Ubuntu 20.04 machine (vs the Ubuntu 22.04 I have now),
>> and it must be that that version also predates the glibc change, and thus behaves like
>> this Fedora 32 box.  You are very likely using a newer Fedora which has the glibc change.
>
> ...
>
>>> What are your thoughts on including this, or something like this with
>>> this commit?  My patch, which applies on top of this commit, is included
>>> at the end of this email.  Please feel free to take any changes that you
>>> feel add value.
>> 
>> I'm totally fine with such a command, though the test I had added covers
>> as much as it would, as the "kill" command fails when the maint command
>> would fail, and passes when the maint command passes.  But I'll incorporate
>> it.
>> 
>
> I realized that my description of the problem above practically
> suggests a way to expose the crash everywhere -- just catch the exec
> event with "catch exec", so that the post-exec program doesn't even
> get to the libc.so.6 load event, and issue "kill" there, or use "maint info linux-lwps".
> So I've adjusted the patch to add a new testcase doing that.  I've attached two
> patches, one adding your "maint info linux-lwps", now with NEWS/docs, and
> the updated version of the crash fix and testcase.
>
> WDYT?
>
> Pedro Alves
> From 450e0133fc884f027cce4ae65378ea5560f6464d Mon Sep 17 00:00:00 2001
> From: Andrew Burgess <aburgess@redhat.com>
> Date: Tue, 4 Apr 2023 14:50:35 +0100
> Subject: [PATCH 1/2] Add "maint info linux-lwps" command
>
> This adds a maintenance command that lets you list all the LWPs under
> control of the linux-nat target.
>
> For example:
>
>  (gdb) maint info linux-lwps
>  LWP Ptid        Thread ID
>  560948.561047.0 None
>  560948.560948.0 1.1
>
> This shows that "560948.561047.0" LWP doesn't map to any thread_info
> object, which is bogus.  We'll be using this in a testcase in a
> following patch.
>
> Co-Authored-By: Pedro Alves <pedro@palves.net>
> Change-Id: Ic4e9e123385976e5cd054391990124b7a20fb3f5
> ---
>  gdb/NEWS            |  3 +++
>  gdb/doc/gdb.texinfo |  4 ++++
>  gdb/linux-nat.c     | 46 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 53 insertions(+)
>
> diff --git a/gdb/NEWS b/gdb/NEWS
> index d729aa24056..3747e7d52c1 100644
> --- a/gdb/NEWS
> +++ b/gdb/NEWS
> @@ -78,6 +78,9 @@ maintenance info frame-unwinders
>  maintenance wait-for-index-cache
>    Wait until all pending writes to the index cache have completed.
>  
> +maintenance info linux-lwps
> +  List all LWPs under control of the linux-nat target.
> +
>  set always-read-ctf on|off
>  show always-read-ctf
>    When off, CTF is only read if DWARF is not present.  When on, CTF is
> diff --git a/gdb/doc/gdb.texinfo b/gdb/doc/gdb.texinfo
> index 6c811b8be2e..398bbb88af6 100644
> --- a/gdb/doc/gdb.texinfo
> +++ b/gdb/doc/gdb.texinfo
> @@ -40605,6 +40605,10 @@ module (@pxref{Disassembly In Python}), and will only be present after
>  that module has been imported.  To force the module to be imported do
>  the following:
>  
> +@kindex maint info linux-lwps
> +@item maint info linux-lwps
> +Print information about LWPs under control of the Linux native target.
> +
>  @smallexample
>  (@value{GDBP}) python import gdb.disassembler
>  @end smallexample
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 944f23de01a..68816ddc999 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -4479,6 +4479,49 @@ current_lwp_ptid (void)
>    return inferior_ptid;
>  }
>  
> +/* Implement 'maintenance info linux-lwps'.  Displays some basic
> +   information about all the current lwp_info objects.  */
> +
> +static void
> +maintenance_info_lwps (const char *arg, int from_tty)
> +{
> +  if (all_lwps ().size () == 0)
> +    {
> +      gdb_printf ("No Linux LWPs\n");
> +      return;
> +    }
> +
> +  /* Start the width at 8 to match the column heading below, then
> +     figure out the widest ptid string.  We'll use this to build our
> +     output table below.  */
> +  size_t ptid_width = 8;
> +  for (lwp_info *lp : all_lwps ())
> +    ptid_width = std::max (ptid_width, lp->ptid.to_string ().size ());
> +
> +  /* Setup the table headers.  */
> +  struct ui_out *uiout = current_uiout;
> +  ui_out_emit_table table_emitter (uiout, 2, -1, "linux-lwps");
> +  uiout->table_header (ptid_width, ui_left, "lwp-ptid", _("LWP Ptid"));
> +  uiout->table_header (9, ui_left, "thread-info", _("Thread ID"));
> +  uiout->table_body ();
> +
> +  /* Display one table row for each lwp_info.  */
> +  for (lwp_info *lp : all_lwps ())
> +    {
> +      ui_out_emit_tuple tuple_emitter (uiout, "lwp-entry");
> +
> +      struct thread_info *th = find_thread_ptid (linux_target, lp->ptid);

After recent changes this line becomes:

  struct thread_info *th = linux_target->find_thread (lp->ptid);

> +
> +      uiout->field_string ("lwp-ptid", lp->ptid.to_string ().c_str ());
> +      if (th == nullptr)
> +	uiout->field_string ("thread-info", "None");
> +      else
> +	uiout->field_string ("thread-info", print_full_thread_id (th));
> +
> +      uiout->message ("\n");
> +    }
> +}
> +
>  void _initialize_linux_nat ();
>  void
>  _initialize_linux_nat ()
> @@ -4516,6 +4559,9 @@ Enables printf debugging output."),
>    sigemptyset (&blocked_mask);
>  
>    lwp_lwpid_htab_create ();
> +
> +  add_cmd ("linux-lwps", class_maintenance, maintenance_info_lwps,
> +	 _("List the Linux LWPS."), &maintenanceinfolist);
>  }
>  
>  
>
> base-commit: 57573e54afb9f7ed957eec43dfd2830f2384c970
> prerequisite-patch-id: 3a896bfe4b7c66a2e3a6aa668c5ae8395e5d8a52
> -- 
> 2.36.0
>
> From ee0a276c08b829ae504fe0eba5badc4f7faf3676 Mon Sep 17 00:00:00 2001
> From: Pedro Alves <pedro@palves.net>
> Date: Wed, 13 Jul 2022 17:16:38 +0100
> Subject: [PATCH 2/2] gdb/linux: Delete all other LWPs immediately on ptrace
>  exec event
>
> I noticed that on an Ubuntu 20.04 system, after a following patch
> ("Step over clone syscall w/ breakpoint,
> TARGET_WAITKIND_THREAD_CLONED"), the gdb.threads/step-over-exec.exp
> was passing cleanly, but still, we'd end up with four new unexpected
> GDB core dumps:
>
> 		 === gdb Summary ===
>
>  # of unexpected core files      4
>  # of expected passes            48
>
> That said patch is making the pre-existing
> gdb.threads/step-over-exec.exp testcase (almost silently) expose a
> latent problem in gdb/linux-nat.c, resulting in a GDB crash when:
>
>  #1 - a non-leader thread execs
>  #2 - the post-exec program stops somewhere
>  #3 - you kill the inferior
>
> Instead of #3 directly, the testcase just returns, which ends up in
> gdb_exit, tearing down GDB, which kills the inferior, and is thus
> equivalent to #3 above.
>
> Vis:
>
>  $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
>  ...
>  (top-gdb) r
>  ...
>  (gdb) b main
>  ...
>  (gdb) r
>  ...
>  Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
>  69        argv0 = argv[0];
>  (gdb) c
>  Continuing.
>  [New Thread 0x7ffff7d89700 (LWP 2506975)]
>  Other going in exec.
>  Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>  process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>
>  Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
>  28        foo ();
>  (gdb) k
>  ...
>  Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
>  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>  393         return m_suspend.waitstatus_pending_p;
>  (top-gdb) bt
>  #0  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>  #1  0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
>  #2  0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
>  #3  0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
>  #4  0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
>  #5  0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
>  #6  0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
>  #7  0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
>  #8  0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
>  ...

As I mentioned in my other message, this backtrace includes
kill_unfollowed_child_callback, which doesn't exist yet!  I think that's
OK though, the text before the backtrace does make it clear that you saw
this problem only after applying a later patch.

>
> The root of the problem is that when a non-leader LWP execs, it just
> changes its tid to the tgid, replacing the pre-exec leader thread,
> becoming the new leader.  There's no thread exit event for the execing
> thread.  It's as if the old pre-exec LWP vanishes without trace.  The
> ptrace man page says:
>
> "PTRACE_O_TRACEEXEC (since Linux 2.5.46)
> 	Stop the tracee at the next execve(2).  A waitpid(2) by the
> 	tracer will return a status value such that
>
> 	  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
>
> 	If the execing thread is not a thread group leader, the thread
> 	ID is reset to thread group leader's ID before this stop.
> 	Since Linux 3.0, the former thread ID can be retrieved with
> 	PTRACE_GETEVENTMSG."
>
> When the core of GDB processes an exec events, it deletes all the
> threads of the inferior.  But, that is too late -- deleting the thread
> does not delete the corresponding LWP, so we end leaving the pre-exec
> non-leader LWP stale in the LWP list.  That's what leads to the crash
> above -- linux_nat_target::kill iterates over all LWPs, and after the
> patch in question, that code will look for the corresponding
> thread_info for each LWP.  For the pre-exec non-leader LWP still
> listed, won't find one.
>
> This patch fixes it, by deleting the pre-exec non-leader LWP (and
> thread) from the LWP/thread lists as soon as we get an exec event out
> of ptrace.
>
> GDBserver does not need an equivalent fix, because it is already doing
> this, as side effect of mourning the pre-exec process, in
> gdbserver/linux-low.cc:
>
>   else if (event == PTRACE_EVENT_EXEC && cs.report_exec_events)
>     {
> ...
>       /* Delete the execing process and all its threads.  */
>       mourn (proc);
>       switch_to_thread (nullptr);
>
>
> The crash with gdb.threads/step-over-exec.exp is not observable on
> newer systems, which postdate the glibc change to move "libpthread.so"
> internals to "libc.so.6", because right after the exec, GDB traps a
> load event for "libc.so.6", which leads to GDB trying to open
> libthread_db for the post-exec inferior, and, on such systems that
> succeeds.  When we load libthread_db, we call
> linux_stop_and_wait_all_lwps, which, as the name suggests, stops all
> lwps, and then waits to see their stops.  While doing this, GDB
> detects that the pre-exec stale LWP is gone, and deletes it.
>
> If we use "catch exec" to stop right at the exec before the
> "libc.so.6" load event ever happens, and issue "kill" right there,
> then GDB crashes on newer systems as well.  So instead of tweaking
> gdb.threads/step-over-exec.exp to cover the fix, add a new
> gdb.threads/threads-after-exec.exp testcase that uses "catch exec".

Maybe it's worth mentioning that because the crash itself only happens
once a later patch is applied we use 'maint info linux-lwps' to reveal
the issue for now?

>
> Also tweak a comment in infrun.c:follow_exec referring to how
> linux-nat.c used to behave, as it would become stale otherwise.
>
> Change-Id: I21ec18072c7750f3a972160ae6b9e46590376643
> ---
>  gdb/infrun.c                                  |  8 +--
>  gdb/linux-nat.c                               | 15 ++++
>  .../gdb.threads/threads-after-exec.exp        | 70 +++++++++++++++++++

Oops, this diff is missing the two source files for this test (.c and
-execd.c).  I was able to figure something out though so I could test
the rest of this patch :)

>  3 files changed, 88 insertions(+), 5 deletions(-)
>  create mode 100644 gdb/testsuite/gdb.threads/threads-after-exec.exp
>
> diff --git a/gdb/infrun.c b/gdb/infrun.c
> index abe49ae0f2f..93edc224622 100644
> --- a/gdb/infrun.c
> +++ b/gdb/infrun.c
> @@ -1224,13 +1224,11 @@ follow_exec (ptid_t ptid, const char *exec_file_target)
>       some other thread does the exec, and even if the main thread was
>       stopped or already gone.  We may still have non-leader threads of
>       the process on our list.  E.g., on targets that don't have thread
> -     exit events (like remote); or on native Linux in non-stop mode if
> -     there were only two threads in the inferior and the non-leader
> -     one is the one that execs (and nothing forces an update of the
> -     thread list up to here).  When debugging remotely, it's best to
> +     exit events (like remote) and nothing forces an update of the
> +     thread list up to here.  When debugging remotely, it's best to
>       avoid extra traffic, when possible, so avoid syncing the thread
>       list with the target, and instead go ahead and delete all threads
> -     of the process but one that reported the event.  Note this must
> +     of the process but the one that reported the event.  Note this must
>       be done before calling update_breakpoints_after_exec, as
>       otherwise clearing the threads' resources would reference stale
>       thread breakpoints -- it may have been one of these threads that
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 68816ddc999..90ac94440b8 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2001,6 +2001,21 @@ linux_handle_extended_wait (struct lwp_info *lp, int status)
>  	 thread execs, it changes its tid to the tgid, and the old
>  	 tgid thread might have not been resumed.  */
>        lp->resumed = 1;
> +
> +      /* All other LWPs are gone now.  We'll have received a thread
> +	 exit notification for all threads other the execing one.
> +	 That one, if it wasn't the leader, just silently changes its
> +	 tid to the tgid, and the previous leader vanishes.  Since
> +	 Linux 3.0, the former thread ID can be retrieved with
> +	 PTRACE_GETEVENTMSG, but since we support older kernels, don't
> +	 bother with it, and just walk the LWP list.  Even with
> +	 PTRACE_GETEVENTMSG, we'd still need to lookup the
> +	 corresponding LWP object, and it would be an extra ptrace
> +	 syscall, so this way may even be more efficient.  */
> +      for (lwp_info *other_lp : all_lwps_safe ())
> +	if (other_lp != lp && other_lp->ptid.pid () == lp->ptid.pid ())
> +	  exit_lwp (other_lp);
> +
>        return 0;
>      }
>  
> diff --git a/gdb/testsuite/gdb.threads/threads-after-exec.exp b/gdb/testsuite/gdb.threads/threads-after-exec.exp
> new file mode 100644
> index 00000000000..824dda349a6
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/threads-after-exec.exp
> @@ -0,0 +1,70 @@
> +# Copyright 2023 Free Software Foundation, Inc.
> +
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +
> +# Test that after an exec of a non-leader thread, we don't leave the
> +# non-leader thread listed in internal thread lists, causing problems.
> +
> +standard_testfile .c -execd.c
> +
> +proc do_test { } {
> +    global srcdir subdir srcfile srcfile2 binfile testfile
> +    global decimal
> +
> +    # Compile main binary (the one that does the exec).
> +    if {[gdb_compile_pthreads $srcdir/$subdir/$srcfile $binfile \
> +	     executable {debug}] != "" } {
> +	return -1
> +    }

You can do:

    if {[build_executable "failed to build main executable" \
             $binfile $srcfile {debug pthread}] == -1} {
	return -1
    }

> +
> +    # Compile the second binary (the one that gets exec'd).
> +    if {[gdb_compile $srcdir/$subdir/$srcfile2 $binfile-execd \
> +	     executable {debug}] != "" } {
> +	return -1
> +    }

And:

    if {[build_executable "failed to build execd executable" \
             $binfile-execd $srcfile2 {debug}] == -1} {
	return -1
    }

I thought we were moving away from calling the gdb_compile* functions
directly.

Assuming the missing source files are added, this all looks great.

Reviewed-By: Andrew Burgess <aburgess@redhat.com>

Thanks,
Andrew

> +
> +    clean_restart $binfile
> +
> +    if ![runto_main] {
> +	return
> +    }
> +
> +    gdb_test "catch exec" "Catchpoint $decimal \\(exec\\)"
> +
> +    gdb_test "continue" "Catchpoint $decimal .*" "continue until exec"
> +
> +    # Confirm we only have one thread in the thread list.
> +    gdb_test "info threads" "\\* 1\[ \t\]+\[^\r\n\]+.*"
> +
> +    if {[istarget *-*-linux*] && [gdb_is_target_native]} {
> +	# Confirm there's only one LWP in the list as well, and that
> +	# it is bound to thread 1.1.
> +	set inf_pid [get_inferior_pid]
> +	gdb_test_multiple "maint info linux-lwps" "" {
> +	    -wrap -re "Thread ID *\r\n$inf_pid\.$inf_pid\.0\[ \t\]+1\.1 *" {
> +		pass $gdb_test_name
> +	    }
> +	}
> +    }
> +
> +    # Test that GDB is able to kill the inferior.  This used to crash
> +    # on native Linux as GDB did not dispose of the pre-exec LWP for
> +    # the non-leader (and that LWP did not have a matching thread in
> +    # the core thread list).
> +    gdb_test "with confirm off -- kill" \
> +	"\\\[Inferior 1 (.*) killed\\\]" \
> +	"kill inferior"
> +}
> +
> +do_test
> -- 
> 2.36.0
  
Pedro Alves Nov. 13, 2023, 2:04 p.m. UTC | #6
Hi!

At long last, I've resumed work on this again.  A belated thank you so much for
the reviews.

On 2023-05-26 16:04, Andrew Burgess wrote:
> Pedro Alves <pedro@palves.net> writes:

>> +  /* Display one table row for each lwp_info.  */
>> +  for (lwp_info *lp : all_lwps ())
>> +    {
>> +      ui_out_emit_tuple tuple_emitter (uiout, "lwp-entry");
>> +
>> +      struct thread_info *th = find_thread_ptid (linux_target, lp->ptid);
> 
> After recent changes this line becomes:
> 
>   struct thread_info *th = linux_target->find_thread (lp->ptid);

Thanks, done.

>> From ee0a276c08b829ae504fe0eba5badc4f7faf3676 Mon Sep 17 00:00:00 2001
>> From: Pedro Alves <pedro@palves.net>
>> Date: Wed, 13 Jul 2022 17:16:38 +0100
>> Subject: [PATCH 2/2] gdb/linux: Delete all other LWPs immediately on ptrace
>>  exec event
>>
>> I noticed that on an Ubuntu 20.04 system, after a following patch
>> ("Step over clone syscall w/ breakpoint,
>> TARGET_WAITKIND_THREAD_CLONED"), the gdb.threads/step-over-exec.exp
>> was passing cleanly, but still, we'd end up with four new unexpected
>> GDB core dumps:
>>
>> 		 === gdb Summary ===
>>
>>  # of unexpected core files      4
>>  # of expected passes            48
>>
>> That said patch is making the pre-existing
>> gdb.threads/step-over-exec.exp testcase (almost silently) expose a
>> latent problem in gdb/linux-nat.c, resulting in a GDB crash when:
>>
>>  #1 - a non-leader thread execs
>>  #2 - the post-exec program stops somewhere
>>  #3 - you kill the inferior
>>
>> Instead of #3 directly, the testcase just returns, which ends up in
>> gdb_exit, tearing down GDB, which kills the inferior, and is thus
>> equivalent to #3 above.
>>
>> Vis:
>>
>>  $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
>>  ...
>>  (top-gdb) r
>>  ...
>>  (gdb) b main
>>  ...
>>  (gdb) r
>>  ...
>>  Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
>>  69        argv0 = argv[0];
>>  (gdb) c
>>  Continuing.
>>  [New Thread 0x7ffff7d89700 (LWP 2506975)]
>>  Other going in exec.
>>  Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>>  process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>>
>>  Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
>>  28        foo ();
>>  (gdb) k
>>  ...
>>  Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
>>  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>>  393         return m_suspend.waitstatus_pending_p;
>>  (top-gdb) bt
>>  #0  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>>  #1  0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
>>  #2  0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
>>  #3  0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
>>  #4  0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
>>  #5  0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
>>  #6  0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
>>  #7  0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
>>  #8  0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
>>  ...
> 
> As I mentioned in my other message, this backtrace includes
> kill_unfollowed_child_callback, which doesn't exist yet!  I think that's
> OK though, the text before the backtrace does make it clear that you saw
> this problem only after applying a later patch.

I've tweaked the commit log to make that more explicit.

> 
>>
>> The root of the problem is that when a non-leader LWP execs, it just
>> changes its tid to the tgid, replacing the pre-exec leader thread,
>> becoming the new leader.  There's no thread exit event for the execing
>> thread.  It's as if the old pre-exec LWP vanishes without trace.  The
>> ptrace man page says:
>>
>> "PTRACE_O_TRACEEXEC (since Linux 2.5.46)
>> 	Stop the tracee at the next execve(2).  A waitpid(2) by the
>> 	tracer will return a status value such that
>>
>> 	  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
>>
>> 	If the execing thread is not a thread group leader, the thread
>> 	ID is reset to thread group leader's ID before this stop.
>> 	Since Linux 3.0, the former thread ID can be retrieved with
>> 	PTRACE_GETEVENTMSG."
>>
>> When the core of GDB processes an exec events, it deletes all the
>> threads of the inferior.  But, that is too late -- deleting the thread
>> does not delete the corresponding LWP, so we end leaving the pre-exec
>> non-leader LWP stale in the LWP list.  That's what leads to the crash
>> above -- linux_nat_target::kill iterates over all LWPs, and after the
>> patch in question, that code will look for the corresponding
>> thread_info for each LWP.  For the pre-exec non-leader LWP still
>> listed, won't find one.
>>
>> This patch fixes it, by deleting the pre-exec non-leader LWP (and
>> thread) from the LWP/thread lists as soon as we get an exec event out
>> of ptrace.
>>
>> GDBserver does not need an equivalent fix, because it is already doing
>> this, as side effect of mourning the pre-exec process, in
>> gdbserver/linux-low.cc:
>>
>>   else if (event == PTRACE_EVENT_EXEC && cs.report_exec_events)
>>     {
>> ...
>>       /* Delete the execing process and all its threads.  */
>>       mourn (proc);
>>       switch_to_thread (nullptr);
>>
>>
>> The crash with gdb.threads/step-over-exec.exp is not observable on
>> newer systems, which postdate the glibc change to move "libpthread.so"
>> internals to "libc.so.6", because right after the exec, GDB traps a
>> load event for "libc.so.6", which leads to GDB trying to open
>> libthread_db for the post-exec inferior, and, on such systems that
>> succeeds.  When we load libthread_db, we call
>> linux_stop_and_wait_all_lwps, which, as the name suggests, stops all
>> lwps, and then waits to see their stops.  While doing this, GDB
>> detects that the pre-exec stale LWP is gone, and deletes it.
>>
>> If we use "catch exec" to stop right at the exec before the
>> "libc.so.6" load event ever happens, and issue "kill" right there,
>> then GDB crashes on newer systems as well.  So instead of tweaking
>> gdb.threads/step-over-exec.exp to cover the fix, add a new
>> gdb.threads/threads-after-exec.exp testcase that uses "catch exec".
> 
> Maybe it's worth mentioning that because the crash itself only happens
> once a later patch is applied we use 'maint info linux-lwps' to reveal
> the issue for now?

I've done something like that.

>>
>> Change-Id: I21ec18072c7750f3a972160ae6b9e46590376643
>> ---
>>  gdb/infrun.c                                  |  8 +--
>>  gdb/linux-nat.c                               | 15 ++++
>>  .../gdb.threads/threads-after-exec.exp        | 70 +++++++++++++++++++
> 
> Oops, this diff is missing the two source files for this test (.c and
> -execd.c).  I was able to figure something out though so I could test
> the rest of this patch :)
> 

Ouch.  And enough time has passed that I completely lost those files.  But as you
discovered, they're trivial enough to rewrite.  Actually, I simplified, and
only use one .c file this time.


>>  3 files changed, 88 insertions(+), 5 deletions(-)
>>  create mode 100644 gdb/testsuite/gdb.threads/threads-after-exec.exp
>>

...

>> +
>> +standard_testfile .c -execd.c
>> +
>> +proc do_test { } {
>> +    global srcdir subdir srcfile srcfile2 binfile testfile
>> +    global decimal
>> +
>> +    # Compile main binary (the one that does the exec).
>> +    if {[gdb_compile_pthreads $srcdir/$subdir/$srcfile $binfile \
>> +	     executable {debug}] != "" } {
>> +	return -1
>> +    }
> 
> You can do:
> 
>     if {[build_executable "failed to build main executable" \
>              $binfile $srcfile {debug pthread}] == -1} {
> 	return -1
>     }
> 
>> +
>> +    # Compile the second binary (the one that gets exec'd).
>> +    if {[gdb_compile $srcdir/$subdir/$srcfile2 $binfile-execd \
>> +	     executable {debug}] != "" } {
>> +	return -1
>> +    }
> 
> And:
> 
>     if {[build_executable "failed to build execd executable" \
>              $binfile-execd $srcfile2 {debug}] == -1} {
> 	return -1
>     }
> 
> I thought we were moving away from calling the gdb_compile* functions
> directly.
> 

Done-ish -- since I only have one test program this time, I can use
prepare_for_testing, instead.

> Assuming the missing source files are added, this all looks great.
> 
> Reviewed-By: Andrew Burgess <aburgess@redhat.com>
  

Patch

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 9b78fd1f8e8..5ee3227f1b9 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -1986,6 +1986,21 @@  linux_handle_extended_wait (struct lwp_info *lp, int status)
 	 thread execs, it changes its tid to the tgid, and the old
 	 tgid thread might have not been resumed.  */
       lp->resumed = 1;
+
+      /* All other LWPs are gone now.  We'll have received a thread
+	 exit notification for all threads other the execing one.
+	 That one, if it wasn't the leader, just silently changes its
+	 tid to the tgid, and the previous leader vanishes.  Since
+	 Linux 3.0, the former thread ID can be retrieved with
+	 PTRACE_GETEVENTMSG, but since we support older kernels, don't
+	 bother with it, and just walk the LWP list.  Even with
+	 PTRACE_GETEVENTMSG, we'd still need to lookup the
+	 corresponding LWP object, and it would be an extra ptrace
+	 syscall, so this way may even be more efficient.  */
+      for (lwp_info *other_lp : all_lwps_safe ())
+	if (other_lp != lp && other_lp->ptid.pid () == lp->ptid.pid ())
+	  exit_lwp (other_lp);
+
       return 0;
     }
 
diff --git a/gdb/testsuite/gdb.threads/step-over-exec.exp b/gdb/testsuite/gdb.threads/step-over-exec.exp
index 783f865585c..a8b01f8aeda 100644
--- a/gdb/testsuite/gdb.threads/step-over-exec.exp
+++ b/gdb/testsuite/gdb.threads/step-over-exec.exp
@@ -102,6 +102,12 @@  proc do_test { execr_thread different_text_segments displaced_stepping } {
     gdb_breakpoint foo
     gdb_test "continue" "Breakpoint $decimal, foo .*" \
 	"continue to foo"
+
+    # Test that GDB is able to kill the inferior.  This may fail if
+    # e.g., GDB does not dispose of the pre-exec threads properly.
+    gdb_test "with confirm off -- kill" \
+	"\\\[Inferior 1 (.*) killed\\\]" \
+	"kill inferior"
 }
 
 foreach_with_prefix displaced_stepping {auto off} {