[v2] Fix heap-use-after-free in select_event_lwp

Message ID 20240123114830.20253-1-tdevries@suse.de
State Superseded
Headers
Series [v2] Fix heap-use-after-free in select_event_lwp |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gdb_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-aarch64 success Testing passed

Commit Message

Tom de Vries Jan. 23, 2024, 11:48 a.m. UTC
  When building gdb with -O0 -fsanitize=thread, and running test-case
gdb.base/vfork-follow-parent.exp, 5 times out of 10 I run into:
...
WARNING: ThreadSanitizer: heap-use-after-free (pid=249653)
  Write of size 4 at 0xffffee83055c by main thread:
    #0 select_event_lwp gdb/linux-nat.c:2809 (gdb+0xb0a65c)
    #1 linux_nat_wait_1 gdb/linux-nat.c:3389 (gdb+0xb0c470)
    #2 linux_nat_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-nat.c:3560 (gdb+0xb0cfc8)
    #3 thread_db_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-thread-db.c:1402 (gdb+0xb35958)
    #4 target_wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/target.c:2571 (gdb+0xfb6c34)
    #5 do_target_wait_1 gdb/infrun.c:4120 (gdb+0xa99dc4)
    #6 operator() gdb/infrun.c:4179 (gdb+0xa99f70)
    #7 do_target_wait gdb/infrun.c:4198 (gdb+0xa9a2bc)
    #8 fetch_inferior_event() gdb/infrun.c:4629 (gdb+0xa9b658)
    #9 inferior_event_handler(inferior_event_type) gdb/inf-loop.c:42 (gdb+0xa6b0c8)
    #10 handle_target_event gdb/linux-nat.c:4357 (gdb+0xb0f694)
    #11 handle_file_event gdbsupport/event-loop.cc:573 (gdb+0x1cfc03c)
    #12 gdb_wait_for_event gdbsupport/event-loop.cc:694 (gdb+0x1cfc700)
    #13 gdb_do_one_event(int) gdbsupport/event-loop.cc:217 (gdb+0x1cfa8ac)
    #14 start_event_loop gdb/main.c:408 (gdb+0xb7be9c)
    #15 captured_command_loop gdb/main.c:472 (gdb+0xb7c0cc)
    #16 captured_main gdb/main.c:1342 (gdb+0xb7e4e4)
    #17 gdb_main(captured_main_args*) gdb/main.c:1361 (gdb+0xb7e594)
    #18 main gdb/gdb.c:39 (gdb+0x423ce8)

  Previous write of size 8 at 0xffffee830558 by main thread:
    #0 operator delete(void*, unsigned long) <null> (libtsan.so.2+0x8fb14)
    #1 delete_lwp gdb/linux-nat.c:849 (gdb+0xb0384c)
    #2 exit_lwp gdb/linux-nat.c:924 (gdb+0xb03c4c)
    #3 wait_lwp gdb/linux-nat.c:2224 (gdb+0xb08404)
    #4 stop_wait_callback gdb/linux-nat.c:2458 (gdb+0xb092a8)
    #5 gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const gdb/../gdbsupport/function-view.h:326 (gdb+0xb15ab0)
    #6 gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) gdb/../gdbsupport/function-view.h:320 (gdb+0xb15b18)
    #7 gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const gdb/../gdbsupport/function-view.h:289 (gdb+0xb13e90)
    #8 iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) gdb/linux-nat.c:879 (gdb+0xb03a18)
    #9 linux_nat_wait_1 gdb/linux-nat.c:3382 (gdb+0xb0c3f8)
    #10 linux_nat_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-nat.c:3560 (gdb+0xb0cfc8)
    #11 thread_db_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-thread-db.c:1402 (gdb+0xb35958)
    #12 target_wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/target.c:2571 (gdb+0xfb6c34)
    #13 do_target_wait_1 gdb/infrun.c:4120 (gdb+0xa99dc4)
    #14 operator() gdb/infrun.c:4179 (gdb+0xa99f70)
    #15 do_target_wait gdb/infrun.c:4198 (gdb+0xa9a2bc)
    #16 fetch_inferior_event() gdb/infrun.c:4629 (gdb+0xa9b658)
    #17 inferior_event_handler(inferior_event_type) gdb/inf-loop.c:42 (gdb+0xa6b0c8)
    #18 handle_target_event gdb/linux-nat.c:4357 (gdb+0xb0f694)
    #19 handle_file_event gdbsupport/event-loop.cc:573 (gdb+0x1cfc03c)
    #20 gdb_wait_for_event gdbsupport/event-loop.cc:694 (gdb+0x1cfc700)
    #21 gdb_do_one_event(int) gdbsupport/event-loop.cc:217 (gdb+0x1cfa8ac)
    #22 start_event_loop gdb/main.c:408 (gdb+0xb7be9c)
    #23 captured_command_loop gdb/main.c:472 (gdb+0xb7c0cc)
    #24 captured_main gdb/main.c:1342 (gdb+0xb7e4e4)
    #25 gdb_main(captured_main_args*) gdb/main.c:1361 (gdb+0xb7e594)
    #26 main gdb/gdb.c:39 (gdb+0x423ce8)

SUMMARY: ThreadSanitizer: heap-use-after-free gdb/linux-nat.c:2809 in select_event_lwp
...

Since heap-use-after-free is essentially an address sanitizer complaint, I
also tried building gdb with -O0 -fsanitize=address, but with this setup it
doesn't seem to trigger (0 times out of 10).

The heap-use-after-free happens during the following scenario:
- linux_nat_wait_1 selects an LWP thread T1 with a status to report.
- it sets variable lp to point to the corresponding lwp_info.
- it calls stop_callback and stop_wait_callback for all threads
  (because !target_is_non_stop_p ()).
- it calls select_event_lwp to maybe pick another thread than T1, to prevent
  starvation.

The problem seems to be the following:
- while calling stop_wait_callback for all threads, it also does this for T1.
  While doing so, the corresponding lwp_info is deleted (callstack
  stop_wait_callback -> wait_lwp -> exit_lwp -> delete_lwp), leaving variable
  lp as a dangling pointer.
- variable lp is passed to select_event_lwp, which derefences it, which causes
  the heap-use-after-free.

Note that the comment here mentions "all other LWP's":
...
      /* Now stop all other LWP's ...  */
      iterate_over_lwps (minus_one_ptid, stop_callback);
      /* ... and wait until all of them have reported back that
        they're no longer running.  */
      iterate_over_lwps (minus_one_ptid, stop_wait_callback);
...
which presumably means other than the one in lp, but the iterators
don't skip lp.

Fix this by making the code match the comment, and skipping stop_callback and
stop_wait_callback for lp.

Tested on aarch64-linux.

PR gdb/31259
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31259
---
 gdb/linux-nat.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)


base-commit: 047fa8cc1cc534f19428b18d3d0c50e8139d3335
  

Comments

Simon Marchi Jan. 23, 2024, 4:08 p.m. UTC | #1
On 2024-01-23 06:48, Tom de Vries wrote:
> When building gdb with -O0 -fsanitize=thread, and running test-case
> gdb.base/vfork-follow-parent.exp, 5 times out of 10 I run into:
> ...
> WARNING: ThreadSanitizer: heap-use-after-free (pid=249653)
>   Write of size 4 at 0xffffee83055c by main thread:
>     #0 select_event_lwp gdb/linux-nat.c:2809 (gdb+0xb0a65c)
>     #1 linux_nat_wait_1 gdb/linux-nat.c:3389 (gdb+0xb0c470)
>     #2 linux_nat_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-nat.c:3560 (gdb+0xb0cfc8)
>     #3 thread_db_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-thread-db.c:1402 (gdb+0xb35958)
>     #4 target_wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/target.c:2571 (gdb+0xfb6c34)
>     #5 do_target_wait_1 gdb/infrun.c:4120 (gdb+0xa99dc4)
>     #6 operator() gdb/infrun.c:4179 (gdb+0xa99f70)
>     #7 do_target_wait gdb/infrun.c:4198 (gdb+0xa9a2bc)
>     #8 fetch_inferior_event() gdb/infrun.c:4629 (gdb+0xa9b658)
>     #9 inferior_event_handler(inferior_event_type) gdb/inf-loop.c:42 (gdb+0xa6b0c8)
>     #10 handle_target_event gdb/linux-nat.c:4357 (gdb+0xb0f694)
>     #11 handle_file_event gdbsupport/event-loop.cc:573 (gdb+0x1cfc03c)
>     #12 gdb_wait_for_event gdbsupport/event-loop.cc:694 (gdb+0x1cfc700)
>     #13 gdb_do_one_event(int) gdbsupport/event-loop.cc:217 (gdb+0x1cfa8ac)
>     #14 start_event_loop gdb/main.c:408 (gdb+0xb7be9c)
>     #15 captured_command_loop gdb/main.c:472 (gdb+0xb7c0cc)
>     #16 captured_main gdb/main.c:1342 (gdb+0xb7e4e4)
>     #17 gdb_main(captured_main_args*) gdb/main.c:1361 (gdb+0xb7e594)
>     #18 main gdb/gdb.c:39 (gdb+0x423ce8)
> 
>   Previous write of size 8 at 0xffffee830558 by main thread:
>     #0 operator delete(void*, unsigned long) <null> (libtsan.so.2+0x8fb14)
>     #1 delete_lwp gdb/linux-nat.c:849 (gdb+0xb0384c)
>     #2 exit_lwp gdb/linux-nat.c:924 (gdb+0xb03c4c)
>     #3 wait_lwp gdb/linux-nat.c:2224 (gdb+0xb08404)
>     #4 stop_wait_callback gdb/linux-nat.c:2458 (gdb+0xb092a8)
>     #5 gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const gdb/../gdbsupport/function-view.h:326 (gdb+0xb15ab0)
>     #6 gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) gdb/../gdbsupport/function-view.h:320 (gdb+0xb15b18)
>     #7 gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const gdb/../gdbsupport/function-view.h:289 (gdb+0xb13e90)
>     #8 iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) gdb/linux-nat.c:879 (gdb+0xb03a18)
>     #9 linux_nat_wait_1 gdb/linux-nat.c:3382 (gdb+0xb0c3f8)
>     #10 linux_nat_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-nat.c:3560 (gdb+0xb0cfc8)
>     #11 thread_db_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-thread-db.c:1402 (gdb+0xb35958)
>     #12 target_wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/target.c:2571 (gdb+0xfb6c34)
>     #13 do_target_wait_1 gdb/infrun.c:4120 (gdb+0xa99dc4)
>     #14 operator() gdb/infrun.c:4179 (gdb+0xa99f70)
>     #15 do_target_wait gdb/infrun.c:4198 (gdb+0xa9a2bc)
>     #16 fetch_inferior_event() gdb/infrun.c:4629 (gdb+0xa9b658)
>     #17 inferior_event_handler(inferior_event_type) gdb/inf-loop.c:42 (gdb+0xa6b0c8)
>     #18 handle_target_event gdb/linux-nat.c:4357 (gdb+0xb0f694)
>     #19 handle_file_event gdbsupport/event-loop.cc:573 (gdb+0x1cfc03c)
>     #20 gdb_wait_for_event gdbsupport/event-loop.cc:694 (gdb+0x1cfc700)
>     #21 gdb_do_one_event(int) gdbsupport/event-loop.cc:217 (gdb+0x1cfa8ac)
>     #22 start_event_loop gdb/main.c:408 (gdb+0xb7be9c)
>     #23 captured_command_loop gdb/main.c:472 (gdb+0xb7c0cc)
>     #24 captured_main gdb/main.c:1342 (gdb+0xb7e4e4)
>     #25 gdb_main(captured_main_args*) gdb/main.c:1361 (gdb+0xb7e594)
>     #26 main gdb/gdb.c:39 (gdb+0x423ce8)
> 
> SUMMARY: ThreadSanitizer: heap-use-after-free gdb/linux-nat.c:2809 in select_event_lwp
> ...
> 
> Since heap-use-after-free is essentially an address sanitizer complaint, I
> also tried building gdb with -O0 -fsanitize=address, but with this setup it
> doesn't seem to trigger (0 times out of 10).
> 
> The heap-use-after-free happens during the following scenario:
> - linux_nat_wait_1 selects an LWP thread T1 with a status to report.
> - it sets variable lp to point to the corresponding lwp_info.
> - it calls stop_callback and stop_wait_callback for all threads
>   (because !target_is_non_stop_p ()).
> - it calls select_event_lwp to maybe pick another thread than T1, to prevent
>   starvation.
> 
> The problem seems to be the following:
> - while calling stop_wait_callback for all threads, it also does this for T1.
>   While doing so, the corresponding lwp_info is deleted (callstack
>   stop_wait_callback -> wait_lwp -> exit_lwp -> delete_lwp), leaving variable
>   lp as a dangling pointer.
> - variable lp is passed to select_event_lwp, which derefences it, which causes
>   the heap-use-after-free.
> 
> Note that the comment here mentions "all other LWP's":
> ...
>       /* Now stop all other LWP's ...  */
>       iterate_over_lwps (minus_one_ptid, stop_callback);
>       /* ... and wait until all of them have reported back that
>         they're no longer running.  */
>       iterate_over_lwps (minus_one_ptid, stop_wait_callback);
> ...
> which presumably means other than the one in lp, but the iterators
> don't skip lp.
> 
> Fix this by making the code match the comment, and skipping stop_callback and
> stop_wait_callback for lp.
> 
> Tested on aarch64-linux.
> 
> PR gdb/31259
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31259
> ---
>  gdb/linux-nat.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index e91c57ba239..8bfae8555fc 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -3375,11 +3375,23 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
>    if (!target_is_non_stop_p ())
>      {
>        /* Now stop all other LWP's ...  */
> -      iterate_over_lwps (minus_one_ptid, stop_callback);
> +      for (lwp_info *other_lp : all_lwps_safe ())
> +	{
> +	  if (other_lp == lp)
> +	    continue;
> +
> +	  stop_callback (other_lp);
> +	}
>  
>        /* ... and wait until all of them have reported back that
>  	 they're no longer running.  */
> -      iterate_over_lwps (minus_one_ptid, stop_wait_callback);
> +      for (lwp_info *other_lp : all_lwps_safe ())
> +	{
> +	  if (other_lp == lp)
> +	    continue;
> +
> +	  stop_wait_callback (other_lp);
> +	}

I did a bit of archeology to see how this code evolved, and I noticed
that this change in commit 9c02b52532 ("linux-nat.c: better starvation
avoidance, handle non-stop mode too"):

https://gitlab.com/gnutools/binutils-gdb/-/commit/9c02b52532ac?view=parallel#a360e5f37ff035d1ed6814cb60de9f2826b55788_3373_3362

Previously, we had:

  lp->stopped = 1;

just before the snippet you modify.  Both stop_callback and
stop_wait_callback are no-ops if `lp->stopped` is true, so that would
make them skip over the event thread.  The commit removed that, so I
suppose that the problem was introduced in that commit.

Ideally, Pedro should look at this, but in the mean time:

Reviewed-By: Simon Marchi <simon.marchi@efficios.com>

Simon
  
Tom de Vries Jan. 23, 2024, 5:52 p.m. UTC | #2
On 1/23/24 17:08, Simon Marchi wrote:
> 
> 
> On 2024-01-23 06:48, Tom de Vries wrote:
>> When building gdb with -O0 -fsanitize=thread, and running test-case
>> gdb.base/vfork-follow-parent.exp, 5 times out of 10 I run into:
>> ...
>> WARNING: ThreadSanitizer: heap-use-after-free (pid=249653)
>>    Write of size 4 at 0xffffee83055c by main thread:
>>      #0 select_event_lwp gdb/linux-nat.c:2809 (gdb+0xb0a65c)
>>      #1 linux_nat_wait_1 gdb/linux-nat.c:3389 (gdb+0xb0c470)
>>      #2 linux_nat_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-nat.c:3560 (gdb+0xb0cfc8)
>>      #3 thread_db_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-thread-db.c:1402 (gdb+0xb35958)
>>      #4 target_wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/target.c:2571 (gdb+0xfb6c34)
>>      #5 do_target_wait_1 gdb/infrun.c:4120 (gdb+0xa99dc4)
>>      #6 operator() gdb/infrun.c:4179 (gdb+0xa99f70)
>>      #7 do_target_wait gdb/infrun.c:4198 (gdb+0xa9a2bc)
>>      #8 fetch_inferior_event() gdb/infrun.c:4629 (gdb+0xa9b658)
>>      #9 inferior_event_handler(inferior_event_type) gdb/inf-loop.c:42 (gdb+0xa6b0c8)
>>      #10 handle_target_event gdb/linux-nat.c:4357 (gdb+0xb0f694)
>>      #11 handle_file_event gdbsupport/event-loop.cc:573 (gdb+0x1cfc03c)
>>      #12 gdb_wait_for_event gdbsupport/event-loop.cc:694 (gdb+0x1cfc700)
>>      #13 gdb_do_one_event(int) gdbsupport/event-loop.cc:217 (gdb+0x1cfa8ac)
>>      #14 start_event_loop gdb/main.c:408 (gdb+0xb7be9c)
>>      #15 captured_command_loop gdb/main.c:472 (gdb+0xb7c0cc)
>>      #16 captured_main gdb/main.c:1342 (gdb+0xb7e4e4)
>>      #17 gdb_main(captured_main_args*) gdb/main.c:1361 (gdb+0xb7e594)
>>      #18 main gdb/gdb.c:39 (gdb+0x423ce8)
>>
>>    Previous write of size 8 at 0xffffee830558 by main thread:
>>      #0 operator delete(void*, unsigned long) <null> (libtsan.so.2+0x8fb14)
>>      #1 delete_lwp gdb/linux-nat.c:849 (gdb+0xb0384c)
>>      #2 exit_lwp gdb/linux-nat.c:924 (gdb+0xb03c4c)
>>      #3 wait_lwp gdb/linux-nat.c:2224 (gdb+0xb08404)
>>      #4 stop_wait_callback gdb/linux-nat.c:2458 (gdb+0xb092a8)
>>      #5 gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const gdb/../gdbsupport/function-view.h:326 (gdb+0xb15ab0)
>>      #6 gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) gdb/../gdbsupport/function-view.h:320 (gdb+0xb15b18)
>>      #7 gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const gdb/../gdbsupport/function-view.h:289 (gdb+0xb13e90)
>>      #8 iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) gdb/linux-nat.c:879 (gdb+0xb03a18)
>>      #9 linux_nat_wait_1 gdb/linux-nat.c:3382 (gdb+0xb0c3f8)
>>      #10 linux_nat_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-nat.c:3560 (gdb+0xb0cfc8)
>>      #11 thread_db_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/linux-thread-db.c:1402 (gdb+0xb35958)
>>      #12 target_wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>) gdb/target.c:2571 (gdb+0xfb6c34)
>>      #13 do_target_wait_1 gdb/infrun.c:4120 (gdb+0xa99dc4)
>>      #14 operator() gdb/infrun.c:4179 (gdb+0xa99f70)
>>      #15 do_target_wait gdb/infrun.c:4198 (gdb+0xa9a2bc)
>>      #16 fetch_inferior_event() gdb/infrun.c:4629 (gdb+0xa9b658)
>>      #17 inferior_event_handler(inferior_event_type) gdb/inf-loop.c:42 (gdb+0xa6b0c8)
>>      #18 handle_target_event gdb/linux-nat.c:4357 (gdb+0xb0f694)
>>      #19 handle_file_event gdbsupport/event-loop.cc:573 (gdb+0x1cfc03c)
>>      #20 gdb_wait_for_event gdbsupport/event-loop.cc:694 (gdb+0x1cfc700)
>>      #21 gdb_do_one_event(int) gdbsupport/event-loop.cc:217 (gdb+0x1cfa8ac)
>>      #22 start_event_loop gdb/main.c:408 (gdb+0xb7be9c)
>>      #23 captured_command_loop gdb/main.c:472 (gdb+0xb7c0cc)
>>      #24 captured_main gdb/main.c:1342 (gdb+0xb7e4e4)
>>      #25 gdb_main(captured_main_args*) gdb/main.c:1361 (gdb+0xb7e594)
>>      #26 main gdb/gdb.c:39 (gdb+0x423ce8)
>>
>> SUMMARY: ThreadSanitizer: heap-use-after-free gdb/linux-nat.c:2809 in select_event_lwp
>> ...
>>
>> Since heap-use-after-free is essentially an address sanitizer complaint, I
>> also tried building gdb with -O0 -fsanitize=address, but with this setup it
>> doesn't seem to trigger (0 times out of 10).
>>
>> The heap-use-after-free happens during the following scenario:
>> - linux_nat_wait_1 selects an LWP thread T1 with a status to report.
>> - it sets variable lp to point to the corresponding lwp_info.
>> - it calls stop_callback and stop_wait_callback for all threads
>>    (because !target_is_non_stop_p ()).
>> - it calls select_event_lwp to maybe pick another thread than T1, to prevent
>>    starvation.
>>
>> The problem seems to be the following:
>> - while calling stop_wait_callback for all threads, it also does this for T1.
>>    While doing so, the corresponding lwp_info is deleted (callstack
>>    stop_wait_callback -> wait_lwp -> exit_lwp -> delete_lwp), leaving variable
>>    lp as a dangling pointer.
>> - variable lp is passed to select_event_lwp, which derefences it, which causes
>>    the heap-use-after-free.
>>
>> Note that the comment here mentions "all other LWP's":
>> ...
>>        /* Now stop all other LWP's ...  */
>>        iterate_over_lwps (minus_one_ptid, stop_callback);
>>        /* ... and wait until all of them have reported back that
>>          they're no longer running.  */
>>        iterate_over_lwps (minus_one_ptid, stop_wait_callback);
>> ...
>> which presumably means other than the one in lp, but the iterators
>> don't skip lp.
>>
>> Fix this by making the code match the comment, and skipping stop_callback and
>> stop_wait_callback for lp.
>>
>> Tested on aarch64-linux.
>>
>> PR gdb/31259
>> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31259
>> ---
>>   gdb/linux-nat.c | 16 ++++++++++++++--
>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
>> index e91c57ba239..8bfae8555fc 100644
>> --- a/gdb/linux-nat.c
>> +++ b/gdb/linux-nat.c
>> @@ -3375,11 +3375,23 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
>>     if (!target_is_non_stop_p ())
>>       {
>>         /* Now stop all other LWP's ...  */
>> -      iterate_over_lwps (minus_one_ptid, stop_callback);
>> +      for (lwp_info *other_lp : all_lwps_safe ())
>> +	{
>> +	  if (other_lp == lp)
>> +	    continue;
>> +
>> +	  stop_callback (other_lp);
>> +	}
>>   
>>         /* ... and wait until all of them have reported back that
>>   	 they're no longer running.  */
>> -      iterate_over_lwps (minus_one_ptid, stop_wait_callback);
>> +      for (lwp_info *other_lp : all_lwps_safe ())
>> +	{
>> +	  if (other_lp == lp)
>> +	    continue;
>> +
>> +	  stop_wait_callback (other_lp);
>> +	}
> 
> I did a bit of archeology to see how this code evolved, and I noticed
> that this change in commit 9c02b52532 ("linux-nat.c: better starvation
> avoidance, handle non-stop mode too"):
> 
> https://gitlab.com/gnutools/binutils-gdb/-/commit/9c02b52532ac?view=parallel#a360e5f37ff035d1ed6814cb60de9f2826b55788_3373_3362
> 
> Previously, we had:
> 
>    lp->stopped = 1;
> 
> just before the snippet you modify.  Both stop_callback and
> stop_wait_callback are no-ops if `lp->stopped` is true, so that would
> make them skip over the event thread.  The commit removed that, so I
> suppose that the problem was introduced in that commit.
> 

Hi Simon,

thanks for the review.

> Ideally, Pedro should look at this, but in the mean time:
> 

cc-ing Pedro.

Thanks,
- Tom

> Reviewed-By: Simon Marchi <simon.marchi@efficios.com>
> 
> Simon
  
Pedro Alves Feb. 9, 2024, 3:46 p.m. UTC | #3
On 2024-01-23 11:48, Tom de Vries wrote:

> Since heap-use-after-free is essentially an address sanitizer complaint, I
> also tried building gdb with -O0 -fsanitize=address, but with this setup it
> doesn't seem to trigger (0 times out of 10).
> 
> The heap-use-after-free happens during the following scenario:
> - linux_nat_wait_1 selects an LWP thread T1 with a status to report.
> - it sets variable lp to point to the corresponding lwp_info.
> - it calls stop_callback and stop_wait_callback for all threads
>   (because !target_is_non_stop_p ()).
> - it calls select_event_lwp to maybe pick another thread than T1, to prevent
>   starvation.
> 
> The problem seems to be the following:
> - while calling stop_wait_callback for all threads, it also does this for T1.
>   While doing so, the corresponding lwp_info is deleted (callstack
>   stop_wait_callback -> wait_lwp -> exit_lwp -> delete_lwp), leaving variable
>   lp as a dangling pointer.
> - variable lp is passed to select_event_lwp, which derefences it, which causes
>   the heap-use-after-free.
> 
> Note that the comment here mentions "all other LWP's":
> ...
>       /* Now stop all other LWP's ...  */
>       iterate_over_lwps (minus_one_ptid, stop_callback);
>       /* ... and wait until all of them have reported back that
>         they're no longer running.  */
>       iterate_over_lwps (minus_one_ptid, stop_wait_callback);
> ...
> which presumably means other than the one in lp, but the iterators
> don't skip lp.

I think I'm missing something here.

The reason the comments say "all other LWP's", and don't bother filtering out LP is that
lp->stopped should be true at this point, and the callbacks (both stop_callback and stop_wait_callback)
check that flag, and do nothing if set.  I.e., they skip already-stopped threads, so they should
skip LP.

It sounds like we were about to report a stop for a thread that isn't marked as stopped?
Now it looks to me that _that_ would be the bug to fix.

Pedro Alves
  
Tom de Vries Feb. 19, 2024, 3:04 p.m. UTC | #4
On 2/9/24 16:46, Pedro Alves wrote:
> On 2024-01-23 11:48, Tom de Vries wrote:
> 
>> Since heap-use-after-free is essentially an address sanitizer complaint, I
>> also tried building gdb with -O0 -fsanitize=address, but with this setup it
>> doesn't seem to trigger (0 times out of 10).
>>
>> The heap-use-after-free happens during the following scenario:
>> - linux_nat_wait_1 selects an LWP thread T1 with a status to report.
>> - it sets variable lp to point to the corresponding lwp_info.
>> - it calls stop_callback and stop_wait_callback for all threads
>>    (because !target_is_non_stop_p ()).
>> - it calls select_event_lwp to maybe pick another thread than T1, to prevent
>>    starvation.
>>
>> The problem seems to be the following:
>> - while calling stop_wait_callback for all threads, it also does this for T1.
>>    While doing so, the corresponding lwp_info is deleted (callstack
>>    stop_wait_callback -> wait_lwp -> exit_lwp -> delete_lwp), leaving variable
>>    lp as a dangling pointer.
>> - variable lp is passed to select_event_lwp, which derefences it, which causes
>>    the heap-use-after-free.
>>
>> Note that the comment here mentions "all other LWP's":
>> ...
>>        /* Now stop all other LWP's ...  */
>>        iterate_over_lwps (minus_one_ptid, stop_callback);
>>        /* ... and wait until all of them have reported back that
>>          they're no longer running.  */
>>        iterate_over_lwps (minus_one_ptid, stop_wait_callback);
>> ...
>> which presumably means other than the one in lp, but the iterators
>> don't skip lp.
> 
> I think I'm missing something here.
> 
> The reason the comments say "all other LWP's", and don't bother filtering out LP is that
> lp->stopped should be true at this point, and the callbacks (both stop_callback and stop_wait_callback)
> check that flag, and do nothing if set.  I.e., they skip already-stopped threads, so they should
> skip LP.
> 
> It sounds like we were about to report a stop for a thread that isn't marked as stopped?
> Now it looks to me that _that_ would be the bug to fix.

Hi Pedro,

thanks for the review.

This patch adds an assert to catch the bug you mention, and a fix in 
wait_lwp:
...
diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index e91c57ba239..5022da9abd2 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2210,6 +2210,7 @@ wait_lwp (struct lwp_info *lp)
  		 core.  Store it in lp->waitstatus, because lp->status
  		 would be ambiguous (W_EXITCODE(0,0) == 0).  */
  	      lp->waitstatus = host_status_to_waitstatus (status);
+	      lp->stopped = 1;
  	      return 0;
  	    }

@@ -3368,6 +3369,7 @@ linux_nat_wait_1 (ptid_t ptid, struct 
target_waitstatus *ourstatus,
      }

    gdb_assert (lp);
+  gdb_assert (lp->stopped);

    status = lp->status;
    lp->status = 0;
...

This fixes the problem observed in the PR, and passes testing on 
x86_64-linux and aarch64-linux.

WDYT?

Thanks,
- Tom

> 
> Pedro Alves
>
  
Pedro Alves Feb. 21, 2024, 5:42 p.m. UTC | #5
On 2024-02-19 15:04, Tom de Vries wrote:
> On 2/9/24 16:46, Pedro Alves wrote:
>> On 2024-01-23 11:48, Tom de Vries wrote:
>> It sounds like we were about to report a stop for a thread that isn't marked as stopped?
>> Now it looks to me that _that_ would be the bug to fix.
> 
> This patch adds an assert to catch the bug you mention, and a fix in wait_lwp:
> ...
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index e91c57ba239..5022da9abd2 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2210,6 +2210,7 @@ wait_lwp (struct lwp_info *lp)
>           core.  Store it in lp->waitstatus, because lp->status
>           would be ambiguous (W_EXITCODE(0,0) == 0).  */
>            lp->waitstatus = host_status_to_waitstatus (status);
> +          lp->stopped = 1;
>            return 0;
>          }
> 
> @@ -3368,6 +3369,7 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
>      }
> 
>    gdb_assert (lp);
> +  gdb_assert (lp->stopped);
> 
>    status = lp->status;
>    lp->status = 0;
> ...
> 
> This fixes the problem observed in the PR, and passes testing on x86_64-linux and aarch64-linux.
> 
> WDYT?

Ah, thanks, so it was an exiting lwp that we're going to report the exit for.

Yes, this is the right thing to do.  We do the same in linux_nat_filter_event:

  /* This LWP is stopped now.  (And if dead, this prevents it from
     ever being continued.)  */
  lp->stopped = 1;


gdbserver has a mark_lwp_dead function to centralize setting up various lwp flags such that the
rest of the code doesn't mishandle them, and it seems like a good idea to do a similar thing in
gdb as well.  Like below.  I ran the testsuite and saw no regressions.

I tried to reproduce the gdb.base/vfork-follow-parent.exp issue described in the bug, but when I
run that with a gdb built with -O0 -fsanitize=thread, the testcase just seemingly hangs starting
up until I ctrl-c:

 Running /home/pedro/rocm/gdb/build-master-sanitize-thread/gdb/testsuite/../../../src/gdb/testsuite/gdb.base/vfork-follow-parent.exp ...
 WARNING: Couldn't set the height to 0
 WARNING: Couldn't set the width to 0.
 ^Cgot a INT signal, interrupted by user

and when I look at gdb.log, I see it is full of the same report over and over:

...
WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=3086741)
    #0 free ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:711 (libtsan.so.0+0x37ab8)
    #1 __gconv_close_transform iconv/gconv_db.c:800 (libc.so.6+0x2be3a)
    #2 ext_lang_initialization() ../../src/gdb/extension.c:339 (gdb+0x5c6a3f)
    #3 captured_main_1 ../../src/gdb/main.c:1058 (gdb+0x7c94da)
    #4 captured_main ../../src/gdb/main.c:1329 (gdb+0x7ca7be)
    #5 gdb_main(captured_main_args*) ../../src/gdb/main.c:1358 (gdb+0x7ca886)
    #6 main ../../src/gdb/gdb.c:39 (gdb+0xf98b3)

SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal iconv/gconv_db.c:800 in __gconv_close_transform
...


Does this patch work for you?

From 8cbc62be0dab8bb4b09fe05a52a348ddbfb2e33c Mon Sep 17 00:00:00 2001
From: Pedro Alves <pedro@palves.net>
Date: Wed, 21 Feb 2024 16:23:55 +0000
Subject: [PATCH] mark_lwp_dead

Change-Id: I4a6169976f89bf714c478cbb2b7d4c32365e62a9
---
 gdb/linux-nat.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index e91c57ba239..b3d319e0a82 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2123,6 +2123,27 @@ wait_for_signal ()
     }
 }
 
+/* Mark LWP dead, with STATUS as exit status pending to report
+   later.  */
+
+static void
+mark_lwp_dead (lwp_info *lp, int status)
+{
+  /* Store the exit status lp->waitstatus, because lp->status would be
+     ambiguous (W_EXITCODE(0,0) == 0).  */
+  lp->waitstatus = host_status_to_waitstatus (status);
+
+  /* If we're processing LP's status, there should be no other event
+     already recorded as pending.  */
+  gdb_assert (lp->status == 0);
+
+  /* Dead LWP's aren't expected to reported a pending sigstop.  */
+  lp->signalled = 0;
+
+  /* Prevent trying to stop it.  */
+  lp->stopped = 1;
+}
+
 /* Wait for LP to stop.  Returns the wait status, or 0 if the LWP has
    exited.  */
 
@@ -2207,9 +2228,8 @@ wait_lwp (struct lwp_info *lp)
 
 	      /* If this is the leader exiting, it means the whole
 		 process is gone.  Store the status to report to the
-		 core.  Store it in lp->waitstatus, because lp->status
-		 would be ambiguous (W_EXITCODE(0,0) == 0).  */
-	      lp->waitstatus = host_status_to_waitstatus (status);
+		 core.  */
+	      mark_lwp_dead (lp, status);
 	      return 0;
 	    }
 
@@ -3013,12 +3033,7 @@ linux_nat_filter_event (int lwpid, int status)
       linux_nat_debug_printf ("LWP %ld exited (resumed=%d)",
 			      lp->ptid.lwp (), lp->resumed);
 
-      /* Dead LWP's aren't expected to reported a pending sigstop.  */
-      lp->signalled = 0;
-
-      /* Store the pending event in the waitstatus, because
-	 W_EXITCODE(0,0) == 0.  */
-      lp->waitstatus = host_status_to_waitstatus (status);
+      mark_lwp_dead (lp, status);
       return;
     }
 
@@ -3368,6 +3383,7 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
     }
 
   gdb_assert (lp);
+  gdb_assert (lp->stopped);
 
   status = lp->status;
   lp->status = 0;

base-commit: 99eeecc8d276e5af745e48825d66efff693a7678
  
Tom de Vries Feb. 22, 2024, 11:43 a.m. UTC | #6
On 2/21/24 18:42, Pedro Alves wrote:
> On 2024-02-19 15:04, Tom de Vries wrote:
>> On 2/9/24 16:46, Pedro Alves wrote:
>>> On 2024-01-23 11:48, Tom de Vries wrote:
>>> It sounds like we were about to report a stop for a thread that isn't marked as stopped?
>>> Now it looks to me that _that_ would be the bug to fix.
>>
>> This patch adds an assert to catch the bug you mention, and a fix in wait_lwp:
>> ...
>> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
>> index e91c57ba239..5022da9abd2 100644
>> --- a/gdb/linux-nat.c
>> +++ b/gdb/linux-nat.c
>> @@ -2210,6 +2210,7 @@ wait_lwp (struct lwp_info *lp)
>>            core.  Store it in lp->waitstatus, because lp->status
>>            would be ambiguous (W_EXITCODE(0,0) == 0).  */
>>             lp->waitstatus = host_status_to_waitstatus (status);
>> +          lp->stopped = 1;
>>             return 0;
>>           }
>>
>> @@ -3368,6 +3369,7 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
>>       }
>>
>>     gdb_assert (lp);
>> +  gdb_assert (lp->stopped);
>>
>>     status = lp->status;
>>     lp->status = 0;
>> ...
>>
>> This fixes the problem observed in the PR, and passes testing on x86_64-linux and aarch64-linux.
>>
>> WDYT?
> 
> Ah, thanks, so it was an exiting lwp that we're going to report the exit for.
> 
> Yes, this is the right thing to do.  We do the same in linux_nat_filter_event:
> 
>    /* This LWP is stopped now.  (And if dead, this prevents it from
>       ever being continued.)  */
>    lp->stopped = 1;
> 
> 
> gdbserver has a mark_lwp_dead function to centralize setting up various lwp flags such that the
> rest of the code doesn't mishandle them, and it seems like a good idea to do a similar thing in
> gdb as well.  Like below.  I ran the testsuite and saw no regressions.
> 

I agree that it's a good idea, and also ran the test-suite and saw no 
regressions.

> I tried to reproduce the gdb.base/vfork-follow-parent.exp issue described in the bug, but when I
> run that with a gdb built with -O0 -fsanitize=thread, the testcase just seemingly hangs starting
> up until I ctrl-c:
> 
>   Running /home/pedro/rocm/gdb/build-master-sanitize-thread/gdb/testsuite/../../../src/gdb/testsuite/gdb.base/vfork-follow-parent.exp ...
>   WARNING: Couldn't set the height to 0
>   WARNING: Couldn't set the width to 0.
>   ^Cgot a INT signal, interrupted by user
> 
> and when I look at gdb.log, I see it is full of the same report over and over:
> 
> ...
> WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=3086741)
>      #0 free ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:711 (libtsan.so.0+0x37ab8)
>      #1 __gconv_close_transform iconv/gconv_db.c:800 (libc.so.6+0x2be3a)
>      #2 ext_lang_initialization() ../../src/gdb/extension.c:339 (gdb+0x5c6a3f)
>      #3 captured_main_1 ../../src/gdb/main.c:1058 (gdb+0x7c94da)
>      #4 captured_main ../../src/gdb/main.c:1329 (gdb+0x7ca7be)
>      #5 gdb_main(captured_main_args*) ../../src/gdb/main.c:1358 (gdb+0x7ca886)
>      #6 main ../../src/gdb/gdb.c:39 (gdb+0xf98b3)
> 
> SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal iconv/gconv_db.c:800 in __gconv_close_transform
> ...
> 
> 

Hmm, I can't reproduce that, not on aarch64-linux f39, nor on Leap 15.4 
x86_64-linux.

At first glance this could be a tsan problem, given that the stack trace 
doesn't show entry into a signal handler.

> Does this patch work for you?
> 

It does.

As mentioned in the PR, on aarch64-linux this reproduces for me 5/10 
times, and with your patch 0/10 times.

Given that pretty much the entire patch is yours, do you want to proceed 
with this, or do you want me to integrate this into my patch with a 
Co-Authored-By tag?

Thanks,
- Tom

>  From 8cbc62be0dab8bb4b09fe05a52a348ddbfb2e33c Mon Sep 17 00:00:00 2001
> From: Pedro Alves <pedro@palves.net>
> Date: Wed, 21 Feb 2024 16:23:55 +0000
> Subject: [PATCH] mark_lwp_dead
> 
> Change-Id: I4a6169976f89bf714c478cbb2b7d4c32365e62a9
> ---
>   gdb/linux-nat.c | 34 +++++++++++++++++++++++++---------
>   1 file changed, 25 insertions(+), 9 deletions(-)
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index e91c57ba239..b3d319e0a82 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2123,6 +2123,27 @@ wait_for_signal ()
>       }
>   }
>   
> +/* Mark LWP dead, with STATUS as exit status pending to report
> +   later.  */
> +
> +static void
> +mark_lwp_dead (lwp_info *lp, int status)
> +{
> +  /* Store the exit status lp->waitstatus, because lp->status would be
> +     ambiguous (W_EXITCODE(0,0) == 0).  */
> +  lp->waitstatus = host_status_to_waitstatus (status);
> +
> +  /* If we're processing LP's status, there should be no other event
> +     already recorded as pending.  */
> +  gdb_assert (lp->status == 0);
> +
> +  /* Dead LWP's aren't expected to reported a pending sigstop.  */
> +  lp->signalled = 0;
> +
> +  /* Prevent trying to stop it.  */
> +  lp->stopped = 1;
> +}
> +
>   /* Wait for LP to stop.  Returns the wait status, or 0 if the LWP has
>      exited.  */
>   
> @@ -2207,9 +2228,8 @@ wait_lwp (struct lwp_info *lp)
>   
>   	      /* If this is the leader exiting, it means the whole
>   		 process is gone.  Store the status to report to the
> -		 core.  Store it in lp->waitstatus, because lp->status
> -		 would be ambiguous (W_EXITCODE(0,0) == 0).  */
> -	      lp->waitstatus = host_status_to_waitstatus (status);
> +		 core.  */
> +	      mark_lwp_dead (lp, status);
>   	      return 0;
>   	    }
>   
> @@ -3013,12 +3033,7 @@ linux_nat_filter_event (int lwpid, int status)
>         linux_nat_debug_printf ("LWP %ld exited (resumed=%d)",
>   			      lp->ptid.lwp (), lp->resumed);
>   
> -      /* Dead LWP's aren't expected to reported a pending sigstop.  */
> -      lp->signalled = 0;
> -
> -      /* Store the pending event in the waitstatus, because
> -	 W_EXITCODE(0,0) == 0.  */
> -      lp->waitstatus = host_status_to_waitstatus (status);
> +      mark_lwp_dead (lp, status);
>         return;
>       }
>   
> @@ -3368,6 +3383,7 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
>       }
>   
>     gdb_assert (lp);
> +  gdb_assert (lp->stopped);
>   
>     status = lp->status;
>     lp->status = 0;
> 
> base-commit: 99eeecc8d276e5af745e48825d66efff693a7678
  
Pedro Alves Feb. 23, 2024, 2:33 p.m. UTC | #7
On 2024-02-22 11:43, Tom de Vries wrote:
> On 2/21/24 18:42, Pedro Alves wrote:

> 
> It does.
> 
> As mentioned in the PR, on aarch64-linux this reproduces for me 5/10 times, and with your patch 0/10 times.
> 
> Given that pretty much the entire patch is yours, do you want to proceed with this, or do you want me to integrate this into my patch with a Co-Authored-By tag?

I think it'd be good to fix up the commit log a bit, so I thought of proceeding with this.
I think it makes sense to add you as Co-Authored-By, as you wrote the initial commit
log and did investigation work.

On the commit log, here:

>  The problem seems to be the following:
>  - while calling stop_wait_callback for all threads, it also does this for T1.
>    While doing so, the corresponding lwp_info is deleted (callstack
>    stop_wait_callback -> wait_lwp -> exit_lwp -> delete_lwp), leaving variable
>    lp as a dangling pointer.
>  - variable lp is passed to select_event_lwp, which derefences it, which causes
>    the heap-use-after-free."

we are missing part of the scenario.  It must have been that we were going to
report an exit event for T1, because we know that the bug is that we missed setting
its lwp->stopped flag from within stop_wait_callback -> wait_lwp.  So that miss must
have happened earlier than what is described above, while reporting a previous event
for another thread.  I can see no other way that this scenario could trigger.

Also, took me a bit to realize, but here:

>  The heap-use-after-free happens during the following scenario:
>  - linux_nat_wait_1 selects an LWP thread T1 with a status to report.

I think you are saying that linux_nat_wait_1 selected a previously pending
event, not that it pulled an event out of the kernel.

With that, it makes a lot more sense to me.

So, here's the same patch but now with an adjusted commit log, with that
missing info added.  WDYT?

From 96f7005bf56ce57a8dfebcb48885342c5f9e237c Mon Sep 17 00:00:00 2001
From: Pedro Alves <pedro@palves.net>
Date: Wed, 21 Feb 2024 16:23:55 +0000
Subject: [PATCH] [gdb] Fix heap-use-after-free in select_event_lwp

PR gdb/31259 reveals one scenario where we run into a
heap-use-after-free reported by thread sanitizer, while running
gdb.base/vfork-follow-parent.exp.

The heap-use-after-free happens during the following scenario:

 - linux_nat_wait_1 is about to return an event for T2.  It stops all
   other threads, and while doing so, stop_wait_callback -> wait_lwp
   sees T1 exit, and decides to leave the exit event pending.  It
   should have set the lp->stopped flag too, but does not -- this is
   the bug.

 - The event for T2 is reported, is processed by infrun, and we're
   back at linux_nat_wait_1.

 - linux_nat_wait_1 selects LWP T1 with the pending exit status to
   report.

 - it sets variable lp to point to the corresponding lwp_info.

 - it calls stop_callback and stop_wait_callback for all threads
   (because !target_is_non_stop_p ()).

 - it calls select_event_lwp to maybe pick another thread than T1, to
   prevent starvation.

The problem is the following:

 - while calling stop_wait_callback for all threads, it also does this
   for T1.  While doing so, the corresponding lwp_info is deleted
   (callstack stop_wait_callback -> wait_lwp -> exit_lwp ->
   delete_lwp), leaving variable lp as a dangling pointer.

 - variable lp is passed to select_event_lwp, which derefences it,
   which causes the heap-use-after-free.

Note that the comment here mentions "all other LWP's":
...
      /* Now stop all other LWP's ...  */
      iterate_over_lwps (minus_one_ptid, stop_callback);
      /* ... and wait until all of them have reported back that
        they're no longer running.  */
      iterate_over_lwps (minus_one_ptid, stop_wait_callback);
...

The reason the comments say "all other LWP's", and doesn't bother
filtering out LP is that lp->stopped should be true at this point, and
the callbacks (both stop_callback and stop_wait_callback) check that
flag, and do nothing if set.  I.e., they skip already-stopped threads,
so they should skip LP.

In this particular scenario, though, we missed setting the stopped
flag right in the first step described above, so LP was iterated over
incorrectly.

The fix is to make wait_lwp set the lp->stopped flag when it decides
to leave the exit event pending.  However, going a bit further,
GDBserver has a mark_lwp_dead function to centralize setting up
various lwp flags such that the rest of the code doesn't mishandle
them, and it seems like a good idea to do a similar thing in GDB as
well.  That is what this patch does.

PR gdb/31259
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31259
Co-Authored-By: Tom de Vries <tdevries@suse.de>
Change-Id: I4a6169976f89bf714c478cbb2b7d4c32365e62a9
---
 gdb/linux-nat.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index e91c57ba239..a9984eb3221 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2123,6 +2123,27 @@ wait_for_signal ()
     }
 }
 
+/* Mark LWP dead, with STATUS as exit status pending to report
+   later.  */
+
+static void
+mark_lwp_dead (lwp_info *lp, int status)
+{
+  /* Store the exit status lp->waitstatus, because lp->status would be
+     ambiguous (W_EXITCODE(0,0) == 0).  */
+  lp->waitstatus = host_status_to_waitstatus (status);
+
+  /* If we're processing LP's status, there should be no other event
+     already recorded as pending.  */
+  gdb_assert (lp->status == 0);
+
+  /* Dead LWPs aren't expected to report a pending sigstop.  */
+  lp->signalled = 0;
+
+  /* Prevent trying to stop it.  */
+  lp->stopped = 1;
+}
+
 /* Wait for LP to stop.  Returns the wait status, or 0 if the LWP has
    exited.  */
 
@@ -2207,9 +2228,8 @@ wait_lwp (struct lwp_info *lp)
 
 	      /* If this is the leader exiting, it means the whole
 		 process is gone.  Store the status to report to the
-		 core.  Store it in lp->waitstatus, because lp->status
-		 would be ambiguous (W_EXITCODE(0,0) == 0).  */
-	      lp->waitstatus = host_status_to_waitstatus (status);
+		 core.  */
+	      mark_lwp_dead (lp, status);
 	      return 0;
 	    }
 
@@ -3013,12 +3033,7 @@ linux_nat_filter_event (int lwpid, int status)
       linux_nat_debug_printf ("LWP %ld exited (resumed=%d)",
 			      lp->ptid.lwp (), lp->resumed);
 
-      /* Dead LWP's aren't expected to reported a pending sigstop.  */
-      lp->signalled = 0;
-
-      /* Store the pending event in the waitstatus, because
-	 W_EXITCODE(0,0) == 0.  */
-      lp->waitstatus = host_status_to_waitstatus (status);
+      mark_lwp_dead (lp, status);
       return;
     }
 
@@ -3368,6 +3383,7 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
     }
 
   gdb_assert (lp);
+  gdb_assert (lp->stopped);
 
   status = lp->status;
   lp->status = 0;

base-commit: e346d50a89106a52fa1545db5eade2a25a4932f0
  
Tom de Vries Feb. 26, 2024, 2:23 p.m. UTC | #8
On 2/23/24 15:33, Pedro Alves wrote:
> On 2024-02-22 11:43, Tom de Vries wrote:
>> On 2/21/24 18:42, Pedro Alves wrote:
> 
>>
>> It does.
>>
>> As mentioned in the PR, on aarch64-linux this reproduces for me 5/10 times, and with your patch 0/10 times.
>>
>> Given that pretty much the entire patch is yours, do you want to proceed with this, or do you want me to integrate this into my patch with a Co-Authored-By tag?
> 
> I think it'd be good to fix up the commit log a bit, so I thought of proceeding with this.

Thanks.

> I think it makes sense to add you as Co-Authored-By, as you wrote the initial commit
> log and did investigation work.
> 
> On the commit log, here:
> 
>>   The problem seems to be the following:
>>   - while calling stop_wait_callback for all threads, it also does this for T1.
>>     While doing so, the corresponding lwp_info is deleted (callstack
>>     stop_wait_callback -> wait_lwp -> exit_lwp -> delete_lwp), leaving variable
>>     lp as a dangling pointer.
>>   - variable lp is passed to select_event_lwp, which derefences it, which causes
>>     the heap-use-after-free."
> 
> we are missing part of the scenario.  It must have been that we were going to
> report an exit event for T1, because we know that the bug is that we missed setting
> its lwp->stopped flag from within stop_wait_callback -> wait_lwp.  So that miss must
> have happened earlier than what is described above, while reporting a previous event
> for another thread.  I can see no other way that this scenario could trigger.
> 
> Also, took me a bit to realize, but here:
> 
>>   The heap-use-after-free happens during the following scenario:
>>   - linux_nat_wait_1 selects an LWP thread T1 with a status to report.
> 
> I think you are saying that linux_nat_wait_1 selected a previously pending
> event, not that it pulled an event out of the kernel.
> 
> With that, it makes a lot more sense to me.
> 
> So, here's the same patch but now with an adjusted commit log, with that
> missing info added.  WDYT?
> 

Hi Pedro,

I've verified that the scenario is as you describe it.

LGTM.

Thanks,
- Tom

>  From 96f7005bf56ce57a8dfebcb48885342c5f9e237c Mon Sep 17 00:00:00 2001
> From: Pedro Alves <pedro@palves.net>
> Date: Wed, 21 Feb 2024 16:23:55 +0000
> Subject: [PATCH] [gdb] Fix heap-use-after-free in select_event_lwp
> 
> PR gdb/31259 reveals one scenario where we run into a
> heap-use-after-free reported by thread sanitizer, while running
> gdb.base/vfork-follow-parent.exp.
> 
> The heap-use-after-free happens during the following scenario:
> 
>   - linux_nat_wait_1 is about to return an event for T2.  It stops all
>     other threads, and while doing so, stop_wait_callback -> wait_lwp
>     sees T1 exit, and decides to leave the exit event pending.  It
>     should have set the lp->stopped flag too, but does not -- this is
>     the bug.
> 
>   - The event for T2 is reported, is processed by infrun, and we're
>     back at linux_nat_wait_1.
> 
>   - linux_nat_wait_1 selects LWP T1 with the pending exit status to
>     report.
> 
>   - it sets variable lp to point to the corresponding lwp_info.
> 
>   - it calls stop_callback and stop_wait_callback for all threads
>     (because !target_is_non_stop_p ()).
> 
>   - it calls select_event_lwp to maybe pick another thread than T1, to
>     prevent starvation.
> 
> The problem is the following:
> 
>   - while calling stop_wait_callback for all threads, it also does this
>     for T1.  While doing so, the corresponding lwp_info is deleted
>     (callstack stop_wait_callback -> wait_lwp -> exit_lwp ->
>     delete_lwp), leaving variable lp as a dangling pointer.
> 
>   - variable lp is passed to select_event_lwp, which derefences it,
>     which causes the heap-use-after-free.
> 
> Note that the comment here mentions "all other LWP's":
> ...
>        /* Now stop all other LWP's ...  */
>        iterate_over_lwps (minus_one_ptid, stop_callback);
>        /* ... and wait until all of them have reported back that
>          they're no longer running.  */
>        iterate_over_lwps (minus_one_ptid, stop_wait_callback);
> ...
> 
> The reason the comments say "all other LWP's", and doesn't bother
> filtering out LP is that lp->stopped should be true at this point, and
> the callbacks (both stop_callback and stop_wait_callback) check that
> flag, and do nothing if set.  I.e., they skip already-stopped threads,
> so they should skip LP.
> 
> In this particular scenario, though, we missed setting the stopped
> flag right in the first step described above, so LP was iterated over
> incorrectly.
> 
> The fix is to make wait_lwp set the lp->stopped flag when it decides
> to leave the exit event pending.  However, going a bit further,
> GDBserver has a mark_lwp_dead function to centralize setting up
> various lwp flags such that the rest of the code doesn't mishandle
> them, and it seems like a good idea to do a similar thing in GDB as
> well.  That is what this patch does.
> 
> PR gdb/31259
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31259
> Co-Authored-By: Tom de Vries <tdevries@suse.de>
> Change-Id: I4a6169976f89bf714c478cbb2b7d4c32365e62a9
> ---
>   gdb/linux-nat.c | 34 +++++++++++++++++++++++++---------
>   1 file changed, 25 insertions(+), 9 deletions(-)
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index e91c57ba239..a9984eb3221 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2123,6 +2123,27 @@ wait_for_signal ()
>       }
>   }
>   
> +/* Mark LWP dead, with STATUS as exit status pending to report
> +   later.  */
> +
> +static void
> +mark_lwp_dead (lwp_info *lp, int status)
> +{
> +  /* Store the exit status lp->waitstatus, because lp->status would be
> +     ambiguous (W_EXITCODE(0,0) == 0).  */
> +  lp->waitstatus = host_status_to_waitstatus (status);
> +
> +  /* If we're processing LP's status, there should be no other event
> +     already recorded as pending.  */
> +  gdb_assert (lp->status == 0);
> +
> +  /* Dead LWPs aren't expected to report a pending sigstop.  */
> +  lp->signalled = 0;
> +
> +  /* Prevent trying to stop it.  */
> +  lp->stopped = 1;
> +}
> +
>   /* Wait for LP to stop.  Returns the wait status, or 0 if the LWP has
>      exited.  */
>   
> @@ -2207,9 +2228,8 @@ wait_lwp (struct lwp_info *lp)
>   
>   	      /* If this is the leader exiting, it means the whole
>   		 process is gone.  Store the status to report to the
> -		 core.  Store it in lp->waitstatus, because lp->status
> -		 would be ambiguous (W_EXITCODE(0,0) == 0).  */
> -	      lp->waitstatus = host_status_to_waitstatus (status);
> +		 core.  */
> +	      mark_lwp_dead (lp, status);
>   	      return 0;
>   	    }
>   
> @@ -3013,12 +3033,7 @@ linux_nat_filter_event (int lwpid, int status)
>         linux_nat_debug_printf ("LWP %ld exited (resumed=%d)",
>   			      lp->ptid.lwp (), lp->resumed);
>   
> -      /* Dead LWP's aren't expected to reported a pending sigstop.  */
> -      lp->signalled = 0;
> -
> -      /* Store the pending event in the waitstatus, because
> -	 W_EXITCODE(0,0) == 0.  */
> -      lp->waitstatus = host_status_to_waitstatus (status);
> +      mark_lwp_dead (lp, status);
>         return;
>       }
>   
> @@ -3368,6 +3383,7 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
>       }
>   
>     gdb_assert (lp);
> +  gdb_assert (lp->stopped);
>   
>     status = lp->status;
>     lp->status = 0;
> 
> base-commit: e346d50a89106a52fa1545db5eade2a25a4932f0
  
Pedro Alves Feb. 26, 2024, 3:28 p.m. UTC | #9
On 2024-02-26 14:23, Tom de Vries wrote:
> On 2/23/24 15:33, Pedro Alves wrote:
>> With that, it makes a lot more sense to me.
>>
>> So, here's the same patch but now with an adjusted commit log, with that
>> missing info added.  WDYT?
>>
> 
> Hi Pedro,
> 
> I've verified that the scenario is as you describe it.
> 
> LGTM.
> 

Great, thanks for the verification.  I'm merging it now.

Pedro Alves
  

Patch

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index e91c57ba239..8bfae8555fc 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -3375,11 +3375,23 @@  linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
   if (!target_is_non_stop_p ())
     {
       /* Now stop all other LWP's ...  */
-      iterate_over_lwps (minus_one_ptid, stop_callback);
+      for (lwp_info *other_lp : all_lwps_safe ())
+	{
+	  if (other_lp == lp)
+	    continue;
+
+	  stop_callback (other_lp);
+	}
 
       /* ... and wait until all of them have reported back that
 	 they're no longer running.  */
-      iterate_over_lwps (minus_one_ptid, stop_wait_callback);
+      for (lwp_info *other_lp : all_lwps_safe ())
+	{
+	  if (other_lp == lp)
+	    continue;
+
+	  stop_wait_callback (other_lp);
+	}
     }
 
   /* If we're not waiting for a specific LWP, choose an event LWP from