[gdb/dap] Fix exit race

Message ID 20240213150141.28034-1-tdevries@suse.de
State Committed
Headers
Series [gdb/dap] Fix exit race |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gdb_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-aarch64 success Testing passed

Commit Message

Tom de Vries Feb. 13, 2024, 3:01 p.m. UTC
  When running test-case gdb.dap/eof.exp, we're likely to get a coredump due to
a segfault in new_threadstate.

At the point of the core dump, the gdb main thread looks like:
...
 (gdb) bt
 #0  0x0000fffee30d2280 in __pthread_kill_implementation () from /lib64/libc.so.6
 #1  0x0000fffee3085800 [PAC] in raise () from /lib64/libc.so.6
 #2  0x00000000007b03e8 [PAC] in handle_fatal_signal (sig=11)
     at gdb/event-top.c:926
 #3  0x00000000007b0470 in handle_sigsegv (sig=11)
     at gdb/event-top.c:976
 #4  <signal handler called>
 #5  0x0000fffee3a4db14 in new_threadstate () from /lib64/libpython3.12.so.1.0
 #6  0x0000fffee3ab0548 [PAC] in PyGILState_Ensure () from /lib64/libpython3.12.so.1.0
 #7  0x0000000000a6d034 [PAC] in gdbpy_gil::gdbpy_gil (this=0xffffcb279738)
     at gdb/python/python-internal.h:787
 #8  0x0000000000ab87ac in gdbpy_event::~gdbpy_event (this=0xfffea8001ee0,
     __in_chrg=<optimized out>) at gdb/python/python.c:1051
 #9  0x0000000000ab9460 in std::_Function_base::_Base_manager<...>::_M_destroy
     (__victim=...) at /usr/include/c++/13/bits/std_function.h:175
 #10 0x0000000000ab92dc in std::_Function_base::_Base_manager<...>::_M_manager
     (__dest=..., __source=..., __op=std::__destroy_functor)
     at /usr/include/c++/13/bits/std_function.h:203
 #11 0x0000000000ab8f14 in std::_Function_handler<...>::_M_manager(...) (...)
     at /usr/include/c++/13/bits/std_function.h:282
 #12 0x000000000042dd9c in std::_Function_base::~_Function_base (this=0xfffea8001c10,
     __in_chrg=<optimized out>) at /usr/include/c++/13/bits/std_function.h:244
 #13 0x000000000042e654 in std::function<void ()>::~function() (this=0xfffea8001c10,
     __in_chrg=<optimized out>) at /usr/include/c++/13/bits/std_function.h:334
 #14 0x0000000000b68e60 in std::_Destroy<std::function<void ()> >(...) (...)
     at /usr/include/c++/13/bits/stl_construct.h:151
 #15 0x0000000000b68cd0 in std::_Destroy_aux<false>::__destroy<...>(...) (...)
     at /usr/include/c++/13/bits/stl_construct.h:163
 #16 0x0000000000b689d8 in std::_Destroy<...>(...) (...)
     at /usr/include/c++/13/bits/stl_construct.h:196
 #17 0x0000000000b68414 in std::_Destroy<...>(...) (...)
     at /usr/include/c++/13/bits/alloc_traits.h:948
 #18 std::vector<...>::~vector() (this=0x2a183c8 <runnables>)
     at /usr/include/c++/13/bits/stl_vector.h:732
 #19 0x0000fffee3088370 in __run_exit_handlers () from /lib64/libc.so.6
 #20 0x0000fffee3088450 [PAC] in exit () from /lib64/libc.so.6
 #21 0x0000000000c95600 [PAC] in quit_force (exit_arg=0x0, from_tty=0)
     at gdb/top.c:1822
 #22 0x0000000000609140 in quit_command (args=0x0, from_tty=0)
     at gdb/cli/cli-cmds.c:508
 #23 0x0000000000c926a4 in quit_cover () at gdb/top.c:300
 #24 0x00000000007b09d4 in async_disconnect (arg=0x0)
     at gdb/event-top.c:1230
 #25 0x0000000000548acc in invoke_async_signal_handlers ()
     at gdb/async-event.c:234
 #26 0x000000000157d2d4 in gdb_do_one_event (mstimeout=-1)
     at gdbsupport/event-loop.cc:199
 #27 0x0000000000943a84 in start_event_loop () at gdb/main.c:401
 #28 0x0000000000943bfc in captured_command_loop () at gdb/main.c:465
 #29 0x000000000094567c in captured_main (data=0xffffcb279d08)
     at gdb/main.c:1335
 #30 0x0000000000945700 in gdb_main (args=0xffffcb279d08)
     at gdb/main.c:1354
 #31 0x0000000000423ab4 in main (argc=14, argv=0xffffcb279e98)
     at gdb/gdb.c:39
...

The direct cause of the segfault is calling PyGILState_Ensure after
calling Py_Finalize.

AFAICT the problem is a race between the gdb main thread and DAP's JSON writer
thread.

On one side, we have the following events:
- DAP's JSON reader thread reads an EOF, and lets DAP's main thread known
  by writing None into read_queue
- DAP's main thread lets DAP's JSON writer thread known by writing None into
  write_queue
- DAP's JSON writer thread sees the None in its queue, and calls
  send_gdb("quit")
- a corresponding gdbpy_event is deposited in the runnables vector, to be
  run by the gdb main thread

On the other side, we have the following events:
- the gdb main thread receives a SIGHUP
- the corresponding handler calls quit_force, which calls do_final_cleanups
- one of the final cleanups is finalize_python, which calls Py_Finalize
- quit_force calls exit, which triggers the exit handlers
- one of the exit handlers is the destructor of the runnables vector
- destruction of the vector triggers destruction of the remaining element
- the remaining element is a gdbpy_event, and the destructor (indirectly)
  calls PyGILState_Ensure

It's good to note that both events (EOF and SIGHUP) are caused by this line in
the test-case:
...
catch "close -i $gdb_spawn_id"
...
where "expect close" closes the stdin and stdout file descriptors, which
causes the SIGHUP to be send.

So, for the system I'm running this on, the send_gdb("quit") is actually not
needed.

I'm not sure if we support any systems where it's actually needed.

Fix this by removing the send_gdb("quit").

Tested on aarch64-linux.

PR dap/31306
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31306
---
 gdb/python/lib/gdb/dap/io.py | 1 -
 1 file changed, 1 deletion(-)


base-commit: a16034bf6417dc2259fef43fd5bcc2dd1dac562f
  

Comments

Tom Tromey Feb. 13, 2024, 6:10 p.m. UTC | #1
>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

Tom> Fix this by removing the send_gdb("quit").

I thought you were sending a different patch for this?
I think this change by itself is insufficient because it might mean that
gdb exits before all the queued responses are emitted.

Tom
  
Tom de Vries Feb. 14, 2024, 3:39 p.m. UTC | #2
On 2/13/24 19:10, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
> Tom> Fix this by removing the send_gdb("quit").
> 
> I thought you were sending a different patch for this?

This is the patch for PR31306, the assertion failure.

> I think this change by itself is insufficient because it might mean that
> gdb exits before all the queued responses are emitted.

As for PR31380, which is about ensuring responses are flushed to client 
before exiting, indeed there's a race between flushing the write queue 
and gdb exiting because of a SIGHUP (which is not caused by the 
send_gdb("quit"), I've mentioned that in the PR.

Thanks,
- Tom
  
Tom Tromey Feb. 14, 2024, 3:46 p.m. UTC | #3
>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

Tom> On 2/13/24 19:10, Tom Tromey wrote:
>>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
Tom> Fix this by removing the send_gdb("quit").
>> I thought you were sending a different patch for this?

Tom> This is the patch for PR31306, the assertion failure.

>> I think this change by itself is insufficient because it might mean that
>> gdb exits before all the queued responses are emitted.

Tom> As for PR31380, which is about ensuring responses are flushed to
Tom> client before exiting, indeed there's a race between flushing the
Tom> write queue and gdb exiting because of a SIGHUP (which is not caused
Tom> by the send_gdb("quit"), I've mentioned that in the PR.

Alright, it's fine by me if you want to do them separately as well.
Approved-By: Tom Tromey <tom@tromey.com>

Tom
  
Tom Tromey Feb. 23, 2024, 5:01 p.m. UTC | #4
>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

Tom> When running test-case gdb.dap/eof.exp, we're likely to get a coredump due to
Tom> a segfault in new_threadstate.

Tom> -                send_gdb("quit")

I think we need a different fix for this.

This patch on its own caused a regression in the internal AdaCore test
suite -- the exit status of gdb is now wrong.  Now, I'm not 100% sure
why this is.  Like, maybe the AdaCore test suite is killing gdb if it
pauses.

However, I thought I'd try to reproduce this in the gdb test suite.  I
wrote the appended.

With this patch in place, dap_shutdown just hangs, which happens because
gdb doesn't exit on its own accord.

I tried adding send_gdb("quit") to Server.main_loop, but of course this
just reintroduces the crash here.  But I tend to think this would be the
right thing to do, and so adding some kind of special case in gdb's
Python layer would be appropriate.

Tom

diff --git a/gdb/testsuite/lib/dap-support.exp b/gdb/testsuite/lib/dap-support.exp
index 72c22d00711..54795a34e39 100644
--- a/gdb/testsuite/lib/dap-support.exp
+++ b/gdb/testsuite/lib/dap-support.exp
@@ -400,6 +400,15 @@ proc dap_check_log_file_re { re } {
 proc dap_shutdown {{terminate false}} {
     dap_check_request_and_response "shutdown" disconnect \
 	[format {o terminateDebuggee [l %s]} $terminate]
+
+    # Check gdb's exit status.
+    global gdb_spawn_id
+    set result [wait -i $gdb_spawn_id]
+    gdb_assert {[lindex $result 2] == 0}
+    gdb_assert {[lindex $result 3] == 0}
+
+    clear_gdb_spawn_id
+
     dap_check_log_file
 }
  
Tom Tromey Feb. 23, 2024, 9:08 p.m. UTC | #5
Tom> I tried adding send_gdb("quit") to Server.main_loop, but of course this
Tom> just reintroduces the crash here.  But I tend to think this would be the
Tom> right thing to do, and so adding some kind of special case in gdb's
Tom> Python layer would be appropriate.

I've got a series to clean this up that I will send in a moment.

Tom
  

Patch

diff --git a/gdb/python/lib/gdb/dap/io.py b/gdb/python/lib/gdb/dap/io.py
index 5149edae977..4edd504c727 100644
--- a/gdb/python/lib/gdb/dap/io.py
+++ b/gdb/python/lib/gdb/dap/io.py
@@ -68,7 +68,6 @@  def start_json_writer(stream, queue):
                 # This is an exit request.  The stream is already
                 # flushed, so all that's left to do is request an
                 # exit.
-                send_gdb("quit")
                 break
             obj["seq"] = seq
             seq = seq + 1