[RFC,1/3,gdb/dap] Fix exit race

Message ID 20240207090224.27521-2-tdevries@suse.de
State Superseded
Headers
Series Fix issues triggered by gdb.dap/eof.exp |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gdb_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-aarch64 success Testing passed

Commit Message

Tom de Vries Feb. 7, 2024, 9:02 a.m. UTC
  When running test-case gdb.dap/eof.exp, we're likely to get a coredump due to
a segfault in new_threadstate.

At the point of the core dump, the gdb main thread looks like:
...
 (gdb) bt
 #0  0x0000fffee30d2280 in __pthread_kill_implementation () from /lib64/libc.so.6
 #1  0x0000fffee3085800 [PAC] in raise () from /lib64/libc.so.6
 #2  0x00000000007b03e8 [PAC] in handle_fatal_signal (sig=11)
     at gdb/event-top.c:926
 #3  0x00000000007b0470 in handle_sigsegv (sig=11)
     at gdb/event-top.c:976
 #4  <signal handler called>
 #5  0x0000fffee3a4db14 in new_threadstate () from /lib64/libpython3.12.so.1.0
 #6  0x0000fffee3ab0548 [PAC] in PyGILState_Ensure () from /lib64/libpython3.12.so.1.0
 #7  0x0000000000a6d034 [PAC] in gdbpy_gil::gdbpy_gil (this=0xffffcb279738)
     at gdb/python/python-internal.h:787
 #8  0x0000000000ab87ac in gdbpy_event::~gdbpy_event (this=0xfffea8001ee0,
     __in_chrg=<optimized out>) at gdb/python/python.c:1051
 #9  0x0000000000ab9460 in std::_Function_base::_Base_manager<...>::_M_destroy
     (__victim=...) at /usr/include/c++/13/bits/std_function.h:175
 #10 0x0000000000ab92dc in std::_Function_base::_Base_manager<...>::_M_manager
     (__dest=..., __source=..., __op=std::__destroy_functor)
     at /usr/include/c++/13/bits/std_function.h:203
 #11 0x0000000000ab8f14 in std::_Function_handler<...>::_M_manager(...) (...)
     at /usr/include/c++/13/bits/std_function.h:282
 #12 0x000000000042dd9c in std::_Function_base::~_Function_base (this=0xfffea8001c10,
     __in_chrg=<optimized out>) at /usr/include/c++/13/bits/std_function.h:244
 #13 0x000000000042e654 in std::function<void ()>::~function() (this=0xfffea8001c10,
     __in_chrg=<optimized out>) at /usr/include/c++/13/bits/std_function.h:334
 #14 0x0000000000b68e60 in std::_Destroy<std::function<void ()> >(...) (...)
     at /usr/include/c++/13/bits/stl_construct.h:151
 #15 0x0000000000b68cd0 in std::_Destroy_aux<false>::__destroy<...>(...) (...)
     at /usr/include/c++/13/bits/stl_construct.h:163
 #16 0x0000000000b689d8 in std::_Destroy<...>(...) (...)
     at /usr/include/c++/13/bits/stl_construct.h:196
 #17 0x0000000000b68414 in std::_Destroy<...>(...) (...)
     at /usr/include/c++/13/bits/alloc_traits.h:948
 #18 std::vector<...>::~vector() (this=0x2a183c8 <runnables>)
     at /usr/include/c++/13/bits/stl_vector.h:732
 #19 0x0000fffee3088370 in __run_exit_handlers () from /lib64/libc.so.6
 #20 0x0000fffee3088450 [PAC] in exit () from /lib64/libc.so.6
 #21 0x0000000000c95600 [PAC] in quit_force (exit_arg=0x0, from_tty=0)
     at gdb/top.c:1822
 #22 0x0000000000609140 in quit_command (args=0x0, from_tty=0)
     at gdb/cli/cli-cmds.c:508
 #23 0x0000000000c926a4 in quit_cover () at gdb/top.c:300
 #24 0x00000000007b09d4 in async_disconnect (arg=0x0)
     at gdb/event-top.c:1230
 #25 0x0000000000548acc in invoke_async_signal_handlers ()
     at gdb/async-event.c:234
 #26 0x000000000157d2d4 in gdb_do_one_event (mstimeout=-1)
     at gdbsupport/event-loop.cc:199
 #27 0x0000000000943a84 in start_event_loop () at gdb/main.c:401
 #28 0x0000000000943bfc in captured_command_loop () at gdb/main.c:465
 #29 0x000000000094567c in captured_main (data=0xffffcb279d08)
     at gdb/main.c:1335
 #30 0x0000000000945700 in gdb_main (args=0xffffcb279d08)
     at gdb/main.c:1354
 #31 0x0000000000423ab4 in main (argc=14, argv=0xffffcb279e98)
     at gdb/gdb.c:39
...

The direct cause of the segfault is calling PyGILState_Ensure after
calling Py_Finalize.

AFAICT the problem is a race between the gdb main thread and DAP's JSON writer
thread.

On one side, we have the following events:
- DAP's JSON reader thread reads an EOF
- it lets DAP's JSON writer thread known by writing None into its queue
- DAP's JSON writer thread sees the None in its queue, and calls
  send_gdb("quit")
- a corresponding gdbpy_event is deposited in the runnables vector, to be
  run by the gdb main thread

On the other side, we have the following events:
- the gdb main thread receives a SIGHUP
- the corresponding handler calls quit_force, which calls do_final_cleanups
- one of the final cleanups is finalize_python, which calls Py_Finalize
- quit_force calls exit, which triggers the exit handlers
- one of the exit handlers is the destructor of the runnables vector
- destruction of the vector triggers destruction of the remaining element
- the remaining element is a gdbpy_event, and the destructor (indirectly)
  calls PyGILState_Ensure

My first attempt at fixing this was to write a finalize_runnables and call it
from quit_force, similar to finalize_values, to ensure that the gdbpy_event
destructor is called before Py_Finalize.  However, I still ran into the same
problem due to the gdbpy_event being posted after finalize_runnables was
called.  I managed to handle this by ignoring run_on_main_thread after
finalize_runnables, but it made me wonder if there's a better way.

Then I tried to simply remove send_gdb("quit"), and that worked as well, and
caused no regressions.  So, either this is the easiest way to address this, or
we need to add a test-case that regresses when we remove it.

This RFC uses the latter approach.  Perhaps the former is better, perhaps
something else is needed, I'm not sure.

Tested on aarch64-linux.

PR dap/31306
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31306
---
 gdb/python/lib/gdb/dap/io.py | 1 -
 1 file changed, 1 deletion(-)
  

Comments

Tom Tromey Feb. 7, 2024, 4:01 p.m. UTC | #1
>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

Tom> Then I tried to simply remove send_gdb("quit"), and that worked as well, and
Tom> caused no regressions.  So, either this is the easiest way to address this, or
Tom> we need to add a test-case that regresses when we remove it.

Right now main_loop does:

        # Got the terminate request.  This is handled by the
        # JSON-writing thread, so that we can ensure that all
        # responses are flushed to the client before exiting.
        self.write_queue.put(None)

Like the comment says, this approach tries to ensure that all responses
have been flushed to the client before quitting.  However, I suppose
this approach does not really work, because main_loop then proceeds to
return.

One way to fix this would be to have start_json_writer return the thread
object, and then have main_loop join the thread to terminate after
writing the None.  This would mean making this particular thread
non-daemon though.

In this setup, removing the "quit" here is the right thing to do.

Tom
  
Tom de Vries Feb. 13, 2024, 3:04 p.m. UTC | #2
On 2/7/24 17:01, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
> Tom> Then I tried to simply remove send_gdb("quit"), and that worked as well, and
> Tom> caused no regressions.  So, either this is the easiest way to address this, or
> Tom> we need to add a test-case that regresses when we remove it.
> 
> Right now main_loop does:
> 
>          # Got the terminate request.  This is handled by the
>          # JSON-writing thread, so that we can ensure that all
>          # responses are flushed to the client before exiting.
>          self.write_queue.put(None)
> 
> Like the comment says, this approach tries to ensure that all responses
> have been flushed to the client before quitting.  However, I suppose
> this approach does not really work, because main_loop then proceeds to
> return.
> 
> One way to fix this would be to have start_json_writer return the thread
> object, and then have main_loop join the thread to terminate after
> writing the None.  This would mean making this particular thread
> non-daemon though.
> 

And as I understand it, the downside of that is that it could possibly 
hang the gdb process.

Btw, I think I found a simpler way of achieving this:
...
diff --git a/gdb/python/lib/gdb/dap/io.py b/gdb/python/lib/gdb/dap/io.py
index 5149edae977..491cf208ae3 100644
--- a/gdb/python/lib/gdb/dap/io.py
+++ b/gdb/python/lib/gdb/dap/io.py
@@ -65,10 +65,7 @@ def start_json_writer(stream, queue):
          while True:
              obj = queue.get()
              if obj is None:
-                # This is an exit request.  The stream is already
-                # flushed, so all that's left to do is request an
-                # exit.
-                send_gdb("quit")
+                queue.task_done()
                  break
              obj["seq"] = seq
              seq = seq + 1
@@ -79,5 +76,6 @@ def start_json_writer(stream, queue):
              stream.write(header_bytes)
              stream.write(body_bytes)
              stream.flush()
+            queue.task_done()

      start_thread("JSON writer", _json_writer)
diff --git a/gdb/python/lib/gdb/dap/server.py 
b/gdb/python/lib/gdb/dap/server.py
index 7cc5a4681ee..ca4465ff207 100644
--- a/gdb/python/lib/gdb/dap/server.py
+++ b/gdb/python/lib/gdb/dap/server.py
@@ -229,6 +229,8 @@ class Server:
          # JSON-writing thread, so that we can ensure that all
          # responses are flushed to the client before exiting.
          self.write_queue.put(None)
+        self.write_queue.join()
+        send_gdb("quit")

      @in_dap_thread
      def send_event_later(self, event, body=None):
...

Regardless, this doesn't address the root cause of the problem, the race 
and crash remain.

+> In this setup, removing the "quit" here is the right thing to do.

I've submitted an updated version (commit message is updated, code is 
the same) of the patch here ( 
https://sourceware.org/pipermail/gdb-patches/2024-February/206567.html ).

Thanks,
- Tom
  
Tom Tromey Feb. 13, 2024, 6:04 p.m. UTC | #3
>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

>> One way to fix this would be to have start_json_writer return the
>> thread
>> object, and then have main_loop join the thread to terminate after
>> writing the None.  This would mean making this particular thread
>> non-daemon though.

Tom> And as I understand it, the downside of that is that it could possibly
Tom> hang the gdb process.

How so?

Tom> +                queue.task_done()

I think SimpleQueue doesn't have task_done, so you have to change this
code to use Queue and instead.

Tom>          # JSON-writing thread, so that we can ensure that all
Tom>          # responses are flushed to the client before exiting.
Tom>          self.write_queue.put(None)
Tom> +        self.write_queue.join()
Tom> +        send_gdb("quit")

I suspect this isn't needed.

Tom
  
Tom Tromey Feb. 13, 2024, 6:11 p.m. UTC | #4
>>>>> "Tom" == Tom Tromey <tom@tromey.com> writes:

> +        self.write_queue.join()
> +        send_gdb("quit")

Tom> I suspect this isn't needed.

Sorry -- here I specifically just mean the explicit quit.
However, it's also harmless.

Tom
  
Tom de Vries Feb. 14, 2024, 3:31 p.m. UTC | #5
On 2/13/24 19:04, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
>>> One way to fix this would be to have start_json_writer return the
>>> thread
>>> object, and then have main_loop join the thread to terminate after
>>> writing the None.  This would mean making this particular thread
>>> non-daemon though.
> 
> Tom> And as I understand it, the downside of that is that it could possibly
> Tom> hang the gdb process.
> 
> How so?
> 

I don't have a concrete worry if that is what you mean.  I'm just trying 
to reverse-engineer the decision making the dap threads daemon, and this 
is what I could come up with.  I hope this answers your question.

> Tom> +                queue.task_done()
> 
> I think SimpleQueue doesn't have task_done, so you have to change this
> code to use Queue and instead.
> 

Ack, understood.

Anyway, I've now filed a separate PR for this issue, which I hope makes 
the discussion a bit clearer, so we have:
- a PR for the assertion failure (PR31306)
- a PR for ensuring responses are flushed to client before exiting
   (PR31380)

> Tom>          # JSON-writing thread, so that we can ensure that all
> Tom>          # responses are flushed to the client before exiting.
> Tom>          self.write_queue.put(None)
> Tom> +        self.write_queue.join()
> Tom> +        send_gdb("quit")
> 
> I suspect this isn't needed.

I assume you specifically mean the last line.  Removing that line is the 
proposed fix for PR31306, the patch quoted above is for PR31380, which 
we somehow ended up discussing in this thread on PR31306.

Thanks,
- Tom
  
Tom Tromey Feb. 14, 2024, 3:34 p.m. UTC | #6
>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

Tom> Anyway, I've now filed a separate PR for this issue, which I hope
Tom> makes the discussion a bit clearer, so we have:
Tom> - a PR for the assertion failure (PR31306)
Tom> - a PR for ensuring responses are flushed to client before exiting
Tom>   (PR31380)

Ok.  FWIW I think the Queue idea seems totally fine, and combining these
patches seems natural to me.

Tom
  
Tom de Vries Feb. 14, 2024, 3:53 p.m. UTC | #7
On 2/14/24 16:34, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
> Tom> Anyway, I've now filed a separate PR for this issue, which I hope
> Tom> makes the discussion a bit clearer, so we have:
> Tom> - a PR for the assertion failure (PR31306)
> Tom> - a PR for ensuring responses are flushed to client before exiting
> Tom>   (PR31380)
> 
> Ok.  FWIW I think the Queue idea seems totally fine, and combining these
> patches seems natural to me.

Sorry for going on this topic, but I'd like to understand why you think 
this.

 From my point of view, it's not a good idea to combine the patches 
because the PRs have distinct root causes, which makes the discussion 
about fixing them together confusing.

So, it's not that I _want_ to do this, it's that I'm convinced by the 
evidence I've seen that this is the natural and obvious thing to do.

Which might mean I'm misunderstanding or overlooking something.

I start to wonder if perhaps you're thinking of a different race, which 
is actually not discussed in the two PRs.

Thanks,
- Tom
  
Tom Tromey Feb. 14, 2024, 4:18 p.m. UTC | #8
>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

>> Ok.  FWIW I think the Queue idea seems totally fine, and combining
>> these
>> patches seems natural to me.

Tom> Sorry for going on this topic, but I'd like to understand why you
Tom> think this.

My thinking is that the code is written this way to solve the flushing
issue (perhaps not well).  Removing the quit fixes one issue, but it
risks introducing another.  I'm not really a stickler for this though.

Tom
  
Tom de Vries Feb. 14, 2024, 5:16 p.m. UTC | #9
On 2/14/24 17:18, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
>>> Ok.  FWIW I think the Queue idea seems totally fine, and combining
>>> these
>>> patches seems natural to me.
> 
> Tom> Sorry for going on this topic, but I'd like to understand why you
> Tom> think this.
> 
> My thinking is that the code is written this way to solve the flushing
> issue (perhaps not well).  Removing the quit fixes one issue, but it
> risks introducing another.  I'm not really a stickler for this though.

Thanks for the explanation.

OK, I think I start to understand your position.  You've written both 
pieces of code with the intent to address one problem, which is why 
you're thinking about it as a whole.

I haven't added the code, I'm just looking at the behaviour of two 
different races and am seeing two independent problems.

[ FWIW, I understood the send_gdb("quit") as a means to ensure that 
gdb's main thread doesn't keep hanging after the dap threads have 
exited, which AFAICT is a problem unrelated to the flushing race. ]

To follow up on the risk remark, I'm not sure which risk you're 
referring to.

I've documented one risk in the commit log:
...
     So, for the system I'm running this on, the send_gdb("quit") is
     actually not needed.

     I'm not sure if we support any systems where it's actually needed.
...
There's a risk that indeed there are such systems, but I figured we can 
deal with that when we encounter them.

If you mean the risk of not flushing in time, I don't think there's any 
relation with send_gdb("quit").  The problem with not flushing in time 
is there, and AFAIU the problem is not made any better or worse by 
removing the send_gdb("quit").

I'll proceed to commit the approved patch, since I'm keen on getting rid 
of the segfault.

As for the flushing race, as I've documented in the PR, the queue 
approach doesn't completely fix the race.  I wonder if using a 
gdb_exiting hook to wait for the maint dap thread to exit is the way to 
go there.

Thanks,
- Tom
  

Patch

diff --git a/gdb/python/lib/gdb/dap/io.py b/gdb/python/lib/gdb/dap/io.py
index 5149edae977..4edd504c727 100644
--- a/gdb/python/lib/gdb/dap/io.py
+++ b/gdb/python/lib/gdb/dap/io.py
@@ -68,7 +68,6 @@  def start_json_writer(stream, queue):
                 # This is an exit request.  The stream is already
                 # flushed, so all that's left to do is request an
                 # exit.
-                send_gdb("quit")
                 break
             obj["seq"] = seq
             seq = seq + 1