I'm posting this as RFC. I started looking at this when I got a CI
failure email from Linaro about a regression on
gdb.base/gdb-sigerm.exp on ARM. Turns out is wasn't my fault, it's
just the precise failure point moves about, so it can look like a
regression.
After looking at it for a bit I realised this was PR gdb/31061, and
there have already been attempts to fix this over the last few years.
What I have here is different than any of the previous approaches
posted, but I'm still not entirely sure this is the right solution,
but I thought I'd share it. Might be nice to see if we can get this
fixed.
The patch might still need some cleanup, but it should be good enough
to discuss this approach.
Thanks,
Andrew
---
This is another attempt to fix PR gdb/31061. There have been two
previous attempts:
https://inbox.sourceware.org/gdb-patches/20231113152609.32726-1-tdevries@suse.de
https://inbox.sourceware.org/gdb-patches/20231128150121.23760-1-tdevries@suse.de
But PR gdb/31061 will only manifest on targets that make use of
software single stepping for most (or all) of their single stepping.
Some common targets where is applies are ARM and RISC-V. However,
there is an interesting series which makes it possible to reproduce
this bug on x86-64 by adding a 'maintenance' switch that has x86-64
perform software single stepping in enough cases that the bug can
manifest, see here:
https://inbox.sourceware.org/gdb-patches/20231129203326.11952-1-tdevries@suse.de
The cause of the bug is a SIGTERM arriving while GDB is within a call
to fetch_inferior_event. The SIGTERM arrives which causes
sync_quit_force_run to be set (see handle_sigerm in event-top.c).
This is then caught be a call to maybe_quit (via the QUIT macro)in
target_write_with_progress, which is called as part of
memory_remove_breakpoint.
GDB's stack at the point quit() is called:
(top-gdb) bt 25
#0 quit () at ../../src/gdb/event-top.c:1176
#1 0x0000000000925035 in maybe_quit () at ../../src/gdb/event-top.c:1206
#2 0x0000000000f0df85 in target_write_with_progress (ops=0x31db680 <the_amd64_linux_nat_target>, object=TARGET_OBJECT_RAW_MEMORY, annex=0x0, buf=0x45f7414 "\220", offset=4198802, len=1, progress=0x0, baton=0x0) at ../../src/gdb/target.c:2223
#3 0x0000000000f0dfd8 in target_write (ops=0x31db680 <the_amd64_linux_nat_target>, object=TARGET_OBJECT_RAW_MEMORY, annex=0x0, buf=0x45f7414 "\220", offset=4198802, len=1) at ../../src/gdb/target.c:2236
#4 0x0000000000f0d39a in target_write_raw_memory (memaddr=0x401192, myaddr=0x45f7414 "\220", len=1) at ../../src/gdb/target.c:1859
#5 0x0000000000b3b2dc in default_memory_remove_breakpoint (gdbarch=0x3ef4590, bp_tgt=0x45f73f8) at ../../src/gdb/mem-break.c:82
#6 0x00000000005157a5 in gdbarch_memory_remove_breakpoint (gdbarch=0x3ef4590, bp_tgt=0x45f73f8) at ../../src/gdb/gdbarch-gen.c:2961
#7 0x0000000000b3b331 in memory_remove_breakpoint (ops=0x31db680 <the_amd64_linux_nat_target>, gdbarch=0x3ef4590, bp_tgt=0x45f73f8, reason=REMOVE_BREAKPOINT) at ../../src/gdb/mem-break.c:99
#8 0x00000000004e5504 in memory_breakpoint_target<process_stratum_target>::remove_breakpoint (this=0x31db680 <the_amd64_linux_nat_target>, gdbarch=0x3ef4590, bp_tgt=0x45f73f8, reason=REMOVE_BREAKPOINT) at ../../src/gdb/target.h:2495
#9 0x0000000000f0e512 in target_remove_breakpoint (gdbarch=0x3ef4590, bp_tgt=0x45f73f8, reason=REMOVE_BREAKPOINT) at ../../src/gdb/target.c:2391
#10 0x00000000005e6e68 in code_breakpoint::remove_location (this=0x456cc30, bl=0x45f7350, reason=REMOVE_BREAKPOINT) at ../../src/gdb/breakpoint.c:12057
#11 0x00000000005d0df3 in remove_breakpoint_1 (bl=0x45f7350, reason=REMOVE_BREAKPOINT) at ../../src/gdb/breakpoint.c:4207
#12 0x00000000005d1147 in remove_breakpoint (bl=0x45f7350) at ../../src/gdb/breakpoint.c:4312
#13 0x00000000005e5c97 in update_global_location_list (insert_mode=UGLL_DONT_INSERT) at ../../src/gdb/breakpoint.c:11607
#14 0x00000000005e86b2 in delete_breakpoint (bpt=0x456cc30) at ../../src/gdb/breakpoint.c:12700
#15 0x0000000000f30ae2 in delete_thread_breakpoint (bp_p=0x394cbf8) at ../../src/gdb/thread.c:99
#16 0x0000000000f30b5f in delete_single_step_breakpoints (tp=0x394cba0) at ../../src/gdb/thread.c:124
#17 0x0000000000a35ae5 in for_each_just_stopped_thread (func=0xf30b3c <delete_single_step_breakpoints(thread_info*)>) at ../../src/gdb/infrun.c:3955
#18 0x0000000000a35bc0 in delete_just_stopped_threads_single_step_breakpoints () at ../../src/gdb/infrun.c:3980
#19 0x0000000000a3d840 in handle_signal_stop (ecs=0x7ffd3ce53060) at ../../src/gdb/infrun.c:6974
#20 0x0000000000a3c58d in handle_inferior_event (ecs=0x7ffd3ce53060) at ../../src/gdb/infrun.c:6560
#21 0x0000000000a37bf4 in fetch_inferior_event () at ../../src/gdb/infrun.c:4702
#22 0x0000000000a0bfc4 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42
#23 0x0000000000a45813 in infrun_async_inferior_event_handler (data=0x0) at ../../src/gdb/infrun.c:10454
#24 0x000000000057bf3a in check_async_event_handlers () at ../../src/gdb/async-event.c:337
(More stack frames follow...)
Because sync_quit_force_run is set, quit() raises a
gdb_exception_forced_quit exception which causes GDB to unwind the
stack all the way back to before the fetch_inferior_event call.
As a consequence, the inferior breakpoints will not be fully removed,
and the delete_single_step_breakpoints and delete_thread_breakpoint
calls will not complete, which means that the thread breakpoint stored
in the thread will not be set back to nullptr.
This is a problem, because delete_breakpoint has already started to
tear down the breakpoint; importantly, the breakpoint has already been
removed from the breakpoint_chain.
As the gdb_exception_forced_quit passes through fetch_inferior_event,
the SCOPE_EXIT handlers are invoked, including the one that calls
delete_just_stopped_threads_infrun_breakpoints. This results in
another call to delete_single_step_breakpoints, which tries again to
delete the breakpoint that GDB already tried to delete. GDB's stack
at this point looks like this:
(top-gdb) bt 13
#0 delete_breakpoint (bpt=0x456cc30) at ../../src/gdb/breakpoint.c:12625
#1 0x0000000000f30ae2 in delete_thread_breakpoint (bp_p=0x394cbf8) at ../../src/gdb/thread.c:99
#2 0x0000000000f30b5f in delete_single_step_breakpoints (tp=0x394cba0) at ../../src/gdb/thread.c:124
#3 0x0000000000a35a7d in delete_thread_infrun_breakpoints (tp=0x394cba0) at ../../src/gdb/infrun.c:3936
#4 0x0000000000a35ae5 in for_each_just_stopped_thread (func=0xa35a4d <delete_thread_infrun_breakpoints(thread_info*)>) at ../../src/gdb/infrun.c:3955
#5 0x0000000000a35baf in delete_just_stopped_threads_infrun_breakpoints () at ../../src/gdb/infrun.c:3971
#6 0x000000000060358a in scope_exit<void (*)()>::on_exit (this=0x7ffd3ce52f80) at ../../src/gdb/../gdbsupport/scope-exit.h:135
#7 0x00000000005fb000 in scope_exit_base<scope_exit<void (*)()> >::~scope_exit_base (this=0x7ffd3ce52f80, __in_chrg=<optimized out>) at ../../src/gdb/../gdbsupport/scope-exit.h:68
#8 0x00000000005f64ea in scope_exit<void (*)()>::~scope_exit (this=0x7ffd3ce52f80, __in_chrg=<optimized out>) at ../../src/gdb/../gdbsupport/scope-exit.h:91
#9 0x0000000000a3806a in fetch_inferior_event () at ../../src/gdb/infrun.c:4697
#10 0x0000000000a0bfc4 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42
#11 0x0000000000a45813 in infrun_async_inferior_event_handler (data=0x0) at ../../src/gdb/infrun.c:10454
#12 0x000000000057bf3a in check_async_event_handlers () at ../../src/gdb/async-event.c:337
(More stack frames follow...)
(top-gdb)
As the breakpoint is no longer in the breakpoint_chain though, this
results in an exception:
../../src/gdb/../gdbsupport/intrusive_list.h:469: internal-error: unlink_element: Assertion `elem_node->prev != INTRUSIVE_LIST_UNLINKED_VALUE' failed.
At this points GDB's stack is:
(top-gdb) bt 16
#0 internal_error_loc (file=0x18eb9b8 "../../src/gdb/../gdbsupport/intrusive_list.h", line=469, fmt=0x18e0080 "%s: Assertion `%s' failed.") at ../../src/gdbsupport/errors.cc:56
#1 0x0000000000607eb2 in intrusive_list<breakpoint, intrusive_base_node<breakpoint> >::unlink_element (this=0x31dd7e0 <breakpoint_chain>, elem=...) at ../../src/gdb/../gdbsupport/intrusive_list.h:469
#2 0x00000000005fef64 in intrusive_list<breakpoint, intrusive_base_node<breakpoint> >::erase (this=0x31dd7e0 <breakpoint_chain>, i=...) at ../../src/gdb/../gdbsupport/intrusive_list.h:508
#3 0x00000000005e868b in delete_breakpoint (bpt=0x456cc30) at ../../src/gdb/breakpoint.c:12676
#4 0x0000000000f30ae2 in delete_thread_breakpoint (bp_p=0x394cbf8) at ../../src/gdb/thread.c:99
#5 0x0000000000f30b5f in delete_single_step_breakpoints (tp=0x394cba0) at ../../src/gdb/thread.c:124
#6 0x0000000000a35a7d in delete_thread_infrun_breakpoints (tp=0x394cba0) at ../../src/gdb/infrun.c:3936
#7 0x0000000000a35ae5 in for_each_just_stopped_thread (func=0xa35a4d <delete_thread_infrun_breakpoints(thread_info*)>) at ../../src/gdb/infrun.c:3955
#8 0x0000000000a35baf in delete_just_stopped_threads_infrun_breakpoints () at ../../src/gdb/infrun.c:3971
#9 0x000000000060358a in scope_exit<void (*)()>::on_exit (this=0x7ffd3ce52f80) at ../../src/gdb/../gdbsupport/scope-exit.h:135
#10 0x00000000005fb000 in scope_exit_base<scope_exit<void (*)()> >::~scope_exit_base (this=0x7ffd3ce52f80, __in_chrg=<optimized out>) at ../../src/gdb/../gdbsupport/scope-exit.h:68
#11 0x00000000005f64ea in scope_exit<void (*)()>::~scope_exit (this=0x7ffd3ce52f80, __in_chrg=<optimized out>) at ../../src/gdb/../gdbsupport/scope-exit.h:91
#12 0x0000000000a3806a in fetch_inferior_event () at ../../src/gdb/infrun.c:4697
#13 0x0000000000a0bfc4 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42
#14 0x0000000000a45813 in infrun_async_inferior_event_handler (data=0x0) at ../../src/gdb/infrun.c:10454
#15 0x000000000057bf3a in check_async_event_handlers () at ../../src/gdb/async-event.c:337
(More stack frames follow...)
(top-gdb)
The first attempt proposed solving this bug by spotting that the
breakpoint had already been removed from the breakpoint_chain, and
just skipping that part of delete_breakpoint. This certainly avoided
the assertion, but maybe wasn't ideal. Along with being removed from
breakpoint_chain, the breakpoint would have had its locations
cleared. The first call to update_global_location_list will have then
removed the deleted breakpoint's location from the global location
list, but the attempt to remove the location from the inferior was
interrupted, so the location is still in the inferior. The second
delete_breakpoint call, because of the earlier work done, isn't going
to cause the locations to actually be removed from the inferior.
The second attempt, written after some discussion on the mailing list
in the V1 thread, takes a different approach. This time a mechanism
is introduced to suppress the maybe_quit() function. This suppresses
all calls to quit(), the forced quit (SIGTERM) handling, and the
"normal" quit (SIGINT) handling. This suppression is then put in
place for the duration of delete_breakpoint.
Though this is an improvement over the first approach, I worry that
this might be going too far. If a remote target is in use then GDB
relies on the maybe_quit() function to call the quit_handler in order
to allow the remote target to be interrupted.
We could potentially improve this proposal by maybe moving where the
suppression is implemented, but I wondered if there was an alternative
approach we could try, which is what I've done here.
This approach is similar to the second approach in many ways, but
doesn't require adding a general quit suppression mechanism. Instead,
based on comments in the V1 thread, I observe that in
fetch_inferior_event, we already partially disable Ctrl-C handling
when we install infrun_quit_handler as the quit_handler. The
motivation for this is that we don't want a gdb_exception_quit to be
thrown and leave GDB in a weird state.
But I think everything that is written in the comments in
fetch_inferior_event can also apply to gdb_exception_forced_quit. So
we should be blocking that too.
However, while the quit_handler is run for SIGINT, and can be swapped
out at various points within GDB, the steps that result in a
gdb_exception_forced_quit exception being thrown are much simpler.
For this case we get SIGTERM, which sets sync_quit_force_run, which in
then converted to gdb_exception_forced_quit when a QUIT macro is
reached.
I suggest that the simplest way to prevent gdb_exception_forced_quit
is to just block SIGTERM for the duration of the critical section of
GDB, in this case, I suggest, we should block SIGTERM in
fetch_inferior_event.
We also need to be sure to temporarily reset sync_quit_force_run to
false after blocking SIGTERM so that a later QUIT macro doesn't
result in a gdb_exception_forced_quit being thrown, but this is all
handled by a new helper RAII class.
With this fix in place I now see the gdb.base/gdb-sigterm.exp pass
without issue on ARM, and gdb.base/gdb-sigerm-3.exp (from one of the
threads mentioned at the start of this message) pass without issue on
x86-64.
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31061
---
gdb/infrun.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
base-commit: 58a32fd620dc22c36b838a613e72c1083fbe2a2e
@@ -78,6 +78,7 @@
#include "extension.h"
#include "disasm.h"
#include "interps.h"
+#include "gdbsupport/scoped_ignore_signal.h"
/* Prototypes for local functions */
@@ -4574,6 +4575,31 @@ infrun_quit_handler ()
}
}
+/* Block SIGTERM and then clear sync_quit_force_run. When we go out of
+ scope, restore the previous sync_quit_force_run value, and then unblock
+ signals.
+
+ This should maybe live in a support file somewhere, but it needs to see
+ sync_quit_force_run, so likely needs to live in the gdb/ directory. */
+
+struct scoped_ignore_sigterm
+{
+ scoped_ignore_sigterm ()
+ : m_old_val (sync_quit_force_run)
+ {
+ sync_quit_force_run = false;
+ }
+
+ ~scoped_ignore_sigterm ()
+ {
+ sync_quit_force_run = m_old_val;
+ }
+
+private:
+ scoped_ignore_signal<SIGTERM, false> m_ignore_signal;
+ bool m_old_val;
+};
+
/* Asynchronous version of wait_for_inferior. It is called by the
event loop whenever a change of state is detected on the file
descriptor corresponding to the target. It can be called more than
@@ -4609,6 +4635,16 @@ fetch_inferior_event ()
scoped_restore restore_quit_handler
= make_scoped_restore (&quit_handler, infrun_quit_handler);
+ /* Similar to how the above custom quit handler ignores the quit flag
+ (thus not interrupting GDB on receipt of Ctrl-C), this arranges to
+ block SIGTERM while we are handling the inferior event. Any SIGTERM
+ will be deferred until this function is done. Usually SIGTERM is
+ converted to an exception by the QUIT macro, but doing that while
+ processing an inferior event can leave the inferior in a weird state,
+ e.g. some breakpoints not removed. Deferring SIGTERM handling until
+ after this function means the event should have been fully handled. */
+ scoped_ignore_sigterm ignore_sigterm;
+
/* Make sure a SIGINT does not interrupt an extension language while
we're handling an event. That could interrupt a Python unwinder
or a Python observer or some such. A Ctrl-C should either be