[RFA,1/2,master+8.3] (Windows) fix thr != nullptr assert failure in delete_thread_1

Message ID 1555453982-77808-2-git-send-email-brobecker@adacore.com
State New, archived
Headers

Commit Message

Joel Brobecker April 16, 2019, 10:33 p.m. UTC
  Hello,

We have observed that GDB would randomly trip the the following
assertion failure when debugging on Windows. When allowing
the program to run until the inferior exits, we occasionally see:

     (gdb) cont
     Continuing.
     [Thread 48192.0xd100 exited with code 1]
     [Thread 48192.0x10ad8 exited with code 1]
     [Thread 48192.0x36e28 exited with code 0]
     [Thread 48192.0x52be4 exited with code 0]
     [Thread 48192.0x5aa40 exited with code 0]
     ../../src/gdb/thread.c:453: internal-error: void delete_thread_1(thread_inf
o*, bool): Assertion `thr != nullptr' failed.

Running the same scenario with some additional traces enabled...

    (gdb) set verbose
    (gdb) set debugevents

... allows us to understand what the issue is. To understand, we need
to first look at the events received when starting the program, and
in particular which threads got created how. First, we get a
CREATE_PROCESS_DEBUG_EVENT for tid=0x442a8:

    gdb: kernel event for pid=317536 tid=0x442a8 code=CREATE_PROCESS_DEBUG_EVENT)

Shortly after, we get some CREATE_THREAD_DEBUG_EVENT events,
one of them being for tid=0x4010c:

    gdb: kernel event for pid=317536 tid=0x4010c code=CREATE_THREAD_DEBUG_EVENT)
Fast forward a bit of debugging, and we do a "cont" as above,
at which point the programs reaches the end, and the system reports
"exit" events. The first interesting one is the following:

    gdb: kernel event for pid=317536 tid=0x442a8 code=EXIT_THREAD_DEBUG_EVENT)

This is reporting a thread-exit event for a thread whose tid
is the TID of what we call the "main thread". That's the thread
that was created when we received the CREATE_PROCESS_DEBUG_EVENT
notification, and whose TID is actually stored in a global variable
called main_thread_id. This is not something we expected, as
the assumption we made was that the main thread would exit last,
and we would be notified of it via an EXIT_PROCESS_DEBUG_EVENT.
But apparently, this is not always true, at least on Windows Server
2012 and 2016 where this issue has been observed happening randomly.

The consequence of the above notification is that we call
windows_delete_thread for that thread, which removes it from
our list of known threads.

And a little bit later, then we then get the EXIT_PROCESS_DEBUG_EVENT,
and we can see that the associated tid is not the main_thread_id,
but rather the tid of one of the threads that was created during
the lifetime of the program, in this case tid=0x4010c:

    gdb: kernel event for pid=317536 tid=0x4010c code=EXIT_PROCESS_DEBUG_EVENT)

And the debug trace printed right after shows why we're crashing:

    [Deleting Thread 317536.0x442a8]

We are trying to delete the thread whose tid=0x442a8, which is
the main_thread_id! As we have already deleted that thread before,
the search for it returns a nullptr, which then trips the assertion
check in delete_thread_1.

This commit fixes the immediate issue; but it still ignores the open
question of what to do with the main_thread_id global, particularly
after that thread has been removed from our list of threads. The bigger
question is kept separate from this patch, because this smaller patch
seems to work well enough on its own in practice, and make the patch
less complex, thus a possible candidate for backporting to a release
branch (thinking of the gdb-8.3-branch, in this case).

Another change that this commit makes is re-instate the thread create/
exit notifications for what we thought was the main thread. We added
some filtering recently to not show those notifications. Otherwise,
with what we've discovered, it is possible that the list of threads
reported by the "new thread" notifications does not match the list
of threads from the "thread exit" notifications.

gdb/ChangeLog:

	* windows-nat.c (windows_add_thread): Remove parameter "main_thread_p".
	Adjust function accordingly.  Update all callers.
	(windows_delete_thread): Likewise.
	(get_windows_debug_event) <EXIT_PROCESS_DEBUG_EVENT>: Use
        current_event.dwThreadId instead of main_thread_id.

Tested on both x86-windows and x86_64-windows using AdaCore's testsuite.
The portion of the patch that uses current_event.dwThreadId instead of
main_thread_id has undergood about a couple of weeks of nightly testing
on various Windows versions.

OK to master?
OK for 8.3?

Thanks!

---
 gdb/windows-nat.c | 58 ++++++++++++++++++++++++-------------------------------
 1 file changed, 25 insertions(+), 33 deletions(-)
  

Comments

Pedro Alves April 18, 2019, 2:52 p.m. UTC | #1
On 4/16/19 11:33 PM, Joel Brobecker wrote:
> Hello,
> 
> We have observed that GDB would randomly trip the the following

"the the"

> assertion failure when debugging on Windows. When allowing
> the program to run until the inferior exits, we occasionally see:
> 
>      (gdb) cont
>      Continuing.
>      [Thread 48192.0xd100 exited with code 1]
>      [Thread 48192.0x10ad8 exited with code 1]
>      [Thread 48192.0x36e28 exited with code 0]
>      [Thread 48192.0x52be4 exited with code 0]
>      [Thread 48192.0x5aa40 exited with code 0]
>      ../../src/gdb/thread.c:453: internal-error: void delete_thread_1(thread_inf
> o*, bool): Assertion `thr != nullptr' failed.
> 
> Running the same scenario with some additional traces enabled...
> 
>     (gdb) set verbose
>     (gdb) set debugevents
> 
> ... allows us to understand what the issue is. To understand, we need
> to first look at the events received when starting the program, and
> in particular which threads got created how. First, we get a
> CREATE_PROCESS_DEBUG_EVENT for tid=0x442a8:
> 
>     gdb: kernel event for pid=317536 tid=0x442a8 code=CREATE_PROCESS_DEBUG_EVENT)
> 
> Shortly after, we get some CREATE_THREAD_DEBUG_EVENT events,
> one of them being for tid=0x4010c:
> 
>     gdb: kernel event for pid=317536 tid=0x4010c code=CREATE_THREAD_DEBUG_EVENT)
> Fast forward a bit of debugging, and we do a "cont" as above,
> at which point the programs reaches the end, and the system reports
> "exit" events. The first interesting one is the following:
> 
>     gdb: kernel event for pid=317536 tid=0x442a8 code=EXIT_THREAD_DEBUG_EVENT)
> 
> This is reporting a thread-exit event for a thread whose tid
> is the TID of what we call the "main thread". That's the thread
> that was created when we received the CREATE_PROCESS_DEBUG_EVENT
> notification, and whose TID is actually stored in a global variable
> called main_thread_id. This is not something we expected, as
> the assumption we made was that the main thread would exit last,
> and we would be notified of it via an EXIT_PROCESS_DEBUG_EVENT.
> But apparently, this is not always true, at least on Windows Server
> 2012 and 2016 where this issue has been observed happening randomly.
> 
> The consequence of the above notification is that we call
> windows_delete_thread for that thread, which removes it from
> our list of known threads.
> 
> And a little bit later, then we then get the EXIT_PROCESS_DEBUG_EVENT,
> and we can see that the associated tid is not the main_thread_id,
> but rather the tid of one of the threads that was created during
> the lifetime of the program, in this case tid=0x4010c:
> 
>     gdb: kernel event for pid=317536 tid=0x4010c code=EXIT_PROCESS_DEBUG_EVENT)
> 
> And the debug trace printed right after shows why we're crashing:
> 
>     [Deleting Thread 317536.0x442a8]
> 
> We are trying to delete the thread whose tid=0x442a8, which is
> the main_thread_id! As we have already deleted that thread before,
> the search for it returns a nullptr, which then trips the assertion
> check in delete_thread_1.
> 
> This commit fixes the immediate issue; but it still ignores the open
> question of what to do with the main_thread_id global, particularly
> after that thread has been removed from our list of threads. The bigger
> question is kept separate from this patch, because this smaller patch
> seems to work well enough on its own in practice, and make the patch
> less complex, thus a possible candidate for backporting to a release
> branch (thinking of the gdb-8.3-branch, in this case).
> 
> Another change that this commit makes is re-instate the thread create/
> exit notifications for what we thought was the main thread. We added
> some filtering recently to not show those notifications. Otherwise,
> with what we've discovered, it is possible that the list of threads
> reported by the "new thread" notifications does not match the list
> of threads from the "thread exit" notifications.
> 
> gdb/ChangeLog:
> 
> 	* windows-nat.c (windows_add_thread): Remove parameter "main_thread_p".
> 	Adjust function accordingly.  Update all callers.
> 	(windows_delete_thread): Likewise.
> 	(get_windows_debug_event) <EXIT_PROCESS_DEBUG_EVENT>: Use
>         current_event.dwThreadId instead of main_thread_id.
> 
> Tested on both x86-windows and x86_64-windows using AdaCore's testsuite.
> The portion of the patch that uses current_event.dwThreadId instead of
> main_thread_id has undergood about a couple of weeks of nightly testing
> on various Windows versions.

"undergood"?

> 
> OK to master?
> OK for 8.3?
> 
> Thanks!
> 
> ---
>  gdb/windows-nat.c | 58 ++++++++++++++++++++++++-------------------------------
>  1 file changed, 25 insertions(+), 33 deletions(-)
> 
> diff --git a/gdb/windows-nat.c b/gdb/windows-nat.c
> index 5009418..039eb13 100644
> --- a/gdb/windows-nat.c
> +++ b/gdb/windows-nat.c
> @@ -430,12 +430,10 @@ thread_rec (DWORD id, int get_context)
>  
>     PTID is the ptid of the thread to be added.
>     H is its Windows handle.
> -   TLB is its thread local base.
> -   MAIN_THREAD_P should be true if the thread to be added is
> -   the main thread, false otherwise.  */
> +   TLB is its thread local base.  */
>  
>  static windows_thread_info *
> -windows_add_thread (ptid_t ptid, HANDLE h, void *tlb, bool main_thread_p)
> +windows_add_thread (ptid_t ptid, HANDLE h, void *tlb)
>  {
>    windows_thread_info *th;
>    DWORD id;
> @@ -456,13 +454,18 @@ windows_add_thread (ptid_t ptid, HANDLE h, void *tlb, bool main_thread_p)
>  
>    /* Add this new thread to the list of threads.
>  
> -     To be consistent with what's done on other platforms, we add
> -     the main thread silently (in reality, this thread is really
> -     more of a process to the user than a thread).  */
> -  if (main_thread_p)
> -    add_thread_silent (ptid);
> -  else
> -    add_thread (ptid);
> +     On many platforms, we add the main program's thread silently.
> +     But unfortunately, it does not seem to identify such thread

Not clear what "it" is referring to here.  Maybe you
meant -- "it does not seem possible" ?

> +     on Windows. We used to think that the main program's thread

Double space after periods.

> +     is the thread reported by the CREATE_PROCESS_DEBUG_EVENT, but
> +     we've seen since then that this thread sometimes exits before
> +     other threads in the program -- when that happens, what we see
> +     is that the CREATE_PROCESS_DEBUG_EVENT thread gets reported
> +     as exited via an EXIT_THREAD_DEBUG_EVENT, and another thread
> +     gets used in the EXIT_PROCESS_DEBUG_EVENT notification instead.

I'd think that the thread reported in the CREATE_PROCESS_DEBUG_EVENT
event _is_ always the main thread, but that then it's possible for that
thread to _not_ be the last to exit the process.  Seems more logical to
me than what the comments above, and the commit log, suggest.

Note that <https://docs.microsoft.com/en-us/windows/desktop/api/minwinbase/ns-minwinbase-create_process_debug_info>
says that the CREATE_PROCESS_DEBUG_INFO structure includes a field for
the _initial_ thread: current_event.u.CreateProcessInfo.hThread.
The initial thread is what we call the main thread.

We're reading that handle when creating the gdb thread:

      main_thread_id = current_event.dwThreadId;
      /* Add the main thread.  */
      th = windows_add_thread
        (ptid_t (current_event.dwProcessId, 0,
		 current_event.dwThreadId),
	 current_event.u.CreateProcessInfo.hThread,
	 current_event.u.CreateProcessInfo.lpThreadLocalBase,

And I'd be very surprised if that handle (and thus the thread id
reported by the event), wasn't always the initial thread.

Note also that
<https://docs.microsoft.com/en-us/windows/desktop/api/minwinbase/ns-minwinbase-exit_process_debug_info>
shows that the EXIT_PROCESS_DEBUG_INFO structure does not have an associated
thread handle at all.  This seems to add a bit more strength to the hypothesis that it's the
assumption that the main thread is always the last to exit that is what's incorrect,
not that we can't determine which is the main/initial thread.

From <https://docs.microsoft.com/en-us/windows/desktop/procthread/terminating-a-thread>,
we see:

  "If the thread is the only active thread in the process, the process is terminated."

which again suggests that there's nothing special about the initial thread wrt to
process termination.

Did you try writing a small test program with a couple threads, that has
the main thread call ExitThread, or better, TerminateThread, while the other
thread still runs?  I wouldn't be surprised if that reproduced the scenario
in question.

> @@ -1637,11 +1629,11 @@ get_windows_debug_event (struct target_ops *ops,
>        else if (saw_create == 1)
>  	{
>  	  windows_delete_thread (ptid_t (current_event.dwProcessId, 0,
> -					 main_thread_id),
> -				 0, true /* main_thread_p */);
> +					 current_event.dwThreadId),
> +				 0);
>  	  ourstatus->kind = TARGET_WAITKIND_EXITED;
>  	  ourstatus->value.integer = current_event.u.ExitProcess.dwExitCode;
> -	  thread_id = main_thread_id;
> +	  thread_id = current_event.dwThreadId;
>  	}
>        break;

This looks right.

Thanks,
Pedro Alves
  
Pedro Alves April 18, 2019, 3:04 p.m. UTC | #2
On 4/18/19 3:52 PM, Pedro Alves wrote:

>> @@ -1637,11 +1629,11 @@ get_windows_debug_event (struct target_ops *ops,
>>        else if (saw_create == 1)
>>  	{
>>  	  windows_delete_thread (ptid_t (current_event.dwProcessId, 0,
>> -					 main_thread_id),
>> -				 0, true /* main_thread_p */);
>> +					 current_event.dwThreadId),
>> +				 0);
>>  	  ourstatus->kind = TARGET_WAITKIND_EXITED;
>>  	  ourstatus->value.integer = current_event.u.ExitProcess.dwExitCode;
>> -	  thread_id = main_thread_id;
>> +	  thread_id = current_event.dwThreadId;
>>  	}
>>        break;
> 
> This looks right.

I forgot to add: reading this made me wonder -- what about gdbserver?
I looked, and while gdbserver/win32-low.c had an equivalent of main_thread_id,
it doesn't use it for EXIT_PROCESS_DEBUG_EVENT either.

Thanks,
Pedro Alves
  

Patch

diff --git a/gdb/windows-nat.c b/gdb/windows-nat.c
index 5009418..039eb13 100644
--- a/gdb/windows-nat.c
+++ b/gdb/windows-nat.c
@@ -430,12 +430,10 @@  thread_rec (DWORD id, int get_context)
 
    PTID is the ptid of the thread to be added.
    H is its Windows handle.
-   TLB is its thread local base.
-   MAIN_THREAD_P should be true if the thread to be added is
-   the main thread, false otherwise.  */
+   TLB is its thread local base.  */
 
 static windows_thread_info *
-windows_add_thread (ptid_t ptid, HANDLE h, void *tlb, bool main_thread_p)
+windows_add_thread (ptid_t ptid, HANDLE h, void *tlb)
 {
   windows_thread_info *th;
   DWORD id;
@@ -456,13 +454,18 @@  windows_add_thread (ptid_t ptid, HANDLE h, void *tlb, bool main_thread_p)
 
   /* Add this new thread to the list of threads.
 
-     To be consistent with what's done on other platforms, we add
-     the main thread silently (in reality, this thread is really
-     more of a process to the user than a thread).  */
-  if (main_thread_p)
-    add_thread_silent (ptid);
-  else
-    add_thread (ptid);
+     On many platforms, we add the main program's thread silently.
+     But unfortunately, it does not seem to identify such thread
+     on Windows. We used to think that the main program's thread
+     is the thread reported by the CREATE_PROCESS_DEBUG_EVENT, but
+     we've seen since then that this thread sometimes exits before
+     other threads in the program -- when that happens, what we see
+     is that the CREATE_PROCESS_DEBUG_EVENT thread gets reported
+     as exited via an EXIT_THREAD_DEBUG_EVENT, and another thread
+     gets used in the EXIT_PROCESS_DEBUG_EVENT notification instead.
+
+     So, we just add all threads verbosely.  */
+  add_thread (ptid);
 
   /* Set the debug registers for the new thread if they are used.  */
   if (debug_registers_used)
@@ -503,12 +506,10 @@  windows_init_thread_list (void)
 /* Delete a thread from the list of threads.
 
    PTID is the ptid of the thread to be deleted.
-   EXIT_CODE is the thread's exit code.
-   MAIN_THREAD_P should be true if the thread to be deleted is
-   the main thread, false otherwise.  */
+   EXIT_CODE is the thread's exit code.  */
 
 static void
-windows_delete_thread (ptid_t ptid, DWORD exit_code, bool main_thread_p)
+windows_delete_thread (ptid_t ptid, DWORD exit_code)
 {
   windows_thread_info *th;
   DWORD id;
@@ -517,16 +518,11 @@  windows_delete_thread (ptid_t ptid, DWORD exit_code, bool main_thread_p)
 
   id = ptid.tid ();
 
-  /* Emit a notification about the thread being deleted.
-
-     Note that no notification was printed when the main thread
-     was created, and thus, unless in verbose mode, we should be
-     symetrical, and avoid that notification for the main thread
-     here as well.  */
+  /* Emit a notification about the thread being deleted.  */
 
   if (info_verbose)
     printf_unfiltered ("[Deleting %s]\n", target_pid_to_str (ptid).c_str ());
-  else if (print_thread_events && !main_thread_p)
+  else if (print_thread_events)
     printf_unfiltered (_("[%s exited with code %u]\n"),
 		       target_pid_to_str (ptid).c_str (),
 		       (unsigned) exit_code);
@@ -1411,8 +1407,7 @@  fake_create_process (void)
     = windows_add_thread (ptid_t (current_event.dwProcessId, 0,
 				  current_event.dwThreadId),
 			  current_event.u.CreateThread.hThread,
-			  current_event.u.CreateThread.lpThreadLocalBase,
-			  true /* main_thread_p */);
+			  current_event.u.CreateThread.lpThreadLocalBase);
   return main_thread_id;
 }
 
@@ -1584,8 +1579,7 @@  get_windows_debug_event (struct target_ops *ops,
       th = windows_add_thread
         (ptid_t (current_event.dwProcessId, 0, current_event.dwThreadId),
 	 current_event.u.CreateThread.hThread,
-	 current_event.u.CreateThread.lpThreadLocalBase,
-	 false /* main_thread_p */);
+	 current_event.u.CreateThread.lpThreadLocalBase);
 
       break;
 
@@ -1596,8 +1590,7 @@  get_windows_debug_event (struct target_ops *ops,
 		     "EXIT_THREAD_DEBUG_EVENT"));
       windows_delete_thread (ptid_t (current_event.dwProcessId, 0,
 				     current_event.dwThreadId),
-			     current_event.u.ExitThread.dwExitCode,
-			     false /* main_thread_p */);
+			     current_event.u.ExitThread.dwExitCode);
       th = &dummy_thread_info;
       break;
 
@@ -1617,8 +1610,7 @@  get_windows_debug_event (struct target_ops *ops,
         (ptid_t (current_event.dwProcessId, 0,
 		 current_event.dwThreadId),
 	 current_event.u.CreateProcessInfo.hThread,
-	 current_event.u.CreateProcessInfo.lpThreadLocalBase,
-	 true /* main_thread_p */);
+	 current_event.u.CreateProcessInfo.lpThreadLocalBase);
       thread_id = current_event.dwThreadId;
       break;
 
@@ -1637,11 +1629,11 @@  get_windows_debug_event (struct target_ops *ops,
       else if (saw_create == 1)
 	{
 	  windows_delete_thread (ptid_t (current_event.dwProcessId, 0,
-					 main_thread_id),
-				 0, true /* main_thread_p */);
+					 current_event.dwThreadId),
+				 0);
 	  ourstatus->kind = TARGET_WAITKIND_EXITED;
 	  ourstatus->value.integer = current_event.u.ExitProcess.dwExitCode;
-	  thread_id = main_thread_id;
+	  thread_id = current_event.dwThreadId;
 	}
       break;