Fix AIX core dump while main thread exits.
Checks
Context |
Check |
Description |
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_gdb_build--master-arm |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_gdb_check--master-arm |
success
|
Test passed
|
linaro-tcwg-bot/tcwg_gdb_check--master-aarch64 |
success
|
Test passed
|
Commit Message
From: Aditya Vidyadhar Kamath <aditya.kamath1@ibm.com>
Consider the test case:
void *thread_main(void *) {
std::cout << getpid() << std::endl;
sleep(20);
return nullptr;
}
int main(void) {
pthread_t thread;
pthread_create(&thread, nullptr, thread_main, nullptr);
pthread_join(thread, nullptr);
return 0;
}
This program creates a thread via main that sleeps for 20 seconds.
When we debug this with gdb we get,
Reading symbols from ./test...
(gdb) b main
Breakpoint 1 at 0x10000934: file test.c, line 11.
(gdb) r
Starting program: /read_only_gdb/binutils-gdb/gdb/test
Breakpoint 1, main () at test.c:11
11 pthread_create(&thread, nullptr, thread_main, nullptr);
(gdb) c
Continuing.
15335884
[New Thread 258 (tid 31130079)]
Thread 2 received signal SIGINT, Interrupt.
[Switching to Thread 258 (tid 31130079)]
0xd0611d70 in _p_nsleep () from /usr/lib/libpthread.a(_shr_xpg5.o)
(gdb) thread 1
[Switching to thread 1 (Thread 1 (tid 25493845))]
(gdb) c
Continuing.
[Thread 1 (tid 25493845) exited]
[Thread 258 (tid 31130079) exited]
inferior.c:405: internal-error: find_inferior_pid: Assertion `pid != 0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
There are two bugs here. One is the core dump. The other is the main thread information
not captured.
So, while I was debugging the first part the reason, the reason I figured out was
the last for loop in sync_threadlists ().
Once both my threads exit we delete them as below:
for (struct thread_info *it : all_threads ())
{
if (in_queue_threads.count (priv->pdtid) == 0
&& in_thread_list (proc_target, it->ptid)
&& pid == it->ptid.pid ())
{
delete_thread (it);
data->exited_threads.insert (priv->pdtid);
But once these two threads are deleted, all_threads ()
has one more thread whose tid and pid are 0.
gdb) c
Continuing.
In for loop 8782296 is pid, 19857879 is tid
[Thread 1 (tid 19857879) exited]
In for loop 8782296 is pid, 30933401 is tid
[Thread 258 (tid 30933401) exited]
In for loop 0 is pid, 0 is tid
[Inferior 1 (process 8782296) exited normally]
(gdb) q
I used a printf in the for loop mentioned above for explaination.
You see the loop enters the third time with 0 as pid.
Hence I proposed this solution to break out of the loop if the process
itself has completed execution and hence its pid is 0.
Kindly let me know if this is okay.
The second part to the bug is the lack of information of the main thread.
Andrew was right here (https://sourceware.org/pipermail/gdb-patches/2024-September/211875.html)
Thank you Andrew.
The thread has loaded but then ptrace () call when we tried to fetch_regs_kernel_thread
failed. This returned EPERM as errno.
if (!ptrace32 (PTT_READ_GPRS, tid, (uintptr_t) gprs32, 0, NULL))
memset (gprs32, 0, sizeof (gprs32));
Hence all registers were set to 0 and we did not get the required infromation.
(gdb) thread 1
[Switching to thread 1 (Thread 1 (tid 31916423))]
(gdb) info registers
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x0 0
r4 0x0 0
r5 0x0 0
r6 0x0 0
r7 0x0 0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0x0 0
r14 0x0 0
r15 0x0 0
r16 0x0 0
r17 0x0 0
r18 0x0 0
r19 0x0 0
r20 0x0 0
r21 0x0 0
r22 0x0 0
r23 0x0 0
r24 0x0 0
r25 0x0 0
r26 0x0 0
r27 0x0 0
r28 0x0 0
r29 0x0 0
r30 0x0 0
r31 0x0 0
pc 0x0 0x0
msr 0x0 0
cr 0x0 0
lr 0x0 0x0
ctr 0x0 0
xer 0x0 0
fpscr 0x0 0
vscr 0x0 0
vrsave 0x0 0
(gdb) c
For some reason the main thread is in kernel mode and I am not able
to read the register contents for the main thread no matter how we try it.
If any other target has faced this type of issue and/or handled this situation
differently then kindly let me know. I will also cordinate with kernel folks
for potential solutions.
Kindly let me know what is your opinion about this patch for atleast the first part.
---
gdb/aix-thread.c | 2 ++
1 file changed, 2 insertions(+)
Comments
Respected community members,
Hi,
This is a patch to fix a core dump that occurred while we were debugging threads using GDB in AIX.
Kindly give me feedback for the same.
Thanks and regards,
Aditya.
From: Aditya Vidyadhar Kamath <akamath996@gmail.com>
Date: Monday, 28 October 2024 at 3:57 PM
To: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Cc: gdb-patches@sourceware.org <gdb-patches@sourceware.org>, Aditya Kamath <Aditya.Kamath1@ibm.com>, SANGAMESH MALLAYYA <sangamesh.swamy@in.ibm.com>, Aditya Kamath <Aditya.Kamath1@ibm.com>
Subject: [EXTERNAL] [PATCH] Fix AIX core dump while main thread exits.
From: Aditya Vidyadhar Kamath <aditya.kamath1@ibm.com>
Consider the test case:
void *thread_main(void *) {
std::cout << getpid() << std::endl;
sleep(20);
return nullptr;
}
int main(void) {
pthread_t thread;
pthread_create(&thread, nullptr, thread_main, nullptr);
pthread_join(thread, nullptr);
return 0;
}
This program creates a thread via main that sleeps for 20 seconds.
When we debug this with gdb we get,
Reading symbols from ./test...
(gdb) b main
Breakpoint 1 at 0x10000934: file test.c, line 11.
(gdb) r
Starting program: /read_only_gdb/binutils-gdb/gdb/test
Breakpoint 1, main () at test.c:11
11 pthread_create(&thread, nullptr, thread_main, nullptr);
(gdb) c
Continuing.
15335884
[New Thread 258 (tid 31130079)]
Thread 2 received signal SIGINT, Interrupt.
[Switching to Thread 258 (tid 31130079)]
0xd0611d70 in _p_nsleep () from /usr/lib/libpthread.a(_shr_xpg5.o)
(gdb) thread 1
[Switching to thread 1 (Thread 1 (tid 25493845))]
(gdb) c
Continuing.
[Thread 1 (tid 25493845) exited]
[Thread 258 (tid 31130079) exited]
inferior.c:405: internal-error: find_inferior_pid: Assertion `pid != 0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
There are two bugs here. One is the core dump. The other is the main thread information
not captured.
So, while I was debugging the first part the reason, the reason I figured out was
the last for loop in sync_threadlists ().
Once both my threads exit we delete them as below:
for (struct thread_info *it : all_threads ())
{
if (in_queue_threads.count (priv->pdtid) == 0
&& in_thread_list (proc_target, it->ptid)
&& pid == it->ptid.pid ())
{
delete_thread (it);
data->exited_threads.insert (priv->pdtid);
But once these two threads are deleted, all_threads ()
has one more thread whose tid and pid are 0.
gdb) c
Continuing.
In for loop 8782296 is pid, 19857879 is tid
[Thread 1 (tid 19857879) exited]
In for loop 8782296 is pid, 30933401 is tid
[Thread 258 (tid 30933401) exited]
In for loop 0 is pid, 0 is tid
[Inferior 1 (process 8782296) exited normally]
(gdb) q
I used a printf in the for loop mentioned above for explaination.
You see the loop enters the third time with 0 as pid.
Hence I proposed this solution to break out of the loop if the process
itself has completed execution and hence its pid is 0.
Kindly let me know if this is okay.
The second part to the bug is the lack of information of the main thread.
Andrew was right here (https://sourceware.org/pipermail/gdb-patches/2024-September/211875.html )
Thank you Andrew.
The thread has loaded but then ptrace () call when we tried to fetch_regs_kernel_thread
failed. This returned EPERM as errno.
if (!ptrace32 (PTT_READ_GPRS, tid, (uintptr_t) gprs32, 0, NULL))
memset (gprs32, 0, sizeof (gprs32));
Hence all registers were set to 0 and we did not get the required infromation.
(gdb) thread 1
[Switching to thread 1 (Thread 1 (tid 31916423))]
(gdb) info registers
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x0 0
r4 0x0 0
r5 0x0 0
r6 0x0 0
r7 0x0 0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0x0 0
r14 0x0 0
r15 0x0 0
r16 0x0 0
r17 0x0 0
r18 0x0 0
r19 0x0 0
r20 0x0 0
r21 0x0 0
r22 0x0 0
r23 0x0 0
r24 0x0 0
r25 0x0 0
r26 0x0 0
r27 0x0 0
r28 0x0 0
r29 0x0 0
r30 0x0 0
r31 0x0 0
pc 0x0 0x0
msr 0x0 0
cr 0x0 0
lr 0x0 0x0
ctr 0x0 0
xer 0x0 0
fpscr 0x0 0
vscr 0x0 0
vrsave 0x0 0
(gdb) c
For some reason the main thread is in kernel mode and I am not able
to read the register contents for the main thread no matter how we try it.
If any other target has faced this type of issue and/or handled this situation
differently then kindly let me know. I will also cordinate with kernel folks
for potential solutions.
Kindly let me know what is your opinion about this patch for atleast the first part.
---
gdb/aix-thread.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/gdb/aix-thread.c b/gdb/aix-thread.c
index 9e6952b974f..94ad0a2d90a 100644
--- a/gdb/aix-thread.c
+++ b/gdb/aix-thread.c
@@ -856,6 +856,8 @@ sync_threadlists (pid_t pid)
is to manually delete such threads. */
for (struct thread_info *it : all_threads ())
{
+ if (it->ptid.pid () == 0)
+ break;
aix_thread_info *priv = get_aix_thread_info (it);
if (in_queue_threads.count (priv->pdtid) == 0
&& in_thread_list (proc_target, it->ptid)
--
2.41.0
Aditya Vidyadhar Kamath <akamath996@gmail.com> wrote:
> for (struct thread_info *it : all_threads ())
> {
>+ if (it->ptid.pid () == 0)
>+ break;
> aix_thread_info *priv = get_aix_thread_info (it);
> if (in_queue_threads.count (priv->pdtid) == 0
> && in_thread_list (proc_target, it->ptid)
This looks a bit suspicious to me. Why is that thread with
PID 0 in the list to begin with? We probably should rather
fix the problem by not adding that thread to the list in the
first place ... (Or maybe the thread should be there and
we just got the PID wrong for some reason?)
Bye,
Ulrich
Respected Ulrich and community members,
Hi,
>This looks a bit suspicious to me. Why is that thread with
>PID 0 in the list to begin with? We probably should rather
>fix the problem by not adding that thread to the list in the
>first place
I debugged further and we know why now.
So when we use delete_thread () and the thread has terminated/exited [with status PST_TERM/PST_UNKNOWN] and is the current thread GDB does not delete this thread and marks this thread as non-deletable.
In thread.c if we see
bool
thread_info::deletable () const
{
/* If this is the current thread, or there's code out there that
relies on it existing (refcount > 0) we can't delete yet. */
return refcount () == 0 && !is_current_thread (this);
}
So what happens is delete_thread () -> delete_thread1 (),
if (!thr->deletable ())
{
/* Will be really deleted some other time. */
return;
}
the main thread in the use case below escaped getting removed and it only got set exited by set_thread_exited ().
The other targets (As far as I checked) do not delete_thread () when using all_threads () in a for loop. It is only AIX that does, hence we hit this.
So we run this below program. New thread is created apart from main thread. We switch to main thread and continue. Both threads exit which is done in sync_threadlists(), but but the main thread is only exited is shown but actually not deleted.
So its pid and tid is 0 and the for loop enters the third time in the loop and when check if this guy is in the threadlist GDB has to crash and rightly so since find_inferior_pid for pid 0 should have never been called.
for (struct thread_info *it : all_threads ())
{
printf ("Hii 2 %d is tid \n", it->ptid.lwp());
aix_thread_info *priv = get_aix_thread_info (it);
if (in_queue_threads.count (priv->pdtid) == 0
&& in_thread_list (proc_target, it->ptid)
&& pid == it->ptid.pid ())
{
delete_thread (it);
data->exited_threads.insert (priv->pdtid);
}
If delete_thread indeed is not able to delete the current thread then how should we fix it? Should we touch the GDB core code or write a different way to handle this scenario such that we ignore the main thread/current thread?
Kindly let me know what is your opinion about this and we can proceed to fix thereafter.
Have a nice day ahead.
Thanks and regards,
Aditya.
==========================================================
Consider the test case:
void *thread_main(void *) {
std::cout << getpid() << std::endl;
sleep(20);
return nullptr;
}
int main(void) {
pthread_t thread;
pthread_create(&thread, nullptr, thread_main, nullptr);
pthread_join(thread, nullptr);
return 0;
}
This program creates a thread via main that sleeps for 20 seconds.
When we debug this with gdb we get,
Reading symbols from ./test...
(gdb) b main
Breakpoint 1 at 0x10000934: file test.c, line 11.
(gdb) r
Starting program: /read_only_gdb/binutils-gdb/gdb/test
Breakpoint 1, main () at test.c:11
11 pthread_create(&thread, nullptr, thread_main, nullptr);
(gdb) c
Continuing.
15335884
[New Thread 258 (tid 31130079)]
Thread 2 received signal SIGINT, Interrupt.
[Switching to Thread 258 (tid 31130079)]
0xd0611d70 in _p_nsleep () from /usr/lib/libpthread.a(_shr_xpg5.o)
(gdb) thread 1
[Switching to thread 1 (Thread 1 (tid 25493845))]
(gdb) c
Continuing.
[Thread 1 (tid 25493845) exited]
[Thread 258 (tid 31130079) exited]
inferior.c:405: internal-error: find_inferior_pid: Assertion `pid != 0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
From: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Date: Monday, 28 October 2024 at 9:56 PM
To: akamath996@gmail.com <akamath996@gmail.com>
Cc: gdb-patches@sourceware.org <gdb-patches@sourceware.org>, Aditya Kamath <Aditya.Kamath1@ibm.com>, SANGAMESH MALLAYYA <sangamesh.swamy@in.ibm.com>
Subject: Re: [PATCH] Fix AIX core dump while main thread exits.
Aditya Vidyadhar Kamath <akamath996@gmail.com> wrote:
> for (struct thread_info *it : all_threads ())
> {
>+ if (it->ptid.pid () == 0)
>+ break;
> aix_thread_info *priv = get_aix_thread_info (it);
> if (in_queue_threads.count (priv->pdtid) == 0
> && in_thread_list (proc_target, it->ptid)
This looks a bit suspicious to me. Why is that thread with
PID 0 in the list to begin with? We probably should rather
fix the problem by not adding that thread to the list in the
first place ... (Or maybe the thread should be there and
we just got the PID wrong for some reason?)
Bye,
Ulrich
Aditya Kamath <Aditya.Kamath1@ibm.com> wrote:
>The other targets (As far as I checked) do not delete_thread () when
>using all_threads () in a for loop.
Ah, I see. This is indeed not supported. If you use delete_thread ()
in the loop, you have to use the all_threads_safe () iterator
instead of all_threads ().
Can you verify whether this fixes the problem?
Bye,
Ulrich
Respected Ulrich and community members,
Hi,
>Ah, I see. This is indeed not supported. If you use delete_thread ()
>in the loop, you have to use the all_threads_safe () iterator
>instead of all_threads ().
>Can you verify whether this fixes the problem?
Yes this fixes the problem. Emailed a v2 of this patch requesting for approval.
Have a nice day ahead.
Thanks and regards,
Aditya.
From: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Date: Wednesday, 30 October 2024 at 6:26 PM
To: akamath996@gmail.com <akamath996@gmail.com>, Aditya Kamath <Aditya.Kamath1@ibm.com>
Cc: gdb-patches@sourceware.org <gdb-patches@sourceware.org>, SANGAMESH MALLAYYA <sangamesh.swamy@in.ibm.com>
Subject: Re: [PATCH] Fix AIX core dump while main thread exits.
Aditya Kamath <Aditya.Kamath1@ibm.com> wrote:
>The other targets (As far as I checked) do not delete_thread () when
>using all_threads () in a for loop.
Ah, I see. This is indeed not supported. If you use delete_thread ()
in the loop, you have to use the all_threads_safe () iterator
instead of all_threads ().
Can you verify whether this fixes the problem?
Bye,
Ulrich
@@ -856,6 +856,8 @@ sync_threadlists (pid_t pid)
is to manually delete such threads. */
for (struct thread_info *it : all_threads ())
{
+ if (it->ptid.pid () == 0)
+ break;
aix_thread_info *priv = get_aix_thread_info (it);
if (in_queue_threads.count (priv->pdtid) == 0
&& in_thread_list (proc_target, it->ptid)