Fix sporadic XFAILs in, gdb.threads/attach-many-short-lived-threads.exp

Message ID AS8P193MB128576058A1486DC7A6CABACE4382@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM
State New
Headers
Series Fix sporadic XFAILs in, gdb.threads/attach-many-short-lived-threads.exp |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gdb_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-aarch64 success Testing passed

Commit Message

Bernd Edlinger March 31, 2024, 10:47 a.m. UTC
  This is about random test failures like those:

XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: attach (EPERM)
XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 7: attach (EPERM)
XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: attach (EPERM)
XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 9: attach (EPERM)
XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 10: attach (EPERM)

The reason for this effect is apparently as follows:

There is a race condition when gdb tries to attach a thread but the
thread exits at the same time.  Normally when that happens the return
code of ptrace(PTRACE_ATTACH, x) is EPERM, which could also have other
reasons.  To detect the true reason, we try to open /proc/<pid>/status
which normally fails in that situation, but it may happen that the
fopen succeeds, and the thread disappears while reading the content,
then the read has the errno=ESRCH, use that as an indication that the
thread has exited between fopen and reading of the status file.
---
 gdb/nat/linux-procfs.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

Thiago Jung Bauermann April 4, 2024, 1:41 a.m. UTC | #1
Hello Bernd,

Great fix! Just a couple of comments below:

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> This is about random test failures like those:
>
> XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: attach (EPERM)
> XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 7: attach (EPERM)
> XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: attach (EPERM)
> XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 9: attach (EPERM)
> XFAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 10: attach (EPERM)
>
> The reason for this effect is apparently as follows:
>
> There is a race condition when gdb tries to attach a thread but the
> thread exits at the same time.  Normally when that happens the return
> code of ptrace(PTRACE_ATTACH, x) is EPERM, which could also have other
> reasons.  To detect the true reason, we try to open /proc/<pid>/status
> which normally fails in that situation, but it may happen that the
> fopen succeeds, and the thread disappears while reading the content,
> then the read has the errno=ESRCH, use that as an indication that the
> thread has exited between fopen and reading of the status file.
> ---
>  gdb/nat/linux-procfs.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c
> index e2086952ce6..36e0f5bf16a 100644
> --- a/gdb/nat/linux-procfs.c
> +++ b/gdb/nat/linux-procfs.c
> @@ -165,6 +165,9 @@ linux_proc_pid_is_gone (pid_t pid)
>      }
>    else if (have_state == 0)
>      {
> +      /* errno is ESRCH "No such process": assume thread has disappeared.  */

I found this behaviour mentioned in the kernel documentation:

https://www.kernel.org/doc/html/latest/filesystems/proc.html#process-specific-subdirectories

> Note that an open file descriptor to /proc/<pid> or to any of its
> contained files or subdirectories does not prevent <pid> being reused
> for some other process in the event that <pid> exits. Operations on
> open /proc/<pid> file descriptors corresponding to dead processes
> never act on any new process that the kernel may, through chance, have
> also assigned the process ID <pid>. Instead, operations on these FDs
> usually fail with ESRCH.

I think it would be useful to mention this link in the comment above.

> +      if (errno == ESRCH)
> +	return 1;
>        /* No "State:" line, assume thread is alive.  */
>        return 0;
>      }

With this patch applied on top of my patch series fixing attach to
zombie threads¹, I don't see these XFAILs anymore on an aarch64-linux
machine where previously I saw them on every run of this testcase. Nice!

I would even suggest removing the XFAIL from the testcase, if other
people can confirm similar results.

In any case:

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>

--
Thiago

¹ https://inbox.sourceware.org/gdb-patches/20240321231149.519549-1-thiago.bauermann@linaro.org/
  
Thiago Jung Bauermann April 4, 2024, 4:25 p.m. UTC | #2
Hello again,

Sorry, one more comment that occurred to me today.

Thiago Jung Bauermann <thiago.bauermann@linaro.org> writes:

> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>
>> diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c
>> index e2086952ce6..36e0f5bf16a 100644
>> --- a/gdb/nat/linux-procfs.c
>> +++ b/gdb/nat/linux-procfs.c
>> @@ -165,6 +165,9 @@ linux_proc_pid_is_gone (pid_t pid)
>>      }
>>    else if (have_state == 0)
>>      {
>> +      /* errno is ESRCH "No such process": assume thread has disappeared.  */
>> +      if (errno == ESRCH)
>> +	return 1;
>>        /* No "State:" line, assume thread is alive.  */
>>        return 0;
>>      }
>
> With this patch applied on top of my patch series fixing attach to
> zombie threads¹, I don't see these XFAILs anymore on an aarch64-linux
> machine where previously I saw them on every run of this testcase. Nice!
>
> I would even suggest removing the XFAIL from the testcase, if other
> people can confirm similar results.
>
> In any case:
>
> Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>

Actually, thinking more about it I think it would be better if
linux_proc_pid_get_state checked for errno == ESRCH itself and returned
-1 in that case (and also warn if that parameter is true), instead of
making its caller do that check.

The function documentation says:

/* Fill in STATE, a buffer with BUFFER_SIZE bytes with the 'State'
   line of /proc/PID/status.  Returns -1 on failure to open the /proc
   file, 1 if the line is found, and 0 if not found.  If WARN, warn on
   failure to open the /proc file.  */

I think that getting the ESRCH error while reading is semantically
equivalent to failing to open the /proc file. Returning 0 when the line
wasn't found because of the ESRCH error adheres to the letter of that
comment but not to its spirit. :-)

--
Thiago
  
Bernd Edlinger April 5, 2024, 5 a.m. UTC | #3
On 4/4/24 18:25, Thiago Jung Bauermann wrote:
> 
> Hello again,
> 
> Sorry, one more comment that occurred to me today.
> 
> Thiago Jung Bauermann <thiago.bauermann@linaro.org> writes:
> 
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>
>>> diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c
>>> index e2086952ce6..36e0f5bf16a 100644
>>> --- a/gdb/nat/linux-procfs.c
>>> +++ b/gdb/nat/linux-procfs.c
>>> @@ -165,6 +165,9 @@ linux_proc_pid_is_gone (pid_t pid)
>>>      }
>>>    else if (have_state == 0)
>>>      {
>>> +      /* errno is ESRCH "No such process": assume thread has disappeared.  */
>>> +      if (errno == ESRCH)
>>> +	return 1;
>>>        /* No "State:" line, assume thread is alive.  */
>>>        return 0;
>>>      }
>>
>> With this patch applied on top of my patch series fixing attach to
>> zombie threads¹, I don't see these XFAILs anymore on an aarch64-linux
>> machine where previously I saw them on every run of this testcase. Nice!
>>
>> I would even suggest removing the XFAIL from the testcase, if other
>> people can confirm similar results.
>>
>> In any case:
>>
>> Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
> 
> Actually, thinking more about it I think it would be better if
> linux_proc_pid_get_state checked for errno == ESRCH itself and returned
> -1 in that case (and also warn if that parameter is true), instead of
> making its caller do that check.
> 
> The function documentation says:
> 
> /* Fill in STATE, a buffer with BUFFER_SIZE bytes with the 'State'
>    line of /proc/PID/status.  Returns -1 on failure to open the /proc
>    file, 1 if the line is found, and 0 if not found.  If WARN, warn on
>    failure to open the /proc file.  */
> 
> I think that getting the ESRCH error while reading is semantically
> equivalent to failing to open the /proc file. Returning 0 when the line
> wasn't found because of the ESRCH error adheres to the letter of that
> comment but not to its spirit. :-)
> 

The patch only works because linux_proc_pid_is_gone is just called
on two places, where errno == EPERM:

gdb/linux-nat.c:	      || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
gdbserver/linux-low.cc:	  || (err == EPERM && linux_proc_pid_is_gone (lwpid)))

so when errno is changed from EPERM->ESRCH and there was no Status line
found, then it is pretty clear what happened.

But linux_proc_pid_get_state is called from other places where errno may
be by chance already ESRCH.

I've spent some time reading the kernel sources, where the ESRCH return code
was first introduced with v2.6.18, but unless the ESRCH is returned, all
linux versions seem to always return the "Name:" followed by "State:" 
etc.

So actually I don't see under which condition a thread can be alive
when no "State:" line is found, as the comment here is probably
just wrong:

      /* No "State:" line, assume thread is alive.  */
      return 0;


Bernd.

> --
> Thiago
  
Thiago Jung Bauermann April 6, 2024, 3:40 a.m. UTC | #4
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 4/4/24 18:25, Thiago Jung Bauermann wrote:
>>
>> Hello again,
>>
>> Sorry, one more comment that occurred to me today.
>>
>> Thiago Jung Bauermann <thiago.bauermann@linaro.org> writes:
>>
>> Actually, thinking more about it I think it would be better if
>> linux_proc_pid_get_state checked for errno == ESRCH itself and returned
>> -1 in that case (and also warn if that parameter is true), instead of
>> making its caller do that check.
>>
>> The function documentation says:
>>
>> /* Fill in STATE, a buffer with BUFFER_SIZE bytes with the 'State'
>>    line of /proc/PID/status.  Returns -1 on failure to open the /proc
>>    file, 1 if the line is found, and 0 if not found.  If WARN, warn on
>>    failure to open the /proc file.  */
>>
>> I think that getting the ESRCH error while reading is semantically
>> equivalent to failing to open the /proc file. Returning 0 when the line
>> wasn't found because of the ESRCH error adheres to the letter of that
>> comment but not to its spirit. :-)
>>
>
> The patch only works because linux_proc_pid_is_gone is just called
> on two places, where errno == EPERM:
>
> gdb/linux-nat.c:	      || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
> gdbserver/linux-low.cc:	  || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
>
> so when errno is changed from EPERM->ESRCH and there was no Status line
> found, then it is pretty clear what happened.
>
> But linux_proc_pid_get_state is called from other places where errno may
> be by chance already ESRCH.

I don't understand why the previous errno value is relevant. The code
needs to make sure that the errno it is checking comes from the fgets
call.

Which actually shows a problem with this patch that I didn't notice
before: it either needs to set errno = 0 before calling fgets, or call
ferror after the fgets call and before checking errno.

> I've spent some time reading the kernel sources, where the ESRCH return code
> was first introduced with v2.6.18, but unless the ESRCH is returned, all
> linux versions seem to always return the "Name:" followed by "State:"
> etc.
>
> So actually I don't see under which condition a thread can be alive
> when no "State:" line is found, as the comment here is probably
> just wrong:
>
>       /* No "State:" line, assume thread is alive.  */
>       return 0;

Good point. At least since Linux 2.6.12-rc2 (the first commit in git
history) there's always a "State:"" line. I also can't think of a
scenario where the assumption in the comment is valid.

--
Thiago
  

Patch

diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c
index e2086952ce6..36e0f5bf16a 100644
--- a/gdb/nat/linux-procfs.c
+++ b/gdb/nat/linux-procfs.c
@@ -165,6 +165,9 @@  linux_proc_pid_is_gone (pid_t pid)
     }
   else if (have_state == 0)
     {
+      /* errno is ESRCH "No such process": assume thread has disappeared.  */
+      if (errno == ESRCH)
+	return 1;
       /* No "State:" line, assume thread is alive.  */
       return 0;
     }