help debugging flaky posix/tst-waitid.c test

Message ID Z6d8PX0cdfqVuAAs@aurel32.net (mailing list archive)
State Under Review
Delegated to: Carlos O'Donell
Headers
Series help debugging flaky posix/tst-waitid.c test |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed

Commit Message

Aurelien Jarno Feb. 8, 2025, 3:46 p.m. UTC
  Hi,

The posix/tst-waitid.c appears to be flaky, although with a low
probability to fail, but that's bad for continuous integration. Under
heavy load, it crashes around once over 80000 times in my tests, but
there is a way to increase the probability (see below). This happens at
least on aarch64, riscv64 and x86-64, so it does not seems architecture
specific at a first glance.

When it happens, it fails with:

  tst-waitid.c:73: numeric comparison failure
     left: 0 (0x0); from: siginfo.si_status
    right: 19 (0x13); from: status
  error: 1 test failures

This happens for the test on lines 156-173, which basically:
- send a SIGSTOP to the child
- wait for the child to appear in "T" state in /proc/$pid/status
- call waitid with WEXITED|WSTOPPED on the child
- call sigwaitinfo on the child

In that later call, the siginfo.si_status field on the child ends-up
with 0 instead of SIGSTOP.

I have done some more tests on a separate code and it seems that this
happens when kill, waitid and sigwaitinfo are called very quickly after
each other. My guess is that this behaviour is explained by the
following comment in that code:

  /* Wait the child stop.  The waitid call below will block until it has
     stopped, but if we are real quick and enter the waitid system call
     before the SIGCHLD has been generated, then it will be discarded and
     never delivered.  */

It seems that the call to support_process_state_wait (pid, stop_state)
should prevent that, but I wonder if there is a small race condition
left.  What I observed is that the 10ms nanosleep in
support_process_state_wait between check of /proc/$pid/status makes the
issue less probable. In other word, the following patch increases the
chances for the tst-waitid test to fail:


Of course using strace here just hides the issue. Any idea how to debug
or fix the issue?

Thanks
Aurelien
  

Comments

Adhemerval Zanella Feb. 12, 2025, 12:33 p.m. UTC | #1
On 08/02/25 12:46, Aurelien Jarno wrote:
> Hi,
> 
> The posix/tst-waitid.c appears to be flaky, although with a low
> probability to fail, but that's bad for continuous integration. Under
> heavy load, it crashes around once over 80000 times in my tests, but
> there is a way to increase the probability (see below). This happens at
> least on aarch64, riscv64 and x86-64, so it does not seems architecture
> specific at a first glance.
> 
> When it happens, it fails with:
> 
>   tst-waitid.c:73: numeric comparison failure
>      left: 0 (0x0); from: siginfo.si_status
>     right: 19 (0x13); from: status
>   error: 1 test failures
> 
> This happens for the test on lines 156-173, which basically:
> - send a SIGSTOP to the child
> - wait for the child to appear in "T" state in /proc/$pid/status
> - call waitid with WEXITED|WSTOPPED on the child
> - call sigwaitinfo on the child
> 
> In that later call, the siginfo.si_status field on the child ends-up
> with 0 instead of SIGSTOP.
> 
> I have done some more tests on a separate code and it seems that this
> happens when kill, waitid and sigwaitinfo are called very quickly after
> each other. My guess is that this behaviour is explained by the
> following comment in that code:
> 
>   /* Wait the child stop.  The waitid call below will block until it has
>      stopped, but if we are real quick and enter the waitid system call
>      before the SIGCHLD has been generated, then it will be discarded and
>      never delivered.  */
> 
> It seems that the call to support_process_state_wait (pid, stop_state)
> should prevent that, but I wonder if there is a small race condition
> left.  What I observed is that the 10ms nanosleep in
> support_process_state_wait between check of /proc/$pid/status makes the
> issue less probable. In other word, the following patch increases the
> chances for the tst-waitid test to fail:
> 
> --- a/support/support_process_state.c
> +++ b/support/support_process_state.c
> @@ -80,9 +80,6 @@ support_process_state_wait (pid_t pid, enum support_process_state state)
>  
>        rewind (fstatus);
>        fflush (fstatus);
> -
> -      if (nanosleep (&(struct timespec) { 0, 10000000 }, NULL) != 0)
> -	FAIL_EXIT1 ("nanosleep: %m");
>      }
>  
>    free (line);
> 
> Of course using strace here just hides the issue. Any idea how to debug
> or fix the issue?

It seems that we can't really rely on siginfo_t::si_status being SIGSTOP
after a waitid call that returns info.si_status equal to SIGSTOP. Skimming
through kernel code, it seems that for CLD_STOPPED the si_status is obtained
from tsk->signal->group_exit_code (kernel/signal.c - do_notify_parent_cldstop),
but I can't really tell why on how it can be zero in this scenario.

The tools/testing/selftests/pidfd/pidfd_wait.c also checks CLD_STOPPED,
but it does not check si_status.  I think we should ignore it for now.
  
Aurelien Jarno Feb. 13, 2025, 10:11 p.m. UTC | #2
On 2025-02-12 09:33, Adhemerval Zanella Netto wrote:
> 
> 
> On 08/02/25 12:46, Aurelien Jarno wrote:
> > Hi,
> > 
> > The posix/tst-waitid.c appears to be flaky, although with a low
> > probability to fail, but that's bad for continuous integration. Under
> > heavy load, it crashes around once over 80000 times in my tests, but
> > there is a way to increase the probability (see below). This happens at
> > least on aarch64, riscv64 and x86-64, so it does not seems architecture
> > specific at a first glance.
> > 
> > When it happens, it fails with:
> > 
> >   tst-waitid.c:73: numeric comparison failure
> >      left: 0 (0x0); from: siginfo.si_status
> >     right: 19 (0x13); from: status
> >   error: 1 test failures
> > 
> > This happens for the test on lines 156-173, which basically:
> > - send a SIGSTOP to the child
> > - wait for the child to appear in "T" state in /proc/$pid/status
> > - call waitid with WEXITED|WSTOPPED on the child
> > - call sigwaitinfo on the child
> > 
> > In that later call, the siginfo.si_status field on the child ends-up
> > with 0 instead of SIGSTOP.
> > 
> > I have done some more tests on a separate code and it seems that this
> > happens when kill, waitid and sigwaitinfo are called very quickly after
> > each other. My guess is that this behaviour is explained by the
> > following comment in that code:
> > 
> >   /* Wait the child stop.  The waitid call below will block until it has
> >      stopped, but if we are real quick and enter the waitid system call
> >      before the SIGCHLD has been generated, then it will be discarded and
> >      never delivered.  */
> > 
> > It seems that the call to support_process_state_wait (pid, stop_state)
> > should prevent that, but I wonder if there is a small race condition
> > left.  What I observed is that the 10ms nanosleep in
> > support_process_state_wait between check of /proc/$pid/status makes the
> > issue less probable. In other word, the following patch increases the
> > chances for the tst-waitid test to fail:
> > 
> > --- a/support/support_process_state.c
> > +++ b/support/support_process_state.c
> > @@ -80,9 +80,6 @@ support_process_state_wait (pid_t pid, enum support_process_state state)
> >  
> >        rewind (fstatus);
> >        fflush (fstatus);
> > -
> > -      if (nanosleep (&(struct timespec) { 0, 10000000 }, NULL) != 0)
> > -	FAIL_EXIT1 ("nanosleep: %m");
> >      }
> >  
> >    free (line);
> > 
> > Of course using strace here just hides the issue. Any idea how to debug
> > or fix the issue?
> 
> It seems that we can't really rely on siginfo_t::si_status being SIGSTOP
> after a waitid call that returns info.si_status equal to SIGSTOP. Skimming
> through kernel code, it seems that for CLD_STOPPED the si_status is obtained
> from tsk->signal->group_exit_code (kernel/signal.c - do_notify_parent_cldstop),
> but I can't really tell why on how it can be zero in this scenario.
> 
> The tools/testing/selftests/pidfd/pidfd_wait.c also checks CLD_STOPPED,
> but it does not check si_status.  I think we should ignore it for now.

Thanks for your feedback and for looking at that.

I have done further test, and from them and from the comment in
tst-waitid.c, siginfo_t::si_status ends up being 0 if waitid() is called
before SIGCHLD has been generated.

The test is waiting for the child to switch to CLD_STOPPED, but it seems
the SIGCHLD is generated a tiny bit after. So there is a tiny window of
time where the child is stopped but calling waitid followed by
sigwaitinfo will return siginfo_t::si_status as 0.

In that regard the following patch is enough to fix the issue for me and
reliably get siginfo_t::si_status being SIGSTOP (tested on 10 millions
executions):

--- a/posix/tst-waitid.c
+++ b/posix/tst-waitid.c
@@ -162,6 +162,7 @@ do_test_waitd_common (idtype_t type, pid_t pid)
      before the SIGCHLD has been generated, then it will be discarded and
      never delivered.  */
   support_process_state_wait (pid, stop_state);
+  nanosleep (&(struct timespec) { 0, 10000000 }, NULL);
 
   fail = waitid (type, pid, &info, WEXITED|WSTOPPED);
   TEST_COMPARE (fail, 0);

However using a fixed delay in a test is never satisfactory and maybe
the best is to stop checking that info.si_status equal to SIGSTOP.
  
Adhemerval Zanella Feb. 14, 2025, 11:46 a.m. UTC | #3
On 13/02/25 19:11, Aurelien Jarno wrote:
> On 2025-02-12 09:33, Adhemerval Zanella Netto wrote:
>>
>>
>> On 08/02/25 12:46, Aurelien Jarno wrote:
>>> Hi,
>>>
>>> The posix/tst-waitid.c appears to be flaky, although with a low
>>> probability to fail, but that's bad for continuous integration. Under
>>> heavy load, it crashes around once over 80000 times in my tests, but
>>> there is a way to increase the probability (see below). This happens at
>>> least on aarch64, riscv64 and x86-64, so it does not seems architecture
>>> specific at a first glance.
>>>
>>> When it happens, it fails with:
>>>
>>>   tst-waitid.c:73: numeric comparison failure
>>>      left: 0 (0x0); from: siginfo.si_status
>>>     right: 19 (0x13); from: status
>>>   error: 1 test failures
>>>
>>> This happens for the test on lines 156-173, which basically:
>>> - send a SIGSTOP to the child
>>> - wait for the child to appear in "T" state in /proc/$pid/status
>>> - call waitid with WEXITED|WSTOPPED on the child
>>> - call sigwaitinfo on the child
>>>
>>> In that later call, the siginfo.si_status field on the child ends-up
>>> with 0 instead of SIGSTOP.
>>>
>>> I have done some more tests on a separate code and it seems that this
>>> happens when kill, waitid and sigwaitinfo are called very quickly after
>>> each other. My guess is that this behaviour is explained by the
>>> following comment in that code:
>>>
>>>   /* Wait the child stop.  The waitid call below will block until it has
>>>      stopped, but if we are real quick and enter the waitid system call
>>>      before the SIGCHLD has been generated, then it will be discarded and
>>>      never delivered.  */
>>>
>>> It seems that the call to support_process_state_wait (pid, stop_state)
>>> should prevent that, but I wonder if there is a small race condition
>>> left.  What I observed is that the 10ms nanosleep in
>>> support_process_state_wait between check of /proc/$pid/status makes the
>>> issue less probable. In other word, the following patch increases the
>>> chances for the tst-waitid test to fail:
>>>
>>> --- a/support/support_process_state.c
>>> +++ b/support/support_process_state.c
>>> @@ -80,9 +80,6 @@ support_process_state_wait (pid_t pid, enum support_process_state state)
>>>  
>>>        rewind (fstatus);
>>>        fflush (fstatus);
>>> -
>>> -      if (nanosleep (&(struct timespec) { 0, 10000000 }, NULL) != 0)
>>> -	FAIL_EXIT1 ("nanosleep: %m");
>>>      }
>>>  
>>>    free (line);
>>>
>>> Of course using strace here just hides the issue. Any idea how to debug
>>> or fix the issue?
>>
>> It seems that we can't really rely on siginfo_t::si_status being SIGSTOP
>> after a waitid call that returns info.si_status equal to SIGSTOP. Skimming
>> through kernel code, it seems that for CLD_STOPPED the si_status is obtained
>> from tsk->signal->group_exit_code (kernel/signal.c - do_notify_parent_cldstop),
>> but I can't really tell why on how it can be zero in this scenario.
>>
>> The tools/testing/selftests/pidfd/pidfd_wait.c also checks CLD_STOPPED,
>> but it does not check si_status.  I think we should ignore it for now.
> 
> Thanks for your feedback and for looking at that.
> 
> I have done further test, and from them and from the comment in
> tst-waitid.c, siginfo_t::si_status ends up being 0 if waitid() is called
> before SIGCHLD has been generated.
> 
> The test is waiting for the child to switch to CLD_STOPPED, but it seems
> the SIGCHLD is generated a tiny bit after. So there is a tiny window of
> time where the child is stopped but calling waitid followed by
> sigwaitinfo will return siginfo_t::si_status as 0.
> 
> In that regard the following patch is enough to fix the issue for me and
> reliably get siginfo_t::si_status being SIGSTOP (tested on 10 millions
> executions):
> 
> --- a/posix/tst-waitid.c
> +++ b/posix/tst-waitid.c
> @@ -162,6 +162,7 @@ do_test_waitd_common (idtype_t type, pid_t pid)
>       before the SIGCHLD has been generated, then it will be discarded and
>       never delivered.  */
>    support_process_state_wait (pid, stop_state);
> +  nanosleep (&(struct timespec) { 0, 10000000 }, NULL);
>  
>    fail = waitid (type, pid, &info, WEXITED|WSTOPPED);
>    TEST_COMPARE (fail, 0);
> 
> However using a fixed delay in a test is never satisfactory and maybe
> the best is to stop checking that info.si_status equal to SIGSTOP.
> 

I think a better way would to just remove the si_status check if we can not
guarantee the required synchronization.  The sleep kind of work around
sometimes require us to reevaluate when new chips and/or kernel version.
  

Patch

--- a/support/support_process_state.c
+++ b/support/support_process_state.c
@@ -80,9 +80,6 @@  support_process_state_wait (pid_t pid, enum support_process_state state)
 
       rewind (fstatus);
       fflush (fstatus);
-
-      if (nanosleep (&(struct timespec) { 0, 10000000 }, NULL) != 0)
-	FAIL_EXIT1 ("nanosleep: %m");
     }
 
   free (line);