posix: Do not use WNOHANG in waitpid call for Linux posix_spawn

Message ID 1508705517-31558-1-git-send-email-adhemerval.zanella@linaro.org
State Dropped
Headers

Commit Message

Adhemerval Zanella Oct. 22, 2017, 8:51 p.m. UTC
  As shown in some buildbot issues on aarch64 and powerpc, calling
clone (VFORK) and waitpid (WNOHANG) does not guarantee the child
is ready to be collected.  This patch changes the call back to 0
as before fe05e1cb6d64 fix.

This change can lead to the scenario 4.3 described in the commit,
where the waitpid call can hang undefinitely on the call.  However
this is also a very unlikely and also undefinied situation where
both the caller is trying to terminate a pid before posix_spawn
returns and the race pid reuse is triggered.  I don't see how to
correct handle this specific situation within posix_spawn.

Checked on x86_64-linux-gnu, aarch64-linux-gnu and
powerpc64-linux-gnu.

	* sysdeps/unix/sysv/linux/spawni.c (__spawnix): Use 0 instead of
	WNOHANG in waitpid call.
---
 ChangeLog                        |  5 +++++
 sysdeps/unix/sysv/linux/spawni.c | 10 +++++-----
 2 files changed, 10 insertions(+), 5 deletions(-)
  

Comments

Florian Weimer Oct. 23, 2017, 6:32 a.m. UTC | #1
On 10/22/2017 10:51 PM, Adhemerval Zanella wrote:
> As shown in some buildbot issues on aarch64 and powerpc, calling
> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child
> is ready to be collected.  This patch changes the call back to 0
> as before fe05e1cb6d64 fix.

I see it on x86-64, too.  It does look like a kernel bug.

> This change can lead to the scenario 4.3 described in the commit,
> where the waitpid call can hang undefinitely on the call.  However
> this is also a very unlikely and also undefinied situation where
> both the caller is trying to terminate a pid before posix_spawn
> returns and the race pid reuse is triggered.  I don't see how to
> correct handle this specific situation within posix_spawn.

Agreed.  I wish we could do better here, but it seems we can't.

Thanks,
Florian
  
Szabolcs Nagy Oct. 23, 2017, 10:38 a.m. UTC | #2
On 23/10/17 07:32, Florian Weimer wrote:
> On 10/22/2017 10:51 PM, Adhemerval Zanella wrote:
>> As shown in some buildbot issues on aarch64 and powerpc, calling
>> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child
>> is ready to be collected.  This patch changes the call back to 0
>> as before fe05e1cb6d64 fix.
> 
> I see it on x86-64, too.  It does look like a kernel bug.
> 
>> This change can lead to the scenario 4.3 described in the commit,
>> where the waitpid call can hang undefinitely on the call.  However
>> this is also a very unlikely and also undefinied situation where
>> both the caller is trying to terminate a pid before posix_spawn
>> returns and the race pid reuse is triggered.  I don't see how to
>> correct handle this specific situation within posix_spawn.
> 
> Agreed.  I wish we could do better here, but it seems we can't.
> 

musl writes a close-on-exec pipe in the child on error,
reading it in the parent tells if the child died before
exec or not. (so the waitpid can be made precise)

VFORK is not precise (i think under ptrace parent can
continue before child execs) so using close-on-exec
fd is a better way to sync with exec.
  
Florian Weimer Oct. 23, 2017, 10:40 a.m. UTC | #3
On 10/23/2017 12:38 PM, Szabolcs Nagy wrote:
> On 23/10/17 07:32, Florian Weimer wrote:
>> On 10/22/2017 10:51 PM, Adhemerval Zanella wrote:
>>> As shown in some buildbot issues on aarch64 and powerpc, calling
>>> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child
>>> is ready to be collected.  This patch changes the call back to 0
>>> as before fe05e1cb6d64 fix.
>>
>> I see it on x86-64, too.  It does look like a kernel bug.
>>
>>> This change can lead to the scenario 4.3 described in the commit,
>>> where the waitpid call can hang undefinitely on the call.  However
>>> this is also a very unlikely and also undefinied situation where
>>> both the caller is trying to terminate a pid before posix_spawn
>>> returns and the race pid reuse is triggered.  I don't see how to
>>> correct handle this specific situation within posix_spawn.
>>
>> Agreed.  I wish we could do better here, but it seems we can't.
>>
> 
> musl writes a close-on-exec pipe in the child on error,
> reading it in the parent tells if the child died before
> exec or not. (so the waitpid can be made precise)

How so?  A wildcard wait can still collect the PID, which is the problem 
we were trying to defend against with the WNOHANG flag.

Thanks,
Florian
  
Szabolcs Nagy Oct. 23, 2017, 10:46 a.m. UTC | #4
On 23/10/17 11:40, Florian Weimer wrote:
> On 10/23/2017 12:38 PM, Szabolcs Nagy wrote:
>> On 23/10/17 07:32, Florian Weimer wrote:
>>> On 10/22/2017 10:51 PM, Adhemerval Zanella wrote:
>>>> As shown in some buildbot issues on aarch64 and powerpc, calling
>>>> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child
>>>> is ready to be collected.  This patch changes the call back to 0
>>>> as before fe05e1cb6d64 fix.
>>>
>>> I see it on x86-64, too.  It does look like a kernel bug.
>>>
>>>> This change can lead to the scenario 4.3 described in the commit,
>>>> where the waitpid call can hang undefinitely on the call.  However
>>>> this is also a very unlikely and also undefinied situation where
>>>> both the caller is trying to terminate a pid before posix_spawn
>>>> returns and the race pid reuse is triggered.  I don't see how to
>>>> correct handle this specific situation within posix_spawn.
>>>
>>> Agreed.  I wish we could do better here, but it seems we can't.
>>>
>>
>> musl writes a close-on-exec pipe in the child on error,
>> reading it in the parent tells if the child died before
>> exec or not. (so the waitpid can be made precise)
> 
> How so?  A wildcard wait can still collect the PID, which is the problem we were trying to defend against with
> the WNOHANG flag.
> 

ah i misunderstood the problem, i have no solution to that.
(that just tells us that doing wild card wait is a bug,
may be we should remove that from the test case?)
  
Adhemerval Zanella Oct. 23, 2017, 1:41 p.m. UTC | #5
On 23/10/2017 08:38, Szabolcs Nagy wrote:
> On 23/10/17 07:32, Florian Weimer wrote:
>> On 10/22/2017 10:51 PM, Adhemerval Zanella wrote:
>>> As shown in some buildbot issues on aarch64 and powerpc, calling
>>> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child
>>> is ready to be collected.  This patch changes the call back to 0
>>> as before fe05e1cb6d64 fix.
>>
>> I see it on x86-64, too.  It does look like a kernel bug.
>>
>>> This change can lead to the scenario 4.3 described in the commit,
>>> where the waitpid call can hang undefinitely on the call.  However
>>> this is also a very unlikely and also undefinied situation where
>>> both the caller is trying to terminate a pid before posix_spawn
>>> returns and the race pid reuse is triggered.  I don't see how to
>>> correct handle this specific situation within posix_spawn.
>>
>> Agreed.  I wish we could do better here, but it seems we can't.
>>
> 
> musl writes a close-on-exec pipe in the child on error,
> reading it in the parent tells if the child died before
> exec or not. (so the waitpid can be made precise)
> 
> VFORK is not precise (i think under ptrace parent can
> continue before child execs) so using close-on-exec
> fd is a better way to sync with exec.
> 

That was my first approach and Rasmus Villemoes proposed to use
CLONE_VFORK to improve the error reporting on 4b4d4056bb1 [1].

Rich raised this very issue back then [2] and Andreas replied
that with recent supported kernels this is not an issue [3]
(at least on CentOS 6 it seems to work properly).

Also, this approach have the advantage of using less resources
and fewer chanches for the spawn to fail (due EMFILE).

[1] https://sourceware.org/ml/libc-alpha/2016-09/msg00360.html
[2] https://sourceware.org/ml/libc-alpha/2016-09/msg00559.html
[3] https://sourceware.org/ml/libc-alpha/2016-09/msg00561.html
  
Tulio Magno Quites Machado Filho Oct. 23, 2017, 1:50 p.m. UTC | #6
Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:

> As shown in some buildbot issues on aarch64 and powerpc, calling
> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child
> is ready to be collected.  This patch changes the call back to 0
> as before fe05e1cb6d64 fix.
>
> This change can lead to the scenario 4.3 described in the commit,
> where the waitpid call can hang undefinitely on the call.  However
> this is also a very unlikely and also undefinied situation where
> both the caller is trying to terminate a pid before posix_spawn
> returns and the race pid reuse is triggered.  I don't see how to
> correct handle this specific situation within posix_spawn.
>
> Checked on x86_64-linux-gnu, aarch64-linux-gnu and
> powerpc64-linux-gnu.

Tested on powerpc64le too and I can't reproduce the issue anymore.

Thanks!
  
Pedro Alves Oct. 23, 2017, 3:11 p.m. UTC | #7
On 10/23/2017 11:38 AM, Szabolcs Nagy wrote:

> VFORK is not precise (i think under ptrace parent can
> continue before child execs) 

It can not, modulo ancient kernel bugs, perhaps.  Even if the
ptracer PTRACE_CONTs the parent, the parent remains frozen
in "D (disk sleep)" until the child execs or dies, at which
point the ptracer is notified with a PTRACE_EVENT_VFORK_DONE
event, if enabled with PTRACE_O_TRACEVFORKDONE.

(GDB has to be careful to avoid resuming a vfork parent if not
resuming the child, otherwise the whole debug session deadlocks,
and Ctrl-C won't unblock it (because the parent/child's process
group would be made the foreground process group on resumption
and that's where the kernel would be sending the SIGINT.  Since
the parent nor the child are scheduled, the SIGINT remains
pending forever.)

Now, if a multi-threaded program vforks, then all other
non-parent threads in the parent continue running.  Only
the thread that vforked is frozen.  But you're likely already
considering that undefined behavior.

> so using close-on-exec fd is a better way to sync with exec.

Thanks,
Pedro Alves
  
Florian Weimer Oct. 23, 2017, 3:21 p.m. UTC | #8
* Pedro Alves:

> On 10/23/2017 11:38 AM, Szabolcs Nagy wrote:
>
>> VFORK is not precise (i think under ptrace parent can
>> continue before child execs) 
>
> It can not, modulo ancient kernel bugs, perhaps.  Even if the
> ptracer PTRACE_CONTs the parent, the parent remains frozen
> in "D (disk sleep)" until the child execs or dies, at which
> point the ptracer is notified with a PTRACE_EVENT_VFORK_DONE
> event, if enabled with PTRACE_O_TRACEVFORKDONE.

What seems to happen here is that the kernel puts the exit
notification on some internal queue, where it lingers for some time,
and the clone system call returns immediately.  At this point, the
queued exit status is not visible.  I think it's a kernel bug
(potentially related to the incorrect reporting of EAGAIN from
clone/fork), but the kernel increasingly does things like that because
they look good in benchmarks.  This way, the kernel can batch multiple
deallocations to use a single (global) synchronization, which is
cheaper.  But it also means that any kind of resource accounting is
inaccurate and can fail spuriously, and it also results in bugs like
the externally synchronized waitpid not showing correct process state.

  <https://bugzilla.kernel.org/show_bug.cgi?id=154011>

I hear that namespace deallocation now has a similar bug.
  

Patch

diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c
index d15fbb1..fb83c2e 100644
--- a/sysdeps/unix/sysv/linux/spawni.c
+++ b/sysdeps/unix/sysv/linux/spawni.c
@@ -374,12 +374,12 @@  __spawnix (pid_t * pid, const char *file,
       ec = args.err;
       if (ec > 0)
 	/* There still an unlikely case where the child is cancelled after
-	   setting args.err, due to a positive error value.  Also due a
+	   setting args.err, due to a positive error value.  Also there is
 	   possible pid reuse race (where the kernel allocated the same pid
-	   to unrelated process) we need not to undefinitely hang expecting
-	   an invalid pid.  In both cases an error is returned to the
-	   caller.  */
-	__waitpid (new_pid, NULL, WNOHANG);
+	   to an unrelated process).  Unfortunately due synchronization
+	   issues where the kernel might not have the process collected
+	   the waitpid below can not use WNOHANG.  */
+	__waitpid (new_pid, NULL, 0);
     }
   else
     ec = -new_pid;