[v2,3/3] posix: New Linux posix_spawn{p} implementation

Message ID CAKwiHFiuZVLX+S8b7OZxJfcdvZ2mjWV6p0CAnQRJHOfHcmn-HQ@mail.gmail.com
State New, archived
Headers

Commit Message

Rasmus Villemoes Sept. 1, 2016, 9:28 a.m. UTC
  On 1 September 2016 at 00:08, Joseph Myers <joseph@codesourcery.com> wrote:
> On Wed, 31 Aug 2016, Rasmus Villemoes wrote:
>
>> Rather late to the party, but I think there's a few bugs here. Most
>> importantly, dup() doesn't preserve the CLOEXEC flag, so if we do move
>> the write end around like this, the fd will not automatically be closed
>> during the exec, and hence the parent won't receive EOF and will block
>> in read() call until the child finally exits. That's easily fixable with
>> fcntl(p, F_DUPFD_CLOEXEC, 0). Pretty annoying to add a test for, though.
>
> In Linux-specific code we can assume the presence of dup3 (which needs to
> be called by the name __dup3 in implementations of POSIX functions).

What I meant was that it is a little hard to write a regression test
for this bug, since we don't know beforehand what fds the pipe2() call
will give us, making it hard to create actions that is guaranteed to
exercise this code.

But, thinking a bit more about this, why do we even need a pipe to
ensure the child is gone, when we already set CLONE_VFORK? Can't we
just exploit the fact that we run in the same VM as the parent and
make the child write a non-zero error code to the spawn_args
structure? That would eliminate this problem entirely. Something like
below (sorry if gmail has whitespace-damaged it).

Rasmus



From: Rasmus Villemoes <rv@rasmusvillemoes.dk>
Date: Thu, 1 Sep 2016 11:03:43 +0200
Subject: [PATCH] linux: spawni.c: simplify error reporting to parent

Using VFORK already ensures that the parent does not run until the
child has either exec'ed succesfully or called _exit. Hence we don't
need to read from a CLOEXEC pipe to ensure proper synchronization - we
just make explicit use of the fact the the child and parent run in the
same VM, so the child can write an error code to a field of the
posix_spawn_args struct instead of sending it through a pipe.

This eliminates some annoying bookkeeping that is necessary to avoid
the file actions from clobbering the write end of the pipe, and
getting rid of the pipe creation in the first place means fewer system
calls and fewer chanches for the spawn to fail (e.g. if we're close to
EMFILE).

Signed-off-by: Rasmus Villemoes <rv@rasmusvillemoes.dk>
---
 sysdeps/unix/sysv/linux/spawni.c | 63 ++++++++++++----------------------------
 1 file changed, 18 insertions(+), 45 deletions(-)
  

Comments

Adhemerval Zanella Sept. 14, 2016, 1:13 p.m. UTC | #1
On 01/09/2016 06:28, Rasmus Villemoes wrote:
> On 1 September 2016 at 00:08, Joseph Myers <joseph@codesourcery.com> wrote:
>> On Wed, 31 Aug 2016, Rasmus Villemoes wrote:
>>
>>> Rather late to the party, but I think there's a few bugs here. Most
>>> importantly, dup() doesn't preserve the CLOEXEC flag, so if we do move
>>> the write end around like this, the fd will not automatically be closed
>>> during the exec, and hence the parent won't receive EOF and will block
>>> in read() call until the child finally exits. That's easily fixable with
>>> fcntl(p, F_DUPFD_CLOEXEC, 0). Pretty annoying to add a test for, though.
>>
>> In Linux-specific code we can assume the presence of dup3 (which needs to
>> be called by the name __dup3 in implementations of POSIX functions).
> 
> What I meant was that it is a little hard to write a regression test
> for this bug, since we don't know beforehand what fds the pipe2() call
> will give us, making it hard to create actions that is guaranteed to
> exercise this code.
> 
> But, thinking a bit more about this, why do we even need a pipe to
> ensure the child is gone, when we already set CLONE_VFORK? Can't we
> just exploit the fact that we run in the same VM as the parent and
> make the child write a non-zero error code to the spawn_args
> structure? That would eliminate this problem entirely. Something like
> below (sorry if gmail has whitespace-damaged it).

I think patch is ok and fixes the issues you noted about using the pipe2
call to signal the execv issue.  It just have one remark about it below.


> @@ -280,14 +267,12 @@ __spawni_child (void *arguments)
>       (2.15).  */
>    maybe_script_execute (args);
> 
> -  ret = -errno;
> -
>  fail:
> -  /* Since sizeof errno < PIPE_BUF, the write is atomic. */
> -  ret = -ret;
> -  if (ret)
> -    while (write_not_cancel (p, &ret, sizeof ret) < 0)
> -      continue;
> +  /* errno should have an appropriate non-zero value, but make sure
> +     that's the case so that our parent knows we failed to
> +     exec. There's no EUNKNOWN or EINTERNALBUG, so we use a value
> +     which is clearly bogus.  */
> +  args->err = errno ? : EHOSTDOWN;
>    _exit (SPAWN_ERROR);
>  }

I would prefer an assert call here to ensure errno is non zero for
failure case instead of reporting a bogus errno to program.  Since
this unexpected issue is either something wrong being reported from
kernel or an underlying bug it would be better to fail at once than
instead to document on manuals that this is potentially an unknown
issue.
  
Rasmus Villemoes Sept. 14, 2016, 6:58 p.m. UTC | #2
On Wed, Sep 14 2016, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:

> I think patch is ok and fixes the issues you noted about using the pipe2
> call to signal the execv issue.  It just have one remark about it below.
>
>
>> @@ -280,14 +267,12 @@ __spawni_child (void *arguments)
>>       (2.15).  */
>>    maybe_script_execute (args);
>> 
>> -  ret = -errno;
>> -
>>  fail:
>> -  /* Since sizeof errno < PIPE_BUF, the write is atomic. */
>> -  ret = -ret;
>> -  if (ret)
>> -    while (write_not_cancel (p, &ret, sizeof ret) < 0)
>> -      continue;
>> +  /* errno should have an appropriate non-zero value, but make sure
>> +     that's the case so that our parent knows we failed to
>> +     exec. There's no EUNKNOWN or EINTERNALBUG, so we use a value
>> +     which is clearly bogus.  */
>> +  args->err = errno ? : EHOSTDOWN;
>>    _exit (SPAWN_ERROR);
>>  }
>
> I would prefer an assert call here to ensure errno is non zero for
> failure case instead of reporting a bogus errno to program.  Since
> this unexpected issue is either something wrong being reported from
> kernel or an underlying bug it would be better to fail at once than
> instead to document on manuals that this is potentially an unknown
> issue.

But asserting/aborting in the child doesn't really solve the problem; we
still need to write some non-zero value for the parent to pick up once
we're gone. We could of course write -1 to indicate this really
exceptional situation, but that still leaves deciding how to handle that
in the parent. IMO an assert/abort is a little too harsh, but then the
parent has to return _some_ error code to its caller.

Rasmus
  
Adhemerval Zanella Sept. 14, 2016, 7:59 p.m. UTC | #3
On 14/09/2016 15:58, Rasmus Villemoes wrote:
> On Wed, Sep 14 2016, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> 
>> I think patch is ok and fixes the issues you noted about using the pipe2
>> call to signal the execv issue.  It just have one remark about it below.
>>
>>
>>> @@ -280,14 +267,12 @@ __spawni_child (void *arguments)
>>>       (2.15).  */
>>>    maybe_script_execute (args);
>>>
>>> -  ret = -errno;
>>> -
>>>  fail:
>>> -  /* Since sizeof errno < PIPE_BUF, the write is atomic. */
>>> -  ret = -ret;
>>> -  if (ret)
>>> -    while (write_not_cancel (p, &ret, sizeof ret) < 0)
>>> -      continue;
>>> +  /* errno should have an appropriate non-zero value, but make sure
>>> +     that's the case so that our parent knows we failed to
>>> +     exec. There's no EUNKNOWN or EINTERNALBUG, so we use a value
>>> +     which is clearly bogus.  */
>>> +  args->err = errno ? : EHOSTDOWN;
>>>    _exit (SPAWN_ERROR);
>>>  }
>>
>> I would prefer an assert call here to ensure errno is non zero for
>> failure case instead of reporting a bogus errno to program.  Since
>> this unexpected issue is either something wrong being reported from
>> kernel or an underlying bug it would be better to fail at once than
>> instead to document on manuals that this is potentially an unknown
>> issue.
> 
> But asserting/aborting in the child doesn't really solve the problem; we
> still need to write some non-zero value for the parent to pick up once
> we're gone. We could of course write -1 to indicate this really
> exceptional situation, but that still leaves deciding how to handle that
> in the parent. IMO an assert/abort is a little too harsh, but then the
> parent has to return _some_ error code to its caller.

My idea is to in fact not return to parent, but rather terminate program
execution in face of an unknown issue.  However, I do not have a strong
opinion if it should be really the desirable behaviour and thinking twice
it does seems that aborting program is too harsh.  I think -1 would be
suffice.
  
Florian Weimer Sept. 20, 2016, 8:25 p.m. UTC | #4
* Rasmus Villemoes:

> +  /* errno should have an appropriate non-zero value, but make sure
> +     that's the case so that our parent knows we failed to
> +     exec. There's no EUNKNOWN or EINTERNALBUG, so we use a value
> +     which is clearly bogus.  */
> +  args->err = errno ? : EHOSTDOWN;
>    _exit (SPAWN_ERROR);
>  }

I think ECHILD is probably a better fake error code.

You should set args->err to 0 on success ...

> +  args.err = 0;

... and initialize it to -1.

>    if (new_pid > 0)
>      {
> -      if (__read (args.pipe[0], &ec, sizeof ec) != sizeof ec)
> -    ec = 0;
> -      else
> +      ec = args.err;
> +      if (ec != 0)
>      __waitpid (new_pid, NULL, 0);
>      }

You should check (assert?) here that args.err is not -1.  Otherwise we
will never notice if the page is not shared between parent and child,
and the error reporting mechanism does not work.
  
Rasmus Villemoes Sept. 20, 2016, 8:54 p.m. UTC | #5
On Tue, Sep 20 2016, Florian Weimer <fw@deneb.enyo.de> wrote:

> * Rasmus Villemoes:
>
>> +  /* errno should have an appropriate non-zero value, but make sure
>> +     that's the case so that our parent knows we failed to
>> +     exec. There's no EUNKNOWN or EINTERNALBUG, so we use a value
>> +     which is clearly bogus.  */
>> +  args->err = errno ? : EHOSTDOWN;
>>    _exit (SPAWN_ERROR);
>>  }
>
> I think ECHILD is probably a better fake error code.

Yeah, that's probably ok. It's consistent with its use in pclose() where
the posix description reads "The status of the child process could not
be obtained". We just use it in the sense "something went wrong, we just
don't know what".

I'd really wish EINTERNALBUG existed.

> You should set args->err to 0 on success ...
>
>> +  args.err = 0;
>
> ... and initialize it to -1.
>
>>    if (new_pid > 0)
>>      {
>> -      if (__read (args.pipe[0], &ec, sizeof ec) != sizeof ec)
>> -    ec = 0;
>> -      else
>> +      ec = args.err;
>> +      if (ec != 0)
>>      __waitpid (new_pid, NULL, 0);
>>      }
>
> You should check (assert?) here that args.err is not -1.  Otherwise we
> will never notice if the page is not shared between parent and child,
> and the error reporting mechanism does not work.

Good point. I'll send an updated patch in a moment.

Rasmus
  

Patch

diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c
index bb3eecf..a3c4175 100644
--- a/sysdeps/unix/sysv/linux/spawni.c
+++ b/sysdeps/unix/sysv/linux/spawni.c
@@ -44,11 +44,12 @@ 
    3. Child must synchronize with parent to enforce 2. and to possible
       return execv issues.

-   The first issue is solved by blocking all signals in child, even the
-   NPTL-internal ones (SIGCANCEL and SIGSETXID).  The second and third issue
-   is done by a stack allocation in parent and a synchronization with using
-   a pipe or waitpid (in case or error).  The pipe has the advantage of
-   allowing the child the communicate an exec error.  */
+   The first issue is solved by blocking all signals in child, even
+   the NPTL-internal ones (SIGCANCEL and SIGSETXID).  The second and
+   third issue is done by a stack allocation in parent, and by using a
+   field in struct spawn_args where the child can write an error
+   code. CLONE_VFORK ensures that the parent does not run until the
+   child has either exec'ed successfully or exited.  */


 /* The Unix standard contains a long explanation of the way to signal
@@ -79,7 +80,6 @@ 

 struct posix_spawn_args
 {
-  int pipe[2];
   sigset_t oldmask;
   const char *file;
   int (*exec) (const char *, char *const *, char *const *);
@@ -89,6 +89,7 @@  struct posix_spawn_args
   ptrdiff_t argc;
   char *const *envp;
   int xflags;
+  int err;
 };

 /* Older version requires that shell script without shebang definition
@@ -125,11 +126,8 @@  __spawni_child (void *arguments)
   struct posix_spawn_args *args = arguments;
   const posix_spawnattr_t *restrict attr = args->attr;
   const posix_spawn_file_actions_t *file_actions = args->fa;
-  int p = args->pipe[1];
   int ret;

-  close_not_cancel (args->pipe[0]);
-
   /* The child must ensure that no signal handler are enabled because it shared
      memory with parent, so the signal disposition must be either SIG_DFL or
      SIG_IGN.  It does by iterating over all signals and although it could
@@ -203,17 +201,6 @@  __spawni_child (void *arguments)
     {
       struct __spawn_action *action = &file_actions->__actions[cnt];

-      /* Dup the pipe fd onto an unoccupied one to avoid any file
-         operation to clobber it.  */
-      if ((action->action.close_action.fd == p)
-          || (action->action.open_action.fd == p)
-          || (action->action.dup2_action.fd == p))
-        {
-          if ((ret = __dup (p)) < 0)
-        goto fail;
-          p = ret;
-        }
-
       switch (action->tag)
         {
         case spawn_do_close:
@@ -280,14 +267,12 @@  __spawni_child (void *arguments)
      (2.15).  */
   maybe_script_execute (args);

-  ret = -errno;
-
 fail:
-  /* Since sizeof errno < PIPE_BUF, the write is atomic. */
-  ret = -ret;
-  if (ret)
-    while (write_not_cancel (p, &ret, sizeof ret) < 0)
-      continue;
+  /* errno should have an appropriate non-zero value, but make sure
+     that's the case so that our parent knows we failed to
+     exec. There's no EUNKNOWN or EINTERNALBUG, so we use a value
+     which is clearly bogus.  */
+  args->err = errno ? : EHOSTDOWN;
   _exit (SPAWN_ERROR);
 }

@@ -304,9 +289,6 @@  __spawnix (pid_t * pid, const char *file,
   struct posix_spawn_args args;
   int ec;

-  if (__pipe2 (args.pipe, O_CLOEXEC))
-    return errno;
-
   /* To avoid imposing hard limits on posix_spawn{p} the total number of
      arguments is first calculated to allocate a mmap to hold all possible
      values.  */
@@ -333,15 +315,12 @@  __spawnix (pid_t * pid, const char *file,
   void *stack = __mmap (NULL, stack_size, prot,
             MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
   if (__glibc_unlikely (stack == MAP_FAILED))
-    {
-      close_not_cancel (args.pipe[0]);
-      close_not_cancel (args.pipe[1]);
-      return errno;
-    }
+    return errno;

   /* Disable asynchronous cancellation.  */
   int cs = LIBC_CANCEL_ASYNC ();

+  args.err = 0;
   args.file = file;
   args.exec = exec;
   args.fa = file_actions;
@@ -355,9 +334,8 @@  __spawnix (pid_t * pid, const char *file,

   /* The clone flags used will create a new child that will run in the same
      memory space (CLONE_VM) and the execution of calling thread will be
-     suspend until the child calls execve or _exit.  These condition as
-     signal below either by pipe write (_exit with SPAWN_ERROR) or
-     a successful execve.
+     suspend until the child calls execve or _exit.
+
      Also since the calling thread execution will be suspend, there is not
      need for CLONE_SETTLS.  Although parent and child share the same TLS
      namespace, there will be no concurrent access for TLS variables (errno
@@ -365,13 +343,10 @@  __spawnix (pid_t * pid, const char *file,
   new_pid = CLONE (__spawni_child, STACK (stack, stack_size), stack_size,
            CLONE_VM | CLONE_VFORK | SIGCHLD, &args);

-  close_not_cancel (args.pipe[1]);
-
   if (new_pid > 0)
     {
-      if (__read (args.pipe[0], &ec, sizeof ec) != sizeof ec)
-    ec = 0;
-      else
+      ec = args.err;
+      if (ec != 0)
     __waitpid (new_pid, NULL, 0);
     }
   else
@@ -379,8 +354,6 @@  __spawnix (pid_t * pid, const char *file,

   __munmap (stack, stack_size);

-  close_not_cancel (args.pipe[0]);
-
   if ((ec == 0) && (pid != NULL))
     *pid = new_pid;