[v3] linux: Add fallback for clone failure on posix_spawn (BZ#29115)

Message ID 20220503170032.2531471-1-adhemerval.zanella@linaro.org
State Superseded
Headers
Series [v3] linux: Add fallback for clone failure on posix_spawn (BZ#29115) |

Checks

Context Check Description
dj/TryBot-apply_patch fail Patch failed to apply to master at the time it was sent
dj/TryBot-32bit fail Patch series failed to apply

Commit Message

Adhemerval Zanella Netto May 3, 2022, 5 p.m. UTC
  Even though it might characterizeis as a kernel bug (incomplete kernel
support for CLONE_VM used along with CLONE_VFORK), time namespace
support
is already presented in a some kernel releases (since 2020).

Although I agree with Carlos that it results in unexpected surprise
behaviour, I also see that providing a reasonable fallback that does not
add possible issues like race conditions (as some syscall fallbacks in
the
past) provides a easier transition and make the interfaca to work on
multiple scenarios.

The fix itself does adds some corner cases that need to be handled, but
I still think it is maintanable and a general improvement.

--

Linux clone with CLONE_VM may fail for some namespace restriction (for
instance if kernel does not allow processes in different time namespaces
to share the sameaddress space).

In this case clone fails with EINVAL and posix_spawn can not spawn a new
process.  However the same process can be spawned with fork and exec.

The patch fixes by retrying the clone syscall with just CLONE_VFORK
if clone fails with a non transient failure (ENOMEM and EAGAIN still
returns failure to caller).

Error on preparation phase or execve is returned by a pipe now that
there is no shared memory that ca be used.  It requires some extra care
for some file operations:

  * If the file operation would clobber the pipe file descriptor, the
    helper process dup the pipe onto an unoccupied file descriptor.

  * dup2 file action that targets the pipe file descriptor returns
    EBADF.

  * If closefrom argument overlaps the pipe file descriptor, it is
    splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou].

Failure on prepare phase in helper process does not trigger the fork
and exec fallback.

Checked on x86_64-linux-gnu.
---
v3: Fixed closerange fallback,.
v2: Fixed closerange file actions, handle EPIPE, return all errors,
    handle if pipe fd is used in file actions.
---
 include/unistd.h                             |   4 +-
 io/closefrom.c                               |   2 +-
 sysdeps/unix/sysv/linux/closefrom_fallback.c |   4 +-
 sysdeps/unix/sysv/linux/spawni.c             | 235 +++++++++++++++----
 4 files changed, 196 insertions(+), 49 deletions(-)
  

Comments

Alexey Izbyshev May 3, 2022, 5:36 p.m. UTC | #1
On 2022-05-03 20:00, Adhemerval Zanella wrote:
> Even though it might characterizeis as a kernel bug (incomplete kernel
> support for CLONE_VM used along with CLONE_VFORK), time namespace
> support
> is already presented in a some kernel releases (since 2020).
> 
> Although I agree with Carlos that it results in unexpected surprise
> behaviour, I also see that providing a reasonable fallback that does 
> not
> add possible issues like race conditions (as some syscall fallbacks in
> the
> past) provides a easier transition and make the interfaca to work on
> multiple scenarios.
> 
> The fix itself does adds some corner cases that need to be handled, but
> I still think it is maintanable and a general improvement.
> 
> --
> 
> Linux clone with CLONE_VM may fail for some namespace restriction (for
> instance if kernel does not allow processes in different time 
> namespaces
> to share the sameaddress space).
> 
> In this case clone fails with EINVAL and posix_spawn can not spawn a 
> new
> process.  However the same process can be spawned with fork and exec.
> 
> The patch fixes by retrying the clone syscall with just CLONE_VFORK
> if clone fails with a non transient failure (ENOMEM and EAGAIN still
> returns failure to caller).
> 
> Error on preparation phase or execve is returned by a pipe now that
> there is no shared memory that ca be used.  It requires some extra care
> for some file operations:
> 
>   * If the file operation would clobber the pipe file descriptor, the
>     helper process dup the pipe onto an unoccupied file descriptor.
> 
>   * dup2 file action that targets the pipe file descriptor returns
>     EBADF.
> 
>   * If closefrom argument overlaps the pipe file descriptor, it is
>     splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou].
> 
> Failure on prepare phase in helper process does not trigger the fork
> and exec fallback.
> 
> Checked on x86_64-linux-gnu.
> ---
> v3: Fixed closerange fallback,.
> v2: Fixed closerange file actions, handle EPIPE, return all errors,
>     handle if pipe fd is used in file actions.
> ---
>  include/unistd.h                             |   4 +-
>  io/closefrom.c                               |   2 +-
>  sysdeps/unix/sysv/linux/closefrom_fallback.c |   4 +-
>  sysdeps/unix/sysv/linux/spawni.c             | 235 +++++++++++++++----
>  4 files changed, 196 insertions(+), 49 deletions(-)
> 
> diff --git a/include/unistd.h b/include/unistd.h
> index af795a37c8..2643991a09 100644
> --- a/include/unistd.h
> +++ b/include/unistd.h
> @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int
> __lowfd, _Bool dirfd_fallback)
>    return false;
>  }
>  #  else
> -extern _Bool __closefrom_fallback (int __lowfd, _Bool) 
> attribute_hidden;
> +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int 
> __to,
> +				   _Bool)
> +     attribute_hidden;
>  #  endif
>  extern ssize_t __read (int __fd, void *__buf, size_t __nbytes);
>  libc_hidden_proto (__read)
> diff --git a/io/closefrom.c b/io/closefrom.c
> index 48cac2e3d1..a07de06e9b 100644
> --- a/io/closefrom.c
> +++ b/io/closefrom.c
> @@ -30,7 +30,7 @@ __closefrom (int lowfd)
>    if (r == 0)
>      return ;
> 
> -  if (!__closefrom_fallback (l, true))
> +  if (!__closefrom_fallback (l, ~0U, true))
>      __fortify_fail ("closefrom failed to close a file descriptor");
>  }
>  weak_alias (__closefrom, closefrom)
> diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c
> b/sysdeps/unix/sysv/linux/closefrom_fallback.c
> index a9dd0c46b2..ad301e1910 100644
> --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c
> +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c
> @@ -28,7 +28,7 @@
>     /proc/self/fd open will trigger a fallback that tries to close a 
> file
>     descriptor before proceed.  */
>  _Bool
> -__closefrom_fallback (int from, _Bool dirfd_fallback)
> +__closefrom_fallback (unsigned int from, unsigned int to, _Bool 
> dirfd_fallback)
>  {
>    int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | 
> O_DIRECTORY,
>                                 0);
> @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool dirfd_fallback)
> 
>            if (fd == dirfd || fd < from)
>              continue;
> +	  if (fd > to)
> +	    break;
> 
__closefrom_fallback() currently ignores "to" if it can't open 
/proc/self/fd and dirfd_fallback is true. This path is not used by 
posix_spawn(), but it's a potential future bug.

>            /* We ignore close errors because EBADF, EINTR, and EIO 
> means the
>               descriptor has been released.  */
> diff --git a/sysdeps/unix/sysv/linux/spawni.c 
> b/sysdeps/unix/sysv/linux/spawni.c
> index d6f5ca89cd..0481791013 100644
> --- a/sysdeps/unix/sysv/linux/spawni.c
> +++ b/sysdeps/unix/sysv/linux/spawni.c
> @@ -44,7 +44,11 @@
>     third issue is done by a stack allocation in parent, and by using a
>     field in struct spawn_args where the child can write an error
>     code. CLONE_VFORK ensures that the parent does not run until the
> -   child has either exec'ed successfully or exited.  */
> +   child has either exec'ed successfully or exited.
> +
> +   If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel 
> limitation
> +   such as time namespace), only CLONE_VFORK is used instead and the
> +   preparation and execve failures are communicated with a pipe.  */
> 
> 
>  /* The Unix standard contains a long explanation of the way to signal
> @@ -67,6 +71,7 @@ struct posix_spawn_args
>    char *const *envp;
>    int xflags;
>    int err;
> +  int pipe[2];
>  };
> 
>  /* Older version requires that shell script without shebang definition
> @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args 
> *args)
>      }
>  }
> 
> +/* If the file operation would clobber the pipe fd used to communicate 
> with
> +   parent, dup the pipe onto an unoccupied file descriptor.  */
> +static inline bool
> +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[])
> +{
> +  int fd;
> +
> +  switch (fa->tag)
> +    {
> +    case spawn_do_close:
> +      fd = fa->action.close_action.fd;
> +      break;
> +    case spawn_do_open:
> +      fd = fa->action.open_action.fd;
> +      break;
> +    case spawn_do_dup2:
> +      fd = fa->action.dup2_action.newfd;
> +      break;
> +    case spawn_do_fchdir:
> +      fd = fa->action.fchdir_action.fd;
> +    default:
> +      return true;
> +    }
> +
> +  if (fd == p[1])
> +    {
> +      int r = __fcntl (p[1], F_DUPFD_CLOEXEC);
> +      if (r < 0)
> +	return false;
> +      __close_nocancel (p[1]);
> +      p[1] = r;
> +    }
> +
> +  return true;
> +}
> +
> +static inline bool
> +spawni_fa_closerange (int from, int to)
> +{
> +#if 0
> +  int r = INLINE_SYSCALL_CALL (close_range, from, to, 0);
> +  return r == 0 || __closefrom_fallback (from, to, false);
> +#else
> +  return __closefrom_fallback (from, to, false);
> +#endif
> +}
> +
>  /* Function used in the clone call to setup the signals mask, 
> posix_spawn
>     attributes, and file actions.  It run on its own stack (provided by 
> the
>     posix_spawn call).  */
> -static int
> -__spawni_child (void *arguments)
> +static _Noreturn int
> +spawni_child (void *arguments)
>  {
>    struct posix_spawn_args *args = arguments;
>    const posix_spawnattr_t *restrict attr = args->attr;
>    const posix_spawn_file_actions_t *file_actions = args->fa;
> +  bool use_pipe = args->pipe[0] != -1;
> 
>    /* The child must ensure that no signal handler are enabled because 
> it shared
>       memory with parent, so the signal disposition must be either 
> SIG_DFL or
> @@ -113,6 +166,9 @@ __spawni_child (void *arguments)
>    struct sigaction sa;
>    memset (&sa, '\0', sizeof (sa));
> 
> +  if (use_pipe)
> +    __close (args->pipe[0]);
> +
>    sigset_t hset;
>    __sigprocmask (SIG_BLOCK, 0, &hset);
>    for (int sig = 1; sig < _NSIG; ++sig)
> @@ -181,6 +237,9 @@ __spawni_child (void *arguments)
>  	{
>  	  struct __spawn_action *action = &file_actions->__actions[cnt];
> 
> +	  if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe))
> +	    goto fail;
> +
>  	  switch (action->tag)
>  	    {
>  	    case spawn_do_close:
> @@ -233,6 +292,11 @@ __spawni_child (void *arguments)
>  	      break;
> 
>  	    case spawn_do_dup2:
> +	      if (use_pipe && action->action.dup2_action.fd == args->pipe[1])
> +		{
> +		  errno = EBADF;
> +		  goto fail;
> +		}
>  	      /* Austin Group issue #411 requires adddup2 action with source
>  		 and destination being equal to remove close-on-exec flag.  */
>  	      if (action->action.dup2_action.fd
> @@ -264,8 +328,18 @@ __spawni_child (void *arguments)
>  	    case spawn_do_closefrom:
>  	      {
>  		int lowfd = action->action.closefrom_action.from;
> -	        int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0);
> -		if (r != 0 && !__closefrom_fallback (lowfd, false))
> +		/* Skip the pipe descriptor if it is used.  No need to handle
> +		   it since it is created with O_CLOEXEC.  */
> +		if (use_pipe && args->pipe[1] >= lowfd)
> +		  {
> +		    if (args->pipe[1] == lowfd
> +			 && !spawni_fa_closerange (lowfd + 1, ~0U))
> +		      goto fail;
> +		    else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1)
> +			     || !spawni_fa_closerange (args->pipe[1] + 1, ~0U))

The "else" branch will also run if "args->pipe[1] == lowfd" and 
spawni_fa_closerange (lowfd + 1, ~0U) succeeds.

> +		      goto fail;
> +		  }
> +		else if (!spawni_fa_closerange (lowfd, ~0U))
>  		  goto fail;
>  	      } break;
> 
> @@ -300,10 +374,112 @@ fail:
>       (EINTERNALBUG) describing that, use ECHILD.  Another option would
>       be to set args->err to some negative sentinel and have the parent
>       abort(), but that seems needlessly harsh.  */
> -  args->err = errno ? : ECHILD;
> +  int ret = errno ? : ECHILD;
> +  if (use_pipe)
> +    {
> +      while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0)
> +	if (errno == EPIPE || errno == EBADF)
> +	  break;
> +    }
> +  else
> +    args->err = ret;
> +
>    _exit (SPAWN_ERROR);
>  }
> 
> +static pid_t
> +clone_call (struct posix_spawn_args *args, int flags, void *stack,
> +	    size_t stack_size)
> +{
> +  struct clone_args clone_args =
> +    {
> +      .flags = flags,
> +      .exit_signal = SIGCHLD,
> +      .stack = (uintptr_t) stack,
> +      .stack_size = stack_size,
> +    };
> +  return __clone_internal (&clone_args, spawni_child, args);
> +}
> +
> +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to 
> optimize
> +   memory and overcommit) and return TRUE if the helper was created or 
> if the
> +   failure was not due resource exhaustion.  */
> +static bool
> +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
> +	      void *stack, size_t stack_size)
> +{
> +  /* The clone flags used will create a new child that will run in the 
> same
> +     memory space (CLONE_VM) and the execution of calling thread will 
> be
> +     suspend until the child calls execve or _exit.
> +
> +     Also since the calling thread execution will be suspend, there is 
> not
> +     need for CLONE_SETTLS.  Although parent and child share the same 
> TLS
> +     namespace, there will be no concurrent access for TLS variables 
> (errno
> +     for instance).  */
> +  *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, 
> stack_size);
> +
> +  /* It needs to collect the case where the auxiliary process was 
> created
> +     but failed to execute the file (due either any preparation step 
> or
> +     for execve itself).  */
> +  if (*new_pid > 0)
> +    {
> +      /* Also, it handles the unlikely case where the auxiliary 
> process was
> +	 terminated before calling execve as if it was successfully.  The
> +	 args.err is set to 0 as default and changed to a positive value
> +	 only in case of failure, so in case of premature termination
> +	 due a signal args.err will remain zeroed and it will be up to
> +	 caller to actually collect it.  */
> +      *ec = args->err;
> +      if (*ec > 0)
> +	/* There still an unlikely case where the child is cancelled after
> +	   setting args.err, due to a positive error value.  Also there is
> +	   possible pid reuse race (where the kernel allocated the same pid
> +	   to an unrelated process).  Unfortunately due synchronization
> +	   issues where the kernel might not have the process collected
> +	   the waitpid below can not use WNOHANG.  */
> +	__waitpid (*new_pid, NULL, 0);
> +    }
> +  else
> +    *ec = errno;
> +
> +  /* There is no much point in retrying with fork and exec if kernel 
> returns a
> +     failure due resource exhaustion.  */
> +  return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN);
> +}
> +
> +/* Fallback spawn case which does not use CLONE_VM.  Any preparation 
> step or
> +   execve failure is passed with a pipe, which requires additional 
> care by
> +   the helper stating process since it additional file descriptors 
> handle.  */
> +static void
> +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
> +	     char *stack, size_t stack_size)
> +{
> +  if (__pipe2 (args->pipe, O_CLOEXEC) != 0)
> +    {
> +      *ec = errno;
> +      return;
> +    }
> +
> +  /* Do not trigger atfork handler nor any internal state reset since 
> the
> +     helper process will call execve.  */
> +  *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size);
> +
> +  __close (args->pipe[1]);
> +
> +  if (*new_pid > 0)
> +    {
> +      if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec)
> +	/* A successful execve will close the helper process pipe end.  */
> +	*ec = 0;
> +      else
> +	__waitpid (*new_pid, NULL, 0);
> +    }
> +  else
> +    *ec = errno;
> +
> +  __close (args->pipe[0]);
> +}
> +
>  /* Spawn a new process executing PATH with the attributes describes in 
> *ATTRP.
>     Before running the process perform the actions described in 
> FILE-ACTIONS. */
>  static int
> @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file,
>    args.argc = argc;
>    args.envp = envp;
>    args.xflags = xflags;
> +  args.pipe[0] = args.pipe[1] = -1;
> 
>    __libc_signal_block_all (&args.oldmask);
> 
> -  /* The clone flags used will create a new child that will run in the 
> same
> -     memory space (CLONE_VM) and the execution of calling thread will 
> be
> -     suspend until the child calls execve or _exit.
> -
> -     Also since the calling thread execution will be suspend, there is 
> not
> -     need for CLONE_SETTLS.  Although parent and child share the same 
> TLS
> -     namespace, there will be no concurrent access for TLS variables 
> (errno
> -     for instance).  */
> -  struct clone_args clone_args =
> -    {
> -      .flags = CLONE_VM | CLONE_VFORK,
> -      .exit_signal = SIGCHLD,
> -      .stack = (uintptr_t) stack,
> -      .stack_size = stack_size,
> -    };
> -  new_pid = __clone_internal (&clone_args, __spawni_child, &args);
> -
> -  /* It needs to collect the case where the auxiliary process was 
> created
> -     but failed to execute the file (due either any preparation step 
> or
> -     for execve itself).  */
> -  if (new_pid > 0)
> -    {
> -      /* Also, it handles the unlikely case where the auxiliary 
> process was
> -	 terminated before calling execve as if it was successfully.  The
> -	 args.err is set to 0 as default and changed to a positive value
> -	 only in case of failure, so in case of premature termination
> -	 due a signal args.err will remain zeroed and it will be up to
> -	 caller to actually collect it.  */
> -      ec = args.err;
> -      if (ec > 0)
> -	/* There still an unlikely case where the child is cancelled after
> -	   setting args.err, due to a positive error value.  Also there is
> -	   possible pid reuse race (where the kernel allocated the same pid
> -	   to an unrelated process).  Unfortunately due synchronization
> -	   issues where the kernel might not have the process collected
> -	   the waitpid below can not use WNOHANG.  */
> -	__waitpid (new_pid, NULL, 0);
> -    }
> -  else
> -    ec = errno;
> +  /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace 
> restriction
> +     (for instance Linux does not allow processes in different time 
> namespaces
> +     to share address space) and in this case clone fails with EINVAL. 
>  Retry
> +     with fork and exec.  */
> +  if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size))
> +    spawni_fork (&args, &new_pid, &ec, stack, stack_size);
> 
>    __munmap (stack, stack_size);
  
Adhemerval Zanella Netto May 3, 2022, 5:50 p.m. UTC | #2
On 03/05/2022 14:36, Alexey Izbyshev wrote:
> On 2022-05-03 20:00, Adhemerval Zanella wrote:
>> Even though it might characterizeis as a kernel bug (incomplete kernel
>> support for CLONE_VM used along with CLONE_VFORK), time namespace
>> support
>> is already presented in a some kernel releases (since 2020).
>>
>> Although I agree with Carlos that it results in unexpected surprise
>> behaviour, I also see that providing a reasonable fallback that does not
>> add possible issues like race conditions (as some syscall fallbacks in
>> the
>> past) provides a easier transition and make the interfaca to work on
>> multiple scenarios.
>>
>> The fix itself does adds some corner cases that need to be handled, but
>> I still think it is maintanable and a general improvement.
>>
>> -- 
>>
>> Linux clone with CLONE_VM may fail for some namespace restriction (for
>> instance if kernel does not allow processes in different time namespaces
>> to share the sameaddress space).
>>
>> In this case clone fails with EINVAL and posix_spawn can not spawn a new
>> process.  However the same process can be spawned with fork and exec.
>>
>> The patch fixes by retrying the clone syscall with just CLONE_VFORK
>> if clone fails with a non transient failure (ENOMEM and EAGAIN still
>> returns failure to caller).
>>
>> Error on preparation phase or execve is returned by a pipe now that
>> there is no shared memory that ca be used.  It requires some extra care
>> for some file operations:
>>
>>   * If the file operation would clobber the pipe file descriptor, the
>>     helper process dup the pipe onto an unoccupied file descriptor.
>>
>>   * dup2 file action that targets the pipe file descriptor returns
>>     EBADF.
>>
>>   * If closefrom argument overlaps the pipe file descriptor, it is
>>     splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou].
>>
>> Failure on prepare phase in helper process does not trigger the fork
>> and exec fallback.
>>
>> Checked on x86_64-linux-gnu.
>> ---
>> v3: Fixed closerange fallback,.
>> v2: Fixed closerange file actions, handle EPIPE, return all errors,
>>     handle if pipe fd is used in file actions.
>> ---
>>  include/unistd.h                             |   4 +-
>>  io/closefrom.c                               |   2 +-
>>  sysdeps/unix/sysv/linux/closefrom_fallback.c |   4 +-
>>  sysdeps/unix/sysv/linux/spawni.c             | 235 +++++++++++++++----
>>  4 files changed, 196 insertions(+), 49 deletions(-)
>>
>> diff --git a/include/unistd.h b/include/unistd.h
>> index af795a37c8..2643991a09 100644
>> --- a/include/unistd.h
>> +++ b/include/unistd.h
>> @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int
>> __lowfd, _Bool dirfd_fallback)
>>    return false;
>>  }
>>  #  else
>> -extern _Bool __closefrom_fallback (int __lowfd, _Bool) attribute_hidden;
>> +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int __to,
>> +                   _Bool)
>> +     attribute_hidden;
>>  #  endif
>>  extern ssize_t __read (int __fd, void *__buf, size_t __nbytes);
>>  libc_hidden_proto (__read)
>> diff --git a/io/closefrom.c b/io/closefrom.c
>> index 48cac2e3d1..a07de06e9b 100644
>> --- a/io/closefrom.c
>> +++ b/io/closefrom.c
>> @@ -30,7 +30,7 @@ __closefrom (int lowfd)
>>    if (r == 0)
>>      return ;
>>
>> -  if (!__closefrom_fallback (l, true))
>> +  if (!__closefrom_fallback (l, ~0U, true))
>>      __fortify_fail ("closefrom failed to close a file descriptor");
>>  }
>>  weak_alias (__closefrom, closefrom)
>> diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c
>> b/sysdeps/unix/sysv/linux/closefrom_fallback.c
>> index a9dd0c46b2..ad301e1910 100644
>> --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c
>> +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c
>> @@ -28,7 +28,7 @@
>>     /proc/self/fd open will trigger a fallback that tries to close a file
>>     descriptor before proceed.  */
>>  _Bool
>> -__closefrom_fallback (int from, _Bool dirfd_fallback)
>> +__closefrom_fallback (unsigned int from, unsigned int to, _Bool dirfd_fallback)
>>  {
>>    int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | O_DIRECTORY,
>>                                 0);
>> @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool dirfd_fallback)
>>
>>            if (fd == dirfd || fd < from)
>>              continue;
>> +      if (fd > to)
>> +        break;
>>
> __closefrom_fallback() currently ignores "to" if it can't open /proc/self/fd and dirfd_fallback is true. This path is not used by posix_spawn(), but it's a potential future bug.

Indeed, I think iterate over [from, to] instead of [from, INT_MAX] should fix it.


> 
>>            /* We ignore close errors because EBADF, EINTR, and EIO means the
>>               descriptor has been released.  */
>> diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c
>> index d6f5ca89cd..0481791013 100644
>> --- a/sysdeps/unix/sysv/linux/spawni.c
>> +++ b/sysdeps/unix/sysv/linux/spawni.c
>> @@ -44,7 +44,11 @@
>>     third issue is done by a stack allocation in parent, and by using a
>>     field in struct spawn_args where the child can write an error
>>     code. CLONE_VFORK ensures that the parent does not run until the
>> -   child has either exec'ed successfully or exited.  */
>> +   child has either exec'ed successfully or exited.
>> +
>> +   If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel limitation
>> +   such as time namespace), only CLONE_VFORK is used instead and the
>> +   preparation and execve failures are communicated with a pipe.  */
>>
>>
>>  /* The Unix standard contains a long explanation of the way to signal
>> @@ -67,6 +71,7 @@ struct posix_spawn_args
>>    char *const *envp;
>>    int xflags;
>>    int err;
>> +  int pipe[2];
>>  };
>>
>>  /* Older version requires that shell script without shebang definition
>> @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args *args)
>>      }
>>  }
>>
>> +/* If the file operation would clobber the pipe fd used to communicate with
>> +   parent, dup the pipe onto an unoccupied file descriptor.  */
>> +static inline bool
>> +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[])
>> +{
>> +  int fd;
>> +
>> +  switch (fa->tag)
>> +    {
>> +    case spawn_do_close:
>> +      fd = fa->action.close_action.fd;
>> +      break;
>> +    case spawn_do_open:
>> +      fd = fa->action.open_action.fd;
>> +      break;
>> +    case spawn_do_dup2:
>> +      fd = fa->action.dup2_action.newfd;
>> +      break;
>> +    case spawn_do_fchdir:
>> +      fd = fa->action.fchdir_action.fd;
>> +    default:
>> +      return true;
>> +    }
>> +
>> +  if (fd == p[1])
>> +    {
>> +      int r = __fcntl (p[1], F_DUPFD_CLOEXEC);
>> +      if (r < 0)
>> +    return false;
>> +      __close_nocancel (p[1]);
>> +      p[1] = r;
>> +    }
>> +
>> +  return true;
>> +}
>> +
>> +static inline bool
>> +spawni_fa_closerange (int from, int to)
>> +{
>> +#if 0
>> +  int r = INLINE_SYSCALL_CALL (close_range, from, to, 0);
>> +  return r == 0 || __closefrom_fallback (from, to, false);
>> +#else
>> +  return __closefrom_fallback (from, to, false);
>> +#endif
>> +}
>> +
>>  /* Function used in the clone call to setup the signals mask, posix_spawn
>>     attributes, and file actions.  It run on its own stack (provided by the
>>     posix_spawn call).  */
>> -static int
>> -__spawni_child (void *arguments)
>> +static _Noreturn int
>> +spawni_child (void *arguments)
>>  {
>>    struct posix_spawn_args *args = arguments;
>>    const posix_spawnattr_t *restrict attr = args->attr;
>>    const posix_spawn_file_actions_t *file_actions = args->fa;
>> +  bool use_pipe = args->pipe[0] != -1;
>>
>>    /* The child must ensure that no signal handler are enabled because it shared
>>       memory with parent, so the signal disposition must be either SIG_DFL or
>> @@ -113,6 +166,9 @@ __spawni_child (void *arguments)
>>    struct sigaction sa;
>>    memset (&sa, '\0', sizeof (sa));
>>
>> +  if (use_pipe)
>> +    __close (args->pipe[0]);
>> +
>>    sigset_t hset;
>>    __sigprocmask (SIG_BLOCK, 0, &hset);
>>    for (int sig = 1; sig < _NSIG; ++sig)
>> @@ -181,6 +237,9 @@ __spawni_child (void *arguments)
>>      {
>>        struct __spawn_action *action = &file_actions->__actions[cnt];
>>
>> +      if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe))
>> +        goto fail;
>> +
>>        switch (action->tag)
>>          {
>>          case spawn_do_close:
>> @@ -233,6 +292,11 @@ __spawni_child (void *arguments)
>>            break;
>>
>>          case spawn_do_dup2:
>> +          if (use_pipe && action->action.dup2_action.fd == args->pipe[1])
>> +        {
>> +          errno = EBADF;
>> +          goto fail;
>> +        }
>>            /* Austin Group issue #411 requires adddup2 action with source
>>           and destination being equal to remove close-on-exec flag.  */
>>            if (action->action.dup2_action.fd
>> @@ -264,8 +328,18 @@ __spawni_child (void *arguments)
>>          case spawn_do_closefrom:
>>            {
>>          int lowfd = action->action.closefrom_action.from;
>> -            int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0);
>> -        if (r != 0 && !__closefrom_fallback (lowfd, false))
>> +        /* Skip the pipe descriptor if it is used.  No need to handle
>> +           it since it is created with O_CLOEXEC.  */
>> +        if (use_pipe && args->pipe[1] >= lowfd)
>> +          {
>> +            if (args->pipe[1] == lowfd
>> +             && !spawni_fa_closerange (lowfd + 1, ~0U))
>> +              goto fail;
>> +            else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1)
>> +                 || !spawni_fa_closerange (args->pipe[1] + 1, ~0U))
> 
> The "else" branch will also run if "args->pipe[1] == lowfd" and spawni_fa_closerange (lowfd + 1, ~0U) succeeds.

I was trying to be too clever here, I think the following is more clear:

                if (use_pipe && args->pipe[1] == lowfd)
                  {
                    if (!spawni_fa_closerange (lowfd + 1, ~0U))
                      goto fail;
                  }
                else if (use_pipe && args->pipe[1] > lowfd)
                  {
                    if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1)
                        || !spawni_fa_closerange (args->pipe[1] + 1, ~0U))
                      goto fail;
                  }
                else if (!spawni_fa_closerange (lowfd, ~0U))
                  goto fail;


> 
>> +              goto fail;
>> +          }
>> +        else if (!spawni_fa_closerange (lowfd, ~0U))
>>            goto fail;
>>            } break;
>>
>> @@ -300,10 +374,112 @@ fail:
>>       (EINTERNALBUG) describing that, use ECHILD.  Another option would
>>       be to set args->err to some negative sentinel and have the parent
>>       abort(), but that seems needlessly harsh.  */
>> -  args->err = errno ? : ECHILD;
>> +  int ret = errno ? : ECHILD;
>> +  if (use_pipe)
>> +    {
>> +      while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0)
>> +    if (errno == EPIPE || errno == EBADF)
>> +      break;
>> +    }
>> +  else
>> +    args->err = ret;
>> +
>>    _exit (SPAWN_ERROR);
>>  }
>>
>> +static pid_t
>> +clone_call (struct posix_spawn_args *args, int flags, void *stack,
>> +        size_t stack_size)
>> +{
>> +  struct clone_args clone_args =
>> +    {
>> +      .flags = flags,
>> +      .exit_signal = SIGCHLD,
>> +      .stack = (uintptr_t) stack,
>> +      .stack_size = stack_size,
>> +    };
>> +  return __clone_internal (&clone_args, spawni_child, args);
>> +}
>> +
>> +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to optimize
>> +   memory and overcommit) and return TRUE if the helper was created or if the
>> +   failure was not due resource exhaustion.  */
>> +static bool
>> +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
>> +          void *stack, size_t stack_size)
>> +{
>> +  /* The clone flags used will create a new child that will run in the same
>> +     memory space (CLONE_VM) and the execution of calling thread will be
>> +     suspend until the child calls execve or _exit.
>> +
>> +     Also since the calling thread execution will be suspend, there is not
>> +     need for CLONE_SETTLS.  Although parent and child share the same TLS
>> +     namespace, there will be no concurrent access for TLS variables (errno
>> +     for instance).  */
>> +  *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, stack_size);
>> +
>> +  /* It needs to collect the case where the auxiliary process was created
>> +     but failed to execute the file (due either any preparation step or
>> +     for execve itself).  */
>> +  if (*new_pid > 0)
>> +    {
>> +      /* Also, it handles the unlikely case where the auxiliary process was
>> +     terminated before calling execve as if it was successfully.  The
>> +     args.err is set to 0 as default and changed to a positive value
>> +     only in case of failure, so in case of premature termination
>> +     due a signal args.err will remain zeroed and it will be up to
>> +     caller to actually collect it.  */
>> +      *ec = args->err;
>> +      if (*ec > 0)
>> +    /* There still an unlikely case where the child is cancelled after
>> +       setting args.err, due to a positive error value.  Also there is
>> +       possible pid reuse race (where the kernel allocated the same pid
>> +       to an unrelated process).  Unfortunately due synchronization
>> +       issues where the kernel might not have the process collected
>> +       the waitpid below can not use WNOHANG.  */
>> +    __waitpid (*new_pid, NULL, 0);
>> +    }
>> +  else
>> +    *ec = errno;
>> +
>> +  /* There is no much point in retrying with fork and exec if kernel returns a
>> +     failure due resource exhaustion.  */
>> +  return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN);
>> +}
>> +
>> +/* Fallback spawn case which does not use CLONE_VM.  Any preparation step or
>> +   execve failure is passed with a pipe, which requires additional care by
>> +   the helper stating process since it additional file descriptors handle.  */
>> +static void
>> +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
>> +         char *stack, size_t stack_size)
>> +{
>> +  if (__pipe2 (args->pipe, O_CLOEXEC) != 0)
>> +    {
>> +      *ec = errno;
>> +      return;
>> +    }
>> +
>> +  /* Do not trigger atfork handler nor any internal state reset since the
>> +     helper process will call execve.  */
>> +  *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size);
>> +
>> +  __close (args->pipe[1]);
>> +
>> +  if (*new_pid > 0)
>> +    {
>> +      if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec)
>> +    /* A successful execve will close the helper process pipe end.  */
>> +    *ec = 0;
>> +      else
>> +    __waitpid (*new_pid, NULL, 0);
>> +    }
>> +  else
>> +    *ec = errno;
>> +
>> +  __close (args->pipe[0]);
>> +}
>> +
>>  /* Spawn a new process executing PATH with the attributes describes in *ATTRP.
>>     Before running the process perform the actions described in FILE-ACTIONS. */
>>  static int
>> @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file,
>>    args.argc = argc;
>>    args.envp = envp;
>>    args.xflags = xflags;
>> +  args.pipe[0] = args.pipe[1] = -1;
>>
>>    __libc_signal_block_all (&args.oldmask);
>>
>> -  /* The clone flags used will create a new child that will run in the same
>> -     memory space (CLONE_VM) and the execution of calling thread will be
>> -     suspend until the child calls execve or _exit.
>> -
>> -     Also since the calling thread execution will be suspend, there is not
>> -     need for CLONE_SETTLS.  Although parent and child share the same TLS
>> -     namespace, there will be no concurrent access for TLS variables (errno
>> -     for instance).  */
>> -  struct clone_args clone_args =
>> -    {
>> -      .flags = CLONE_VM | CLONE_VFORK,
>> -      .exit_signal = SIGCHLD,
>> -      .stack = (uintptr_t) stack,
>> -      .stack_size = stack_size,
>> -    };
>> -  new_pid = __clone_internal (&clone_args, __spawni_child, &args);
>> -
>> -  /* It needs to collect the case where the auxiliary process was created
>> -     but failed to execute the file (due either any preparation step or
>> -     for execve itself).  */
>> -  if (new_pid > 0)
>> -    {
>> -      /* Also, it handles the unlikely case where the auxiliary process was
>> -     terminated before calling execve as if it was successfully.  The
>> -     args.err is set to 0 as default and changed to a positive value
>> -     only in case of failure, so in case of premature termination
>> -     due a signal args.err will remain zeroed and it will be up to
>> -     caller to actually collect it.  */
>> -      ec = args.err;
>> -      if (ec > 0)
>> -    /* There still an unlikely case where the child is cancelled after
>> -       setting args.err, due to a positive error value.  Also there is
>> -       possible pid reuse race (where the kernel allocated the same pid
>> -       to an unrelated process).  Unfortunately due synchronization
>> -       issues where the kernel might not have the process collected
>> -       the waitpid below can not use WNOHANG.  */
>> -    __waitpid (new_pid, NULL, 0);
>> -    }
>> -  else
>> -    ec = errno;
>> +  /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace restriction
>> +     (for instance Linux does not allow processes in different time namespaces
>> +     to share address space) and in this case clone fails with EINVAL.  Retry
>> +     with fork and exec.  */
>> +  if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size))
>> +    spawni_fork (&args, &new_pid, &ec, stack, stack_size);
>>
>>    __munmap (stack, stack_size);
  
Alexey Izbyshev May 3, 2022, 10:49 p.m. UTC | #3
On 2022-05-03 20:50, Adhemerval Zanella wrote:
> On 03/05/2022 14:36, Alexey Izbyshev wrote:
>> On 2022-05-03 20:00, Adhemerval Zanella wrote:
>>> Even though it might characterizeis as a kernel bug (incomplete 
>>> kernel
>>> support for CLONE_VM used along with CLONE_VFORK), time namespace
>>> support
>>> is already presented in a some kernel releases (since 2020).
>>> 
>>> Although I agree with Carlos that it results in unexpected surprise
>>> behaviour, I also see that providing a reasonable fallback that does 
>>> not
>>> add possible issues like race conditions (as some syscall fallbacks 
>>> in
>>> the
>>> past) provides a easier transition and make the interfaca to work on
>>> multiple scenarios.
>>> 
>>> The fix itself does adds some corner cases that need to be handled, 
>>> but
>>> I still think it is maintanable and a general improvement.
>>> 
>>> --
>>> 
>>> Linux clone with CLONE_VM may fail for some namespace restriction 
>>> (for
>>> instance if kernel does not allow processes in different time 
>>> namespaces
>>> to share the sameaddress space).
>>> 
>>> In this case clone fails with EINVAL and posix_spawn can not spawn a 
>>> new
>>> process.  However the same process can be spawned with fork and exec.
>>> 
>>> The patch fixes by retrying the clone syscall with just CLONE_VFORK
>>> if clone fails with a non transient failure (ENOMEM and EAGAIN still
>>> returns failure to caller).
>>> 
>>> Error on preparation phase or execve is returned by a pipe now that
>>> there is no shared memory that ca be used.  It requires some extra 
>>> care
>>> for some file operations:
>>> 
>>>   * If the file operation would clobber the pipe file descriptor, the
>>>     helper process dup the pipe onto an unoccupied file descriptor.
>>> 
>>>   * dup2 file action that targets the pipe file descriptor returns
>>>     EBADF.
>>> 
>>>   * If closefrom argument overlaps the pipe file descriptor, it is
>>>     splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou].
>>> 
>>> Failure on prepare phase in helper process does not trigger the fork
>>> and exec fallback.
>>> 
>>> Checked on x86_64-linux-gnu.
>>> ---
>>> v3: Fixed closerange fallback,.
>>> v2: Fixed closerange file actions, handle EPIPE, return all errors,
>>>     handle if pipe fd is used in file actions.
>>> ---
>>>  include/unistd.h                             |   4 +-
>>>  io/closefrom.c                               |   2 +-
>>>  sysdeps/unix/sysv/linux/closefrom_fallback.c |   4 +-
>>>  sysdeps/unix/sysv/linux/spawni.c             | 235 
>>> +++++++++++++++----
>>>  4 files changed, 196 insertions(+), 49 deletions(-)
>>> 
>>> diff --git a/include/unistd.h b/include/unistd.h
>>> index af795a37c8..2643991a09 100644
>>> --- a/include/unistd.h
>>> +++ b/include/unistd.h
>>> @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int
>>> __lowfd, _Bool dirfd_fallback)
>>>    return false;
>>>  }
>>>  #  else
>>> -extern _Bool __closefrom_fallback (int __lowfd, _Bool) 
>>> attribute_hidden;
>>> +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int 
>>> __to,
>>> +                   _Bool)
>>> +     attribute_hidden;
>>>  #  endif
>>>  extern ssize_t __read (int __fd, void *__buf, size_t __nbytes);
>>>  libc_hidden_proto (__read)
>>> diff --git a/io/closefrom.c b/io/closefrom.c
>>> index 48cac2e3d1..a07de06e9b 100644
>>> --- a/io/closefrom.c
>>> +++ b/io/closefrom.c
>>> @@ -30,7 +30,7 @@ __closefrom (int lowfd)
>>>    if (r == 0)
>>>      return ;
>>> 
>>> -  if (!__closefrom_fallback (l, true))
>>> +  if (!__closefrom_fallback (l, ~0U, true))
>>>      __fortify_fail ("closefrom failed to close a file descriptor");
>>>  }
>>>  weak_alias (__closefrom, closefrom)
>>> diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c
>>> b/sysdeps/unix/sysv/linux/closefrom_fallback.c
>>> index a9dd0c46b2..ad301e1910 100644
>>> --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c
>>> +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c
>>> @@ -28,7 +28,7 @@
>>>     /proc/self/fd open will trigger a fallback that tries to close a 
>>> file
>>>     descriptor before proceed.  */
>>>  _Bool
>>> -__closefrom_fallback (int from, _Bool dirfd_fallback)
>>> +__closefrom_fallback (unsigned int from, unsigned int to, _Bool 
>>> dirfd_fallback)
>>>  {
>>>    int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | 
>>> O_DIRECTORY,
>>>                                 0);
>>> @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool 
>>> dirfd_fallback)
>>> 
>>>            if (fd == dirfd || fd < from)
>>>              continue;
>>> +      if (fd > to)
>>> +        break;
>>> 
>> __closefrom_fallback() currently ignores "to" if it can't open 
>> /proc/self/fd and dirfd_fallback is true. This path is not used by 
>> posix_spawn(), but it's a potential future bug.
> 
> Indeed, I think iterate over [from, to] instead of [from, INT_MAX]
> should fix it.
> 
Yes, with some care for the fact that "to" is unsigned.
> 
>> 
>>>            /* We ignore close errors because EBADF, EINTR, and EIO 
>>> means the
>>>               descriptor has been released.  */
>>> diff --git a/sysdeps/unix/sysv/linux/spawni.c 
>>> b/sysdeps/unix/sysv/linux/spawni.c
>>> index d6f5ca89cd..0481791013 100644
>>> --- a/sysdeps/unix/sysv/linux/spawni.c
>>> +++ b/sysdeps/unix/sysv/linux/spawni.c
>>> @@ -44,7 +44,11 @@
>>>     third issue is done by a stack allocation in parent, and by using 
>>> a
>>>     field in struct spawn_args where the child can write an error
>>>     code. CLONE_VFORK ensures that the parent does not run until the
>>> -   child has either exec'ed successfully or exited.  */
>>> +   child has either exec'ed successfully or exited.
>>> +
>>> +   If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel 
>>> limitation
>>> +   such as time namespace), only CLONE_VFORK is used instead and the
>>> +   preparation and execve failures are communicated with a pipe.  */
>>> 
>>> 
>>>  /* The Unix standard contains a long explanation of the way to 
>>> signal
>>> @@ -67,6 +71,7 @@ struct posix_spawn_args
>>>    char *const *envp;
>>>    int xflags;
>>>    int err;
>>> +  int pipe[2];
>>>  };
>>> 
>>>  /* Older version requires that shell script without shebang 
>>> definition
>>> @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args 
>>> *args)
>>>      }
>>>  }
>>> 
>>> +/* If the file operation would clobber the pipe fd used to 
>>> communicate with
>>> +   parent, dup the pipe onto an unoccupied file descriptor.  */
>>> +static inline bool
>>> +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[])
>>> +{
>>> +  int fd;
>>> +
>>> +  switch (fa->tag)
>>> +    {
>>> +    case spawn_do_close:
>>> +      fd = fa->action.close_action.fd;
>>> +      break;
>>> +    case spawn_do_open:
>>> +      fd = fa->action.open_action.fd;
>>> +      break;
>>> +    case spawn_do_dup2:
>>> +      fd = fa->action.dup2_action.newfd;
>>> +      break;
>>> +    case spawn_do_fchdir:
>>> +      fd = fa->action.fchdir_action.fd;
>>> +    default:
>>> +      return true;
>>> +    }
>>> +
>>> +  if (fd == p[1])
>>> +    {
>>> +      int r = __fcntl (p[1], F_DUPFD_CLOEXEC);
>>> +      if (r < 0)
>>> +    return false;
>>> +      __close_nocancel (p[1]);
>>> +      p[1] = r;
>>> +    }
>>> +
>>> +  return true;
>>> +}
>>> +
>>> +static inline bool
>>> +spawni_fa_closerange (int from, int to)
>>> +{
>>> +#if 0
>>> +  int r = INLINE_SYSCALL_CALL (close_range, from, to, 0);
>>> +  return r == 0 || __closefrom_fallback (from, to, false);
>>> +#else
>>> +  return __closefrom_fallback (from, to, false);
>>> +#endif
>>> +}
>>> +
>>>  /* Function used in the clone call to setup the signals mask, 
>>> posix_spawn
>>>     attributes, and file actions.  It run on its own stack (provided 
>>> by the
>>>     posix_spawn call).  */
>>> -static int
>>> -__spawni_child (void *arguments)
>>> +static _Noreturn int
>>> +spawni_child (void *arguments)
>>>  {
>>>    struct posix_spawn_args *args = arguments;
>>>    const posix_spawnattr_t *restrict attr = args->attr;
>>>    const posix_spawn_file_actions_t *file_actions = args->fa;
>>> +  bool use_pipe = args->pipe[0] != -1;
>>> 
>>>    /* The child must ensure that no signal handler are enabled 
>>> because it shared
>>>       memory with parent, so the signal disposition must be either 
>>> SIG_DFL or
>>> @@ -113,6 +166,9 @@ __spawni_child (void *arguments)
>>>    struct sigaction sa;
>>>    memset (&sa, '\0', sizeof (sa));
>>> 
>>> +  if (use_pipe)
>>> +    __close (args->pipe[0]);
>>> +
>>>    sigset_t hset;
>>>    __sigprocmask (SIG_BLOCK, 0, &hset);
>>>    for (int sig = 1; sig < _NSIG; ++sig)
>>> @@ -181,6 +237,9 @@ __spawni_child (void *arguments)
>>>      {
>>>        struct __spawn_action *action = &file_actions->__actions[cnt];
>>> 
>>> +      if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe))
>>> +        goto fail;
>>> +
>>>        switch (action->tag)
>>>          {
>>>          case spawn_do_close:
>>> @@ -233,6 +292,11 @@ __spawni_child (void *arguments)
>>>            break;
>>> 
>>>          case spawn_do_dup2:
>>> +          if (use_pipe && action->action.dup2_action.fd == 
>>> args->pipe[1])
>>> +        {
>>> +          errno = EBADF;
>>> +          goto fail;
>>> +        }
>>>            /* Austin Group issue #411 requires adddup2 action with 
>>> source
>>>           and destination being equal to remove close-on-exec flag.  
>>> */
>>>            if (action->action.dup2_action.fd
>>> @@ -264,8 +328,18 @@ __spawni_child (void *arguments)
>>>          case spawn_do_closefrom:
>>>            {
>>>          int lowfd = action->action.closefrom_action.from;
>>> -            int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 
>>> 0);
>>> -        if (r != 0 && !__closefrom_fallback (lowfd, false))
>>> +        /* Skip the pipe descriptor if it is used.  No need to 
>>> handle
>>> +           it since it is created with O_CLOEXEC.  */
>>> +        if (use_pipe && args->pipe[1] >= lowfd)
>>> +          {
>>> +            if (args->pipe[1] == lowfd
>>> +             && !spawni_fa_closerange (lowfd + 1, ~0U))
>>> +              goto fail;
>>> +            else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 
>>> 1)
>>> +                 || !spawni_fa_closerange (args->pipe[1] + 1, ~0U))
>> 
>> The "else" branch will also run if "args->pipe[1] == lowfd" and 
>> spawni_fa_closerange (lowfd + 1, ~0U) succeeds.
> 
> I was trying to be too clever here, I think the following is more 
> clear:
> 
>                 if (use_pipe && args->pipe[1] == lowfd)
>                   {
>                     if (!spawni_fa_closerange (lowfd + 1, ~0U))
>                       goto fail;
>                   }
>                 else if (use_pipe && args->pipe[1] > lowfd)
>                   {
>                     if (!spawni_fa_closerange (lowfd, args->pipe[1] - 
> 1)
>                         || !spawni_fa_closerange (args->pipe[1] + 1, 
> ~0U))
>                       goto fail;
>                   }
>                 else if (!spawni_fa_closerange (lowfd, ~0U))
>                   goto fail;
> 
Yeah, it should work. In a (theoretical?) case when args->pipe[1] == 
INT_MAX it might also make sense to avoid signed overflow (UB) on 
additions (e.g. by using "lowfd + 1u").

>>>            } break;
>>> 
>>> @@ -300,10 +374,112 @@ fail:
>>>       (EINTERNALBUG) describing that, use ECHILD.  Another option 
>>> would
>>>       be to set args->err to some negative sentinel and have the 
>>> parent
>>>       abort(), but that seems needlessly harsh.  */
>>> -  args->err = errno ? : ECHILD;
>>> +  int ret = errno ? : ECHILD;
>>> +  if (use_pipe)
>>> +    {
>>> +      while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 
>>> 0)
>>> +    if (errno == EPIPE || errno == EBADF)
>>> +      break;
>>> +    }
>>> +  else
>>> +    args->err = ret;
>>> +
>>>    _exit (SPAWN_ERROR);
>>>  }
>>> 
>>> +static pid_t
>>> +clone_call (struct posix_spawn_args *args, int flags, void *stack,
>>> +        size_t stack_size)
>>> +{
>>> +  struct clone_args clone_args =
>>> +    {
>>> +      .flags = flags,
>>> +      .exit_signal = SIGCHLD,
>>> +      .stack = (uintptr_t) stack,
>>> +      .stack_size = stack_size,
>>> +    };
>>> +  return __clone_internal (&clone_args, spawni_child, args);
>>> +}
>>> +
>>> +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to 
>>> optimize
>>> +   memory and overcommit) and return TRUE if the helper was created 
>>> or if the
>>> +   failure was not due resource exhaustion.  */
>>> +static bool
>>> +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int 
>>> *ec,
>>> +          void *stack, size_t stack_size)
>>> +{
>>> +  /* The clone flags used will create a new child that will run in 
>>> the same
>>> +     memory space (CLONE_VM) and the execution of calling thread 
>>> will be
>>> +     suspend until the child calls execve or _exit.
>>> +
>>> +     Also since the calling thread execution will be suspend, there 
>>> is not
>>> +     need for CLONE_SETTLS.  Although parent and child share the 
>>> same TLS
>>> +     namespace, there will be no concurrent access for TLS variables 
>>> (errno
>>> +     for instance).  */
>>> +  *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, 
>>> stack_size);
>>> +
>>> +  /* It needs to collect the case where the auxiliary process was 
>>> created
>>> +     but failed to execute the file (due either any preparation step 
>>> or
>>> +     for execve itself).  */
>>> +  if (*new_pid > 0)
>>> +    {
>>> +      /* Also, it handles the unlikely case where the auxiliary 
>>> process was
>>> +     terminated before calling execve as if it was successfully.  
>>> The
>>> +     args.err is set to 0 as default and changed to a positive value
>>> +     only in case of failure, so in case of premature termination
>>> +     due a signal args.err will remain zeroed and it will be up to
>>> +     caller to actually collect it.  */
>>> +      *ec = args->err;
>>> +      if (*ec > 0)
>>> +    /* There still an unlikely case where the child is cancelled 
>>> after
>>> +       setting args.err, due to a positive error value.  Also there 
>>> is
>>> +       possible pid reuse race (where the kernel allocated the same 
>>> pid
>>> +       to an unrelated process).  Unfortunately due synchronization
>>> +       issues where the kernel might not have the process collected
>>> +       the waitpid below can not use WNOHANG.  */
>>> +    __waitpid (*new_pid, NULL, 0);
>>> +    }
>>> +  else
>>> +    *ec = errno;
>>> +
>>> +  /* There is no much point in retrying with fork and exec if kernel 
>>> returns a
>>> +     failure due resource exhaustion.  */
>>> +  return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN);
>>> +}
>>> +
>>> +/* Fallback spawn case which does not use CLONE_VM.  Any preparation 
>>> step or
>>> +   execve failure is passed with a pipe, which requires additional 
>>> care by
>>> +   the helper stating process since it additional file descriptors 
>>> handle.  */
>>> +static void
>>> +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
>>> +         char *stack, size_t stack_size)
>>> +{
>>> +  if (__pipe2 (args->pipe, O_CLOEXEC) != 0)
>>> +    {
>>> +      *ec = errno;
>>> +      return;
>>> +    }
>>> +
>>> +  /* Do not trigger atfork handler nor any internal state reset 
>>> since the
>>> +     helper process will call execve.  */
>>> +  *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size);
>>> +
>>> +  __close (args->pipe[1]);
>>> +
>>> +  if (*new_pid > 0)
>>> +    {
>>> +      if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec)
>>> +    /* A successful execve will close the helper process pipe end.  
>>> */
>>> +    *ec = 0;
>>> +      else
>>> +    __waitpid (*new_pid, NULL, 0);
>>> +    }
>>> +  else
>>> +    *ec = errno;
>>> +
>>> +  __close (args->pipe[0]);
>>> +}
>>> +
>>>  /* Spawn a new process executing PATH with the attributes describes 
>>> in *ATTRP.
>>>     Before running the process perform the actions described in 
>>> FILE-ACTIONS. */
>>>  static int
>>> @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file,
>>>    args.argc = argc;
>>>    args.envp = envp;
>>>    args.xflags = xflags;
>>> +  args.pipe[0] = args.pipe[1] = -1;
>>> 
>>>    __libc_signal_block_all (&args.oldmask);
>>> 
>>> -  /* The clone flags used will create a new child that will run in 
>>> the same
>>> -     memory space (CLONE_VM) and the execution of calling thread 
>>> will be
>>> -     suspend until the child calls execve or _exit.
>>> -
>>> -     Also since the calling thread execution will be suspend, there 
>>> is not
>>> -     need for CLONE_SETTLS.  Although parent and child share the 
>>> same TLS
>>> -     namespace, there will be no concurrent access for TLS variables 
>>> (errno
>>> -     for instance).  */
>>> -  struct clone_args clone_args =
>>> -    {
>>> -      .flags = CLONE_VM | CLONE_VFORK,
>>> -      .exit_signal = SIGCHLD,
>>> -      .stack = (uintptr_t) stack,
>>> -      .stack_size = stack_size,
>>> -    };
>>> -  new_pid = __clone_internal (&clone_args, __spawni_child, &args);
>>> -
>>> -  /* It needs to collect the case where the auxiliary process was 
>>> created
>>> -     but failed to execute the file (due either any preparation step 
>>> or
>>> -     for execve itself).  */
>>> -  if (new_pid > 0)
>>> -    {
>>> -      /* Also, it handles the unlikely case where the auxiliary 
>>> process was
>>> -     terminated before calling execve as if it was successfully.  
>>> The
>>> -     args.err is set to 0 as default and changed to a positive value
>>> -     only in case of failure, so in case of premature termination
>>> -     due a signal args.err will remain zeroed and it will be up to
>>> -     caller to actually collect it.  */
>>> -      ec = args.err;
>>> -      if (ec > 0)
>>> -    /* There still an unlikely case where the child is cancelled 
>>> after
>>> -       setting args.err, due to a positive error value.  Also there 
>>> is
>>> -       possible pid reuse race (where the kernel allocated the same 
>>> pid
>>> -       to an unrelated process).  Unfortunately due synchronization
>>> -       issues where the kernel might not have the process collected
>>> -       the waitpid below can not use WNOHANG.  */
>>> -    __waitpid (new_pid, NULL, 0);
>>> -    }
>>> -  else
>>> -    ec = errno;
>>> +  /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace 
>>> restriction
>>> +     (for instance Linux does not allow processes in different time 
>>> namespaces
>>> +     to share address space) and in this case clone fails with 
>>> EINVAL.  Retry
>>> +     with fork and exec.  */
>>> +  if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size))
>>> +    spawni_fork (&args, &new_pid, &ec, stack, stack_size);
>>> 
>>>    __munmap (stack, stack_size);
  

Patch

diff --git a/include/unistd.h b/include/unistd.h
index af795a37c8..2643991a09 100644
--- a/include/unistd.h
+++ b/include/unistd.h
@@ -167,7 +167,9 @@  static inline _Bool __closefrom_fallback (int __lowfd, _Bool dirfd_fallback)
   return false;
 }
 #  else
-extern _Bool __closefrom_fallback (int __lowfd, _Bool) attribute_hidden;
+extern _Bool __closefrom_fallback (unsigned int __from, unsigned int __to,
+				   _Bool)
+     attribute_hidden;
 #  endif
 extern ssize_t __read (int __fd, void *__buf, size_t __nbytes);
 libc_hidden_proto (__read)
diff --git a/io/closefrom.c b/io/closefrom.c
index 48cac2e3d1..a07de06e9b 100644
--- a/io/closefrom.c
+++ b/io/closefrom.c
@@ -30,7 +30,7 @@  __closefrom (int lowfd)
   if (r == 0)
     return ;
 
-  if (!__closefrom_fallback (l, true))
+  if (!__closefrom_fallback (l, ~0U, true))
     __fortify_fail ("closefrom failed to close a file descriptor");
 }
 weak_alias (__closefrom, closefrom)
diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c b/sysdeps/unix/sysv/linux/closefrom_fallback.c
index a9dd0c46b2..ad301e1910 100644
--- a/sysdeps/unix/sysv/linux/closefrom_fallback.c
+++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c
@@ -28,7 +28,7 @@ 
    /proc/self/fd open will trigger a fallback that tries to close a file
    descriptor before proceed.  */
 _Bool
-__closefrom_fallback (int from, _Bool dirfd_fallback)
+__closefrom_fallback (unsigned int from, unsigned int to, _Bool dirfd_fallback)
 {
   int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | O_DIRECTORY,
                                0);
@@ -84,6 +84,8 @@  __closefrom_fallback (int from, _Bool dirfd_fallback)
 
           if (fd == dirfd || fd < from)
             continue;
+	  if (fd > to)
+	    break;
 
           /* We ignore close errors because EBADF, EINTR, and EIO means the
              descriptor has been released.  */
diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c
index d6f5ca89cd..0481791013 100644
--- a/sysdeps/unix/sysv/linux/spawni.c
+++ b/sysdeps/unix/sysv/linux/spawni.c
@@ -44,7 +44,11 @@ 
    third issue is done by a stack allocation in parent, and by using a
    field in struct spawn_args where the child can write an error
    code. CLONE_VFORK ensures that the parent does not run until the
-   child has either exec'ed successfully or exited.  */
+   child has either exec'ed successfully or exited.
+
+   If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel limitation
+   such as time namespace), only CLONE_VFORK is used instead and the
+   preparation and execve failures are communicated with a pipe.  */
 
 
 /* The Unix standard contains a long explanation of the way to signal
@@ -67,6 +71,7 @@  struct posix_spawn_args
   char *const *envp;
   int xflags;
   int err;
+  int pipe[2];
 };
 
 /* Older version requires that shell script without shebang definition
@@ -94,15 +99,63 @@  maybe_script_execute (struct posix_spawn_args *args)
     }
 }
 
+/* If the file operation would clobber the pipe fd used to communicate with
+   parent, dup the pipe onto an unoccupied file descriptor.  */
+static inline bool
+spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[])
+{
+  int fd;
+
+  switch (fa->tag)
+    {
+    case spawn_do_close:
+      fd = fa->action.close_action.fd;
+      break;
+    case spawn_do_open:
+      fd = fa->action.open_action.fd;
+      break;
+    case spawn_do_dup2:
+      fd = fa->action.dup2_action.newfd;
+      break;
+    case spawn_do_fchdir:
+      fd = fa->action.fchdir_action.fd;
+    default:
+      return true;
+    }
+
+  if (fd == p[1])
+    {
+      int r = __fcntl (p[1], F_DUPFD_CLOEXEC);
+      if (r < 0)
+	return false;
+      __close_nocancel (p[1]);
+      p[1] = r;
+    }
+
+  return true;
+}
+
+static inline bool
+spawni_fa_closerange (int from, int to)
+{
+#if 0
+  int r = INLINE_SYSCALL_CALL (close_range, from, to, 0);
+  return r == 0 || __closefrom_fallback (from, to, false);
+#else
+  return __closefrom_fallback (from, to, false);
+#endif
+}
+
 /* Function used in the clone call to setup the signals mask, posix_spawn
    attributes, and file actions.  It run on its own stack (provided by the
    posix_spawn call).  */
-static int
-__spawni_child (void *arguments)
+static _Noreturn int
+spawni_child (void *arguments)
 {
   struct posix_spawn_args *args = arguments;
   const posix_spawnattr_t *restrict attr = args->attr;
   const posix_spawn_file_actions_t *file_actions = args->fa;
+  bool use_pipe = args->pipe[0] != -1;
 
   /* The child must ensure that no signal handler are enabled because it shared
      memory with parent, so the signal disposition must be either SIG_DFL or
@@ -113,6 +166,9 @@  __spawni_child (void *arguments)
   struct sigaction sa;
   memset (&sa, '\0', sizeof (sa));
 
+  if (use_pipe)
+    __close (args->pipe[0]);
+
   sigset_t hset;
   __sigprocmask (SIG_BLOCK, 0, &hset);
   for (int sig = 1; sig < _NSIG; ++sig)
@@ -181,6 +237,9 @@  __spawni_child (void *arguments)
 	{
 	  struct __spawn_action *action = &file_actions->__actions[cnt];
 
+	  if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe))
+	    goto fail;
+
 	  switch (action->tag)
 	    {
 	    case spawn_do_close:
@@ -233,6 +292,11 @@  __spawni_child (void *arguments)
 	      break;
 
 	    case spawn_do_dup2:
+	      if (use_pipe && action->action.dup2_action.fd == args->pipe[1])
+		{
+		  errno = EBADF;
+		  goto fail;
+		}
 	      /* Austin Group issue #411 requires adddup2 action with source
 		 and destination being equal to remove close-on-exec flag.  */
 	      if (action->action.dup2_action.fd
@@ -264,8 +328,18 @@  __spawni_child (void *arguments)
 	    case spawn_do_closefrom:
 	      {
 		int lowfd = action->action.closefrom_action.from;
-	        int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0);
-		if (r != 0 && !__closefrom_fallback (lowfd, false))
+		/* Skip the pipe descriptor if it is used.  No need to handle
+		   it since it is created with O_CLOEXEC.  */
+		if (use_pipe && args->pipe[1] >= lowfd)
+		  {
+		    if (args->pipe[1] == lowfd
+			 && !spawni_fa_closerange (lowfd + 1, ~0U))
+		      goto fail;
+		    else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1)
+			     || !spawni_fa_closerange (args->pipe[1] + 1, ~0U))
+		      goto fail;
+		  }
+		else if (!spawni_fa_closerange (lowfd, ~0U))
 		  goto fail;
 	      } break;
 
@@ -300,10 +374,112 @@  fail:
      (EINTERNALBUG) describing that, use ECHILD.  Another option would
      be to set args->err to some negative sentinel and have the parent
      abort(), but that seems needlessly harsh.  */
-  args->err = errno ? : ECHILD;
+  int ret = errno ? : ECHILD;
+  if (use_pipe)
+    {
+      while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0)
+	if (errno == EPIPE || errno == EBADF)
+	  break;
+    }
+  else
+    args->err = ret;
+
   _exit (SPAWN_ERROR);
 }
 
+static pid_t
+clone_call (struct posix_spawn_args *args, int flags, void *stack,
+	    size_t stack_size)
+{
+  struct clone_args clone_args =
+    {
+      .flags = flags,
+      .exit_signal = SIGCHLD,
+      .stack = (uintptr_t) stack,
+      .stack_size = stack_size,
+    };
+  return __clone_internal (&clone_args, spawni_child, args);
+}
+
+/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to optimize
+   memory and overcommit) and return TRUE if the helper was created or if the
+   failure was not due resource exhaustion.  */
+static bool
+spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
+	      void *stack, size_t stack_size)
+{
+  /* The clone flags used will create a new child that will run in the same
+     memory space (CLONE_VM) and the execution of calling thread will be
+     suspend until the child calls execve or _exit.
+
+     Also since the calling thread execution will be suspend, there is not
+     need for CLONE_SETTLS.  Although parent and child share the same TLS
+     namespace, there will be no concurrent access for TLS variables (errno
+     for instance).  */
+  *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, stack_size);
+
+  /* It needs to collect the case where the auxiliary process was created
+     but failed to execute the file (due either any preparation step or
+     for execve itself).  */
+  if (*new_pid > 0)
+    {
+      /* Also, it handles the unlikely case where the auxiliary process was
+	 terminated before calling execve as if it was successfully.  The
+	 args.err is set to 0 as default and changed to a positive value
+	 only in case of failure, so in case of premature termination
+	 due a signal args.err will remain zeroed and it will be up to
+	 caller to actually collect it.  */
+      *ec = args->err;
+      if (*ec > 0)
+	/* There still an unlikely case where the child is cancelled after
+	   setting args.err, due to a positive error value.  Also there is
+	   possible pid reuse race (where the kernel allocated the same pid
+	   to an unrelated process).  Unfortunately due synchronization
+	   issues where the kernel might not have the process collected
+	   the waitpid below can not use WNOHANG.  */
+	__waitpid (*new_pid, NULL, 0);
+    }
+  else
+    *ec = errno;
+
+  /* There is no much point in retrying with fork and exec if kernel returns a
+     failure due resource exhaustion.  */
+  return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN);
+}
+
+/* Fallback spawn case which does not use CLONE_VM.  Any preparation step or
+   execve failure is passed with a pipe, which requires additional care by
+   the helper stating process since it additional file descriptors handle.  */
+static void
+spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
+	     char *stack, size_t stack_size)
+{
+  if (__pipe2 (args->pipe, O_CLOEXEC) != 0)
+    {
+      *ec = errno;
+      return;
+    }
+
+  /* Do not trigger atfork handler nor any internal state reset since the
+     helper process will call execve.  */
+  *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size);
+
+  __close (args->pipe[1]);
+
+  if (*new_pid > 0)
+    {
+      if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec)
+	/* A successful execve will close the helper process pipe end.  */
+	*ec = 0;
+      else
+	__waitpid (*new_pid, NULL, 0);
+    }
+  else
+    *ec = errno;
+
+  __close (args->pipe[0]);
+}
+
 /* Spawn a new process executing PATH with the attributes describes in *ATTRP.
    Before running the process perform the actions described in FILE-ACTIONS. */
 static int
@@ -367,49 +543,16 @@  __spawnix (pid_t * pid, const char *file,
   args.argc = argc;
   args.envp = envp;
   args.xflags = xflags;
+  args.pipe[0] = args.pipe[1] = -1;
 
   __libc_signal_block_all (&args.oldmask);
 
-  /* The clone flags used will create a new child that will run in the same
-     memory space (CLONE_VM) and the execution of calling thread will be
-     suspend until the child calls execve or _exit.
-
-     Also since the calling thread execution will be suspend, there is not
-     need for CLONE_SETTLS.  Although parent and child share the same TLS
-     namespace, there will be no concurrent access for TLS variables (errno
-     for instance).  */
-  struct clone_args clone_args =
-    {
-      .flags = CLONE_VM | CLONE_VFORK,
-      .exit_signal = SIGCHLD,
-      .stack = (uintptr_t) stack,
-      .stack_size = stack_size,
-    };
-  new_pid = __clone_internal (&clone_args, __spawni_child, &args);
-
-  /* It needs to collect the case where the auxiliary process was created
-     but failed to execute the file (due either any preparation step or
-     for execve itself).  */
-  if (new_pid > 0)
-    {
-      /* Also, it handles the unlikely case where the auxiliary process was
-	 terminated before calling execve as if it was successfully.  The
-	 args.err is set to 0 as default and changed to a positive value
-	 only in case of failure, so in case of premature termination
-	 due a signal args.err will remain zeroed and it will be up to
-	 caller to actually collect it.  */
-      ec = args.err;
-      if (ec > 0)
-	/* There still an unlikely case where the child is cancelled after
-	   setting args.err, due to a positive error value.  Also there is
-	   possible pid reuse race (where the kernel allocated the same pid
-	   to an unrelated process).  Unfortunately due synchronization
-	   issues where the kernel might not have the process collected
-	   the waitpid below can not use WNOHANG.  */
-	__waitpid (new_pid, NULL, 0);
-    }
-  else
-    ec = errno;
+  /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace restriction
+     (for instance Linux does not allow processes in different time namespaces
+     to share address space) and in this case clone fails with EINVAL.  Retry
+     with fork and exec.  */
+  if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size))
+    spawni_fork (&args, &new_pid, &ec, stack, stack_size);
 
   __munmap (stack, stack_size);