[v4,0/3] Linux: Fix posix_spawn when user with time namespaces

Message ID 20220510191155.1998575-1-adhemerval.zanella@linaro.org
Headers
Series Linux: Fix posix_spawn when user with time namespaces |

Message

Adhemerval Zanella Netto May 10, 2022, 7:11 p.m. UTC
  The patchset adds some support to tests the fallback code to
use only use CLONE_VFORK.  It uses unshare directly because
it simpler than add container support.

Adhemerval Zanella (3):
  linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
  support: Add support_enter_time_namespace
  linux: Add fallback for clone failure on posix_spawn (BZ #29115)

 include/unistd.h                             |   4 +-
 io/closefrom.c                               |   2 +-
 posix/Makefile                               |  16 +-
 posix/tst-spawn-chdir-timens.c               |   2 +
 posix/tst-spawn-chdir.c                      |   8 +
 posix/tst-spawn-timens.c                     |   2 +
 posix/tst-spawn.c                            |   6 +
 posix/tst-spawn2-timens.c                    |   2 +
 posix/tst-spawn2.c                           |   8 +
 posix/tst-spawn4-timens.c                    |   2 +
 posix/tst-spawn4.c                           |  10 +-
 posix/tst-spawn5-timens.c                    |   2 +
 posix/tst-spawn5.c                           |  10 +-
 posix/tst-spawn6-timens.c                    |   2 +
 posix/tst-spawn6.c                           |  12 +-
 support/Makefile                             |   1 +
 support/namespace.h                          |   5 +
 support/support_enter_time_namespace.c       |  34 +++
 sysdeps/unix/sysv/linux/bits/sched.h         |   4 +
 sysdeps/unix/sysv/linux/closefrom_fallback.c |   6 +-
 sysdeps/unix/sysv/linux/spawni.c             | 233 +++++++++++++++----
 21 files changed, 315 insertions(+), 56 deletions(-)
 create mode 100644 posix/tst-spawn-chdir-timens.c
 create mode 100644 posix/tst-spawn-timens.c
 create mode 100644 posix/tst-spawn2-timens.c
 create mode 100644 posix/tst-spawn4-timens.c
 create mode 100644 posix/tst-spawn5-timens.c
 create mode 100644 posix/tst-spawn6-timens.c
 create mode 100644 support/support_enter_time_namespace.c
  

Comments

Florian Weimer May 10, 2022, 7:18 p.m. UTC | #1
* Adhemerval Zanella:

> The patchset adds some support to tests the fallback code to
> use only use CLONE_VFORK.  It uses unshare directly because
> it simpler than add container support.
>
> Adhemerval Zanella (3):
>   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
>   support: Add support_enter_time_namespace
>   linux: Add fallback for clone failure on posix_spawn (BZ #29115)

Christan, how likely is it that we'd get another time namespace variant
that would only become effective after execve (when the DSO is remapped
anyway)?

This is a lot of complexity in posix_spawn implementations for what is a
rather limited use case.

Thanks,
Florian
  
Christian Brauner May 11, 2022, 9:21 a.m. UTC | #2
On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
> > The patchset adds some support to tests the fallback code to
> > use only use CLONE_VFORK.  It uses unshare directly because
> > it simpler than add container support.
> >
> > Adhemerval Zanella (3):
> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
> >   support: Add support_enter_time_namespace
> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
> 
> Christan, how likely is it that we'd get another time namespace variant
> that would only become effective after execve (when the DSO is remapped
> anyway)?

Not unlikely if it helps you avoid a lot of complexity. I will need some
time to track down Andrei and others to discuss though.

Thanks,
Christian
  
Alexey Izbyshev May 11, 2022, 4:01 p.m. UTC | #3
On 2022-05-11 12:21, Christian Brauner wrote:
> On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>> > The patchset adds some support to tests the fallback code to
>> > use only use CLONE_VFORK.  It uses unshare directly because
>> > it simpler than add container support.
>> >
>> > Adhemerval Zanella (3):
>> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
>> >   support: Add support_enter_time_namespace
>> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
>> 
>> Christan, how likely is it that we'd get another time namespace 
>> variant
>> that would only become effective after execve (when the DSO is 
>> remapped
>> anyway)?
> 
> Not unlikely if it helps you avoid a lot of complexity. I will need 
> some
> time to track down Andrei and others to discuss though.
> 
A more general question is whether the kernel aims to maintain the 
property that vfork() can always be used in place of fork() in the 
future. If it doesn't, code that currently relies on vfork() (even 
unwittingly via high-level APIs like posix_spawn()) will still have to 
grow workarounds.

Thanks,
Alexey
  
Florian Weimer May 25, 2022, 12:24 p.m. UTC | #4
* Christian Brauner:

> On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>> > The patchset adds some support to tests the fallback code to
>> > use only use CLONE_VFORK.  It uses unshare directly because
>> > it simpler than add container support.
>> >
>> > Adhemerval Zanella (3):
>> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
>> >   support: Add support_enter_time_namespace
>> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
>> 
>> Christan, how likely is it that we'd get another time namespace variant
>> that would only become effective after execve (when the DSO is remapped
>> anyway)?
>
> Not unlikely if it helps you avoid a lot of complexity. I will need some
> time to track down Andrei and others to discuss though.

Any progress with that?  (I hope I guessed the right Andrei.)

Breaking vfork is really a bit of a hassle for us, and the workaround
code is quite non-trivial and will have to implemented across many
projects (not just glibc).  An unshare request that takes effect on
execve only would really help.

Thanks,
Florian
  
Andrei Vagin May 27, 2022, 3:53 p.m. UTC | #5
On Wed, May 25, 2022 at 5:24 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Christian Brauner:
>
> > On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
> >> * Adhemerval Zanella:
> >>
> >> > The patchset adds some support to tests the fallback code to
> >> > use only use CLONE_VFORK.  It uses unshare directly because
> >> > it simpler than add container support.
> >> >
> >> > Adhemerval Zanella (3):
> >> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
> >> >   support: Add support_enter_time_namespace
> >> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
> >>
> >> Christan, how likely is it that we'd get another time namespace variant
> >> that would only become effective after execve (when the DSO is remapped
> >> anyway)?
> >
> > Not unlikely if it helps you avoid a lot of complexity. I will need some
> > time to track down Andrei and others to discuss though.
>
> Any progress with that?  (I hope I guessed the right Andrei.)

I think this is the right me. Have I missed something?

>
> Breaking vfork is really a bit of a hassle for us, and the workaround
> code is quite non-trivial and will have to implemented across many
> projects (not just glibc).  An unshare request that takes effect on
> execve only would really help.

Is the problem that vfork fails if a process has half-entered a time namespace?

>
> Thanks,
> Florian
>
  
Andrei Vagin May 28, 2022, 12:03 a.m. UTC | #6
On Fri, May 27, 2022 at 08:53:54AM -0700, Andrei Vagin wrote:
> On Wed, May 25, 2022 at 5:24 AM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * Christian Brauner:
> >
> > > On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
> > >> * Adhemerval Zanella:
> > >>
> > >> > The patchset adds some support to tests the fallback code to
> > >> > use only use CLONE_VFORK.  It uses unshare directly because
> > >> > it simpler than add container support.
> > >> >
> > >> > Adhemerval Zanella (3):
> > >> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
> > >> >   support: Add support_enter_time_namespace
> > >> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
> > >>
> > >> Christan, how likely is it that we'd get another time namespace variant
> > >> that would only become effective after execve (when the DSO is remapped
> > >> anyway)?
> > >
> > > Not unlikely if it helps you avoid a lot of complexity. I will need some
> > > time to track down Andrei and others to discuss though.
> >
> > Any progress with that?  (I hope I guessed the right Andrei.)
> 
> I think this is the right me. Have I missed something?
> 
> >
> > Breaking vfork is really a bit of a hassle for us, and the workaround
> > code is quite non-trivial and will have to implemented across many
> > projects (not just glibc).  An unshare request that takes effect on
> > execve only would really help.
> 
> Is the problem that vfork fails if a process has half-entered a time namespace?

I like the idea of entering a time namespace on exec and don't fail
vfork.  The only worry is that the behavior becomes more complicated and
less obvious.

Here is the untested patch:

diff --git a/fs/exec.c b/fs/exec.c
index 14b4b3755580..96f650ec7533 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -65,6 +65,7 @@
 #include <linux/io_uring.h>
 #include <linux/syscall_user_dispatch.h>
 #include <linux/coredump.h>
+#include <linux/time_namespace.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1266,6 +1267,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	timens_on_fork(me->nsproxy, me);
 	/*
 	 * Cancel any io_uring activity across execve
 	 */
diff --git a/kernel/fork.c b/kernel/fork.c
index 124829ed0163..653b80524a54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2030,15 +2030,6 @@ static __latent_entropy struct task_struct
*copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
-	/*
-	 * If the new process will be in a different time namespace
-	 * do not allow it to share VM or a thread group with the
	 forking task.
-	 */
-	if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
-		if (nsp->time_ns != nsp->time_ns_for_children)
-			return ERR_PTR(-EINVAL);
-	}
-
 	if (clone_flags & CLONE_PIDFD) {
 		/*
 		 * - CLONE_DETACHED is blocked so that we can
 		 * potentially
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index eec72ca962e2..b4cbb406bc28 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -179,7 +179,8 @@ int copy_namespaces(unsigned long flags, struct
task_struct *tsk)
 	if (IS_ERR(new_ns))
 		return  PTR_ERR(new_ns);
 
-	timens_on_fork(new_ns, tsk);
+	if ((flags & CLONE_VM) == 0)
+		timens_on_fork(new_ns, tsk);
 
 	tsk->nsproxy = new_ns;
 	return 0;

If this is what we want, I can prepare a final patch. But I will be on
vacation next week, so it will happen after it.

Thanks,
Andrei
  
Christian Brauner May 29, 2022, 2:33 p.m. UTC | #7
On Fri, May 27, 2022 at 05:03:14PM -0700, Andrei Vagin wrote:
> On Fri, May 27, 2022 at 08:53:54AM -0700, Andrei Vagin wrote:
> > On Wed, May 25, 2022 at 5:24 AM Florian Weimer <fweimer@redhat.com> wrote:
> > >
> > > * Christian Brauner:
> > >
> > > > On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
> > > >> * Adhemerval Zanella:
> > > >>
> > > >> > The patchset adds some support to tests the fallback code to
> > > >> > use only use CLONE_VFORK.  It uses unshare directly because
> > > >> > it simpler than add container support.
> > > >> >
> > > >> > Adhemerval Zanella (3):
> > > >> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
> > > >> >   support: Add support_enter_time_namespace
> > > >> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
> > > >>
> > > >> Christan, how likely is it that we'd get another time namespace variant
> > > >> that would only become effective after execve (when the DSO is remapped
> > > >> anyway)?
> > > >
> > > > Not unlikely if it helps you avoid a lot of complexity. I will need some
> > > > time to track down Andrei and others to discuss though.
> > >
> > > Any progress with that?  (I hope I guessed the right Andrei.)
> > 
> > I think this is the right me. Have I missed something?
> > 
> > >
> > > Breaking vfork is really a bit of a hassle for us, and the workaround
> > > code is quite non-trivial and will have to implemented across many
> > > projects (not just glibc).  An unshare request that takes effect on
> > > execve only would really help.
> > 
> > Is the problem that vfork fails if a process has half-entered a time namespace?
> 
> I like the idea of entering a time namespace on exec and don't fail
> vfork.  The only worry is that the behavior becomes more complicated and
> less obvious.
> 
> Here is the untested patch:
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 14b4b3755580..96f650ec7533 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -65,6 +65,7 @@
>  #include <linux/io_uring.h>
>  #include <linux/syscall_user_dispatch.h>
>  #include <linux/coredump.h>
> +#include <linux/time_namespace.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -1266,6 +1267,7 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> +	timens_on_fork(me->nsproxy, me);
>  	/*
>  	 * Cancel any io_uring activity across execve
>  	 */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 124829ed0163..653b80524a54 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2030,15 +2030,6 @@ static __latent_entropy struct task_struct
> *copy_process(
>  			return ERR_PTR(-EINVAL);
>  	}
>  
> -	/*
> -	 * If the new process will be in a different time namespace
> -	 * do not allow it to share VM or a thread group with the
> 	 forking task.
> -	 */
> -	if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
> -		if (nsp->time_ns != nsp->time_ns_for_children)
> -			return ERR_PTR(-EINVAL);
> -	}
> -
>  	if (clone_flags & CLONE_PIDFD) {
>  		/*
>  		 * - CLONE_DETACHED is blocked so that we can
>  		 * potentially
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index eec72ca962e2..b4cbb406bc28 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -179,7 +179,8 @@ int copy_namespaces(unsigned long flags, struct
> task_struct *tsk)
>  	if (IS_ERR(new_ns))
>  		return  PTR_ERR(new_ns);
>  
> -	timens_on_fork(new_ns, tsk);
> +	if ((flags & CLONE_VM) == 0)
> +		timens_on_fork(new_ns, tsk);
>  
>  	tsk->nsproxy = new_ns;
>  	return 0;
> 
> If this is what we want, I can prepare a final patch. But I will be on
> vacation next week, so it will happen after it.

No problem. It's the merge window anyway and it's May which means
bank holiday galore in parts of Europe.
  
Florian Weimer May 30, 2022, 12:58 p.m. UTC | #8
* Andrei Vagin:

[CLONE_NEWTIME and vfork]

>> Breaking vfork is really a bit of a hassle for us, and the workaround
>> code is quite non-trivial and will have to implemented across many
>> projects (not just glibc).  An unshare request that takes effect on
>> execve only would really help.
>
> Is the problem that vfork fails if a process has half-entered a time
> namespace?

Exactly.  Anything that implements a general-purpose process launching
facility on top of vfork now needs to implement fork fallback (after
vfork failure), so that launching processes still works if the original
process has called unshare(CLONE_NEWTIME).  

In glibc, this affects posix_spawn, but other libcs are also impacted,
and so is any custom posix_spawn-like interface that uses vfork
internally.  Without fork fallback, they turn unusable if anything in
the process has previously called unshare(CLONE_NEWTIME).  The fallback
implementation tends to be complicated if it's necessary to report
execve and other errors to the caller.  There is a choice between the
O_CLOEXEC pipe hack (which has become more complex to implement due to
close_range support), or a shared mapping has to be created using
MAP_SHARED, and the subprocess writes error information to that (which
adds more potentially costly MM updates).  MAP_SHARED is probably easier
to implement than the pipe approach (no interference possible from file
actions), but for glibc, Adhemerval wrote something based on the pipe
approach.

But the key point is that any general-purpose wrapper around vfork now
has to implement fork fallback.

Regarding the patch you sketched, we'd probably have to introduce a new
flag (not CLONE_NEWTIME) for this because the difference in behavior is
quite visible.

Thanks,
Florian
  
Andrei Vagin June 8, 2022, 7:21 a.m. UTC | #9
On Mon, May 30, 2022 at 02:58:01PM +0200, Florian Weimer wrote:
> * Andrei Vagin:
> 
> [CLONE_NEWTIME and vfork]
> 
> >> Breaking vfork is really a bit of a hassle for us, and the workaround
> >> code is quite non-trivial and will have to implemented across many
> >> projects (not just glibc).  An unshare request that takes effect on
> >> execve only would really help.
> >
> > Is the problem that vfork fails if a process has half-entered a time
> > namespace?
> 
> Exactly.  Anything that implements a general-purpose process launching
> facility on top of vfork now needs to implement fork fallback (after
> vfork failure), so that launching processes still works if the original
> process has called unshare(CLONE_NEWTIME).  
> 
> In glibc, this affects posix_spawn, but other libcs are also impacted,
> and so is any custom posix_spawn-like interface that uses vfork
> internally.  Without fork fallback, they turn unusable if anything in
> the process has previously called unshare(CLONE_NEWTIME).  The fallback
> implementation tends to be complicated if it's necessary to report
> execve and other errors to the caller.  There is a choice between the
> O_CLOEXEC pipe hack (which has become more complex to implement due to
> close_range support), or a shared mapping has to be created using
> MAP_SHARED, and the subprocess writes error information to that (which
> adds more potentially costly MM updates).  MAP_SHARED is probably easier
> to implement than the pipe approach (no interference possible from file
> actions), but for glibc, Adhemerval wrote something based on the pipe
> approach.
> 
> But the key point is that any general-purpose wrapper around vfork now
> has to implement fork fallback.
> 
> Regarding the patch you sketched, we'd probably have to introduce a new
> flag (not CLONE_NEWTIME) for this because the difference in behavior is
> quite visible.

I agree that if we want to switch timens on exec, we need to introduce a
new clone flag. But I would like to avoid doing that. I think we can
live with the current clone flag if we allow switching timens only when
exec is executed by a vfork-ed process. In this case, the chance to
affect existing users is very low, isn't it?


Without the new change, vfork fails if the currect process has unshared
timens. With the change, vfork creates a new process, and the following
exec executes a binary in the target timens.

Thanks,
Andrei
  
Florian Weimer June 8, 2022, 8:35 a.m. UTC | #10
* Andrei Vagin:

>> Regarding the patch you sketched, we'd probably have to introduce a new
>> flag (not CLONE_NEWTIME) for this because the difference in behavior is
>> quite visible.
>
> I agree that if we want to switch timens on exec, we need to introduce a
> new clone flag. But I would like to avoid doing that. I think we can
> live with the current clone flag if we allow switching timens only when
> exec is executed by a vfork-ed process. In this case, the chance to
> affect existing users is very low, isn't it?
>
> Without the new change, vfork fails if the currect process has unshared
> timens. With the change, vfork creates a new process, and the following
> exec executes a binary in the target timens.

This is an interesting observation.  The behavior is a bit non-obvious
(but so is the vfork failure), but I agree that it could work.  There
are already restrictions and oddities after vfork, so I think it's not
too bad after all.  And I think it will make posix_spawn-like functions
magically work that are currently broken, in pretty much all cases.

Thanks,
Florian