[v3] Allow for unpriviledged nested containers

Message ID xnh7cd3tlw.fsf@greed.delorie.com
State Superseded
Headers
Series [v3] Allow for unpriviledged nested containers |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

DJ Delorie Nov. 15, 2021, 11:08 p.m. UTC
  Subject: Allow for unpriviledged nested containers

When running a "make check" in an untrusted podman container, we do
not have priviledges to mount a new /proc.  Previously, we just failed
to initialize the container and thus all test-container tests were
"unsupported".  With this change, we bind mount the parent's /proc,
which we have priviledges to do.  Note that MS_REC is needed as /proc
typically has things mounted within it, and not mounting those would
be a security hole[*].

[*] https://stackoverflow.com/questions/23417521/mounting-proc-in-non-privileged-namespace-sandbox
  

Comments

Florian Weimer Nov. 17, 2021, 11:06 a.m. UTC | #1
* DJ Delorie:

> Subject: Allow for unpriviledged nested containers
>
> When running a "make check" in an untrusted podman container, we do
> not have priviledges to mount a new /proc.  Previously, we just failed
> to initialize the container and thus all test-container tests were
> "unsupported".  With this change, we bind mount the parent's /proc,
> which we have priviledges to do.  Note that MS_REC is needed as /proc
> typically has things mounted within it, and not mounting those would
> be a security hole[*].

I see a new test failure with this, elf/tst-pldd.  Happens with
kernel-5.14.17-301.fc35.x86_64, kernel-5.14.13-100.fc33.x86_64,
linux-image-5.10.0-9-amd64_5.10.70-1, as a non-root user.

Thanks,
Florian
  
DJ Delorie Nov. 17, 2021, 10:44 p.m. UTC | #2
Florian Weimer <fweimer@redhat.com> writes:
>> When running a "make check" in an untrusted podman container, we do
>> not have priviledges to mount a new /proc.  Previously, we just failed
>> to initialize the container and thus all test-container tests were
>> "unsupported".  With this change, we bind mount the parent's /proc,
>> which we have priviledges to do.  Note that MS_REC is needed as /proc
>> typically has things mounted within it, and not mounting those would
>> be a security hole[*].
>
> I see a new test failure with this, elf/tst-pldd.  Happens with
> kernel-5.14.17-301.fc35.x86_64, kernel-5.14.13-100.fc33.x86_64,
> linux-image-5.10.0-9-amd64_5.10.70-1, as a non-root user.

Heh, this is a fun one.  If you bind mount /proc, you get the
/proc/<pid> from the parent namespace.  If you direct mount it as type
"proc" you get the /proc/<pid> from the child namespace.

I.e. the pldd test fails because it's looking at the wrong process.

I suspect the only way around this is to check for the specific
permission (capability?) we need, early, so we can bind mount /proc only
if we know in advance that the direct mount will fail.  Or decide that
having the parent's /proc/<pid> would cause more problems than it's
worth and just not have a /proc at all in that case.
  
Florian Weimer Nov. 18, 2021, 11:35 a.m. UTC | #3
* DJ Delorie:

> Florian Weimer <fweimer@redhat.com> writes:
>>> When running a "make check" in an untrusted podman container, we do
>>> not have priviledges to mount a new /proc.  Previously, we just failed
>>> to initialize the container and thus all test-container tests were
>>> "unsupported".  With this change, we bind mount the parent's /proc,
>>> which we have priviledges to do.  Note that MS_REC is needed as /proc
>>> typically has things mounted within it, and not mounting those would
>>> be a security hole[*].
>>
>> I see a new test failure with this, elf/tst-pldd.  Happens with
>> kernel-5.14.17-301.fc35.x86_64, kernel-5.14.13-100.fc33.x86_64,
>> linux-image-5.10.0-9-amd64_5.10.70-1, as a non-root user.
>
> Heh, this is a fun one.  If you bind mount /proc, you get the
> /proc/<pid> from the parent namespace.  If you direct mount it as type
> "proc" you get the /proc/<pid> from the child namespace.
>
> I.e. the pldd test fails because it's looking at the wrong process.
>
> I suspect the only way around this is to check for the specific
> permission (capability?) we need, early, so we can bind mount /proc only
> if we know in advance that the direct mount will fail.  Or decide that
> having the parent's /proc/<pid> would cause more problems than it's
> worth and just not have a /proc at all in that case.

Do we actually need to enter a PID namespace?

Maybe we should do that only if we know we can mount /proc in the
namespace.

Thanks,
Florian
  
DJ Delorie Nov. 18, 2021, 6:37 p.m. UTC | #4
Florian Weimer <fweimer@redhat.com> writes:
> Do we actually need to enter a PID namespace?

Well, for the pldd test, obviously we do ;-)

> Maybe we should do that only if we know we can mount /proc in the
> namespace.

So given three options:

1. No /proc
2. /proc in wrong namespace
3. /proc in correct namespace

We'd prefer 3, then 1, but not 2?
  
Florian Weimer Nov. 18, 2021, 7:47 p.m. UTC | #5
* DJ Delorie:

> Florian Weimer <fweimer@redhat.com> writes:
>> Do we actually need to enter a PID namespace?
>
> Well, for the pldd test, obviously we do ;-)

Do we?  /proc and the PIDs of the processes have to be consistent for
sure, but do we really need them to be separate from the host?

>> Maybe we should do that only if we know we can mount /proc in the
>> namespace.
>
> So given three options:
>
> 1. No /proc
> 2. /proc in wrong namespace
> 3. /proc in correct namespace
>
> We'd prefer 3, then 1, but not 2?

Yeah, 2 is quite bad for some tests at least.  Some thread-exit tests
will suffer as well, I think, because they read TIDs from
/proc/self/task.

Thanks,
Florian
  
DJ Delorie Nov. 18, 2021, 7:52 p.m. UTC | #6
Florian Weimer <fweimer@redhat.com> writes:
>> Well, for the pldd test, obviously we do ;-)
>
> Do we?  /proc and the PIDs of the processes have to be consistent for
> sure, but do we really need them to be separate from the host?

It's the consistency that's the problem.  If getpid() (which returns a
pid in the child namespace) returns a value that's useless in
/proc/<pid> (because those are pids in the parent namespace) then the
test fails.

One process can have different PIDs depending on how you look at it.

>> 1. No /proc
>> 2. /proc in wrong namespace
>> 3. /proc in correct namespace
>>
>> We'd prefer 3, then 1, but not 2?
>
> Yeah, 2 is quite bad for some tests at least.  Some thread-exit tests
> will suffer as well, I think, because they read TIDs from
> /proc/self/task.

3-then-1 returns us to my original patch, which attempted to mount it in
the child namespace, or failed but let the test run anyway.
  
Florian Weimer Nov. 18, 2021, 7:55 p.m. UTC | #7
* DJ Delorie:

> Florian Weimer <fweimer@redhat.com> writes:
>>> Well, for the pldd test, obviously we do ;-)
>>
>> Do we?  /proc and the PIDs of the processes have to be consistent for
>> sure, but do we really need them to be separate from the host?
>
> It's the consistency that's the problem.  If getpid() (which returns a
> pid in the child namespace) returns a value that's useless in
> /proc/<pid> (because those are pids in the parent namespace) then the
> test fails.
>
> One process can have different PIDs depending on how you look at it.

Then elf/tst-pldd should be fine with 4.

>>> 1. No /proc
>>> 2. /proc in wrong namespace
>>> 3. /proc in correct namespace
>>>
>>> We'd prefer 3, then 1, but not 2?
>>
>> Yeah, 2 is quite bad for some tests at least.  Some thread-exit tests
>> will suffer as well, I think, because they read TIDs from
>> /proc/self/task.
>
> 3-then-1 returns us to my original patch, which attempted to mount it in
> the child namespace, or failed but let the test run anyway.

Sorry, I missed that there is no 3b:

3b. /proc in correct namespace (but host PID namespace)

So instead:

  /* The unshare here gives us our own spaces and capabilities.  */
  if (unshare (CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNS) < 0)

This:

  /* The unshare here gives us our own spaces and capabilities.  */
  if (unshare (CLONE_NEWUSER | CLONE_NEWNS) < 0)

Not sure if it will work.  CLONE_NEWPID is preferable for better test
isolation, but maybe it's still better to run tests with reduced
isolation.  I think our main goal is the unprivileged chroot, followed
by avoiding Netfilter connection tracking table overflow.

Thanks,
Florian
  
DJ Delorie Nov. 18, 2021, 8:18 p.m. UTC | #8
IIRC part of the unshared pid namespace was to test processes that act
differently if they're pid 1 (init) but I don't think we have any of
those yet.  Certainly the unshared mount namespace has been used to test
corrupt config files etc.

But the security issue is for the CICD tools, which test unvetted
patches from the mailing list.  Bind mounting /proc doesn't exit that
security (it just gives you the same /proc the build already saw) but
*enabling* a non-bind-mounted proc means giving security privs to the
build that I'd rather not give.

Florian Weimer <fweimer@redhat.com> writes:
>> One process can have different PIDs depending on how you look at it.
>
> Then elf/tst-pldd should be fine with 4.

Sure, but if getpid() returns 4, and /proc/4/ is for a different process
(or doesn't exist), the test fails.  pldd would have to know the pid in
the parent's namespace, for the same process, which might be something
like /proc/41768423/ instead of /proc/4/

I wouldn't be surprised if /proc/self/ referred to the wrong process
too.
  
Florian Weimer Nov. 18, 2021, 8:20 p.m. UTC | #9
* DJ Delorie:

> IIRC part of the unshared pid namespace was to test processes that act
> differently if they're pid 1 (init) but I don't think we have any of
> those yet.  Certainly the unshared mount namespace has been used to test
> corrupt config files etc.
>
> But the security issue is for the CICD tools, which test unvetted
> patches from the mailing list.  Bind mounting /proc doesn't exit that
> security (it just gives you the same /proc the build already saw) but
> *enabling* a non-bind-mounted proc means giving security privs to the
> build that I'd rather not give.

There is no new security issue because the patch could as well change
support/test-container.c.

The bind mount doesn't give you the /proc from the host, but from the
container running the glibc build.  That still doesn't match the new PID
namespace created by test-container, though.

> Florian Weimer <fweimer@redhat.com> writes:
>>> One process can have different PIDs depending on how you look at it.
>>
>> Then elf/tst-pldd should be fine with 4.
>
> Sure, but if getpid() returns 4, and /proc/4/ is for a different process
> (or doesn't exist), the test fails.  pldd would have to know the pid in
> the parent's namespace, for the same process, which might be something
> like /proc/41768423/ instead of /proc/4/

Sorry I meant 3b there.

Thanks,
Florian
  
DJ Delorie Nov. 18, 2021, 8:25 p.m. UTC | #10
Florian Weimer <fweimer@redhat.com> writes:
> There is no new security issue because the patch could as well change
> support/test-container.c.

Right.  Our problem is a side-effect of the build container's security,
not extra security we're trying to impose to the test-container.

> The bind mount doesn't give you the /proc from the host, but from the
> container running the glibc build.  That still doesn't match the new PID
> namespace created by test-container, though.

Right.

>> Florian Weimer <fweimer@redhat.com> writes:
>>>> One process can have different PIDs depending on how you look at it.
>>>
>>> Then elf/tst-pldd should be fine with 4.
>>
>> Sure, but ...
>
> Sorry I meant 3b there.

Ah.

Hmmm... yes, I suppose if we don't create a new namespace, we'd lose the
ability to test pid 1 things, but /proc/<pid> would be consistent with
getpid().
  

Patch

diff --git a/support/test-container.c b/support/test-container.c
index 94498d39019..8f47f136e75 100644
--- a/support/test-container.c
+++ b/support/test-container.c
@@ -231,7 +231,7 @@  concat (const char *str, ...)
 static void
 trymount (const char *src, const char *dest)
 {
-  if (mount (src, dest, "", MS_BIND, NULL) < 0)
+  if (mount (src, dest, "", MS_BIND|MS_REC, NULL) < 0)
     FAIL_EXIT1 ("can't mount %s onto %s\n", src, dest);
 }
 
@@ -1094,6 +1094,13 @@  main (int argc, char **argv)
   trymount (support_srcdir_root, new_srcdir_path);
   trymount (support_objdir_root, new_objdir_path);
 
+  /* It may not be possible to mount /proc directly.  */
+  {
+    char *new_proc = concat (new_root_path, "/proc", NULL);
+    xmkdirp (new_proc, 0755);
+    trymount ("/proc", new_proc);
+  }
+
   xmkdirp (concat (new_root_path, "/dev", NULL), 0755);
   devmount (new_root_path, "null");
   devmount (new_root_path, "zero");
@@ -1163,11 +1170,6 @@  main (int argc, char **argv)
 
   maybe_xmkdir ("/tmp", 0755);
 
-  /* Now that we're pid 1 (effectively "root") we can mount /proc  */
-  maybe_xmkdir ("/proc", 0777);
-  if (mount ("proc", "/proc", "proc", 0, NULL) < 0)
-    FAIL_EXIT1 ("Unable to mount /proc: ");
-
   /* We map our original UID to the same UID in the container so we
      can own our own files normally.  */
   UMAP = open ("/proc/self/uid_map", O_WRONLY);