[RFC] Linux: Workaround seccomp() issue with faccessat2()

Message ID 20210225194702.6113-1-pvorel@suse.cz
State Rejected
Headers
Series [RFC] Linux: Workaround seccomp() issue with faccessat2() |

Commit Message

Petr Vorel Feb. 25, 2021, 7:47 p.m. UTC
  3d3ab573a5 ("Linux: Use faccessat2 to implement faccessat (bug 18683)")
started to use faccessat2() which breaks docker/podman/... containers
with guest running glibc 2.33 running on host with older kernel and are
built with older libseccomp.

See also: https://bugzilla.opensuse.org/show_bug.cgi?id=1182451#c17

Signed-off-by: Petr Vorel <pvorel@suse.cz>
---
Hi,

I admit that this is a very ugly workaround and wouldn't be surprised if
you just don't care about seccomp() incompatibilities. But it'd be nice
to have unified approach for this incompatibility, as it hits any distro
with glibc 2.33 (currently openSUSE Tumbleweed, Arch Linux, Fedora
rawhide). And after some time (when old LTS distros EOL) this crap could be removed.

More info:
https://github.com/opencontainers/runc/pull/2750
https://github.com/seccomp/libseccomp/issues/314

Kind regards,
Petr

 sysdeps/unix/sysv/linux/faccessat.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)
  

Comments

Dmitry V. Levin Feb. 25, 2021, 10:38 p.m. UTC | #1
Hi,

On Thu, Feb 25, 2021 at 08:47:02PM +0100, Petr Vorel wrote:
> 3d3ab573a5 ("Linux: Use faccessat2 to implement faccessat (bug 18683)")
> started to use faccessat2() which breaks docker/podman/... containers
> with guest running glibc 2.33 running on host with older kernel and are
> built with older libseccomp.
> 
> See also: https://bugzilla.opensuse.org/show_bug.cgi?id=1182451#c17
> 
> Signed-off-by: Petr Vorel <pvorel@suse.cz>
> ---
> Hi,
> 
> I admit that this is a very ugly workaround and wouldn't be surprised if
> you just don't care about seccomp() incompatibilities. But it'd be nice
> to have unified approach for this incompatibility, as it hits any distro
> with glibc 2.33 (currently openSUSE Tumbleweed, Arch Linux, Fedora
> rawhide). And after some time (when old LTS distros EOL) this crap could be removed.
> 
> More info:
> https://github.com/opencontainers/runc/pull/2750
> https://github.com/seccomp/libseccomp/issues/314
> 
> Kind regards,
> Petr

Petr, you must have missed the whole discussion on this subject [1][2],
the consensus was that problematic container runtimes need to be fixed
to make their seccomp filters return ENOSYS for unknown syscalls.

[1] https://sourceware.org/pipermail/libc-alpha/2020-November/119955.html
[2] https://lore.kernel.org/linux-api/87lfer2c0b.fsf@oldenburg2.str.redhat.com/T/#u
  
Petr Vorel Feb. 26, 2021, 4:11 a.m. UTC | #2
Hi Dmitry,

> Petr, you must have missed the whole discussion on this subject [1][2],
> the consensus was that problematic container runtimes need to be fixed
> to make their seccomp filters return ENOSYS for unknown syscalls.

Thanks for info and sorry for spam then.

Kind regards,
Petr

> [1] https://sourceware.org/pipermail/libc-alpha/2020-November/119955.html
> [2] https://lore.kernel.org/linux-api/87lfer2c0b.fsf@oldenburg2.str.redhat.com/T/#u
  
Mike Frysinger Feb. 28, 2021, 6:03 a.m. UTC | #3
On 26 Feb 2021 05:11, Petr Vorel wrote:
> Hi Dmitry,
> > Petr, you must have missed the whole discussion on this subject [1][2],
> > the consensus was that problematic container runtimes need to be fixed
> > to make their seccomp filters return ENOSYS for unknown syscalls.
> 
> Thanks for info and sorry for spam then.

no need to apologize.  can't expect everyone to read everything all the time.
-mike
  
Aleksa Sarai Feb. 28, 2021, 7:56 a.m. UTC | #4
On 2021-02-26, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 25, 2021 at 08:47:02PM +0100, Petr Vorel wrote:
> > 3d3ab573a5 ("Linux: Use faccessat2 to implement faccessat (bug 18683)")
> > started to use faccessat2() which breaks docker/podman/... containers
> > with guest running glibc 2.33 running on host with older kernel and are
> > built with older libseccomp.
> > 
> > See also: https://bugzilla.opensuse.org/show_bug.cgi?id=1182451#c17
> > 
> > Signed-off-by: Petr Vorel <pvorel@suse.cz>
> > ---
> > Hi,
> > 
> > I admit that this is a very ugly workaround and wouldn't be surprised if
> > you just don't care about seccomp() incompatibilities. But it'd be nice
> > to have unified approach for this incompatibility, as it hits any distro
> > with glibc 2.33 (currently openSUSE Tumbleweed, Arch Linux, Fedora
> > rawhide). And after some time (when old LTS distros EOL) this crap could be removed.
> > 
> > More info:
> > https://github.com/opencontainers/runc/pull/2750
> > https://github.com/seccomp/libseccomp/issues/314
> > 
> > Kind regards,
> > Petr
> 
> Petr, you must have missed the whole discussion on this subject [1][2],
> the consensus was that problematic container runtimes need to be fixed
> to make their seccomp filters return ENOSYS for unknown syscalls.

It should also be noted that we fixed this in runc a month ago[1], which
means that it's up to distributions and cloud vendors to update their
runc packages to the latest version or backport the patch.

Docker's packaging hasn't been updated to use the latest runc yet
(that'll happen in the next patch release), but distributions can ship
newer runc versions -- that's what we do in openSUSE.

[1]: https://github.com/opencontainers/runc/pull/2750
  
Florian Weimer March 1, 2021, 11:54 a.m. UTC | #5
* Aleksa Sarai:

> It should also be noted that we fixed this in runc a month ago[1], which
> means that it's up to distributions and cloud vendors to update their
> runc packages to the latest version or backport the patch.
>
> Docker's packaging hasn't been updated to use the latest runc yet
> (that'll happen in the next patch release), but distributions can ship
> newer runc versions -- that's what we do in openSUSE.
>
> [1]: https://github.com/opencontainers/runc/pull/2750

There are some indications that not all container runtimes will pick up
the runc kludge (thanks for developing that by the way).  So it's likely
that the general issue will be with us for a while longer.  Maybe the
competitive pressure from other working container runtimes will
encourage other re-evaluate their approach, I don't know.

We still don't plan to throw in downstream-only glibc patches to paper
over this (given that it's been rejected by kernel and glibc developers
alike, I really think it's the wrong way to go).  So far management
isn't breathing down our necks.

Thanks,
Florian
  
Petr Vorel March 4, 2021, 8:27 a.m. UTC | #6
Hi all,

> There are some indications that not all container runtimes will pick up
> the runc kludge (thanks for developing that by the way).  So it's likely
> that the general issue will be with us for a while longer.  Maybe the
> competitive pressure from other working container runtimes will
> encourage other re-evaluate their approach, I don't know.
Hopefully.

> We still don't plan to throw in downstream-only glibc patches to paper
> over this (given that it's been rejected by kernel and glibc developers
> alike, I really think it's the wrong way to go).  So far management
> isn't breathing down our necks.
As workaround exists (for openSUSE using podman with newest runc v1.0.0-rc93)
I understand the reluctance to accept a workaround. It just reminds me occasional
musl approach to be correct no matter what problems it brings to users.

Kind regards,
Petr

> Thanks,
> Florian
  

Patch

diff --git a/sysdeps/unix/sysv/linux/faccessat.c b/sysdeps/unix/sysv/linux/faccessat.c
index 13160d3249..f01c59b6e7 100644
--- a/sysdeps/unix/sysv/linux/faccessat.c
+++ b/sysdeps/unix/sysv/linux/faccessat.c
@@ -30,9 +30,22 @@  __faccessat (int fd, const char *file, int mode, int flag)
 #if __ASSUME_FACCESSAT2
   return ret;
 #else
-  if (ret == 0 || errno != ENOSYS)
+  if (ret == 0 || (errno != ENOSYS && errno != EPERM))
     return ret;
 
+  /*
+   * Check seccomp() issue with faccessat2(). Additional EPERM means seccomp()
+   * in use, ENOSYS or EBADF real EPERM.
+   */
+  if (errno == EPERM) {
+    int backup = errno;
+    INLINE_SYSCALL_CALL (faccessat2, -2, ".", 0, 0);
+    int err = errno;
+    errno = backup;
+    if (err != EPERM)
+      return ret;
+  }
+
   if (flag & ~(AT_SYMLINK_NOFOLLOW | AT_EACCESS))
     return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);