diff mbox series

RFC: Disable clone3 for glibc 2.34

Message ID	87eebkf8ph.fsf@oldenburg.str.redhat.com
State	Dropped
Headers	DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org ED7F5397203C To: libc-alpha@sourceware.org Subject: RFC: Disable clone3 for glibc 2.34 Date: Tue, 27 Jul 2021 10:43:22 +0200 Message-ID: <87eebkf8ph.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Precedence: list From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Florian Weimer <fweimer@redhat.com> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
Series	RFC: Disable clone3 for glibc 2.34 \| RFC: Disable clone3 for glibc 2.34

Checks

Context	Check	Description
dj/TryBot-apply_patch	success	Patch applied to master at the time it was sent
dj/TryBot-32bit	fail	Patch series failed to apply

Commit Message

Florian Weimer July 27, 2021, 8:43 a.m. UTC

  Reportedly, the docker package in Ubuntu as used by Github Actions and
others does not provide a way to enable the clone3 system call.  It
always fails with EPERM.

Should we apply a patch like this for the release?


My concern with this is that we don't know yet where the CET kernel API
will land exactly and if CET will require clone3.  So clone3 might have
to come back once we turn on CET, which is hopefully soon.

Thanks,
Florian

Comments

Florian Weimer July 27, 2021, 9:11 a.m. UTC | #1

* Florian Weimer via Libc-alpha:

> Reportedly, the docker package in Ubuntu as used by Github Actions and
> others does not provide a way to enable the clone3 system call.  It
> always fails with EPERM.
>
> Should we apply a patch like this for the release?
>
> diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
> index 1e7a8f6b35..4046c81180 100644
> --- a/sysdeps/unix/sysv/linux/clone-internal.c
> +++ b/sysdeps/unix/sysv/linux/clone-internal.c
> @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args,
>  		  int (*func) (void *arg), void *arg)
>  {
>    int ret;
> -#ifdef HAVE_CLONE3_WAPPER
> -  /* Try clone3 first.  */
> -  int saved_errno = errno;
> -  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> -  if (ret != -1 || errno != ENOSYS)
> -    return ret;
> -
> -  /* NB: Restore errno since errno may be checked against non-zero
> -     return value.  */
> -  __set_errno (saved_errno);
> -#endif
>  
>    /* Map clone3 arguments to clone arguments.  NB: No need to check
>       invalid clone3 specific bits in flags nor exit_signal since this
>
> My concern with this is that we don't know yet where the CET kernel API
> will land exactly and if CET will require clone3.  So clone3 might have
> to come back once we turn on CET, which is hopefully soon.

Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the
issue with its docker.io/containerd/runc packages.

I could trivially fix a previously failing Github Action with:

diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml
index d2381ec..7b10286 100644
--- a/.github/workflows/fedora.yml
+++ b/.github/workflows/fedora.yml
@@ -22,6 +22,7 @@ jobs:
     runs-on: ubuntu-latest
     container:
       image: fedora:${{matrix.release}}
+      options: --security-opt seccomp=unconfined
 
     steps:
       - name: Checkout repository

So I think we need to figure out what people are actually complaining
about.

Thanks,
Florian

Christian Brauner July 27, 2021, 9:24 a.m. UTC | #2

On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote:
> * Florian Weimer via Libc-alpha:
> 
> > Reportedly, the docker package in Ubuntu as used by Github Actions and
> > others does not provide a way to enable the clone3 system call.  It
> > always fails with EPERM.
> >
> > Should we apply a patch like this for the release?
> >
> > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
> > index 1e7a8f6b35..4046c81180 100644
> > --- a/sysdeps/unix/sysv/linux/clone-internal.c
> > +++ b/sysdeps/unix/sysv/linux/clone-internal.c
> > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args,
> >  		  int (*func) (void *arg), void *arg)
> >  {
> >    int ret;
> > -#ifdef HAVE_CLONE3_WAPPER
> > -  /* Try clone3 first.  */
> > -  int saved_errno = errno;
> > -  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> > -  if (ret != -1 || errno != ENOSYS)
> > -    return ret;
> > -
> > -  /* NB: Restore errno since errno may be checked against non-zero
> > -     return value.  */
> > -  __set_errno (saved_errno);
> > -#endif
> >  
> >    /* Map clone3 arguments to clone arguments.  NB: No need to check
> >       invalid clone3 specific bits in flags nor exit_signal since this
> >
> > My concern with this is that we don't know yet where the CET kernel API
> > will land exactly and if CET will require clone3.  So clone3 might have
> > to come back once we turn on CET, which is hopefully soon.
> 
> Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the
> issue with its docker.io/containerd/runc packages.
> 
> I could trivially fix a previously failing Github Action with:
> 
> diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml
> index d2381ec..7b10286 100644
> --- a/.github/workflows/fedora.yml
> +++ b/.github/workflows/fedora.yml
> @@ -22,6 +22,7 @@ jobs:
>      runs-on: ubuntu-latest
>      container:
>        image: fedora:${{matrix.release}}
> +      options: --security-opt seccomp=unconfined
>  
>      steps:
>        - name: Checkout repository
> 
> So I think we need to figure out what people are actually complaining
> about.

This relates to the discussion what errno value should be used in a
seccomp filter to indicate that a syscall is blocked.

So there are two problems I see with seccomp and clone3():
1. the profile doesn't include clone3() at all and therefore the syscall
   is blocked and the default action is EPERM
2. the profile does include clone3() and decided to block it but the
   runtime has decided to make seccomp return EPERM and not ENOSYS when
   clone3() is attempted

The correct fix in both scenarios is to add clone3() to the seccomp
profile and either allow it or return ENOSYS.

Note that this ENOSYS/EPERM problem is a general problem. Not just glibc
doesn't know when to fallback gracefully other tools don't know either.
Application container usually just get lucky because their applications
don't need to issue the syscalls that are blocked. On a generic system
container with systemd inside this is always an issue and not using
ENOSYS is guaranteed to fail across the board.

Christian

Christian Brauner July 27, 2021, 9:41 a.m. UTC | #3

On Tue, Jul 27, 2021 at 11:24:16AM +0200, Christian Brauner wrote:
> On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote:
> > * Florian Weimer via Libc-alpha:
> > 
> > > Reportedly, the docker package in Ubuntu as used by Github Actions and
> > > others does not provide a way to enable the clone3 system call.  It
> > > always fails with EPERM.
> > >
> > > Should we apply a patch like this for the release?
> > >
> > > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
> > > index 1e7a8f6b35..4046c81180 100644
> > > --- a/sysdeps/unix/sysv/linux/clone-internal.c
> > > +++ b/sysdeps/unix/sysv/linux/clone-internal.c
> > > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args,
> > >  		  int (*func) (void *arg), void *arg)
> > >  {
> > >    int ret;
> > > -#ifdef HAVE_CLONE3_WAPPER
> > > -  /* Try clone3 first.  */
> > > -  int saved_errno = errno;
> > > -  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> > > -  if (ret != -1 || errno != ENOSYS)
> > > -    return ret;
> > > -
> > > -  /* NB: Restore errno since errno may be checked against non-zero
> > > -     return value.  */
> > > -  __set_errno (saved_errno);
> > > -#endif
> > >  
> > >    /* Map clone3 arguments to clone arguments.  NB: No need to check
> > >       invalid clone3 specific bits in flags nor exit_signal since this
> > >
> > > My concern with this is that we don't know yet where the CET kernel API
> > > will land exactly and if CET will require clone3.  So clone3 might have
> > > to come back once we turn on CET, which is hopefully soon.
> > 
> > Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the
> > issue with its docker.io/containerd/runc packages.
> > 
> > I could trivially fix a previously failing Github Action with:
> > 
> > diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml
> > index d2381ec..7b10286 100644
> > --- a/.github/workflows/fedora.yml
> > +++ b/.github/workflows/fedora.yml
> > @@ -22,6 +22,7 @@ jobs:
> >      runs-on: ubuntu-latest
> >      container:
> >        image: fedora:${{matrix.release}}
> > +      options: --security-opt seccomp=unconfined
> >  
> >      steps:
> >        - name: Checkout repository
> > 
> > So I think we need to figure out what people are actually complaining
> > about.
> 
> This relates to the discussion what errno value should be used in a
> seccomp filter to indicate that a syscall is blocked.
> 
> So there are two problems I see with seccomp and clone3():
> 1. the profile doesn't include clone3() at all and therefore the syscall
>    is blocked and the default action is EPERM
> 2. the profile does include clone3() and decided to block it but the
>    runtime has decided to make seccomp return EPERM and not ENOSYS when
>    clone3() is attempted
> 
> The correct fix in both scenarios is to add clone3() to the seccomp
> profile and either allow it or return ENOSYS.
> 
> Note that this ENOSYS/EPERM problem is a general problem. Not just glibc
> doesn't know when to fallback gracefully other tools don't know either.
> Application container usually just get lucky because their applications
> don't need to issue the syscalls that are blocked. On a generic system
> container with systemd inside this is always an issue and not using
> ENOSYS is guaranteed to fail across the board.

Aleksa, this is fixed in runC, right?

Christian

Aleksa Sarai July 27, 2021, 10:22 a.m. UTC | #4

On 2021-07-27, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> On Tue, Jul 27, 2021 at 11:24:16AM +0200, Christian Brauner wrote:
> > On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote:
> > > * Florian Weimer via Libc-alpha:
> > > 
> > > > Reportedly, the docker package in Ubuntu as used by Github Actions and
> > > > others does not provide a way to enable the clone3 system call.  It
> > > > always fails with EPERM.
> > > >
> > > > Should we apply a patch like this for the release?
> > > >
> > > > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
> > > > index 1e7a8f6b35..4046c81180 100644
> > > > --- a/sysdeps/unix/sysv/linux/clone-internal.c
> > > > +++ b/sysdeps/unix/sysv/linux/clone-internal.c
> > > > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args,
> > > >  		  int (*func) (void *arg), void *arg)
> > > >  {
> > > >    int ret;
> > > > -#ifdef HAVE_CLONE3_WAPPER
> > > > -  /* Try clone3 first.  */
> > > > -  int saved_errno = errno;
> > > > -  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> > > > -  if (ret != -1 || errno != ENOSYS)
> > > > -    return ret;
> > > > -
> > > > -  /* NB: Restore errno since errno may be checked against non-zero
> > > > -     return value.  */
> > > > -  __set_errno (saved_errno);
> > > > -#endif
> > > >  
> > > >    /* Map clone3 arguments to clone arguments.  NB: No need to check
> > > >       invalid clone3 specific bits in flags nor exit_signal since this
> > > >
> > > > My concern with this is that we don't know yet where the CET kernel API
> > > > will land exactly and if CET will require clone3.  So clone3 might have
> > > > to come back once we turn on CET, which is hopefully soon.
> > > 
> > > Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the
> > > issue with its docker.io/containerd/runc packages.
> > > 
> > > I could trivially fix a previously failing Github Action with:
> > > 
> > > diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml
> > > index d2381ec..7b10286 100644
> > > --- a/.github/workflows/fedora.yml
> > > +++ b/.github/workflows/fedora.yml
> > > @@ -22,6 +22,7 @@ jobs:
> > >      runs-on: ubuntu-latest
> > >      container:
> > >        image: fedora:${{matrix.release}}
> > > +      options: --security-opt seccomp=unconfined
> > >  
> > >      steps:
> > >        - name: Checkout repository
> > > 
> > > So I think we need to figure out what people are actually complaining
> > > about.
> > 
> > This relates to the discussion what errno value should be used in a
> > seccomp filter to indicate that a syscall is blocked.
> > 
> > So there are two problems I see with seccomp and clone3():
> > 1. the profile doesn't include clone3() at all and therefore the syscall
> >    is blocked and the default action is EPERM
> > 2. the profile does include clone3() and decided to block it but the
> >    runtime has decided to make seccomp return EPERM and not ENOSYS when
> >    clone3() is attempted
> > 
> > The correct fix in both scenarios is to add clone3() to the seccomp
> > profile and either allow it or return ENOSYS.
> > 
> > Note that this ENOSYS/EPERM problem is a general problem. Not just glibc
> > doesn't know when to fallback gracefully other tools don't know either.
> > Application container usually just get lucky because their applications
> > don't need to issue the syscalls that are blocked. On a generic system
> > container with systemd inside this is always an issue and not using
> > ENOSYS is guaranteed to fail across the board.
> 
> Aleksa, this is fixed in runC, right?

Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.

The way it works is that any syscall which has a larger syscall number
than any syscall specified in the filter will get -ENOSYS (this works
even if libseccomp is outdated). The only way you could get the -EPERM
behaviour with modern runc is if you write a seccomp profile that had
rules for newer syscalls (openat2 for instance) but not clone3 -- but
Docker doesn't do that. (The reason for this slightly convoluted
behaviour was to make sure that intentional omissions actually give you
-EPERM.)

However this requires the container host to have an updated version of
runc which is up to GitHub. (Though we fixed a security issue in runc
recently, so I would expect that they've updated their versions of runc
by now.)

Szabolcs Nagy July 27, 2021, 10:48 a.m. UTC | #5

The 07/27/2021 20:22, Aleksa Sarai wrote:
> Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> 
> The way it works is that any syscall which has a larger syscall number
> than any syscall specified in the filter will get -ENOSYS (this works
> even if libseccomp is outdated). The only way you could get the -EPERM
> behaviour with modern runc is if you write a seccomp profile that had
> rules for newer syscalls (openat2 for instance) but not clone3 -- but
> Docker doesn't do that. (The reason for this slightly convoluted
> behaviour was to make sure that intentional omissions actually give you
> -EPERM.)

this sounds broken. it really should return ENOSYS unless
a user specifically asked for a different errno value for
a syscall. EPERM is just wrong.

we will see random breakage in the future depending on
what unrelated but newer syscalls users added to their
whitelist. who thought this was a good idea?

i can't believe this is still broken, this seccomp filter
bug was reported ages ago.

Andreas K. Huettel July 27, 2021, 11:07 p.m. UTC | #6

Semi-related...

>    int ret;
> -#ifdef HAVE_CLONE3_WAPPER
> -  /* Try clone3 first.  */
> -  int saved_errno = errno;

shouldn't this be WRAPPER ?

(It seems to be consistent in the code, so no harm done so far...)

Florian Weimer July 28, 2021, 4:58 a.m. UTC | #7

* Andreas K. Huettel via Libc-alpha:

> Semi-related...
>
>>    int ret;
>> -#ifdef HAVE_CLONE3_WAPPER
>> -  /* Try clone3 first.  */
>> -  int saved_errno = errno;
>
> shouldn't this be WRAPPER ?
>
> (It seems to be consistent in the code, so no harm done so far...)

Yes, it should be.  And yes, it's currently consistent.  But it's
confusing if you search for it and don't make the typo.

Thanks,
Florian

Florian Weimer July 28, 2021, 5:44 p.m. UTC | #8

* Aleksa Sarai:

> Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
>
> The way it works is that any syscall which has a larger syscall number
> than any syscall specified in the filter will get -ENOSYS (this works
> even if libseccomp is outdated). The only way you could get the -EPERM
> behaviour with modern runc is if you write a seccomp profile that had
> rules for newer syscalls (openat2 for instance) but not clone3 -- but
> Docker doesn't do that. (The reason for this slightly convoluted
> behaviour was to make sure that intentional omissions actually give you
> -EPERM.)
>
> However this requires the container host to have an updated version of
> runc which is up to GitHub. (Though we fixed a security issue in runc
> recently, so I would expect that they've updated their versions of runc
> by now.)

Indeed I wasn't able to reproduce this locally.  Ubuntu's docker.io
package behaves as expected, even for “docker build” as far as I can
see.

So far, the reported breakage has been focused on Github Actions and
Azure Devops.  They use a custom Docker-Moby build, and I don't know
what's in it.  The net effect is that clone3 does not work in containers
by default.  “docker build” still does not allow “--security-opt
seccomp=unconfined” for unknown reasons, but that workaround still
applies to “docker create”.

Daniel P. Berrangé reported that Moby mentions a system call in its
policy whose number is larger than clone3, effectively turning ENOSYS
into ENOPERM for clone3.  Looking at the recent change, it could be the
addition of close_range and epoll_pwait2 in this commit:

commit 54eff4354b17a9c460b851300f28aed1408a8615
Author: Aleksa Sarai <asarai@suse.de>
Date:   Sun Jan 17 23:39:31 2021 +1100

    profiles: seccomp: update to Linux 5.11 syscall list
    
    These syscalls (some of which have been in Linux for a while but were
    missing from the profile) fall into a few buckets:
    
     * close_range(2), epoll_pwait2(2) are just extensions of existing "safe
       for everyone" syscalls.
    
     * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
       all equivalent to aspects of mount(2) and thus go into the
       CAP_SYS_ADMIN category.
    
     * process_madvise(2) is similar to the other process_*(2) syscalls and
       thus goes in the CAP_SYS_PTRACE category.
    
    Signed-off-by: Aleksa Sarai <asarai@suse.de>

Maybe we don't see this everywhere because these higher system call
numbers become available only if the system libseccomp version is recent
enough to know about them.  Once that is the case, the ENOSYS/EPERM line
shifts and clone3 is on the wrong side of it.

If that's indeed the explanation, then maybe we can simply fix moby and
ask Microsoft to respin their images?

Thanks,
Florian

Daniel P. Berrangé July 29, 2021, 8:36 a.m. UTC | #9

On Wed, Jul 28, 2021 at 07:44:03PM +0200, Florian Weimer wrote:
> * Aleksa Sarai:
> 
> > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> >
> > The way it works is that any syscall which has a larger syscall number
> > than any syscall specified in the filter will get -ENOSYS (this works
> > even if libseccomp is outdated). The only way you could get the -EPERM
> > behaviour with modern runc is if you write a seccomp profile that had
> > rules for newer syscalls (openat2 for instance) but not clone3 -- but
> > Docker doesn't do that. (The reason for this slightly convoluted
> > behaviour was to make sure that intentional omissions actually give you
> > -EPERM.)
> >
> > However this requires the container host to have an updated version of
> > runc which is up to GitHub. (Though we fixed a security issue in runc
> > recently, so I would expect that they've updated their versions of runc
> > by now.)
> 
> Indeed I wasn't able to reproduce this locally.  Ubuntu's docker.io
> package behaves as expected, even for “docker build” as far as I can
> see.
> 
> So far, the reported breakage has been focused on Github Actions and
> Azure Devops.  They use a custom Docker-Moby build, and I don't know
> what's in it.  The net effect is that clone3 does not work in containers
> by default.  “docker build” still does not allow “--security-opt
> seccomp=unconfined” for unknown reasons, but that workaround still
> applies to “docker create”.

FYI I found this issue describing the docker build + seccomp feature
gap. Seems it was intentional to not allow it to be used:

   https://github.com/moby/moby/issues/34454

To workaround this gap in "docker build", you can use the --seccomp-profile
option to dockerd daemon when starting it up, to pass a new profile that
applies by default to everything. Doesn't help if you're just using a
dockerd instance started/managed by someone/something else though.

> Daniel P. Berrangé reported that Moby mentions a system call in its
> policy whose number is larger than clone3, effectively turning ENOSYS
> into ENOPERM for clone3.  Looking at the recent change, it could be the
> addition of close_range and epoll_pwait2 in this commit:

I'm not 100% convinced my understanding is right, as there are quite
a few moving parts involved. Wierdly I managed to get things working
with existing docker 20.10.7 simply by using the seccomp profile from
docker git master, which is counter to my understanding described
above. The heuristics involved in runc for EPERM/ENOSYS are very
hard to understand and rationalize behaviour for :-(

To try to make it simpler I send a pull request to explicitly list
clone3 with ENOSYS, so that its not subject to the wierd heuristics
in runc

   https://github.com/moby/moby/pull/42681

> commit 54eff4354b17a9c460b851300f28aed1408a8615
> Author: Aleksa Sarai <asarai@suse.de>
> Date:   Sun Jan 17 23:39:31 2021 +1100
> 
>     profiles: seccomp: update to Linux 5.11 syscall list
>     
>     These syscalls (some of which have been in Linux for a while but were
>     missing from the profile) fall into a few buckets:
>     
>      * close_range(2), epoll_pwait2(2) are just extensions of existing "safe
>        for everyone" syscalls.
>     
>      * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
>        all equivalent to aspects of mount(2) and thus go into the
>        CAP_SYS_ADMIN category.
>     
>      * process_madvise(2) is similar to the other process_*(2) syscalls and
>        thus goes in the CAP_SYS_PTRACE category.
>     
>     Signed-off-by: Aleksa Sarai <asarai@suse.de>
> 
> Maybe we don't see this everywhere because these higher system call
> numbers become available only if the system libseccomp version is recent
> enough to know about them.  Once that is the case, the ENOSYS/EPERM line
> shifts and clone3 is on the wrong side of it.
> 
> If that's indeed the explanation, then maybe we can simply fix moby and
> ask Microsoft to respin their images?

Regards,
Daniel

Aleksa Sarai July 29, 2021, 8:56 a.m. UTC | #10

On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> The 07/27/2021 20:22, Aleksa Sarai wrote:
> > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> > 
> > The way it works is that any syscall which has a larger syscall number
> > than any syscall specified in the filter will get -ENOSYS (this works
> > even if libseccomp is outdated). The only way you could get the -EPERM
> > behaviour with modern runc is if you write a seccomp profile that had
> > rules for newer syscalls (openat2 for instance) but not clone3 -- but
> > Docker doesn't do that. (The reason for this slightly convoluted
> > behaviour was to make sure that intentional omissions actually give you
> > -EPERM.)
> 
> this sounds broken. it really should return ENOSYS unless
> a user specifically asked for a different errno value for
> a syscall. EPERM is just wrong.

Yes, if I was designing it from scratch, that's what I would've done.

But there are already existing filters that are written assuming the
default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for
existing profiles is not a workable solution.

Should we fix all existing profiles and then change the behaviour again?
Sure, but given we solved this problem in a period of time when people
were screaming about glibc being broken in containers, I hope you'll
excuse the fact that we didn't really have time to co-ordinate updating
every downstream runc user.

> we will see random breakage in the future depending on
> what unrelated but newer syscalls users added to their
> whitelist. who thought this was a good idea?

If you update your syscall profile without knowing what you're doing,
things will break. That will always be the case.

The plan is/was to eventually implement this by explicitly stating a
minimum kernel version (so that all syscalls missing in the profile that
were available in that kernel version get ENOSYS) but libseccomp doesn't
provide that information at the moment, and given that such a filter
would be more complicated than the one we have at the moment, that
behaviour probably belongs in libseccomp (there are several issues open
in the libseccomp repo describing this issue and possible solutions).

Florian Weimer July 29, 2021, 10:50 a.m. UTC | #11

* Aleksa Sarai:

> If you update your syscall profile without knowing what you're doing,
> things will break. That will always be the case.

But with the current syscall number dependency, this is jusy *way* too
hard.  Who would think that adding close_range (#436) to the policy
would switch clone3 (#435) from ENOSYS to ENOPERM?

I realized that Github actually provides a way to report image bugs, so
I filed:

  Docker seccomp policy incompatible with glibc 2.34
  <https://github.com/actions/virtual-environments/issues/3812>

Thanks,
Florian

Szabolcs Nagy July 29, 2021, 11:38 a.m. UTC | #12

The 07/29/2021 18:56, Aleksa Sarai wrote:
> On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> > The 07/27/2021 20:22, Aleksa Sarai wrote:
> > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> > > 
> > > The way it works is that any syscall which has a larger syscall number
> > > than any syscall specified in the filter will get -ENOSYS (this works
> > > even if libseccomp is outdated). The only way you could get the -EPERM
> > > behaviour with modern runc is if you write a seccomp profile that had
> > > rules for newer syscalls (openat2 for instance) but not clone3 -- but
> > > Docker doesn't do that. (The reason for this slightly convoluted
> > > behaviour was to make sure that intentional omissions actually give you
> > > -EPERM.)
> > 
> > this sounds broken. it really should return ENOSYS unless
> > a user specifically asked for a different errno value for
> > a syscall. EPERM is just wrong.
> 
> Yes, if I was designing it from scratch, that's what I would've done.
> 
> But there are already existing filters that are written assuming the
> default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for
> existing profiles is not a workable solution.
> 
> Should we fix all existing profiles and then change the behaviour again?
> Sure, but given we solved this problem in a period of time when people
> were screaming about glibc being broken in containers, I hope you'll
> excuse the fact that we didn't really have time to co-ordinate updating
> every downstream runc user.

i think this can be fixed backward compatibly by
returning EPERM for old syscalls.

> > we will see random breakage in the future depending on
> > what unrelated but newer syscalls users added to their
> > whitelist. who thought this was a good idea?
> 
> If you update your syscall profile without knowing what you're doing,
> things will break. That will always be the case.
> 
> The plan is/was to eventually implement this by explicitly stating a
> minimum kernel version (so that all syscalls missing in the profile that
> were available in that kernel version get ENOSYS) but libseccomp doesn't
> provide that information at the moment, and given that such a filter
> would be more complicated than the one we have at the moment, that
> behaviour probably belongs in libseccomp (there are several issues open
> in the libseccomp repo describing this issue and possible solutions).

i dont think you need to do anything complicated
with a fixed cut off, e.g.

  return nr < 403 ? EPERM : ENOSYS

or you can give an explicit list of syscalls that
should return EPERM for bw compat reasons and the
rest is ENOSYS.

(and there should be an easy way to opt-out of
the bw compat behaviour and always do ENOSYS)

Aleksa Sarai July 30, 2021, 12:16 p.m. UTC | #13

On 2021-07-29, Florian Weimer <fweimer@redhat.com> wrote:
> * Aleksa Sarai:
> 
> > If you update your syscall profile without knowing what you're doing,
> > things will break. That will always be the case.
> 
> But with the current syscall number dependency, this is jusy *way* too
> hard.  Who would think that adding close_range (#436) to the policy
> would switch clone3 (#435) from ENOSYS to ENOPERM?

Yeah, I expected that the Docker folks would've been aware of this when
updating the profile (the maintainers were aware of the runc change at
the time) so it does seem this is a bit too complicated...

I think changing this to one of the older versions of the feature I had
(only EPERM for syscalls that were present in Linux 3.0) is probably
less likely to cause confusion, until we have the whole minimum kernel
version infrastructure I mentioned.

Aleksa Sarai July 30, 2021, 3:08 p.m. UTC | #14

On 2021-07-29, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> The 07/29/2021 18:56, Aleksa Sarai wrote:
> > On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> > > The 07/27/2021 20:22, Aleksa Sarai wrote:
> > > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> > > > 
> > > > The way it works is that any syscall which has a larger syscall number
> > > > than any syscall specified in the filter will get -ENOSYS (this works
> > > > even if libseccomp is outdated). The only way you could get the -EPERM
> > > > behaviour with modern runc is if you write a seccomp profile that had
> > > > rules for newer syscalls (openat2 for instance) but not clone3 -- but
> > > > Docker doesn't do that. (The reason for this slightly convoluted
> > > > behaviour was to make sure that intentional omissions actually give you
> > > > -EPERM.)
> > > 
> > > this sounds broken. it really should return ENOSYS unless
> > > a user specifically asked for a different errno value for
> > > a syscall. EPERM is just wrong.
> > 
> > Yes, if I was designing it from scratch, that's what I would've done.
> > 
> > But there are already existing filters that are written assuming the
> > default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for
> > existing profiles is not a workable solution.
> > 
> > Should we fix all existing profiles and then change the behaviour again?
> > Sure, but given we solved this problem in a period of time when people
> > were screaming about glibc being broken in containers, I hope you'll
> > excuse the fact that we didn't really have time to co-ordinate updating
> > every downstream runc user.
> 
> i think this can be fixed backward compatibly by
> returning EPERM for old syscalls.

I just remembered why this was a problem (I tried to implement exactly
this behaviour when I first worked on the patch) -- in runc we use
libseccomp to generate profiles, but libseccomp has several limitations
that mean we cannot implement *any* syscall-number-based fallback
behaviour (I tried every way I could think of for at least a week or
two).

The end result is that in runc we use libseccomp to generate the filter,
then output the BPF program, then patch it (add a program to the start
which does the syscall check and returns ENOSYS in the correct case) and
then we run the program. The only other option would be to basically
rewrite libseccomp for runc (which I did consider, but then thought
better of).

Now, having the fallback being a fixed syscall number seems like it
would be trivial -- but because we have to use libseccomp's filter
generatiion and patch it, patching the return value of the program to be
syscall-specific would require patching every return statement in the
generated BPF (and then possibly rewriting every jump depending on where
the returns are). I think in practice there's only two returns, but
hardcoding that is going to cause issues if libseccomp ever changes that
behaviour.

But I think you're right that this is probably less likely to cause
confusion. Unfortunately I'm not really sure that there's a
straightforward way to implement it, outside of implementing the
behaviour in libseccomp (or at the very least expanding libseccomp to
let us work around it).

> > > we will see random breakage in the future depending on
> > > what unrelated but newer syscalls users added to their
> > > whitelist. who thought this was a good idea?
> > 
> > If you update your syscall profile without knowing what you're doing,
> > things will break. That will always be the case.
> > 
> > The plan is/was to eventually implement this by explicitly stating a
> > minimum kernel version (so that all syscalls missing in the profile that
> > were available in that kernel version get ENOSYS) but libseccomp doesn't
> > provide that information at the moment, and given that such a filter
> > would be more complicated than the one we have at the moment, that
> > behaviour probably belongs in libseccomp (there are several issues open
> > in the libseccomp repo describing this issue and possible solutions).
> 
> i dont think you need to do anything complicated
> with a fixed cut off, e.g.
> 
>   return nr < 403 ? EPERM : ENOSYS

See my above point about how this is non-trivial.

> or you can give an explicit list of syscalls that
> should return EPERM for bw compat reasons and the
> rest is ENOSYS.

The issue with requiring explicit EPERM rules is that libseccomp doesn't
support certain sets of argument rule combinations, meaning that you
cannot generate nor require users to specify a set of inverse rules to
return EPERM for syscalls like clone() where only certain flags are
being blocked. (Another one of my attempts was to generate the set of
inverse rules, and use ENOSYS as the default.)

(Also some of the inverse rules you can technically generate -- most
notably the inverse of SCMP_MASKED_EQ requires 2^n rules to be generated
(where n is the number of bits in the mask).)

> (and there should be an easy way to opt-out of
> the bw compat behaviour and always do ENOSYS)

You can opt of out of it already, by setting the defaultErrnoRet to
ENOSYS. And someone has submitted a patch to Docker to do exactly
that[1].

[1]: https://github.com/moby/moby/pull/42649

diff mbox series

Patch

diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
index 1e7a8f6b35..4046c81180 100644
--- a/sysdeps/unix/sysv/linux/clone-internal.c
+++ b/sysdeps/unix/sysv/linux/clone-internal.c
@@ -48,17 +48,6 @@  __clone_internal (struct clone_args *cl_args,
 		  int (*func) (void *arg), void *arg)
 {
   int ret;
-#ifdef HAVE_CLONE3_WAPPER
-  /* Try clone3 first.  */
-  int saved_errno = errno;
-  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
-  if (ret != -1 || errno != ENOSYS)
-    return ret;
-
-  /* NB: Restore errno since errno may be checked against non-zero
-     return value.  */
-  __set_errno (saved_errno);
-#endif
 
   /* Map clone3 arguments to clone arguments.  NB: No need to check
      invalid clone3 specific bits in flags nor exit_signal since this