[PATCHv3] tst-pidfd.c: UNSUPPORTED if we get EPERM on pidfd_open or pidfd_getfd

Message ID 20220701112502.13458-1-mark@klomp.org
State Rejected
Headers
Series [PATCHv3] tst-pidfd.c: UNSUPPORTED if we get EPERM on pidfd_open or pidfd_getfd |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Mark Wielaard July 1, 2022, 11:25 a.m. UTC
  pidfd_open or pidfd_getfd can fail with errno EPERM for various reasons
in a restricted environment. Use FAIL_UNSUPPORTED in that case.
---

v3: Also test for EPERM on pidfd_open, don't mention
    PTRACE_MODE_ATTACH_REALCREDS since it is just one reason for
    getting EPERM.
v2: separate ENOSYS and EPERM checks and FAIL_UNSUPPORTED messages

https://code.wildebeest.org/git/user/mjw/glibc/commit/?h=container-perms&id=3e1211cb6e3f0dba98201c12610a6cb2cb106d2d

 sysdeps/unix/sysv/linux/tst-pidfd.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)
  

Comments

Adhemerval Zanella July 4, 2022, 7:46 p.m. UTC | #1
> On 1 Jul 2022, at 08:25, Mark Wielaard <mark@klomp.org> wrote:
> 
> pidfd_open or pidfd_getfd can fail with errno EPERM for various reasons
> in a restricted environment. Use FAIL_UNSUPPORTED in that case.
> ---
> 
> v3: Also test for EPERM on pidfd_open, don't mention
>    PTRACE_MODE_ATTACH_REALCREDS since it is just one reason for
>    getting EPERM.
> v2: separate ENOSYS and EPERM checks and FAIL_UNSUPPORTED messages
> 
> https://code.wildebeest.org/git/user/mjw/glibc/commit/?h=container-perms&id=3e1211cb6e3f0dba98201c12610a6cb2cb106d2d
> 
> sysdeps/unix/sysv/linux/tst-pidfd.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/sysdeps/unix/sysv/linux/tst-pidfd.c b/sysdeps/unix/sysv/linux/tst-pidfd.c
> index d93b6faa6f..2655d94636 100644
> --- a/sysdeps/unix/sysv/linux/tst-pidfd.c
> +++ b/sysdeps/unix/sysv/linux/tst-pidfd.c
> @@ -92,11 +92,15 @@ do_test (void)
>   {
>     /* The pidfd_getfd syscall was the last in the set of pidfd related
>        syscalls added to the kernel.  Use pidfd_getfd to decide if this
> -       kernel has pidfd support that we can test.  */
> +       kernel has pidfd support that we can test.  And that we have
> +       permission to use pidfd_getfd.  */
>     int r = pidfd_getfd (0, 0, 1);
>     TEST_VERIFY_EXIT (r == -1);
>     if (errno == ENOSYS)
>       FAIL_UNSUPPORTED ("kernel does not support pidfd_getfd, skipping test");
> +    if (errno == EPERM)
> +      FAIL_UNSUPPORTED ("don't have permission to use pidfd_getfd, "
> +			"skipping test");

Sorry, but if this is really failing with EPERM for a valid call the
syscall filtering is broken: either kernel should return ENOSYS
or EINVAL (assuming 1 in an invalid flag, we might need to update
it once kernel starts to implement possible flags).

>   }
> 
>   ppid = getpid ();
> @@ -113,9 +117,15 @@ do_test (void)
>   xclose (sockets[1]);
> 
>   TEST_COMPARE (pidfd_open (-1, 0), -1);
> +  if (errno == EPERM)
> +    FAIL_UNSUPPORTED ("don't have permission to use pidfd_getfd, "
> +		      "skipping test");
>   TEST_COMPARE (errno, EINVAL);
> 
>   int pidfd = pidfd_open (pid, 0);
> +  if (pidfd == -1 && errno == EPERM)
> +    FAIL_UNSUPPORTED ("don't have permission to use pidfd_getfd, "
> +		      "skipping test");
>   TEST_VERIFY (pidfd != -1);
> 
>   /* Wait for first sigtimedwait.  */
> -- 
> 2.18.4
>
  
Mark Wielaard July 4, 2022, 9:20 p.m. UTC | #2
Hi Adhemerval,

On Mon, Jul 04, 2022 at 04:46:58PM -0300, Adhemerval Zanella wrote:
> > On 1 Jul 2022, at 08:25, Mark Wielaard <mark@klomp.org> wrote:
> > diff --git a/sysdeps/unix/sysv/linux/tst-pidfd.c b/sysdeps/unix/sysv/linux/tst-pidfd.c
> > index d93b6faa6f..2655d94636 100644
> > --- a/sysdeps/unix/sysv/linux/tst-pidfd.c
> > +++ b/sysdeps/unix/sysv/linux/tst-pidfd.c
> > @@ -92,11 +92,15 @@ do_test (void)
> >   {
> >     /* The pidfd_getfd syscall was the last in the set of pidfd related
> >        syscalls added to the kernel.  Use pidfd_getfd to decide if this
> > -       kernel has pidfd support that we can test.  */
> > +       kernel has pidfd support that we can test.  And that we have
> > +       permission to use pidfd_getfd.  */
> >     int r = pidfd_getfd (0, 0, 1);
> >     TEST_VERIFY_EXIT (r == -1);
> >     if (errno == ENOSYS)
> >       FAIL_UNSUPPORTED ("kernel does not support pidfd_getfd, skipping test");
> > +    if (errno == EPERM)
> > +      FAIL_UNSUPPORTED ("don't have permission to use pidfd_getfd, "
> > +			"skipping test");
> 
> Sorry, but if this is really failing with EPERM for a valid call the
> syscall filtering is broken: either kernel should return ENOSYS
> or EINVAL (assuming 1 in an invalid flag, we might need to update
> it once kernel starts to implement possible flags).

The call isn't really valid and I don't think we can assume the flag
value is checked first and just produces an EINVAL in this case. The
point of the check is that we cannot run the test if we get an EPERM
at this point. pidfd_getfd can fail with EPERM for various reasons (an
LSM module, a ptrace mode check, a [broken] seccomp filter). In all
cases we simply record the issue and mark the testcase as UNSUPPORTED.

Cheers,

Mark
  
Adhemerval Zanella July 5, 2022, 2:39 a.m. UTC | #3
> On 4 Jul 2022, at 18:20, Mark Wielaard <mark@klomp.org> wrote:
> 
> Hi Adhemerval,
> 
> On Mon, Jul 04, 2022 at 04:46:58PM -0300, Adhemerval Zanella wrote:
>>> On 1 Jul 2022, at 08:25, Mark Wielaard <mark@klomp.org> wrote:
>>> diff --git a/sysdeps/unix/sysv/linux/tst-pidfd.c b/sysdeps/unix/sysv/linux/tst-pidfd.c
>>> index d93b6faa6f..2655d94636 100644
>>> --- a/sysdeps/unix/sysv/linux/tst-pidfd.c
>>> +++ b/sysdeps/unix/sysv/linux/tst-pidfd.c
>>> @@ -92,11 +92,15 @@ do_test (void)
>>>  {
>>>    /* The pidfd_getfd syscall was the last in the set of pidfd related
>>>       syscalls added to the kernel.  Use pidfd_getfd to decide if this
>>> -       kernel has pidfd support that we can test.  */
>>> +       kernel has pidfd support that we can test.  And that we have
>>> +       permission to use pidfd_getfd.  */
>>>    int r = pidfd_getfd (0, 0, 1);
>>>    TEST_VERIFY_EXIT (r == -1);
>>>    if (errno == ENOSYS)
>>>      FAIL_UNSUPPORTED ("kernel does not support pidfd_getfd, skipping test");
>>> +    if (errno == EPERM)
>>> +      FAIL_UNSUPPORTED ("don't have permission to use pidfd_getfd, "
>>> +			"skipping test");
>> 
>> Sorry, but if this is really failing with EPERM for a valid call the
>> syscall filtering is broken: either kernel should return ENOSYS
>> or EINVAL (assuming 1 in an invalid flag, we might need to update
>> it once kernel starts to implement possible flags).
> 
> The call isn't really valid and I don't think we can assume the flag
> value is checked first and just produces an EINVAL in this case. The

That’s why the usual kernel interface is suppose to return, otherwise
we can’t really distinguish between a faulty argument or syscall filtering
and a missing kernel support.

> point of the check is that we cannot run the test if we get an EPERM
> at this point. pidfd_getfd can fail with EPERM for various reasons (an
> LSM module, a ptrace mode check, a [broken] seccomp filter). In all
> cases we simply record the issue and mark the testcase as UNSUPPORTED.

No, in glibc we assume that if the syscall is not implemented by the kernel
is should return ENOSYS, not EPERM.  If kernel is doing otherwise it is
returning bogus values and we moved away to either try to emulate fallbacks
it or handle it (just check the issues we had with clone3).

Sorry, but you really need to sort this out on your syscall filtering
mechanism. 

> 
> Cheers,
> 
> Mark
>
  
Carlos O'Donell July 11, 2022, 2:16 p.m. UTC | #4
On 7/4/22 22:39, Adhemerval Zanella via Libc-alpha wrote:
> No, in glibc we assume that if the syscall is not implemented by the kernel
> is should return ENOSYS, not EPERM.  If kernel is doing otherwise it is
> returning bogus values and we moved away to either try to emulate fallbacks
> it or handle it (just check the issues we had with clone3).
> 
> Sorry, but you really need to sort this out on your syscall filtering
> mechanism. 

I agree with Adhemerval.

The glibc testsuite will only mark tests as unsupported when the underlying
system is correctly configured. An incorrectly configured system should result
in the tests being marked as failed to indicate that incorrect configuration.

Running the glibc testsuite requires a configuration where the kernel returns
ENOSYS. If you have a custom LSM module that returns ENOPERM, then that is an
error because such a module should return ENOSYS.

Given we have two developers disagreeing on v3's direction I'm marking this
as rejected in patchwork.
  
Mark Wielaard July 11, 2022, 4:22 p.m. UTC | #5
Hi Adhemerval,

On Mon, 2022-07-04 at 23:39 -0300, Adhemerval Zanella wrote:
> > On 4 Jul 2022, at 18:20, Mark Wielaard <mark@klomp.org> wrote:
> > On Mon, Jul 04, 2022 at 04:46:58PM -0300, Adhemerval Zanella wrote:
> > point of the check is that we cannot run the test if we get an EPERM
> > at this point. pidfd_getfd can fail with EPERM for various reasons (an
> > LSM module, a ptrace mode check, a [broken] seccomp filter). In all
> > cases we simply record the issue and mark the testcase as UNSUPPORTED.
> 
> No, in glibc we assume that if the syscall is not implemented by the kernel
> is should return ENOSYS, not EPERM.  If kernel is doing otherwise it is
> returning bogus values and we moved away to either try to emulate fallbacks
> it or handle it (just check the issues we had with clone3).

I see where you are coming from, but I don't think these situations are
comparable. All we do here is check the test environment, not implement
some fallback mechanism.

That said, I think I went overboard wanting to catch all possible EPERM
causes. And I can see that doing this as a "pre-check" which also
covers the ENOSYS case might be confusing. And also not really
necessary if you don't like that.

I'll post a v4 that drops this early-check and also on the pid_fdopen
calls. Because that really isn't the point here. Just checking that
that a pidfd_getfd with valid pidfd and remotefd gets an EPERM is all
that is really needed. And can really be rejected by the ptrace/lsm
checks without caring whether there are also syscall filters that might
also reject it in other cases.

> Sorry, but you really need to sort this out on your syscall filtering
> mechanism. 

I don't think the syscall filtering mechanism is doing anything wrong
here, but I can see why you say that given I added those extra EPERM
checks where you don't want to see them.

BTW. It isn't "my" syscall filtering mechanism, it really is the
standard one used by docker. Which you will probably encounter in a lot
of container environments. The idea is that we we can use a buildbot
running tests in containers for all glibc developers to do pre-commit
checks and to test patchwork patches with the try-bot (and currently
for all commits to the main branch): 
https://builder.sourceware.org/buildbot/#/builders?tags=glibc

All the configuration is in git:
https://sourceware.org/git/builder.git (see SETUP and README_workers)
If you really think it was setup wrongly (which is certainly a
possibility) please do suggest a patch so we can run the containers in
a better way.

Thanks,

Mark
  
Adhemerval Zanella July 11, 2022, 5 p.m. UTC | #6
> On 11 Jul 2022, at 13:22, Mark Wielaard <mark@klomp.org> wrote:
> 
> Hi Adhemerval,
> 
> On Mon, 2022-07-04 at 23:39 -0300, Adhemerval Zanella wrote:
>>> On 4 Jul 2022, at 18:20, Mark Wielaard <mark@klomp.org> wrote:
>>> On Mon, Jul 04, 2022 at 04:46:58PM -0300, Adhemerval Zanella wrote:
>>> point of the check is that we cannot run the test if we get an EPERM
>>> at this point. pidfd_getfd can fail with EPERM for various reasons (an
>>> LSM module, a ptrace mode check, a [broken] seccomp filter). In all
>>> cases we simply record the issue and mark the testcase as UNSUPPORTED.
>> 
>> No, in glibc we assume that if the syscall is not implemented by the kernel
>> is should return ENOSYS, not EPERM.  If kernel is doing otherwise it is
>> returning bogus values and we moved away to either try to emulate fallbacks
>> it or handle it (just check the issues we had with clone3).
> 
> I see where you are coming from, but I don't think these situations are
> comparable. All we do here is check the test environment, not implement
> some fallback mechanism.
> 
> That said, I think I went overboard wanting to catch all possible EPERM
> causes. And I can see that doing this as a "pre-check" which also
> covers the ENOSYS case might be confusing. And also not really
> necessary if you don't like that.
> 
> I'll post a v4 that drops this early-check and also on the pid_fdopen
> calls. Because that really isn't the point here. Just checking that
> that a pidfd_getfd with valid pidfd and remotefd gets an EPERM is all
> that is really needed. And can really be rejected by the ptrace/lsm
> checks without caring whether there are also syscall filters that might
> also reject it in other cases.

Thanks, from the patch I got the impression that the syscall filtering
was returning EPERM on early check which prevents users check for
syscall availability.

> 
>> Sorry, but you really need to sort this out on your syscall filtering
>> mechanism. 
> 
> I don't think the syscall filtering mechanism is doing anything wrong
> here, but I can see why you say that given I added those extra EPERM
> checks where you don't want to see them.

We had multiple issues with docker filtering syscalls by returning
EPERM where glibc expects ENOSYS (so either a fallback will be used
or to handle the error correctly) and I though it was still the same
issue.

My understanding is it was fixed on latest docker.

> 
> BTW. It isn't "my" syscall filtering mechanism, it really is the
> standard one used by docker. Which you will probably encounter in a lot
> of container environments. The idea is that we we can use a buildbot
> running tests in containers for all glibc developers to do pre-commit
> checks and to test patchwork patches with the try-bot (and currently
> for all commits to the main branch): 
> https://builder.sourceware.org/buildbot/#/builders?tags=glibc
> 
> All the configuration is in git:
> https://sourceware.org/git/builder.git (see SETUP and README_workers)
> If you really think it was setup wrongly (which is certainly a
> possibility) please do suggest a patch so we can run the containers in
> a better way.
> 
> Thanks,
> 
> Mark
  
Mark Wielaard July 18, 2022, 12:26 p.m. UTC | #7
Hi Adhemerval,

On Mon, Jul 11, 2022 at 02:00:33PM -0300, Adhemerval Zanella via Libc-alpha wrote:
> >> Sorry, but you really need to sort this out on your syscall filtering
> >> mechanism. 
> > 
> > I don't think the syscall filtering mechanism is doing anything wrong
> > here, but I can see why you say that given I added those extra EPERM
> > checks where you don't want to see them.
> 
> We had multiple issues with docker filtering syscalls by returning
> EPERM where glibc expects ENOSYS (so either a fallback will be used
> or to handle the error correctly) and I though it was still the same
> issue.
> 
> My understanding is it was fixed on latest docker.

I wish I could say that it is. There certainly have been improvements,
so that a syscall that is newer than any syscall mentioned in the
seccomp filter will produce an ENOSYS instead of an EPERM:
https://github.com/opencontainers/runc/pull/2750 So you won't see that
particular issue with EPERM for new syscalls instead of ENOSYS issue
anymore with newer container runtimes.

But in general seccomp filters are somewhat fragile and hard to get
right. The basic issue is that they run before the kernel gets a
chance to do its own normal checks. And it is hard to write a seccomp
filter that replicates all the normal checks the kernel does. So in
practice you might still receive an EPERM instead of an EINVAL for
example. Also the default for known, but not listed, syscalls is still
EPERM: https://github.com/moby/moby/issues/42871

So yes, there are some fixes in the latest container runtimes, but in
general seccomp filters are slightly fragile and unfortunately used
by default in a lot of cases.

Cheers,

Mark
  
Xi Ruoyao July 18, 2022, 12:31 p.m. UTC | #8
On Mon, 2022-07-18 at 14:26 +0200, Mark Wielaard wrote:

> But in general seccomp filters are somewhat fragile and hard to get
> right. The basic issue is that they run before the kernel gets a
> chance to do its own normal checks. And it is hard to write a seccomp
> filter that replicates all the normal checks the kernel does. So in
> practice you might still receive an EPERM instead of an EINVAL for
> example. Also the default for known, but not listed, syscalls is still
> EPERM: https://github.com/moby/moby/issues/42871
> 
> So yes, there are some fixes in the latest container runtimes, but in
> general seccomp filters are slightly fragile and unfortunately used
> by default in a lot of cases.

So should we really take account those seccomp cases into our testsuite?
If "some container environment" just uses SECCOMP_RET_KILL we'll have no
way to "do things correctly".  To me running Glibc testsuite in an
arbitrary container environment is completely impossible.
  
Adhemerval Zanella July 18, 2022, 12:35 p.m. UTC | #9
On 18/07/22 09:31, Xi Ruoyao wrote:
> On Mon, 2022-07-18 at 14:26 +0200, Mark Wielaard wrote:
> 
>> But in general seccomp filters are somewhat fragile and hard to get
>> right. The basic issue is that they run before the kernel gets a
>> chance to do its own normal checks. And it is hard to write a seccomp
>> filter that replicates all the normal checks the kernel does. So in
>> practice you might still receive an EPERM instead of an EINVAL for
>> example. Also the default for known, but not listed, syscalls is still
>> EPERM: https://github.com/moby/moby/issues/42871
>>
>> So yes, there are some fixes in the latest container runtimes, but in
>> general seccomp filters are slightly fragile and unfortunately used
>> by default in a lot of cases.
> 
> So should we really take account those seccomp cases into our testsuite?
> If "some container environment" just uses SECCOMP_RET_KILL we'll have no
> way to "do things correctly".  To me running Glibc testsuite in an
> arbitrary container environment is completely impossible.

Not only that, but also for some glibc syscalls and internal implementation
it does expects kernel to return ENOSYS, for instance if the architecture
supports clone3 (x86 only for now).  If the container starts to return 
EPERM instead of expected ENOSYS, users will start to see that calls like
pthread_create will just not work (as we saw with some browsers containers,
like chrome and firefox).  We have decided that we will not support such
environments and I think we should extend it to testsuite as well.
  
Florian Weimer July 18, 2022, 12:43 p.m. UTC | #10
* Adhemerval Zanella via Libc-alpha:

> We had multiple issues with docker filtering syscalls by returning
> EPERM where glibc expects ENOSYS (so either a fallback will be used
> or to handle the error correctly) and I though it was still the same
> issue.
>
> My understanding is it was fixed on latest docker.

Moby regresses from time to time because people updating the default
filters are not aware of the extra work required to avoid them.

Thanks,
Florian
  
Carlos O'Donell July 18, 2022, 1:51 p.m. UTC | #11
On 7/18/22 08:35, Adhemerval Zanella Netto via Libc-alpha wrote:
> 
> 
> On 18/07/22 09:31, Xi Ruoyao wrote:
>> On Mon, 2022-07-18 at 14:26 +0200, Mark Wielaard wrote:
>>
>>> But in general seccomp filters are somewhat fragile and hard to get
>>> right. The basic issue is that they run before the kernel gets a
>>> chance to do its own normal checks. And it is hard to write a seccomp
>>> filter that replicates all the normal checks the kernel does. So in
>>> practice you might still receive an EPERM instead of an EINVAL for
>>> example. Also the default for known, but not listed, syscalls is still
>>> EPERM: https://github.com/moby/moby/issues/42871
>>>
>>> So yes, there are some fixes in the latest container runtimes, but in
>>> general seccomp filters are slightly fragile and unfortunately used
>>> by default in a lot of cases.
>>
>> So should we really take account those seccomp cases into our testsuite?
>> If "some container environment" just uses SECCOMP_RET_KILL we'll have no
>> way to "do things correctly".  To me running Glibc testsuite in an
>> arbitrary container environment is completely impossible.
> 
> Not only that, but also for some glibc syscalls and internal implementation
> it does expects kernel to return ENOSYS, for instance if the architecture
> supports clone3 (x86 only for now).  If the container starts to return 
> EPERM instead of expected ENOSYS, users will start to see that calls like
> pthread_create will just not work (as we saw with some browsers containers,
> like chrome and firefox).  We have decided that we will not support such
> environments and I think we should extend it to testsuite as well.

I agree.

Just for clarity. *I* want to be able to run as much of the glibc testsuite as I can
within a container because our pre-commit CI is doing exactly that. This requires careful
attention to the VM and container environment, and that's OK for pre-commit CI runs.

We have our own runner for the "dj/32-bit" i686 testing in patchwork CI which uses the
Fedora Rawhide container to do the testing. We control the VM and container.
  

Patch

diff --git a/sysdeps/unix/sysv/linux/tst-pidfd.c b/sysdeps/unix/sysv/linux/tst-pidfd.c
index d93b6faa6f..2655d94636 100644
--- a/sysdeps/unix/sysv/linux/tst-pidfd.c
+++ b/sysdeps/unix/sysv/linux/tst-pidfd.c
@@ -92,11 +92,15 @@  do_test (void)
   {
     /* The pidfd_getfd syscall was the last in the set of pidfd related
        syscalls added to the kernel.  Use pidfd_getfd to decide if this
-       kernel has pidfd support that we can test.  */
+       kernel has pidfd support that we can test.  And that we have
+       permission to use pidfd_getfd.  */
     int r = pidfd_getfd (0, 0, 1);
     TEST_VERIFY_EXIT (r == -1);
     if (errno == ENOSYS)
       FAIL_UNSUPPORTED ("kernel does not support pidfd_getfd, skipping test");
+    if (errno == EPERM)
+      FAIL_UNSUPPORTED ("don't have permission to use pidfd_getfd, "
+			"skipping test");
   }
 
   ppid = getpid ();
@@ -113,9 +117,15 @@  do_test (void)
   xclose (sockets[1]);
 
   TEST_COMPARE (pidfd_open (-1, 0), -1);
+  if (errno == EPERM)
+    FAIL_UNSUPPORTED ("don't have permission to use pidfd_getfd, "
+		      "skipping test");
   TEST_COMPARE (errno, EINVAL);
 
   int pidfd = pidfd_open (pid, 0);
+  if (pidfd == -1 && errno == EPERM)
+    FAIL_UNSUPPORTED ("don't have permission to use pidfd_getfd, "
+		      "skipping test");
   TEST_VERIFY (pidfd != -1);
 
   /* Wait for first sigtimedwait.  */