[2/3] Linux: Use rseq in sched_getcpu if available (v9)

Message ID 20200629190036.26982-3-mathieu.desnoyers@efficios.com
State Committed
Commit 6e29cb3f61ff5432c78a1c84b0d9b123a350ab36
Headers
Series Restartable Sequences enablement |

Commit Message

Mathieu Desnoyers June 29, 2020, 7 p.m. UTC
  When available, use the cpu_id field from __rseq_abi on Linux to
implement sched_getcpu().  Fall-back on the vgetcpu vDSO if unavailable.

Benchmarks:

x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

glibc sched_getcpu():                     13.7 ns (baseline)
glibc sched_getcpu() using rseq:           2.5 ns (speedup:  5.5x)
inline load cpuid from __rseq_abi TLS:     0.8 ns (speedup: 17.1x)

CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
CC: Joseph Myers <joseph@codesourcery.com>
CC: Szabolcs Nagy <szabolcs.nagy@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ben Maurer <bmaurer@fb.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Paul Turner <pjt@google.com>
CC: libc-alpha@sourceware.org
CC: linux-kernel@vger.kernel.org
CC: linux-api@vger.kernel.org
---
Changes since v1:
- rseq is only used if both __NR_rseq and RSEQ_SIG are defined.

Changes since v2:
- remove duplicated __rseq_abi extern declaration.

Changes since v3:
- update ChangeLog.

Changes since v4:
- Use atomic_load_relaxed to load the __rseq_abi.cpu_id field, a
  consequence of the fact that __rseq_abi is not volatile anymore.
- Include atomic.h which provides atomic_load_relaxed.

Changes since v5:
- Use __ASSUME_RSEQ to detect rseq availability.

Changes since v6:
- Remove use of __ASSUME_RSEQ.

Changes since v7:
- Fix incorrect merge with commit d0def09ff6 ("linux: Fix vDSO macros
  build with time64 interfaces")

Changes since v8:
- Update patch title.
- Add /* RSEQ_SIG */ for #else and #endif.
---
 sysdeps/unix/sysv/linux/sched_getcpu.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)
  

Comments

Florian Weimer July 6, 2020, 1:59 p.m. UTC | #1
* Mathieu Desnoyers:

> When available, use the cpu_id field from __rseq_abi on Linux to
> implement sched_getcpu().  Fall-back on the vgetcpu vDSO if
> unavailable.

I've pushed this to glibc master, but unfortunately it looks like this
exposes a kernel bug related to affinity mask changes.

After building and testing glibc, this

  for x in {1..2000} ; do posix/tst-affinity-static  & done

produces some “error:” lines for me:

error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0

“expected 0” is a result of how the test has been written, it bails out
on the first failure, which happens with CPU ID 0.

Smaller systems can use a smaller count than 2000 to reproduce this.  It
also happens sporadically when running the glibc test suite itself
(which is why it took further testing to reveal this issue).

I can reproduce this with the Debian 4.19.118-2+deb10u1 kernel, the
Fedora 5.6.19-300.fc32 kernel, and the Red Hat Enterprise Linux kernel
4.18.0-193.el8 (all x86_64).

As to the cause, I'd guess that the exit path in the sched_setaffinity
system call fails to update the rseq area, so that userspace can observe
the outdated CPU ID there.

Thanks,
Florian
  
Mathieu Desnoyers July 6, 2020, 2:49 p.m. UTC | #2
----- On Jul 6, 2020, at 9:59 AM, Florian Weimer fweimer@redhat.com wrote:

> * Mathieu Desnoyers:
> 
>> When available, use the cpu_id field from __rseq_abi on Linux to
>> implement sched_getcpu().  Fall-back on the vgetcpu vDSO if
>> unavailable.
> 
> I've pushed this to glibc master, but unfortunately it looks like this
> exposes a kernel bug related to affinity mask changes.
> 
> After building and testing glibc, this
> 
>  for x in {1..2000} ; do posix/tst-affinity-static  & done
> 
> produces some “error:” lines for me:
> 
> error: Unexpected CPU 2, expected 0
> error: Unexpected CPU 2, expected 0
> error: Unexpected CPU 2, expected 0
> error: Unexpected CPU 2, expected 0
> error: Unexpected CPU 138, expected 0
> error: Unexpected CPU 138, expected 0
> error: Unexpected CPU 138, expected 0
> error: Unexpected CPU 138, expected 0
> 
> “expected 0” is a result of how the test has been written, it bails out
> on the first failure, which happens with CPU ID 0.
> 
> Smaller systems can use a smaller count than 2000 to reproduce this.  It
> also happens sporadically when running the glibc test suite itself
> (which is why it took further testing to reveal this issue).
> 
> I can reproduce this with the Debian 4.19.118-2+deb10u1 kernel, the
> Fedora 5.6.19-300.fc32 kernel, and the Red Hat Enterprise Linux kernel
> 4.18.0-193.el8 (all x86_64).
> 
> As to the cause, I'd guess that the exit path in the sched_setaffinity
> system call fails to update the rseq area, so that userspace can observe
> the outdated CPU ID there.

Hi Florian,

We have a similar test in Linux, see tools/testing/selftests/rseq/basic_test.c.
That test does not trigger this issue, even when executed repeatedly.

I'll investigate further what is happening within the glibc test.

Thanks,

Mathieu
  
Mathieu Desnoyers July 6, 2020, 5:33 p.m. UTC | #3
----- On Jul 6, 2020, at 10:49 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Jul 6, 2020, at 9:59 AM, Florian Weimer fweimer@redhat.com wrote:
> 
>> * Mathieu Desnoyers:
>> 
>>> When available, use the cpu_id field from __rseq_abi on Linux to
>>> implement sched_getcpu().  Fall-back on the vgetcpu vDSO if
>>> unavailable.
>> 
>> I've pushed this to glibc master, but unfortunately it looks like this
>> exposes a kernel bug related to affinity mask changes.
>> 
>> After building and testing glibc, this
>> 
>>  for x in {1..2000} ; do posix/tst-affinity-static  & done
>> 
>> produces some “error:” lines for me:
>> 
>> error: Unexpected CPU 2, expected 0
>> error: Unexpected CPU 2, expected 0
>> error: Unexpected CPU 2, expected 0
>> error: Unexpected CPU 2, expected 0
>> error: Unexpected CPU 138, expected 0
>> error: Unexpected CPU 138, expected 0
>> error: Unexpected CPU 138, expected 0
>> error: Unexpected CPU 138, expected 0
>> 
>> “expected 0” is a result of how the test has been written, it bails out
>> on the first failure, which happens with CPU ID 0.
>> 
>> Smaller systems can use a smaller count than 2000 to reproduce this.  It
>> also happens sporadically when running the glibc test suite itself
>> (which is why it took further testing to reveal this issue).
>> 
>> I can reproduce this with the Debian 4.19.118-2+deb10u1 kernel, the
>> Fedora 5.6.19-300.fc32 kernel, and the Red Hat Enterprise Linux kernel
>> 4.18.0-193.el8 (all x86_64).
>> 
>> As to the cause, I'd guess that the exit path in the sched_setaffinity
>> system call fails to update the rseq area, so that userspace can observe
>> the outdated CPU ID there.
> 
> Hi Florian,
> 
> We have a similar test in Linux, see tools/testing/selftests/rseq/basic_test.c.
> That test does not trigger this issue, even when executed repeatedly.
> 
> I'll investigate further what is happening within the glibc test.

Here are the missing kernel bits. Just adding those in wake_up_new_task() is
enough to make the problem go away, but to err on the safe side I'm planning to
add an rseq_migrate within sched_fork:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1757381cabd4..ff83f790a9b3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3043,6 +3043,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
         * so use __set_task_cpu().
         */
        __set_task_cpu(p, smp_processor_id());
+       rseq_migrate(p);
        if (p->sched_class->task_fork)
                p->sched_class->task_fork(p);
        raw_spin_unlock_irqrestore(&p->pi_lock, flags);
@@ -3103,6 +3104,7 @@ void wake_up_new_task(struct task_struct *p)
         */
        p->recent_used_cpu = task_cpu(p);
        __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+       rseq_migrate(p);
 #endif
        rq = __task_rq_lock(p, &rf);
        update_rq_clock(rq);

I'm not sure why your test catches it fairly quickly but ours does not. That's a good
catch.

Now we need to discuss how we introduce that fix in a way that will allow user-space
to trust the __rseq_abi.cpu_id field's content.

The usual approach to kernel bug fixing is typically to push the fix, mark it for
stable kernels, and expect everyone to pick up the fixes. I wonder how comfortable
glibc would be to replace its sched_getcpu implementation with a broken-until-fixed
kernel rseq implementation without any mechanism in place to know whether it can
trust the value of the cpu_id field. I am extremely reluctant to do so.

One possible way to introduce this fix would be to use the rseq flags argument to allow
glibc to query whether the kernel implements a rseq with a cpu_id field it can trust.
So glibc could do, for registration:

  ret = INTERNAL_SYSCALL_CALL (rseq, &__rseq_abi, sizeof (struct rseq),
                               RSEQ_FLAG_REGISTER | RSEQ_FLAG_RELIABLE_CPU_ID,
                               RSEQ_SIG);

(I'm open to bike-shedding the actual flag name)

So if the kernel does not implement the fix, the registration would return EINVAL.
When getting EINVAL like this, we could then re-issue the rseq syscall:

  ret = INTERNAL_SYSCALL_CALL (rseq, NULL, 0, RSEQ_FLAG_RELIABLE_CPU_ID, 0);

to check whether EINVAL has been caused by other invalid system call parameters,
or it's really because the RSEQ_FLAG_RELIABLE_CPU_ID flag is unknown. Being able
to distinguish between invalid parameters and unknown flags like this would end
up requiring one extra system call on failed registration attempt on kernels which
implement a broken rseq.

This also means glibc would only start really activating rseq on kernels implementing
this fix.

Thoughts ?

Thanks,

Mathieu
  
Florian Weimer July 6, 2020, 5:50 p.m. UTC | #4
* Mathieu Desnoyers:

> Now we need to discuss how we introduce that fix in a way that will
> allow user-space to trust the __rseq_abi.cpu_id field's content.

I don't think that's necessary.  We can mention it in the glibc
distribution notes on the wiki.

> The usual approach to kernel bug fixing is typically to push the fix,
> mark it for stable kernels, and expect everyone to pick up the
> fixes. I wonder how comfortable glibc would be to replace its
> sched_getcpu implementation with a broken-until-fixed kernel rseq
> implementation without any mechanism in place to know whether it can
> trust the value of the cpu_id field. I am extremely reluctant to do
> so.

We have already had similar regressions in sched_getcpu, and we didn't
put anything into glibc to deal with those.

Just queue the fix for the stable kernels.  I expect that all
distributions track stable kernel branches in some way, so just put into
the kernel commit message that this commit is needed for a working
sched_getcpu in glibc 2.32 and later.

Once the upstream fix is in Linus' tree, I'm going to file a request to
backport the fix into the Red Hat Enterprise Linux 8.

Thanks for finding the root cause so quickly,
Florian
  
Mathieu Desnoyers July 6, 2020, 6:02 p.m. UTC | #5
----- On Jul 6, 2020, at 1:50 PM, Florian Weimer fweimer@redhat.com wrote:

> * Mathieu Desnoyers:
> 
>> Now we need to discuss how we introduce that fix in a way that will
>> allow user-space to trust the __rseq_abi.cpu_id field's content.
> 
> I don't think that's necessary.  We can mention it in the glibc
> distribution notes on the wiki.
> 
>> The usual approach to kernel bug fixing is typically to push the fix,
>> mark it for stable kernels, and expect everyone to pick up the
>> fixes. I wonder how comfortable glibc would be to replace its
>> sched_getcpu implementation with a broken-until-fixed kernel rseq
>> implementation without any mechanism in place to know whether it can
>> trust the value of the cpu_id field. I am extremely reluctant to do
>> so.
> 
> We have already had similar regressions in sched_getcpu, and we didn't
> put anything into glibc to deal with those.

Was that acceptable because having a wrong cpu number would never trigger
corruption, only slowdowns ?

In the case of rseq, having the wrong cpu_id value is a real issue which
will lead to corruption and crashes. So I maintain my reluctance to introduce
the fix without any way for userspace to know whether the cpu_id field
value is reliable.

What were the reasons why it was OK to have this kind of regression in
sched_getcpu in the past, and are they still valid in the context of
rseq ?

Thanks,

Mathieu

> 
> Just queue the fix for the stable kernels.  I expect that all
> distributions track stable kernel branches in some way, so just put into
> the kernel commit message that this commit is needed for a working
> sched_getcpu in glibc 2.32 and later.
> 
> Once the upstream fix is in Linus' tree, I'm going to file a request to
> backport the fix into the Red Hat Enterprise Linux 8.
> 
> Thanks for finding the root cause so quickly,
> Florian
  
Florian Weimer July 6, 2020, 6:11 p.m. UTC | #6
* Mathieu Desnoyers:

> ----- On Jul 6, 2020, at 1:50 PM, Florian Weimer fweimer@redhat.com wrote:
>
>> * Mathieu Desnoyers:
>> 
>>> Now we need to discuss how we introduce that fix in a way that will
>>> allow user-space to trust the __rseq_abi.cpu_id field's content.
>> 
>> I don't think that's necessary.  We can mention it in the glibc
>> distribution notes on the wiki.
>> 
>>> The usual approach to kernel bug fixing is typically to push the fix,
>>> mark it for stable kernels, and expect everyone to pick up the
>>> fixes. I wonder how comfortable glibc would be to replace its
>>> sched_getcpu implementation with a broken-until-fixed kernel rseq
>>> implementation without any mechanism in place to know whether it can
>>> trust the value of the cpu_id field. I am extremely reluctant to do
>>> so.
>> 
>> We have already had similar regressions in sched_getcpu, and we didn't
>> put anything into glibc to deal with those.
>
> Was that acceptable because having a wrong cpu number would never trigger
> corruption, only slowdowns ?

First of all, it's a kernel bug.  It's rare that we put workarounds for
kernel bugs into glibc.

And yes, in pretty much all cases it's just a performance issue for
sched_getcpu.  When you know the CPU ID of a thread due to pinning to a
single CPU, why would you call sched_getcpu?  (That's the case where you
could get corruption in theory.)

> In the case of rseq, having the wrong cpu_id value is a real issue
> which will lead to corruption and crashes. So I maintain my reluctance
> to introduce the fix without any way for userspace to know whether the
> cpu_id field value is reliable.

Yes, for rseq itself, the scenario is somewhat different.  Still, it's
just another kernel bug.  There will be others. 8-/

From a schedule point of view, it looks tough to get the magic flag into
the mainline kernel in time for the upcoming glibc 2.32 release.  If you
insist on registering rseq only if the bug is not present, we'll
probably have to back out some or all of the rseq changes.

Thanks,
Florian
  
Mathieu Desnoyers July 6, 2020, 9:08 p.m. UTC | #7
----- On Jul 6, 2020, at 2:11 PM, Florian Weimer fweimer@redhat.com wrote:

> * Mathieu Desnoyers:
> 
>> ----- On Jul 6, 2020, at 1:50 PM, Florian Weimer fweimer@redhat.com wrote:
>>
>>> * Mathieu Desnoyers:
>>> 
>>>> Now we need to discuss how we introduce that fix in a way that will
>>>> allow user-space to trust the __rseq_abi.cpu_id field's content.
>>> 
>>> I don't think that's necessary.  We can mention it in the glibc
>>> distribution notes on the wiki.
>>> 
>>>> The usual approach to kernel bug fixing is typically to push the fix,
>>>> mark it for stable kernels, and expect everyone to pick up the
>>>> fixes. I wonder how comfortable glibc would be to replace its
>>>> sched_getcpu implementation with a broken-until-fixed kernel rseq
>>>> implementation without any mechanism in place to know whether it can
>>>> trust the value of the cpu_id field. I am extremely reluctant to do
>>>> so.
>>> 
>>> We have already had similar regressions in sched_getcpu, and we didn't
>>> put anything into glibc to deal with those.
>>
>> Was that acceptable because having a wrong cpu number would never trigger
>> corruption, only slowdowns ?
> 
> First of all, it's a kernel bug.  It's rare that we put workarounds for
> kernel bugs into glibc.
> 
> And yes, in pretty much all cases it's just a performance issue for
> sched_getcpu.  When you know the CPU ID of a thread due to pinning to a
> single CPU, why would you call sched_getcpu?  (That's the case where you
> could get corruption in theory.)
> 
>> In the case of rseq, having the wrong cpu_id value is a real issue
>> which will lead to corruption and crashes. So I maintain my reluctance
>> to introduce the fix without any way for userspace to know whether the
>> cpu_id field value is reliable.
> 
> Yes, for rseq itself, the scenario is somewhat different.  Still, it's
> just another kernel bug.  There will be others. 8-/
> 
> From a schedule point of view, it looks tough to get the magic flag into
> the mainline kernel in time for the upcoming glibc 2.32 release.  If you
> insist on registering rseq only if the bug is not present, we'll
> probably have to back out some or all of the rseq changes.

I've just submitted the fix and a the new rseq flag as RFC to lkml:

https://lore.kernel.org/lkml/20200706204913.20347-1-mathieu.desnoyers@efficios.com/

Let's see how quickly we can come to an agreement on this on the kernel
side.

Thanks,

Mathieu
  

Patch

diff --git a/sysdeps/unix/sysv/linux/sched_getcpu.c b/sysdeps/unix/sysv/linux/sched_getcpu.c
index c019cfb3cf..c0f992e056 100644
--- a/sysdeps/unix/sysv/linux/sched_getcpu.c
+++ b/sysdeps/unix/sysv/linux/sched_getcpu.c
@@ -18,10 +18,12 @@ 
 #include <errno.h>
 #include <sched.h>
 #include <sysdep.h>
+#include <atomic.h>
 #include <sysdep-vdso.h>
+#include <sys/rseq.h>
 
-int
-sched_getcpu (void)
+static int
+vsyscall_sched_getcpu (void)
 {
   unsigned int cpu;
   int r = -1;
@@ -32,3 +34,19 @@  sched_getcpu (void)
 #endif
   return r == -1 ? r : cpu;
 }
+
+#ifdef RSEQ_SIG
+int
+sched_getcpu (void)
+{
+  int cpu_id = atomic_load_relaxed (&__rseq_abi.cpu_id);
+
+  return cpu_id >= 0 ? cpu_id : vsyscall_sched_getcpu ();
+}
+#else /* RSEQ_SIG */
+int
+sched_getcpu (void)
+{
+  return vsyscall_sched_getcpu ();
+}
+#endif /* RSEQ_SIG */