[RFC,v2,1/4] rseq: Add sched_state field to struct rseq

Message ID 20230529191416.53955-2-mathieu.desnoyers@efficios.com
State Not applicable
Headers
Series Extend rseq with sched_state_ptr field |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch fail Patch failed to apply to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm fail Patch failed to apply
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 fail Patch failed to apply
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 pending Test started
linaro-tcwg-bot/tcwg_glibc_check--master-arm pending Test started

Commit Message

Mathieu Desnoyers May 29, 2023, 7:14 p.m. UTC
  Expose the "on-cpu" state for each thread through struct rseq to allow
adaptative mutexes to decide more accurately between busy-waiting and
calling sys_futex() to release the CPU, based on the on-cpu state of the
mutex owner.

It is only provided as an optimization hint, because there is no
guarantee that the page containing this field is in the page cache, and
therefore the scheduler may very well fail to clear the on-cpu state on
preemption. This is expected to be rare though, and is resolved as soon
as the task returns to user-space.

The goal is to improve use-cases where the duration of the critical
sections for a given lock follows a multi-modal distribution, preventing
statistical guesses from doing a good job at choosing between busy-wait
and futex wait behavior.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: libc-alpha@sourceware.org
---
 include/linux/sched.h     | 16 +++++++++++++++
 include/uapi/linux/rseq.h | 41 +++++++++++++++++++++++++++++++++++++
 kernel/rseq.c             | 43 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 100 insertions(+)
  

Comments

Florian Weimer May 29, 2023, 7:35 p.m. UTC | #1
* Mathieu Desnoyers:

> +/*
> + * rseq_sched_state should be aligned on the cache line size.
> + */
> +struct rseq_sched_state {
> +	/*
> +	 * Version of this structure. Populated by the kernel, read by
> +	 * user-space.
> +	 */
> +	__u32 version;
> +	/*
> +	 * The state is updated by the kernel. Read by user-space with
> +	 * single-copy atomicity semantics. This field can be read by any
> +	 * userspace thread. Aligned on 32-bit. Contains a bitmask of enum
> +	 * rseq_sched_state_flags. This field is provided as a hint by the
> +	 * scheduler, and requires that the page holding this state is
> +	 * faulted-in for the state update to be performed by the scheduler.
> +	 */
> +	__u32 state;
> +	/*
> +	 * Thread ID associated with the thread registering this structure.
> +	 * Initialized by user-space before registration.
> +	 */
> +	__u32 tid;
> +};

How does the version handshake protocol in practice?  Given that this
user-allocated?

I don't see why we can't stick this directly into struct rseq because
it's all public anyway.

The TID field would be useful in its own right.

Thanks,
Florian
  
Mathieu Desnoyers May 29, 2023, 7:48 p.m. UTC | #2
On 5/29/23 15:35, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> +/*
>> + * rseq_sched_state should be aligned on the cache line size.
>> + */
>> +struct rseq_sched_state {
>> +	/*
>> +	 * Version of this structure. Populated by the kernel, read by
>> +	 * user-space.
>> +	 */
>> +	__u32 version;
>> +	/*
>> +	 * The state is updated by the kernel. Read by user-space with
>> +	 * single-copy atomicity semantics. This field can be read by any
>> +	 * userspace thread. Aligned on 32-bit. Contains a bitmask of enum
>> +	 * rseq_sched_state_flags. This field is provided as a hint by the
>> +	 * scheduler, and requires that the page holding this state is
>> +	 * faulted-in for the state update to be performed by the scheduler.
>> +	 */
>> +	__u32 state;
>> +	/*
>> +	 * Thread ID associated with the thread registering this structure.
>> +	 * Initialized by user-space before registration.
>> +	 */
>> +	__u32 tid;
>> +};
> 
> How does the version handshake protocol in practice?  Given that this
> user-allocated?

Good point, I must admit that I have not thought this specific version 
protocol through. :) As you say, userspace is responsible for 
allocation, and the kernel is responsible for implementing features.

Let's first see if we can get away with embedding these fields in struct 
rseq.

> 
> I don't see why we can't stick this directly into struct rseq because
> it's all public anyway.

The motivation for moving this to a different cache line is to handle 
the prior comment from Boqun, who is concerned that busy-waiting 
repeatedly loading a field from struct rseq will cause false-sharing and 
make other stores to that cache line slower, especially stores to 
rseq_cs to begin rseq critical sections, thus slightly increasing the 
overhead of rseq critical sections taken while mutexes are held.

If we want to embed this field into struct rseq with its own cache line, 
then we need to add a lot of padding, which is inconvenient.

That being said, perhaps this is premature optimization, what do you think ?

> 
> The TID field would be useful in its own right.

Indeed, good point.

While we are there, I wonder if we should use the thread_pointer() as 
lock identifier, or if the address of struct rseq is fine ?

Thanks,

Mathieu

> 
> Thanks,
> Florian
>
  
Florian Weimer May 30, 2023, 8:20 a.m. UTC | #3
* Mathieu Desnoyers:

>> I don't see why we can't stick this directly into struct rseq because
>> it's all public anyway.
>
> The motivation for moving this to a different cache line is to handle
> the prior comment from Boqun, who is concerned that busy-waiting
> repeatedly loading a field from struct rseq will cause false-sharing
> and make other stores to that cache line slower, especially stores to
> rseq_cs to begin rseq critical sections, thus slightly increasing the
> overhead of rseq critical sections taken while mutexes are held.

Hmm.  For context, in glibc, we have to place struct rseq on a fixed
offset from the start of a page (or even some larger alignment) for all
threads.  In the future (once we move the thread control block off the
top of the userspace stack, where it resides since the LinuxThreads
days), it is likely that the pointer difference between different
threads will also be a multiple of a fairly large power of two
(something like 2**20 might be common).  Maybe this will make caching
even more difficult?

> If we want to embed this field into struct rseq with its own cache
> line, then we need to add a lot of padding, which is inconvenient.
>
> That being said, perhaps this is premature optimization, what do you
> think ?

Maybe?  I don't know how the access patterns will look like.  But I
suspect that once we hit this case, performance will not be great
anyway, so the optimization is perhaps unnecessary?

The challenge is that once we put stuff at fixed offsets, we can't
transparently fix it later.  It would need more auxv entries with
further offsets, or accessing this data through some indirection,
perhaps via vDSO helpers.

>> The TID field would be useful in its own right.
>
> Indeed, good point.
>
> While we are there, I wonder if we should use the thread_pointer() as
> lock identifier, or if the address of struct rseq is fine ?

Hard to tell until we'll see what the futex integration looks like, I
think.

Thanks,
Florian
  
Mathieu Desnoyers May 30, 2023, 2:25 p.m. UTC | #4
On 5/30/23 04:20, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>>> I don't see why we can't stick this directly into struct rseq because
>>> it's all public anyway.
>>
>> The motivation for moving this to a different cache line is to handle
>> the prior comment from Boqun, who is concerned that busy-waiting
>> repeatedly loading a field from struct rseq will cause false-sharing
>> and make other stores to that cache line slower, especially stores to
>> rseq_cs to begin rseq critical sections, thus slightly increasing the
>> overhead of rseq critical sections taken while mutexes are held.
> 
> Hmm.  For context, in glibc, we have to place struct rseq on a fixed
> offset from the start of a page (or even some larger alignment) for all
> threads.  In the future (once we move the thread control block off the
> top of the userspace stack, where it resides since the LinuxThreads
> days), it is likely that the pointer difference between different
> threads will also be a multiple of a fairly large power of two
> (something like 2**20 might be common).  Maybe this will make caching
> even more difficult?
> 
>> If we want to embed this field into struct rseq with its own cache
>> line, then we need to add a lot of padding, which is inconvenient.
>>
>> That being said, perhaps this is premature optimization, what do you
>> think ?
> 
> Maybe?  I don't know how the access patterns will look like.  But I
> suspect that once we hit this case, performance will not be great
> anyway, so the optimization is perhaps unnecessary?

What I dislike though is that contention for any lock which busy-waits 
on the rseq sched_state would slow down all rseq critical sections of 
that thread, which is a side-effect we want to avoid.

I've done some more additional benchmarks on my 8-core AMD laptop, and I 
notice that things get especially bad whenever the store to 
rseq_abi->rseq_cs is surrounded by other instructions that need to be 
ordered with that store, e.g. a for loop doing 10 stores to a local 
variables. If it's surrounded by instructions that don't need to be 
ordered wrt that store (e.g. a for loop of 10 iterations issuing 
barrier() "memory" asm clobbers), then the overhead cannot be noticed 
anymore.

> 
> The challenge is that once we put stuff at fixed offsets, we can't
> transparently fix it later.  It would need more auxv entries with
> further offsets, or accessing this data through some indirection,
> perhaps via vDSO helpers.

Perhaps this is more flexibility/complexity than we really need. One 
possible approach would be to split struct rseq into sub-structures, e.g.:

rseq_len = overall size of all sub-structures.
auxv AT_RSEQ_ALIGN = 256

auxv AT_RSEQ_FEATURE_SIZE = size of first portion of struct rseq,
                             at most 256 bytes, meant to contain fields
                             stored/loaded from the thread doing the
                             registration.
auxv AT_RSEQ_SHARED_FEATURE_SIZE =
                             size of 2nd portion of struct rseq,
                             starts at offset 256, at most 256 bytes,
                             meant to contain fields stored/loaded by
                             any thread.

Then we have this layout:

struct rseq {
   struct rseq_local {
     /* Fields accessed from local thread. */

   } __attribute__((aligned((256));
   struct rseq_shared {
     /* Shared fields. */

   } __attribute__((aligned(256));
} __attribute__((aligned(256));

And if someday AT_RSEQ_FEATURE_SIZE needs to grow over 256 bytes
(32 * u64), we can still extend with a new auxv entry after the "shared"
features.


> 
>>> The TID field would be useful in its own right.
>>
>> Indeed, good point.
>>
>> While we are there, I wonder if we should use the thread_pointer() as
>> lock identifier, or if the address of struct rseq is fine ?
> 
> Hard to tell until we'll see what the futex integration looks like, I
> think.

Good point. I can choose one way or another for the prototype, and then 
we'll see how things go with futex integration.

Thanks,

Mathieu
  
Mathieu Desnoyers May 30, 2023, 3:13 p.m. UTC | #5
On 5/30/23 10:25, Mathieu Desnoyers wrote:
> On 5/30/23 04:20, Florian Weimer wrote:
[...]
>>
>> The challenge is that once we put stuff at fixed offsets, we can't
>> transparently fix it later.  It would need more auxv entries with
>> further offsets, or accessing this data through some indirection,
>> perhaps via vDSO helpers.
> 
> Perhaps this is more flexibility/complexity than we really need. One 
> possible approach would be to split struct rseq into sub-structures, e.g.:
> 
> rseq_len = overall size of all sub-structures.
> auxv AT_RSEQ_ALIGN = 256
> 
> auxv AT_RSEQ_FEATURE_SIZE = size of first portion of struct rseq,
>                              at most 256 bytes, meant to contain fields
>                              stored/loaded from the thread doing the
>                              registration.
> auxv AT_RSEQ_SHARED_FEATURE_SIZE =
>                              size of 2nd portion of struct rseq,
>                              starts at offset 256, at most 256 bytes,
>                              meant to contain fields stored/loaded by
>                              any thread.
> 
> Then we have this layout:
> 
> struct rseq {
>    struct rseq_local {
>      /* Fields accessed from local thread. */
> 
>    } __attribute__((aligned((256));
>    struct rseq_shared {
>      /* Shared fields. */
> 
>    } __attribute__((aligned(256));
> } __attribute__((aligned(256));
> 
> And if someday AT_RSEQ_FEATURE_SIZE needs to grow over 256 bytes
> (32 * u64), we can still extend with a new auxv entry after the "shared"
> features.

Actually, after giving it some more thoughts, I think we can do better:

- Add a sys_rseq() rseq_flag RSEQ_FLAG_SHARED, which changes the 
behavior of sys_rseq() to expect an additional "struct rseq_shared *" 
argument.

- Introduce auxv AT_RSEQ_SHARED_FEATURE_SIZE.

This way, it's up to the libc to decide how to allocate its private vs 
shared rseq structures.

The auxv "AT_RSEQ_ALIGN" would dictate the minimal alignment required 
for both private and shared rseq structures.

I don't think we need to express the size of the rseq_shared memory area 
allocated by libc because we know that it needs to be large enough to 
handle the shared feature size.

Thoughts ?

Thanks,

Mathieu
  
Dmitry Vyukov Sept. 26, 2023, 8:52 p.m. UTC | #6
>> I don't see why we can't stick this directly into struct rseq because
>> it's all public anyway.
>
> The motivation for moving this to a different cache line is to handle 
> the prior comment from Boqun, who is concerned that busy-waiting 
> repeatedly loading a field from struct rseq will cause false-sharing and 
> make other stores to that cache line slower, especially stores to 
> rseq_cs to begin rseq critical sections, thus slightly increasing the 
> overhead of rseq critical sections taken while mutexes are held.
>
> If we want to embed this field into struct rseq with its own cache line, 
> then we need to add a lot of padding, which is inconvenient.
>
> That being said, perhaps this is premature optimization, what do you think ?

Hi Mathieu, Florian,

This is exciting!

I thought the motivation for moving rseq_sched_state out of struct rseq
is lifetime management problem. I assume when a thread locks a mutex,
it stores pointer to rseq_sched_state in the mutex state for other
threads to poll. So the waiting thread would do something along the following
lines:

rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED);
if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU))
	futex_wait();

Now if the state is struct rseq, which is stored in TLS,
then the owning thread can unlock the mutex, exit and unmap TLS in between.
Consequently, load of state->state will cause a paging fault.

And we do want rseq in TLS to save 1 indirection.

If rseq_sched_state is separated from struct rseq, then it can be allocated
in type stable memory that is never unmapped.

What am I missing here?

However, if we can store this state in struct rseq, then an alternative
interface would for the kernel to do:

rseq->cpu_id = -1;

to denote that the thread is not running on any CPU.
I think it kinda makes sense, rseq->cpu_id is the thread's current CPU,
and -1 naturally means "not running at all". And we already store -1
right after init, so it shouldn't be a surprising value.
  
Dmitry Vyukov Sept. 26, 2023, 11:49 p.m. UTC | #7
On Tue, 26 Sept 2023 at 13:52, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> >> I don't see why we can't stick this directly into struct rseq because
> >> it's all public anyway.
> >
> > The motivation for moving this to a different cache line is to handle
> > the prior comment from Boqun, who is concerned that busy-waiting
> > repeatedly loading a field from struct rseq will cause false-sharing and
> > make other stores to that cache line slower, especially stores to
> > rseq_cs to begin rseq critical sections, thus slightly increasing the
> > overhead of rseq critical sections taken while mutexes are held.
> >
> > If we want to embed this field into struct rseq with its own cache line,
> > then we need to add a lot of padding, which is inconvenient.
> >
> > That being said, perhaps this is premature optimization, what do you think ?
>
> Hi Mathieu, Florian,
>
> This is exciting!
>
> I thought the motivation for moving rseq_sched_state out of struct rseq
> is lifetime management problem. I assume when a thread locks a mutex,
> it stores pointer to rseq_sched_state in the mutex state for other
> threads to poll. So the waiting thread would do something along the following
> lines:
>
> rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED);
> if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU))
>         futex_wait();
>
> Now if the state is struct rseq, which is stored in TLS,
> then the owning thread can unlock the mutex, exit and unmap TLS in between.
> Consequently, load of state->state will cause a paging fault.
>
> And we do want rseq in TLS to save 1 indirection.
>
> If rseq_sched_state is separated from struct rseq, then it can be allocated
> in type stable memory that is never unmapped.
>
> What am I missing here?
>
> However, if we can store this state in struct rseq, then an alternative
> interface would for the kernel to do:
>
> rseq->cpu_id = -1;
>
> to denote that the thread is not running on any CPU.
> I think it kinda makes sense, rseq->cpu_id is the thread's current CPU,
> and -1 naturally means "not running at all". And we already store -1
> right after init, so it shouldn't be a surprising value.

As you may know we experimented with "virtual CPUs" in tcmalloc. The
extension allows kernel to assign dense virtual CPU numbers to running
threads instead of real sparse CPU numbers:

https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/linux_syscall_support.h#L30-L41

Recently I added another change that [ab]uses rseq in an interesting
way. We want to get notifications about thread re-scheduling. A bit
simplified version of this is as follows:
we don't use rseq.cpu_id_start for its original purpose, so instead we
store something else there with a high bit set. Real CPU numbers don't
have a high bit set (at least while you have less than 2B CPUs :)).
This allows us to distinguish the value we stored in rseq.cpu_id_start
from real CPU id stored by the kernel.
Inside of rseq critical section we check if rseq.cpu_id_start has high
bit set, and if not, then we know that we were just rescheduled, so we
can do some additional work and update rseq.cpu_id_start to have high
bit set.

In reality it's a bit more involved since the field is actually 8
bytes and only partially overlaps with rseq.cpu_id_start (it's an
8-byte pointer with high 4 bytes overlap rseq.cpu_id_start):

https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165

I am thinking if we could extend the current proposed interface in a
way that would be more flexible and would satisfy all of these use
cases (spinlocks, and possibility of using virtual CPUs and
rescheduling notifications). In the end they all need a very similar
thing: kernel writing some value at some user address when a thread is
de-scheduled.

The minimal support we need for tcmalloc is an 8-byte user address +
kernel writing 0 at that address when a thread is descheduled.

The most flexible option to support multiple users
(malloc/spinlocks/something else) would be as follows:

User-space passes an array of structs with address + size (1/2/4/8
bytes) + value.
Kernel intereates over the array when the thread is de-scheduled and
writes the specified value at the provided address/size.
Something along the following lines (pseudo-code):

struct rseq {
    ...
    struct rseq_desched_notif_t* desched_notifs;
    int desched_notif_count;
};

struct rseq_desched_notif_t {
    void* addr;
    uint64_t value;
    int size;
};

static inline void rseq_preempt(struct task_struct *t)
{
    ...
    for (int i = 0; i < t->rseq->desched_notif_count; i++) {
        switch (t->rseq->desched_notifs[i].size) {
        case 1: put_user1(t->rseq->desched_notifs[i].addr,
t->rseq->desched_notifs[i].value);
        case 2: put_user2(t->rseq->desched_notifs[i].addr,
t->rseq->desched_notifs[i].value);
        case 4: put_user4(t->rseq->desched_notifs[i].addr,
t->rseq->desched_notifs[i].value);
        case 8: put_user8(t->rseq->desched_notifs[i].addr,
t->rseq->desched_notifs[i].value);
        }
    }
}
  
Dmitry Vyukov Sept. 26, 2023, 11:54 p.m. UTC | #8
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> + if (unlikely(t->flags & PF_EXITING))
> + return;

Don't we want to do this even if the task is exciting?
If rseq is read only only the current task, then it does not matter
(exiting task won't read own rseq anymore). But if we extending
expected uses cases to tasks reading rseq of other tasks, we may need
to updates even in PF_EXITING case.

> + pagefault_disable();

This is a bit concerning. Does this mean updates are not done if the
page is currently not in memory? Can we make it reliable as for the
main rseq data (cpu_id)?
For tcmalloc uses (and other uses that require reliable information)
this may be problematic. User-space may start thinking that all "CPU
slots" are busy and that there more threads running than there are
CPUs.

> + (void) put_user(state, &t->rseq_sched_state->state);
> + pagefault_enable();
> +}
  
Florian Weimer Sept. 27, 2023, 4:51 a.m. UTC | #9
* Dmitry Vyukov:

> In reality it's a bit more involved since the field is actually 8
> bytes and only partially overlaps with rseq.cpu_id_start (it's an
> 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start):
>
> https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165

This does not compose with other rseq users, as noted in the sources:

  // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.

For a core library such a malloc replacement, that is a very bad trap.

Thanks,
Florian
  
Dmitry Vyukov Sept. 27, 2023, 3:58 p.m. UTC | #10
On Tue, 26 Sept 2023 at 21:51, Florian Weimer <fweimer@redhat.com> wrote:
>
> * Dmitry Vyukov:
>
> > In reality it's a bit more involved since the field is actually 8
> > bytes and only partially overlaps with rseq.cpu_id_start (it's an
> > 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start):
> >
> > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165
>
> This does not compose with other rseq users, as noted in the sources:
>
>   // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.
>
> For a core library such a malloc replacement, that is a very bad trap.

I agree. I wouldn't do this if there were other options. That's why I
am looking for official kernel support for this.
If we would have a separate 8 bytes that are overwritten with 0 when a
thread is descheduled, that would be perfect.
  
Florian Weimer Sept. 28, 2023, 8:52 a.m. UTC | #11
* Dmitry Vyukov:

> On Tue, 26 Sept 2023 at 21:51, Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Dmitry Vyukov:
>>
>> > In reality it's a bit more involved since the field is actually 8
>> > bytes and only partially overlaps with rseq.cpu_id_start (it's an
>> > 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start):
>> >
>> > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165
>>
>> This does not compose with other rseq users, as noted in the sources:
>>
>>   // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.
>>
>> For a core library such a malloc replacement, that is a very bad trap.
>
> I agree. I wouldn't do this if there were other options. That's why I
> am looking for official kernel support for this.
> If we would have a separate 8 bytes that are overwritten with 0 when a
> thread is descheduled, that would be perfect.

That only solves part of the problem because these fields would still
have to be locked to tcmalloc.  I think you'd need a rescheduling
counter, then every library could keep their reference values in
library-private thread-local storage.

Thanks,
Florian
  
Peter Zijlstra Sept. 28, 2023, 10:39 a.m. UTC | #12
On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
> Expose the "on-cpu" state for each thread through struct rseq to allow
> adaptative mutexes to decide more accurately between busy-waiting and
> calling sys_futex() to release the CPU, based on the on-cpu state of the
> mutex owner.
> 
> It is only provided as an optimization hint, because there is no
> guarantee that the page containing this field is in the page cache, and
> therefore the scheduler may very well fail to clear the on-cpu state on
> preemption. This is expected to be rare though, and is resolved as soon
> as the task returns to user-space.
> 
> The goal is to improve use-cases where the duration of the critical
> sections for a given lock follows a multi-modal distribution, preventing
> statistical guesses from doing a good job at choosing between busy-wait
> and futex wait behavior.

As always, are syscalls really *that* expensive? Why can't we busy wait
in the kernel instead?

I mean, sure, meltdown sucked, but most people should now be running
chips that are not affected by that particular horror show, no?
  
David Laight Sept. 28, 2023, 11:22 a.m. UTC | #13
From: Peter Zijlstra
> Sent: 28 September 2023 11:39
> 
> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
> > Expose the "on-cpu" state for each thread through struct rseq to allow
> > adaptative mutexes to decide more accurately between busy-waiting and
> > calling sys_futex() to release the CPU, based on the on-cpu state of the
> > mutex owner.

Are you trying to avoid spinning when the owning process is sleeping?
Or trying to avoid the system call when it will find that the futex
is no longer held?

The latter is really horribly detremental.

> >
> > It is only provided as an optimization hint, because there is no
> > guarantee that the page containing this field is in the page cache, and
> > therefore the scheduler may very well fail to clear the on-cpu state on
> > preemption. This is expected to be rare though, and is resolved as soon
> > as the task returns to user-space.
> >
> > The goal is to improve use-cases where the duration of the critical
> > sections for a given lock follows a multi-modal distribution, preventing
> > statistical guesses from doing a good job at choosing between busy-wait
> > and futex wait behavior.
> 
> As always, are syscalls really *that* expensive? Why can't we busy wait
> in the kernel instead?
> 
> I mean, sure, meltdown sucked, but most people should now be running
> chips that are not affected by that particular horror show, no?

IIRC 'page table separation' which is what makes system calls expensive
is only a compile-time option. So is likely to be enabled on any 'distro'
kernel.
But a lot of other mitigations (eg RSB stuffing) are also pretty detrimental.

OTOH if you have a 'hot' userspace mutex you are going to lose whatever.
All that needs to happen is for a ethernet interrupt to decide to discard
completed transmits and refill the rx ring, and then for the softint code
to free a load of stuff deferred by rcu while you've grabbed the mutex
and no matter how short the user-space code path the mutex won't be
released for absolutely ages.

I had to change a load of code to use arrays and atomic increments
to avoid delays acquiring mutex.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Mathieu Desnoyers Sept. 28, 2023, 1:20 p.m. UTC | #14
On 9/28/23 07:22, David Laight wrote:
> From: Peter Zijlstra
>> Sent: 28 September 2023 11:39
>>
>> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
>>> Expose the "on-cpu" state for each thread through struct rseq to allow
>>> adaptative mutexes to decide more accurately between busy-waiting and
>>> calling sys_futex() to release the CPU, based on the on-cpu state of the
>>> mutex owner.
> 
> Are you trying to avoid spinning when the owning process is sleeping?

Yes, this is my main intent.

> Or trying to avoid the system call when it will find that the futex
> is no longer held?
> 
> The latter is really horribly detremental.

That's a good questions. What should we do in those three situations 
when trying to grab the lock:

1) Lock has no owner

We probably want to simply grab the lock with an atomic instruction. But 
then if other threads are queued on sys_futex and did not manage to grab 
the lock yet, this would be detrimental to fairness.

2) Lock owner is running:

The lock owner is certainly running on another cpu (I'm using the term 
"cpu" here as logical cpu).

I guess we could either decide to bypass sys_futex entirely and try to 
grab the lock with an atomic, or we go through sys_futex nevertheless to 
allow futex to guarantee some fairness across threads.

3) Lock owner is sleeping:

The lock owner may be either tied to the same cpu as the requester, or a 
different cpu. Here calling FUTEX_WAIT and friends is pretty much required.

Can you elaborate on why skipping sys_futex in scenario (2) would be so 
bad ? I wonder if we could get away with skipping futex entirely in this 
scenario and still guarantee fairness by implementing MCS locking or 
ticket locks in userspace. Basically, if userspace queues itself on the 
lock through either MCS locking or ticket locks, it could guarantee 
fairness on its own.

Of course things are more complicated with PI-futex, is that what you 
have in mind ?

> 
>>>
>>> It is only provided as an optimization hint, because there is no
>>> guarantee that the page containing this field is in the page cache, and
>>> therefore the scheduler may very well fail to clear the on-cpu state on
>>> preemption. This is expected to be rare though, and is resolved as soon
>>> as the task returns to user-space.
>>>
>>> The goal is to improve use-cases where the duration of the critical
>>> sections for a given lock follows a multi-modal distribution, preventing
>>> statistical guesses from doing a good job at choosing between busy-wait
>>> and futex wait behavior.
>>
>> As always, are syscalls really *that* expensive? Why can't we busy wait
>> in the kernel instead?
>>
>> I mean, sure, meltdown sucked, but most people should now be running
>> chips that are not affected by that particular horror show, no?
> 
> IIRC 'page table separation' which is what makes system calls expensive
> is only a compile-time option. So is likely to be enabled on any 'distro'
> kernel.
> But a lot of other mitigations (eg RSB stuffing) are also pretty detrimental.
> 
> OTOH if you have a 'hot' userspace mutex you are going to lose whatever.
> All that needs to happen is for a ethernet interrupt to decide to discard
> completed transmits and refill the rx ring, and then for the softint code
> to free a load of stuff deferred by rcu while you've grabbed the mutex
> and no matter how short the user-space code path the mutex won't be
> released for absolutely ages.
> 
> I had to change a load of code to use arrays and atomic increments
> to avoid delays acquiring mutex.

That's good input, thanks! I mostly defer to André Almeida on the 
use-case motivation. I mostly provided this POC patch to show that it 
_can_ be done with sys_rseq(2).

Thanks!

Mathieu

> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
  
Peter Zijlstra Sept. 28, 2023, 2:26 p.m. UTC | #15
On Thu, Sep 28, 2023 at 09:20:36AM -0400, Mathieu Desnoyers wrote:

> Of course things are more complicated with PI-futex, is that what you have
> in mind ?

PI futex has to go through the kernel, also, they already spin-wait on
owner.
  
David Laight Sept. 28, 2023, 2:33 p.m. UTC | #16
From: Mathieu Desnoyers
> Sent: 28 September 2023 14:21
> 
> On 9/28/23 07:22, David Laight wrote:
> > From: Peter Zijlstra
> >> Sent: 28 September 2023 11:39
> >>
> >> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
> >>> Expose the "on-cpu" state for each thread through struct rseq to allow
> >>> adaptative mutexes to decide more accurately between busy-waiting and
> >>> calling sys_futex() to release the CPU, based on the on-cpu state of the
> >>> mutex owner.
> >
> > Are you trying to avoid spinning when the owning process is sleeping?
> 
> Yes, this is my main intent.
> 
> > Or trying to avoid the system call when it will find that the futex
> > is no longer held?
> >
> > The latter is really horribly detremental.
> 
> That's a good questions. What should we do in those three situations
> when trying to grab the lock:
> 
> 1) Lock has no owner
> 
> We probably want to simply grab the lock with an atomic instruction. But
> then if other threads are queued on sys_futex and did not manage to grab
> the lock yet, this would be detrimental to fairness.
> 
> 2) Lock owner is running:
> 
> The lock owner is certainly running on another cpu (I'm using the term
> "cpu" here as logical cpu).
> 
> I guess we could either decide to bypass sys_futex entirely and try to
> grab the lock with an atomic, or we go through sys_futex nevertheless to
> allow futex to guarantee some fairness across threads.

I'd not worry about 'fairness'.
If the mutex is that contended you've already lost!

I had a big problem trying to avoid the existing 'fairness' code.
Consider 30 RT threads blocked in cv_wait() on the same condvar.
Something does cv_broadcast() and you want them all to wakeup.
They'll all release the mutex pretty quickly - it doesn't matter is they spin.
But what actually happens is one thread is woken up.
Once it has been scheduled (after the cpu has come out of a sleep state
and/or any hardware interrupts completed (etc) then next thread is woken.
If you are lucky it'll 'only' take a few ms to get them all running.
Not good when you are trying to process audio every 10ms.
I had to use a separate cv for each thread and get the woken threads
to help with the wakeups. Gog knows what happens with 256 threads!

> 3) Lock owner is sleeping:
> 
> The lock owner may be either tied to the same cpu as the requester, or a
> different cpu. Here calling FUTEX_WAIT and friends is pretty much required.

You'd need the 'holding process is sleeping' test to be significantly
faster then the 'optimistic spin hoping the mutex will be released'.
And for the 'spin' to be longer than the syscall time for futex.
Otherwise you are optimising an already slow path.
If the thread is going to have to sleep until the thread that owns
a mutex wakes up then I can't imagine performance mattering.

OTOH it is much more usual for the owning thread to be running and
release the mutex quickly.

I wouldn't have thought it was really worth optimising for the
'lock owner is sleeping' case.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Steven Rostedt Sept. 28, 2023, 2:43 p.m. UTC | #17
On Thu, 28 Sep 2023 12:39:26 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> As always, are syscalls really *that* expensive? Why can't we busy wait
> in the kernel instead?

Yes syscalls are that expensive. Several years ago I had a good talk
with Robert Haas (one of the PostgreSQL maintainers) at Linux Plumbers,
and I asked him if they used futexes. His answer was "no". He told me
how they did several benchmarks and it was a huge performance hit (and
this was before Spectre/Meltdown made things much worse). He explained
to me that most locks are taken just to flip a few bits. Going into the
kernel and coming back was orders of magnitude longer than the critical
sections. By going into the kernel, it caused a ripple effect and lead
to even more contention. There answer was to implement their locking
completely in user space without any help from the kernel.

This is when I thought that having an adaptive spinner that could get
hints from the kernel via memory mapping would be extremely useful.

The obvious problem with their implementation is that if the owner is
sleeping, there's no point in spinning. Worse, the owner may even be
waiting for the spinner to get off the CPU before it can run again. But
according to Robert, the gain in the general performance greatly
outweighed the few times this happened in practice.

But still, if userspace could figure out if the owner is running on
another CPU or not, to act just like the adaptive mutexes in the
kernel, that would prevent the problem of a spinner keeping the owner
from running.

-- Steve
  
Dmitry Vyukov Sept. 28, 2023, 2:44 p.m. UTC | #18
On Thu, 28 Sept 2023 at 01:52, Florian Weimer <fweimer@redhat.com> wrote:
>
> * Dmitry Vyukov:
>
> > On Tue, 26 Sept 2023 at 21:51, Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Dmitry Vyukov:
> >>
> >> > In reality it's a bit more involved since the field is actually 8
> >> > bytes and only partially overlaps with rseq.cpu_id_start (it's an
> >> > 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start):
> >> >
> >> > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165
> >>
> >> This does not compose with other rseq users, as noted in the sources:
> >>
> >>   // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.
> >>
> >> For a core library such a malloc replacement, that is a very bad trap.
> >
> > I agree. I wouldn't do this if there were other options. That's why I
> > am looking for official kernel support for this.
> > If we would have a separate 8 bytes that are overwritten with 0 when a
> > thread is descheduled, that would be perfect.
>
> That only solves part of the problem because these fields would still
> have to be locked to tcmalloc.  I think you'd need a rescheduling
> counter, then every library could keep their reference values in
> library-private thread-local storage.

This unfortunatly won't work for tcmalloc.
This data is accessed on the very hot path of malloc/free. We need a
ready to use pointer in TLS, which is reset by the kernel to 0 (or
some user-space specified value). Doing to separate loads for counters
in different cache lines would be too expensive.

It may be possible to make several libraries use this feature  with an
array of notifications (see rseq_desched_notif_t in my previous
email).
  
Dmitry Vyukov Sept. 28, 2023, 2:47 p.m. UTC | #19
On Tue, 26 Sept 2023 at 16:49, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> > >> I don't see why we can't stick this directly into struct rseq because
> > >> it's all public anyway.
> > >
> > > The motivation for moving this to a different cache line is to handle
> > > the prior comment from Boqun, who is concerned that busy-waiting
> > > repeatedly loading a field from struct rseq will cause false-sharing and
> > > make other stores to that cache line slower, especially stores to
> > > rseq_cs to begin rseq critical sections, thus slightly increasing the
> > > overhead of rseq critical sections taken while mutexes are held.
> > >
> > > If we want to embed this field into struct rseq with its own cache line,
> > > then we need to add a lot of padding, which is inconvenient.
> > >
> > > That being said, perhaps this is premature optimization, what do you think ?
> >
> > Hi Mathieu, Florian,
> >
> > This is exciting!
> >
> > I thought the motivation for moving rseq_sched_state out of struct rseq
> > is lifetime management problem. I assume when a thread locks a mutex,
> > it stores pointer to rseq_sched_state in the mutex state for other
> > threads to poll. So the waiting thread would do something along the following
> > lines:
> >
> > rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED);
> > if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU))
> >         futex_wait();
> >
> > Now if the state is struct rseq, which is stored in TLS,
> > then the owning thread can unlock the mutex, exit and unmap TLS in between.
> > Consequently, load of state->state will cause a paging fault.
> >
> > And we do want rseq in TLS to save 1 indirection.
> >
> > If rseq_sched_state is separated from struct rseq, then it can be allocated
> > in type stable memory that is never unmapped.
> >
> > What am I missing here?
> >
> > However, if we can store this state in struct rseq, then an alternative
> > interface would for the kernel to do:
> >
> > rseq->cpu_id = -1;
> >
> > to denote that the thread is not running on any CPU.
> > I think it kinda makes sense, rseq->cpu_id is the thread's current CPU,
> > and -1 naturally means "not running at all". And we already store -1
> > right after init, so it shouldn't be a surprising value.
>
> As you may know we experimented with "virtual CPUs" in tcmalloc. The
> extension allows kernel to assign dense virtual CPU numbers to running
> threads instead of real sparse CPU numbers:
>
> https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/linux_syscall_support.h#L30-L41
>
> Recently I added another change that [ab]uses rseq in an interesting
> way. We want to get notifications about thread re-scheduling. A bit
> simplified version of this is as follows:
> we don't use rseq.cpu_id_start for its original purpose, so instead we
> store something else there with a high bit set. Real CPU numbers don't
> have a high bit set (at least while you have less than 2B CPUs :)).
> This allows us to distinguish the value we stored in rseq.cpu_id_start
> from real CPU id stored by the kernel.
> Inside of rseq critical section we check if rseq.cpu_id_start has high
> bit set, and if not, then we know that we were just rescheduled, so we
> can do some additional work and update rseq.cpu_id_start to have high
> bit set.
>
> In reality it's a bit more involved since the field is actually 8
> bytes and only partially overlaps with rseq.cpu_id_start (it's an
> 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start):
>
> https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165
>
> I am thinking if we could extend the current proposed interface in a
> way that would be more flexible and would satisfy all of these use
> cases (spinlocks, and possibility of using virtual CPUs and
> rescheduling notifications). In the end they all need a very similar
> thing: kernel writing some value at some user address when a thread is
> de-scheduled.
>
> The minimal support we need for tcmalloc is an 8-byte user address +
> kernel writing 0 at that address when a thread is descheduled.
>
> The most flexible option to support multiple users
> (malloc/spinlocks/something else) would be as follows:
>
> User-space passes an array of structs with address + size (1/2/4/8
> bytes) + value.
> Kernel intereates over the array when the thread is de-scheduled and
> writes the specified value at the provided address/size.
> Something along the following lines (pseudo-code):
>
> struct rseq {
>     ...
>     struct rseq_desched_notif_t* desched_notifs;
>     int desched_notif_count;
> };
>
> struct rseq_desched_notif_t {
>     void* addr;
>     uint64_t value;
>     int size;
> };
>
> static inline void rseq_preempt(struct task_struct *t)
> {
>     ...
>     for (int i = 0; i < t->rseq->desched_notif_count; i++) {
>         switch (t->rseq->desched_notifs[i].size) {
>         case 1: put_user1(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         case 2: put_user2(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         case 4: put_user4(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         case 8: put_user8(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         }
>     }
> }

One thing I forgot to mention: ideally the kernel also writes a
timestamp of descheduling somewhere.
We are using this logic to assign per-CPU malloc caches to threads,
and it's useful to know which caches were used very recently (still
hot in cache) and which ones were not used for a long time.
  
André Almeida Sept. 28, 2023, 3:05 p.m. UTC | #20
On 9/28/23 15:20, Mathieu Desnoyers wrote:
> On 9/28/23 07:22, David Laight wrote:
>> From: Peter Zijlstra
>>> Sent: 28 September 2023 11:39
>>>
>>> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
>>>> Expose the "on-cpu" state for each thread through struct rseq to allow
>>>> adaptative mutexes to decide more accurately between busy-waiting and
>>>> calling sys_futex() to release the CPU, based on the on-cpu state 
>>>> of the
>>>> mutex owner.
>>
>> Are you trying to avoid spinning when the owning process is sleeping?
>
> Yes, this is my main intent.
>
>> Or trying to avoid the system call when it will find that the futex
>> is no longer held?
>>
>> The latter is really horribly detremental.
>
> That's a good questions. What should we do in those three situations 
> when trying to grab the lock:
>
> 1) Lock has no owner
>
> We probably want to simply grab the lock with an atomic instruction. 
> But then if other threads are queued on sys_futex and did not manage 
> to grab the lock yet, this would be detrimental to fairness.
>
> 2) Lock owner is running:
>
> The lock owner is certainly running on another cpu (I'm using the term 
> "cpu" here as logical cpu).
>
> I guess we could either decide to bypass sys_futex entirely and try to 
> grab the lock with an atomic, or we go through sys_futex nevertheless 
> to allow futex to guarantee some fairness across threads.

About the fairness part:

Even if you enqueue everyone, the futex syscall doesn't provide any 
guarantee about the order of the wake. The current implementation tries 
to be fair, but I don't think it works for every case. I wouldn't be 
much concern about being fair here, given that it's an inherent problem 
of the futex anyway.

 From the man pages:

"No guarantee is provided about which waiters are awoken"

>
> 3) Lock owner is sleeping:
>
> The lock owner may be either tied to the same cpu as the requester, or 
> a different cpu. Here calling FUTEX_WAIT and friends is pretty much 
> required.
>
> Can you elaborate on why skipping sys_futex in scenario (2) would be 
> so bad ? I wonder if we could get away with skipping futex entirely in 
> this scenario and still guarantee fairness by implementing MCS locking 
> or ticket locks in userspace. Basically, if userspace queues itself on 
> the lock through either MCS locking or ticket locks, it could 
> guarantee fairness on its own.
>
> Of course things are more complicated with PI-futex, is that what you 
> have in mind ?
>
>>
>>>>
>>>> It is only provided as an optimization hint, because there is no
>>>> guarantee that the page containing this field is in the page cache, 
>>>> and
>>>> therefore the scheduler may very well fail to clear the on-cpu 
>>>> state on
>>>> preemption. This is expected to be rare though, and is resolved as 
>>>> soon
>>>> as the task returns to user-space.
>>>>
>>>> The goal is to improve use-cases where the duration of the critical
>>>> sections for a given lock follows a multi-modal distribution, 
>>>> preventing
>>>> statistical guesses from doing a good job at choosing between 
>>>> busy-wait
>>>> and futex wait behavior.
>>>
>>> As always, are syscalls really *that* expensive? Why can't we busy wait
>>> in the kernel instead?
>>>
>>> I mean, sure, meltdown sucked, but most people should now be running
>>> chips that are not affected by that particular horror show, no?
>>
>> IIRC 'page table separation' which is what makes system calls expensive
>> is only a compile-time option. So is likely to be enabled on any 
>> 'distro'
>> kernel.
>> But a lot of other mitigations (eg RSB stuffing) are also pretty 
>> detrimental.
>>
>> OTOH if you have a 'hot' userspace mutex you are going to lose whatever.
>> All that needs to happen is for a ethernet interrupt to decide to 
>> discard
>> completed transmits and refill the rx ring, and then for the softint 
>> code
>> to free a load of stuff deferred by rcu while you've grabbed the mutex
>> and no matter how short the user-space code path the mutex won't be
>> released for absolutely ages.
>>
>> I had to change a load of code to use arrays and atomic increments
>> to avoid delays acquiring mutex.
>
> That's good input, thanks! I mostly defer to André Almeida on the 
> use-case motivation. I mostly provided this POC patch to show that it 
> _can_ be done with sys_rseq(2).
>
> Thanks!
>
> Mathieu
>
>>
>>     David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, 
>> MK1 1PT, UK
>> Registration No: 1397386 (Wales)
>>
>
  
David Laight Sept. 28, 2023, 3:51 p.m. UTC | #21
From: Steven Rostedt
> Sent: 28 September 2023 15:43
> 
> On Thu, 28 Sep 2023 12:39:26 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > As always, are syscalls really *that* expensive? Why can't we busy wait
> > in the kernel instead?
> 
> Yes syscalls are that expensive. Several years ago I had a good talk
> with Robert Haas (one of the PostgreSQL maintainers) at Linux Plumbers,
> and I asked him if they used futexes. His answer was "no". He told me
> how they did several benchmarks and it was a huge performance hit (and
> this was before Spectre/Meltdown made things much worse). He explained
> to me that most locks are taken just to flip a few bits. Going into the
> kernel and coming back was orders of magnitude longer than the critical
> sections. By going into the kernel, it caused a ripple effect and lead
> to even more contention. There answer was to implement their locking
> completely in user space without any help from the kernel.

That matches what I found with code that was using a mutex to take
work items off a global list.
Although the mutex was only held for a few instructions (probably
several 100 because the list wasn't that well written), what happened
was that as soon as there was any contention (which might start
with a hardware interrupt) performance when through the floor.

The fix was to replace the linked list with and array and use
atomic add to 'grab' blocks of entries.
(Even the atomic operations slowed things down.)

> This is when I thought that having an adaptive spinner that could get
> hints from the kernel via memory mapping would be extremely useful.

Did you consider writing a timestamp into the mutex when it was
acquired - or even as the 'acquired' value?
A 'moderately synched TSC' should do.
Then the waiter should be able to tell how long the mutex
has been held for - and then not spin if it had been held ages.

> The obvious problem with their implementation is that if the owner is
> sleeping, there's no point in spinning. Worse, the owner may even be
> waiting for the spinner to get off the CPU before it can run again. But
> according to Robert, the gain in the general performance greatly
> outweighed the few times this happened in practice.

Unless you can use atomics (ok for bits and linked lists) you
always have the problem that userspace can't disable interrupts.
So, unlike the kernel, you can't implement a proper spinlock.

I've NFI how CONFIG_RT manages to get anything done with all
the spinlocks replaced by sleep locks.
Clearly there are a spinlocks that are held for far too long.
But you really do want to spin most of the time.
...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Thomas Gleixner Sept. 28, 2023, 8:21 p.m. UTC | #22
On Mon, May 29 2023 at 15:14, Mathieu Desnoyers wrote:
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
> +
> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +	if (t->rseq_sched_state)
> +		__rseq_set_sched_state(t, state);

This is invoked on every context switch and writes over that state
unconditionally even in the case that the state was already
cleared. There are enough situations where tasks are scheduled out
several times while being in the kernel.

>  /* rseq_preempt() requires preemption to be disabled. */
>  static inline void rseq_preempt(struct task_struct *t)
>  {
>  	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>  	rseq_set_notify_resume(t);
> +	rseq_set_sched_state(t, 0);

This code is already stupid to begin with. __set_bit() is cheap, but
rseq_set_notify_resume() is not as it has a conditional and a locked
instruction and now you add two more conditionals into the context
switch path.

Thanks,

        tglx
  
Mathieu Desnoyers Sept. 28, 2023, 8:43 p.m. UTC | #23
On 9/28/23 16:21, Thomas Gleixner wrote:
> On Mon, May 29 2023 at 15:14, Mathieu Desnoyers wrote:
>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
>> +
>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
>> +{
>> +	if (t->rseq_sched_state)
>> +		__rseq_set_sched_state(t, state);
> 
> This is invoked on every context switch and writes over that state
> unconditionally even in the case that the state was already
> cleared. There are enough situations where tasks are scheduled out
> several times while being in the kernel.

Right, if this becomes more than a PoC, I'll make sure to keep track of 
the current state within the task struct, and only update userspace on 
state transition.

> 
>>   /* rseq_preempt() requires preemption to be disabled. */
>>   static inline void rseq_preempt(struct task_struct *t)
>>   {
>>   	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>>   	rseq_set_notify_resume(t);
>> +	rseq_set_sched_state(t, 0);
> 
> This code is already stupid to begin with. __set_bit() is cheap, but
> rseq_set_notify_resume() is not as it has a conditional and a locked
> instruction

What alternative would you recommend to set a per-thread state that has 
the same effect as TIF_NOTIFY_RESUME ? Indeed all it really needs to 
synchronize with is the thread owning the flags, but AFAIU having this 
flag part of the TIF flags requires use of an atomic instruction to 
synchronize updates against concurrent threads.

If we move this thread flag into a separate field of struct thread_info, 
then we could turn this atomic set bit into a more lightweight store, 
but then we'd need to check against an extra field on return to userspace.

And if we want to remove the conditional branch on the scheduler 
fast-path, we could always load and test both the task struct's rseq 
pointer and the thread_info "preempted" state on return to userspace.

The tradeoff there would be to add extra loads and conditional branches 
on return to userspace to speed up the scheduler fast-path.

Is this what you have in mind or am I missing your point ?

> and now you add two more conditionals into the context
> switch path.

I'm open to suggestions on how to improve this if this goes beyond PoC 
stage and we observe measurable benefits on the userspace side.

Thanks,

Mathieu

> 
> Thanks,
> 
>          tglx
  
Thomas Gleixner Sept. 28, 2023, 8:54 p.m. UTC | #24
On Mon, May 29 2023 at 15:14, Mathieu Desnoyers wrote:
> +/*
> + * rseq_sched_state should be aligned on the cache line size.
> + */
> +struct rseq_sched_state {
> +	/*
> +	 * Version of this structure. Populated by the kernel, read by
> +	 * user-space.
> +	 */
> +	__u32 version;
> +	/*
> +	 * The state is updated by the kernel. Read by user-space with
> +	 * single-copy atomicity semantics. This field can be read by any
> +	 * userspace thread. Aligned on 32-bit. Contains a bitmask of enum
> +	 * rseq_sched_state_flags. This field is provided as a hint by the
> +	 * scheduler, and requires that the page holding this state is
> +	 * faulted-in for the state update to be performed by the scheduler.
> +	 */
> +	__u32 state;
> +	/*
> +	 * Thread ID associated with the thread registering this structure.
> +	 * Initialized by user-space before registration.
> +	 */
> +	__u32 tid;
> +};
> +
>  /*
>   * struct rseq is aligned on 4 * 8 bytes to ensure it is always
>   * contained within a single cache-line.
> @@ -148,6 +180,15 @@ struct rseq {
>  	 */
>  	__u32 mm_cid;
>  
> +	__u32 padding1;
> +
> +	/*
> +	 * Restartable sequences sched_state_ptr field. Initialized by
> +	 * userspace to the address at which the struct rseq_sched_state is
> +	 * located. Read by the kernel on rseq registration.
> +	 */
> +	__u64 sched_state_ptr;
> +

Why is this a separate entity instead of being simply embedded into
struct rseq?

Neither the code comment nor the changelog tells anything about that.

If your intention was to provide a solution for process shared futexes
then you completely failed to understand how process shared futexes
work. If not, then please explain why you need a pointer and the
associated hackery to deal with it.

Thanks,

        tglx
  
Mathieu Desnoyers Sept. 28, 2023, 10:11 p.m. UTC | #25
On 9/28/23 16:54, Thomas Gleixner wrote:
> On Mon, May 29 2023 at 15:14, Mathieu Desnoyers wrote:
>> +/*
>> + * rseq_sched_state should be aligned on the cache line size.
>> + */
>> +struct rseq_sched_state {
>> +	/*
>> +	 * Version of this structure. Populated by the kernel, read by
>> +	 * user-space.
>> +	 */
>> +	__u32 version;
>> +	/*
>> +	 * The state is updated by the kernel. Read by user-space with
>> +	 * single-copy atomicity semantics. This field can be read by any
>> +	 * userspace thread. Aligned on 32-bit. Contains a bitmask of enum
>> +	 * rseq_sched_state_flags. This field is provided as a hint by the
>> +	 * scheduler, and requires that the page holding this state is
>> +	 * faulted-in for the state update to be performed by the scheduler.
>> +	 */
>> +	__u32 state;
>> +	/*
>> +	 * Thread ID associated with the thread registering this structure.
>> +	 * Initialized by user-space before registration.
>> +	 */
>> +	__u32 tid;
>> +};
>> +
>>   /*
>>    * struct rseq is aligned on 4 * 8 bytes to ensure it is always
>>    * contained within a single cache-line.
>> @@ -148,6 +180,15 @@ struct rseq {
>>   	 */
>>   	__u32 mm_cid;
>>   
>> +	__u32 padding1;
>> +
>> +	/*
>> +	 * Restartable sequences sched_state_ptr field. Initialized by
>> +	 * userspace to the address at which the struct rseq_sched_state is
>> +	 * located. Read by the kernel on rseq registration.
>> +	 */
>> +	__u64 sched_state_ptr;
>> +
> 
> Why is this a separate entity instead of being simply embedded into
> struct rseq?
> 
> Neither the code comment nor the changelog tells anything about that.

Here is the email thread from v1 that led to this:

https://lore.kernel.org/lkml/ZGevZxOjJLMO9zlM@boqun-archlinux/

The reason for moving this sched_state to its own structure was to 
optimize for a scenario where we have:

- many threads contending for the lock, thus loading the value of the 
struct rseq "sched_state".
- the lock owner thread doing rseq critical sections with the lock held, 
thus updating the value of struct rseq "rseq_cs" field (in the same 
cache line).

The loads of the sched_state from other threads would slow down (even 
slightly) the update to the rseq_cs field, thus causing the critical 
sections to take a few more cycles.

I am not convinced that the complexity vs speed tradeoff of moving this 
into its own separate structure is worth it though. Especially given 
that it would require userspace to wire things up with an additional 
side-structure, rather than just extend naturally with the 
extensible-rseq ABI. Another approach would be to cacheline-align this 
field within struct rseq and waste space to padding.

I am inclined to just embed this into struct rseq without caring too 
much about slight overhead caused by sharing the cache line with other 
fields.

> 
> If your intention was to provide a solution for process shared futexes
> then you completely failed to understand how process shared futexes
> work. If not, then please explain why you need a pointer and the
> associated hackery to deal with it.

I have not given a single thought to shared futexes in this PoC so far. :)

So let's see: we could do the following to support shared futexes: move 
the information about the owner's struct rseq (could be the thread 
pointer of the owner thread) to a field separate from the "lock" state 
(which is ABI as you pointed out). Therefore, grabbing the lock would be 
done with an atomic instruction, and then setting this extra information 
about lock owner would be done with a simple store.

Given that the lock owner information is only used to statistically 
decide whether other threads should spin or block when contended, we 
don't really care if this information is set a few instructions after 
acquiring the lock.

Thoughts ?

Thanks for the feedback,

Mathieu

> 
> Thanks,
> 
>          tglx
> 
>
  
Steven Rostedt Oct. 2, 2023, 4:51 p.m. UTC | #26
On Thu, 28 Sep 2023 15:51:47 +0000
David Laight <David.Laight@ACULAB.COM> wrote:


> > This is when I thought that having an adaptive spinner that could get
> > hints from the kernel via memory mapping would be extremely useful.  
> 
> Did you consider writing a timestamp into the mutex when it was
> acquired - or even as the 'acquired' value?
> A 'moderately synched TSC' should do.
> Then the waiter should be able to tell how long the mutex
> has been held for - and then not spin if it had been held ages.

And what heuristic would you use. My experience with picking "time to spin"
may work for one workload but cause major regressions in another workload.
I came to the conclusion to "hate" heuristics and NACK them whenever
someone suggested adding them to the rt_mutex in the kernel (back before
adaptive mutexes were introduced).

> 
> > The obvious problem with their implementation is that if the owner is
> > sleeping, there's no point in spinning. Worse, the owner may even be
> > waiting for the spinner to get off the CPU before it can run again. But
> > according to Robert, the gain in the general performance greatly
> > outweighed the few times this happened in practice.  
> 
> Unless you can use atomics (ok for bits and linked lists) you
> always have the problem that userspace can't disable interrupts.
> So, unlike the kernel, you can't implement a proper spinlock.

Why do you need to disable interrupts? If you know the owner is running on
the CPU, you know it's not trying to run on the CPU that is acquiring the
lock. Heck, there's normal spin locks outside of PREEMPT_RT that do not
disable interrupts. The only time you need to disable interrupts is if the
interrupt itself takes the spin lock, and that's just to prevent deadlocks.

> 
> I've NFI how CONFIG_RT manages to get anything done with all
> the spinlocks replaced by sleep locks.
> Clearly there are a spinlocks that are held for far too long.
> But you really do want to spin most of the time.

It spins as long as the owner of the lock is running on the CPU. This is
what we are looking to get from this patch series for user space.

Back in 2007, we had an issue with scaling on SMP machines. The RT kernel
with the sleeping spin locks would start to exponentially slow down with
the more CPUs you had. Once we hit more than 16 CPUs,  the time to boot a
kernel took 10s of minutes to boot RT when the normal CONFIG_PREEMPT kernel
would only take a couple of minutes. The more CPUs you added, the worse it
became.

Then SUSE submitted a patch to have the rt_mutex spin only if the owner of
the mutex was still running on another CPU. This actually mimics a real
spin lock (because that's exactly what they do, they spin while the owner
is running on a CPU). The difference between a true spin lock and an
rt_mutex was that the spinner would stop spinning if the owner was
preempted (a true spin lock owner could not be preempted).

After applying the adaptive spinning, we were able to scale PREEMPT_RT to
any number of CPUs that the normal kernel could do with just a linear
performance hit.

This is why I'm very much interested in getting the same ability into user
space spin locks.

-- Steve
  
David Laight Oct. 2, 2023, 5:22 p.m. UTC | #27
From: Steven Rostedt
> Sent: 02 October 2023 17:51
> 
> On Thu, 28 Sep 2023 15:51:47 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> 
> > > This is when I thought that having an adaptive spinner that could get
> > > hints from the kernel via memory mapping would be extremely useful.
> >
> > Did you consider writing a timestamp into the mutex when it was
> > acquired - or even as the 'acquired' value?
> > A 'moderately synched TSC' should do.
> > Then the waiter should be able to tell how long the mutex
> > has been held for - and then not spin if it had been held ages.
> 
> And what heuristic would you use. My experience with picking "time to spin"
> may work for one workload but cause major regressions in another workload.
> I came to the conclusion to "hate" heuristics and NACK them whenever
> someone suggested adding them to the rt_mutex in the kernel (back before
> adaptive mutexes were introduced).

Isn't that exactly what and adaptive mutex does?
Spin 'for a bit' before sleeping.

> > > The obvious problem with their implementation is that if the owner is
> > > sleeping, there's no point in spinning. Worse, the owner may even be
> > > waiting for the spinner to get off the CPU before it can run again. But
> > > according to Robert, the gain in the general performance greatly
> > > outweighed the few times this happened in practice.
> >
> > Unless you can use atomics (ok for bits and linked lists) you
> > always have the problem that userspace can't disable interrupts.
> > So, unlike the kernel, you can't implement a proper spinlock.
> 
> Why do you need to disable interrupts? If you know the owner is running on
> the CPU, you know it's not trying to run on the CPU that is acquiring the
> lock. Heck, there's normal spin locks outside of PREEMPT_RT that do not
> disable interrupts. The only time you need to disable interrupts is if the
> interrupt itself takes the spin lock, and that's just to prevent deadlocks.

You need to disable interrupts in order to bound the time the
spinlock is held for.
If all you are doing is a dozen instructions (eg to remove an
item from s list) then you really don't want an interrupt coming in
while you have the spinlock held.
It isn't the cost of the ISR - that has to happen sometime, but that
the cpu waiting for the spinlock also take the cost of the ISR.

A network+softint ISR can run for a long time - I'm sure I've
seen a good fraction of a millisecond.
You really don't want another (or many other) cpu spinning while
that is going on.
Which (to my mind) pretty much means that you always want to
disable interrupts on a spinlock.
If the architecture makes masking ISR expensive then I've seen schemes
that let the hardware interrupt happen, then disable it and rerun later.

> > I've NFI how CONFIG_RT manages to get anything done with all
> > the spinlocks replaced by sleep locks.
> > Clearly there are a spinlocks that are held for far too long.
> > But you really do want to spin most of the time.
> 
> It spins as long as the owner of the lock is running on the CPU. This is
> what we are looking to get from this patch series for user space.

I think you'd need to detect that the cpu was in-kernel running an ISR.

But the multithreaded audio app I was 'fixing' basically failed
as soon as it had to sleep on one of the futex.
The real problem was ISR while the mutex was held.
So deciding to sleep because the lock owner isn't running (in user)
would already be delaying things too much.

> 
> Back in 2007, we had an issue with scaling on SMP machines. The RT kernel
> with the sleeping spin locks would start to exponentially slow down with
> the more CPUs you had. Once we hit more than 16 CPUs,  the time to boot a
> kernel took 10s of minutes to boot RT when the normal CONFIG_PREEMPT kernel
> would only take a couple of minutes. The more CPUs you added, the worse it
> became.
> 
> Then SUSE submitted a patch to have the rt_mutex spin only if the owner of
> the mutex was still running on another CPU. This actually mimics a real
> spin lock (because that's exactly what they do, they spin while the owner
> is running on a CPU). The difference between a true spin lock and an
> rt_mutex was that the spinner would stop spinning if the owner was
> preempted (a true spin lock owner could not be preempted).
> 
> After applying the adaptive spinning, we were able to scale PREEMPT_RT to
> any number of CPUs that the normal kernel could do with just a linear
> performance hit.

Sounds like it was spinning for far too long at the best of times.
But analysing these sort of latencies is hard.

	David

> 
> This is why I'm very much interested in getting the same ability into user
> space spin locks.
> 
> -- Steve

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Steven Rostedt Oct. 2, 2023, 5:56 p.m. UTC | #28
On Mon, 2 Oct 2023 17:22:34 +0000
David Laight <David.Laight@ACULAB.COM> wrote:


> > And what heuristic would you use. My experience with picking "time to spin"
> > may work for one workload but cause major regressions in another workload.
> > I came to the conclusion to "hate" heuristics and NACK them whenever
> > someone suggested adding them to the rt_mutex in the kernel (back before
> > adaptive mutexes were introduced).  
> 
> Isn't that exactly what and adaptive mutex does?
> Spin 'for a bit' before sleeping.

But it's not some arbitrary time to spin. Technically, a kernel spin lock
is spinning on the heuristic of ownership. "Spin until the lock is
released" is a heuristic!

> 
> > > > The obvious problem with their implementation is that if the owner is
> > > > sleeping, there's no point in spinning. Worse, the owner may even be
> > > > waiting for the spinner to get off the CPU before it can run again. But
> > > > according to Robert, the gain in the general performance greatly
> > > > outweighed the few times this happened in practice.  
> > >
> > > Unless you can use atomics (ok for bits and linked lists) you
> > > always have the problem that userspace can't disable interrupts.
> > > So, unlike the kernel, you can't implement a proper spinlock.  
> > 
> > Why do you need to disable interrupts? If you know the owner is running on
> > the CPU, you know it's not trying to run on the CPU that is acquiring the
> > lock. Heck, there's normal spin locks outside of PREEMPT_RT that do not
> > disable interrupts. The only time you need to disable interrupts is if the
> > interrupt itself takes the spin lock, and that's just to prevent deadlocks.  
> 
> You need to disable interrupts in order to bound the time the
> spinlock is held for.
> If all you are doing is a dozen instructions (eg to remove an
> item from s list) then you really don't want an interrupt coming in
> while you have the spinlock held.

That's just noise of normal processing. What's the difference of it
happening during spinning to where it happens in normal execution?

> It isn't the cost of the ISR - that has to happen sometime, but that
> the cpu waiting for the spinlock also take the cost of the ISR.

As supposed to just going into the kernel? So it wastes some of its quota.
It's not stopping anything else from running more than normal.

> 
> A network+softint ISR can run for a long time - I'm sure I've
> seen a good fraction of a millisecond.
> You really don't want another (or many other) cpu spinning while
> that is going on.

Why not? The current user space only code does that now (and it will even
spin if the owner is preempted). What we are talking about implementing is
a big improvement to what is currently done.

> Which (to my mind) pretty much means that you always want to
> disable interrupts on a spinlock.

The benchmarks say otherwise. Sure, once in a while you may spin longer
because of an interrupt, but that's a very rare occurrence compared to
normal taking of spin locks. Disabling interrupts is an expensive
operation. The savings you get from "not waiting for a softirq to finish"
will be drowned out by the added overhead of disabling interrupts at every
acquire.

> If the architecture makes masking ISR expensive then I've seen schemes
> that let the hardware interrupt happen, then disable it and rerun later.
> 
> > > I've NFI how CONFIG_RT manages to get anything done with all
> > > the spinlocks replaced by sleep locks.
> > > Clearly there are a spinlocks that are held for far too long.
> > > But you really do want to spin most of the time.  
> > 
> > It spins as long as the owner of the lock is running on the CPU. This is
> > what we are looking to get from this patch series for user space.  
> 
> I think you'd need to detect that the cpu was in-kernel running an ISR.

For the few times that might happen, it's not worth it.

> 
> But the multithreaded audio app I was 'fixing' basically failed
> as soon as it had to sleep on one of the futex.
> The real problem was ISR while the mutex was held.
> So deciding to sleep because the lock owner isn't running (in user)
> would already be delaying things too much.

That doesn't sound like the use case we are fixing. If your audio app
failed because it had to sleep, that tells me it would fail regardless.

> 
> > 
> > Back in 2007, we had an issue with scaling on SMP machines. The RT kernel
> > with the sleeping spin locks would start to exponentially slow down with
> > the more CPUs you had. Once we hit more than 16 CPUs,  the time to boot a
> > kernel took 10s of minutes to boot RT when the normal CONFIG_PREEMPT kernel
> > would only take a couple of minutes. The more CPUs you added, the worse it
> > became.
> > 
> > Then SUSE submitted a patch to have the rt_mutex spin only if the owner of
> > the mutex was still running on another CPU. This actually mimics a real
> > spin lock (because that's exactly what they do, they spin while the owner
> > is running on a CPU). The difference between a true spin lock and an
> > rt_mutex was that the spinner would stop spinning if the owner was
> > preempted (a true spin lock owner could not be preempted).
> > 
> > After applying the adaptive spinning, we were able to scale PREEMPT_RT to
> > any number of CPUs that the normal kernel could do with just a linear
> > performance hit.  
> 
> Sounds like it was spinning for far too long at the best of times.
> But analysing these sort of latencies is hard.

It wasn't spinning at all! The problem was that all rt_mutex would
immediately sleep on any contention. This caused a ripple effect that would
increase the time locks were held and that would increase contention which
increased the time more. A very bad feedback loop.

This was all very well traced and studied. That analysis was not hard at
all. We know exactly what the cause was and why adaptive mutexes fixed the
situation. And this is why I'm excited about this current work.

-- Steve
  

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..7741ff10136a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1311,6 +1311,7 @@  struct task_struct {
 	 * with respect to preemption.
 	 */
 	unsigned long rseq_event_mask;
+	struct rseq_sched_state __user *rseq_sched_state;
 #endif
 
 #ifdef CONFIG_SCHED_MM_CID
@@ -2351,11 +2352,20 @@  static inline void rseq_signal_deliver(struct ksignal *ksig,
 	rseq_handle_notify_resume(ksig, regs);
 }
 
+void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
+
+static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+	if (t->rseq_sched_state)
+		__rseq_set_sched_state(t, state);
+}
+
 /* rseq_preempt() requires preemption to be disabled. */
 static inline void rseq_preempt(struct task_struct *t)
 {
 	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
 	rseq_set_notify_resume(t);
+	rseq_set_sched_state(t, 0);
 }
 
 /* rseq_migrate() requires preemption to be disabled. */
@@ -2376,11 +2386,13 @@  static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 		t->rseq_len = 0;
 		t->rseq_sig = 0;
 		t->rseq_event_mask = 0;
+		t->rseq_sched_state = NULL;
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
 		t->rseq_event_mask = current->rseq_event_mask;
+		t->rseq_sched_state = current->rseq_sched_state;
 	}
 }
 
@@ -2390,6 +2402,7 @@  static inline void rseq_execve(struct task_struct *t)
 	t->rseq_len = 0;
 	t->rseq_sig = 0;
 	t->rseq_event_mask = 0;
+	t->rseq_sched_state = NULL;
 }
 
 #else
@@ -2405,6 +2418,9 @@  static inline void rseq_signal_deliver(struct ksignal *ksig,
 				       struct pt_regs *regs)
 {
 }
+static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+}
 static inline void rseq_preempt(struct task_struct *t)
 {
 }
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac9..b28588225fa7 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -37,6 +37,13 @@  enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
 };
 
+enum rseq_sched_state_flags {
+	/*
+	 * Task is currently running on a CPU if bit is set.
+	 */
+	RSEQ_SCHED_STATE_FLAG_ON_CPU		= (1U << 0),
+};
+
 /*
  * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
  * contained within a single cache-line. It is usually declared as
@@ -53,6 +60,31 @@  struct rseq_cs {
 	__u64 abort_ip;
 } __attribute__((aligned(4 * sizeof(__u64))));
 
+/*
+ * rseq_sched_state should be aligned on the cache line size.
+ */
+struct rseq_sched_state {
+	/*
+	 * Version of this structure. Populated by the kernel, read by
+	 * user-space.
+	 */
+	__u32 version;
+	/*
+	 * The state is updated by the kernel. Read by user-space with
+	 * single-copy atomicity semantics. This field can be read by any
+	 * userspace thread. Aligned on 32-bit. Contains a bitmask of enum
+	 * rseq_sched_state_flags. This field is provided as a hint by the
+	 * scheduler, and requires that the page holding this state is
+	 * faulted-in for the state update to be performed by the scheduler.
+	 */
+	__u32 state;
+	/*
+	 * Thread ID associated with the thread registering this structure.
+	 * Initialized by user-space before registration.
+	 */
+	__u32 tid;
+};
+
 /*
  * struct rseq is aligned on 4 * 8 bytes to ensure it is always
  * contained within a single cache-line.
@@ -148,6 +180,15 @@  struct rseq {
 	 */
 	__u32 mm_cid;
 
+	__u32 padding1;
+
+	/*
+	 * Restartable sequences sched_state_ptr field. Initialized by
+	 * userspace to the address at which the struct rseq_sched_state is
+	 * located. Read by the kernel on rseq registration.
+	 */
+	__u64 sched_state_ptr;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9de6e35fe679..e36d6deeae77 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -87,10 +87,12 @@ 
 
 static int rseq_update_cpu_node_id(struct task_struct *t)
 {
+	struct rseq_sched_state __user *rseq_sched_state = t->rseq_sched_state;
 	struct rseq __user *rseq = t->rseq;
 	u32 cpu_id = raw_smp_processor_id();
 	u32 node_id = cpu_to_node(cpu_id);
 	u32 mm_cid = task_mm_cid(t);
+	u32 sched_state = RSEQ_SCHED_STATE_FLAG_ON_CPU;
 
 	WARN_ON_ONCE((int) mm_cid < 0);
 	if (!user_write_access_begin(rseq, t->rseq_len))
@@ -99,6 +101,7 @@  static int rseq_update_cpu_node_id(struct task_struct *t)
 	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
 	unsafe_put_user(node_id, &rseq->node_id, efault_end);
 	unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end);
+	unsafe_put_user(sched_state, &rseq_sched_state->state, efault_end);
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally updated only if
@@ -339,6 +342,18 @@  void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	force_sigsegv(sig);
 }
 
+/*
+ * Attempt to update rseq scheduler state.
+ */
+void __rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	pagefault_disable();
+	(void) put_user(state, &t->rseq_sched_state->state);
+	pagefault_enable();
+}
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*
@@ -359,6 +374,29 @@  void rseq_syscall(struct pt_regs *regs)
 
 #endif
 
+static int rseq_get_sched_state_ptr(struct rseq __user *rseq, u32 rseq_len,
+				    struct rseq_sched_state __user **_sched_state_ptr)
+{
+	struct rseq_sched_state __user *sched_state_ptr;
+	u64 sched_state_ptr_value;
+	u32 version = 0;
+	int ret;
+
+	if (rseq_len < offsetofend(struct rseq, sched_state_ptr))
+		return 0;
+	ret = get_user(sched_state_ptr_value, &rseq->sched_state_ptr);
+	if (ret)
+		return ret;
+	sched_state_ptr = (struct rseq_sched_state __user *)(unsigned long)sched_state_ptr_value;
+	if (!sched_state_ptr)
+		return 0;
+	ret = put_user(version, &sched_state_ptr->version);
+	if (ret)
+		return ret;
+	*_sched_state_ptr = sched_state_ptr;
+	return 0;
+}
+
 /*
  * sys_rseq - setup restartable sequences for caller thread.
  */
@@ -366,6 +404,7 @@  SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		int, flags, u32, sig)
 {
 	int ret;
+	struct rseq_sched_state __user *sched_state_ptr = NULL;
 
 	if (flags & RSEQ_FLAG_UNREGISTER) {
 		if (flags & ~RSEQ_FLAG_UNREGISTER)
@@ -383,6 +422,7 @@  SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		current->rseq = NULL;
 		current->rseq_sig = 0;
 		current->rseq_len = 0;
+		current->rseq_sched_state = NULL;
 		return 0;
 	}
 
@@ -420,9 +460,12 @@  SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		return -EINVAL;
 	if (!access_ok(rseq, rseq_len))
 		return -EFAULT;
+	if (rseq_get_sched_state_ptr(rseq, rseq_len, &sched_state_ptr))
+		return -EFAULT;
 	current->rseq = rseq;
 	current->rseq_len = rseq_len;
 	current->rseq_sig = sig;
+	current->rseq_sched_state = sched_state_ptr;
 	/*
 	 * If rseq was previously inactive, and has just been
 	 * registered, ensure the cpu_id_start and cpu_id fields