[2/2] Add single-threaded fast path to rand()

Message ID PAWPR08MB898203ABD80D81B91E5398B4832D2@PAWPR08MB8982.eurprd08.prod.outlook.com
State Superseded
Delegated to: Adhemerval Zanella Netto
Headers
Series [1/2] Add random benchmark |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed

Commit Message

Wilco Dijkstra March 18, 2024, 3:20 p.m. UTC
  Improve performance of rand() and __random() by adding a single-threaded fast
path.  Bench-random-lock shows about 5x speedup on Neoverse V1.

OK for commit?

---
  

Comments

Cristian Rodríguez March 18, 2024, 7:24 p.m. UTC | #1
On Mon, Mar 18, 2024 at 12:21 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Improve performance of rand() and __random() by adding a single-threaded fast
> path.  Bench-random-lock shows about 5x speedup on Neoverse V1.
>
> OK for commit?

So the generalized lock-omitting code in __libc_lock_lock didn't
flew.. what a new fast path macro function like (sorry I suck at
naming things) __libc_maybe_lock_lock that only locks if not single
thread ?
  
Wilco Dijkstra March 19, 2024, 3:44 p.m. UTC | #2
Hi Cristian,

> So the generalized lock-omitting code in __libc_lock_lock didn't
> flew..

There were some comments on it. One issue is that it would affect
practically every lock in GLIBC, thus adding code and branches even for
locks that don't benefit from single-threaded optimization.

Hence the idea of first sorting out rand() since it is a frequent performance
complaint (unfortunately).

> what a new fast path macro function like (sorry I suck at
> naming things) __libc_maybe_lock_lock that only locks if not single
> thread ?

Yes despite the huge mess of locking macros in use, adding a new set of
macros may be better overall. We could say that the critical section must not
create a new thread itself. This means we can avoid checking the lock value
and compile the lock macro like:

#define __libc_fast_lock(NAME) if (SINGLE_THREAD_P || !__libc_lock_lock (NAME)) {

Cheers,
Wilco
  
Adhemerval Zanella March 20, 2024, 12:31 p.m. UTC | #3
On 19/03/24 12:44, Wilco Dijkstra wrote:
> Hi Cristian,
> 
>> So the generalized lock-omitting code in __libc_lock_lock didn't
>> flew..
> 
> There were some comments on it. One issue is that it would affect
> practically every lock in GLIBC, thus adding code and branches even for
> locks that don't benefit from single-threaded optimization.
> 
> Hence the idea of first sorting out rand() since it is a frequent performance
> complaint (unfortunately).

I don't have a strong opinion on this rand patch, if this idea is to
have it as an workbench for a possible single-thread lock optimization
it should be fine.  It is just that I don't see much gain in optimizing
such a bad interface (although we still lack a proper userland PRNG).

> 
>> what a new fast path macro function like (sorry I suck at
>> naming things) __libc_maybe_lock_lock that only locks if not single
>> thread ?
> 
> Yes despite the huge mess of locking macros in use, adding a new set of
> macros may be better overall. We could say that the critical section must not
> create a new thread itself. This means we can avoid checking the lock value
> and compile the lock macro like:
> 
> #define __libc_fast_lock(NAME) if (SINGLE_THREAD_P || !__libc_lock_lock (NAME)) {

This sounds reasonable, maybe we use a static inline instead to enforce the
lock type (not sure if this really required).
  
Cristian Rodríguez March 20, 2024, 2:18 p.m. UTC | #4
On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:


> I don't have a strong opinion on this rand patch, if this idea is to
> have it as an workbench for a possible single-thread lock optimization
> it should be fine.  It is just that I don't see much gain in optimizing
> such a bad interface (although we still lack a proper userland PRNG).

Yeah, it should be no surprise this interfaces are bad,
I thought this was common knowledge.

we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md
which prety much outperforms even non-CS algorithms in at least 64 bit x86.
but the question of the state remains.global? TLS? how to discard it
in all the appropriate occasions?
  
Adhemerval Zanella March 20, 2024, 2:27 p.m. UTC | #5
On 20/03/24 11:18, Cristian Rodríguez wrote:
> On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> 
> 
>> I don't have a strong opinion on this rand patch, if this idea is to
>> have it as an workbench for a possible single-thread lock optimization
>> it should be fine.  It is just that I don't see much gain in optimizing
>> such a bad interface (although we still lack a proper userland PRNG).
> 
> Yeah, it should be no surprise this interfaces are bad,
> I thought this was common knowledge.
> 
> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md
> which prety much outperforms even non-CS algorithms in at least 64 bit x86.
> but the question of the state remains.global? TLS? how to discard it
> in all the appropriate occasions?

And this is the arc4random in userspace discussion all over again.
  
Wilco Dijkstra March 20, 2024, 2:28 p.m. UTC | #6
Hi Cristian,

> Yeah, it should be no surprise this interfaces are bad,
> I thought this was common knowledge.

It's bad but we also made it thread-safe via locks rather than add its state to TLS.

> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md
> which prety much outperforms even non-CS algorithms in at least 64 bit x86.
> but the question of the state remains.global? TLS? how to discard it
> in all the appropriate occasions?

I believe the current implementation is already way overkill. All you need is a simple
LFSR random generator with 64 bits of state and a decently large cycle. It's not
meant to be a serious random number generator - it's invariably used to generate
randomized inputs for testing and benchmarking. That's all. The complaints are
due to it being too slow due to all the locking.

Cheers,
Wilco
  
Xi Ruoyao March 20, 2024, 2:40 p.m. UTC | #7
On Wed, 2024-03-20 at 14:28 +0000, Wilco Dijkstra wrote:
> Hi Cristian,
> 
> > Yeah, it should be no surprise this interfaces are bad,
> > I thought this was common knowledge.
> 
> It's bad but we also made it thread-safe via locks rather than add its state to TLS.
> 
> > we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md
> > which prety much outperforms even non-CS algorithms in at least 64 bit x86.
> > but the question of the state remains.global? TLS? how to discard it
> > in all the appropriate occasions?
> 
> I believe the current implementation is already way overkill. All you need is a simple
> LFSR random generator with 64 bits of state and a decently large cycle. It's not
> meant to be a serious random number generator - it's invariably used to generate
> randomized inputs for testing and benchmarking. That's all. The complaints are
> due to it being too slow due to all the locking.

Yes, we don't need a *CS*PRNG, just a PRNG.  Thus no "oops, the user
space cannot know whether to reseed" stuff.
  
Florian Weimer March 21, 2024, 7:39 a.m. UTC | #8
* Adhemerval Zanella Netto:

> On 20/03/24 11:18, Cristian Rodríguez wrote:
>> On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto
>> <adhemerval.zanella@linaro.org> wrote:
>> 
>> 
>>> I don't have a strong opinion on this rand patch, if this idea is to
>>> have it as an workbench for a possible single-thread lock optimization
>>> it should be fine.  It is just that I don't see much gain in optimizing
>>> such a bad interface (although we still lack a proper userland PRNG).
>> 
>> Yeah, it should be no surprise this interfaces are bad,
>> I thought this was common knowledge.
>> 
>> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md
>> which prety much outperforms even non-CS algorithms in at least 64 bit x86.
>> but the question of the state remains.global? TLS? how to discard it
>> in all the appropriate occasions?
>
> And this is the arc4random in userspace discussion all over again.

Agreed.  But what has changed that we know now that Linux won't provide
us with vDSO acceleration for arc4random.  So I think it wouldn't be
unreasonable to roll our own.  Right now, the switch to arc4random
provided by glibc is a massive performance regression compared to other
implementations.

Thanks,
Florian
  
Cristian Rodríguez March 21, 2024, 1:33 p.m. UTC | #9
On Thu, Mar 21, 2024 at 4:39 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Adhemerval Zanella Netto:
>
> > On 20/03/24 11:18, Cristian Rodríguez wrote:
> >> On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto
> >> <adhemerval.zanella@linaro.org> wrote:
> >>
> >>
> >>> I don't have a strong opinion on this rand patch, if this idea is to
> >>> have it as an workbench for a possible single-thread lock optimization
> >>> it should be fine.  It is just that I don't see much gain in optimizing
> >>> such a bad interface (although we still lack a proper userland PRNG).
> >>
> >> Yeah, it should be no surprise this interfaces are bad,
> >> I thought this was common knowledge.
> >>
> >> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md
> >> which prety much outperforms even non-CS algorithms in at least 64 bit x86.
> >> but the question of the state remains.global? TLS? how to discard it
> >> in all the appropriate occasions?
> >
> > And this is the arc4random in userspace discussion all over again.
>
> Agreed.  But what has changed that we know now that Linux won't provide
> us with vDSO acceleration for arc4random.  So I think it wouldn't be
> unreasonable to roll our own.  Right now, the switch to arc4random
> provided by glibc is a massive performance regression compared to other
> implementations.


Afaik the other path that hasn't been tried is rseq no ? could the
kernel provide a random state this way that it is cleared on the right
conditions..
  
Mathieu Desnoyers March 21, 2024, 2:35 p.m. UTC | #10
On 2024-03-21 09:33, Cristian Rodríguez wrote:
> On Thu, Mar 21, 2024 at 4:39 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Adhemerval Zanella Netto:
>>
>>> On 20/03/24 11:18, Cristian Rodríguez wrote:
>>>> On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>
>>>>
>>>>> I don't have a strong opinion on this rand patch, if this idea is to
>>>>> have it as an workbench for a possible single-thread lock optimization
>>>>> it should be fine.  It is just that I don't see much gain in optimizing
>>>>> such a bad interface (although we still lack a proper userland PRNG).
>>>>
>>>> Yeah, it should be no surprise this interfaces are bad,
>>>> I thought this was common knowledge.
>>>>
>>>> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md
>>>> which prety much outperforms even non-CS algorithms in at least 64 bit x86.
>>>> but the question of the state remains.global? TLS? how to discard it
>>>> in all the appropriate occasions?
>>>
>>> And this is the arc4random in userspace discussion all over again.
>>
>> Agreed.  But what has changed that we know now that Linux won't provide
>> us with vDSO acceleration for arc4random.  So I think it wouldn't be
>> unreasonable to roll our own.  Right now, the switch to arc4random
>> provided by glibc is a massive performance regression compared to other
>> implementations.
> 
> 
> Afaik the other path that hasn't been tried is rseq no ? could the
> kernel provide a random state this way that it is cleared on the right
> conditions..

Letting rseq know about userspace random state would add a lot of coupling
between kernel and userspace. I wonder if we can achieve the speed up you
are looking for using existing rseq/membarrier features.

At the last GNU Cauldron, I created a biased locking prototype for Florian
based on rseq:

https://github.com/compudj/librseq/tree/biased-lock
https://github.com/compudj/librseq/blob/biased-lock/tests/rseq_biased_lock.h
https://github.com/compudj/librseq/blob/biased-lock/tests/rseq_biased_lock.c
https://github.com/compudj/librseq/blob/biased-lock/tests/test_rseq_biased_lock.c

The original purpose of this prototype was to replace the FILE lock
by a biased lock, which acts as a very fast lock (a small rseq assembly
sequence of loads, comparisons, stores) as long as the thread accessing the
FILE is the thread owner thread whom created it.

If another thread attempts to access the FILE, then the internal state machine
of the biased lock transitions from RSEQ_BIASED_LOCK_STATE_ST to
RSEQ_BIASED_LOCK_STATE_MT_STARTED, issues a membarrier RSEQ_PRIVATE_EXPEDITED
(also referred to as "rseq fence") to abort all pre-existing rseq critical
sections, and then transition to RSEQ_BIASED_LOCK_STATE_MT_READY before taking
the lock.

After the biased lock has transitioned to RSEQ_BIASED_LOCK_STATE_MT_READY
it is meant to stay in that state forever. So as long as only the owner
thread uses the lock, it stays fast. This means that even a multi-threaded
process with short-lived FILE that are only used by a single thread in their
lifetime will benefit from the biased lock fast-path.

I suspect we could adapt this scheme to the srand()/rand() lock. Here is how
I see this unfold for both srand() and rand():

   - On first use (if rseq biased lock owner is NULL):
     - use rseq_biased_lock_try_set_fast_thread to set lock owner with a
       compare-and-swap (expecting NULL).
   - Take the rseq biased lock:
     - This rseq biased lock will take care of transitioning to a scheme
       which is MT-aware if it detects that the thread trying to access
       the biased lock is not the owner.

So this scheme applied to both srand() and rand() should take care of making
things really fast as long as only the biased lock owner thread uses
srand()/rand().

Its overhead would be a single system call (membarrier) issued the first
time the biased lock is accessed from a thread which differs from the
first thread which used the lock.

Thoughts ?

Thanks,

Mathieu
  
Szabolcs Nagy March 21, 2024, 3:07 p.m. UTC | #11
The 03/21/2024 10:35, Mathieu Desnoyers wrote:
> On 2024-03-21 09:33, Cristian Rodríguez wrote:
> > Afaik the other path that hasn't been tried is rseq no ? could the
> > kernel provide a random state this way that it is cleared on the right
> > conditions..
> 
> Letting rseq know about userspace random state would add a lot of coupling
> between kernel and userspace. I wonder if we can achieve the speed up you
> are looking for using existing rseq/membarrier features.
> 
> At the last GNU Cauldron, I created a biased locking prototype for Florian
> based on rseq:
...

i think rand() should be a 2 line function, without any
locking, since it is *not* required to be thread-safe nor
crypto quality.

but i guess for glibc it's safer to keep the current
implementation just with single-thread checks.

either way, it is not the best candidate for rseq
optimizations, as that complexity is not justified, to
speed up code with undefined behaviour..
  
Wilco Dijkstra March 21, 2024, 3:18 p.m. UTC | #12
Hi,

> i think rand() should be a 2 line function, without any
> locking, since it is *not* required to be thread-safe nor
> crypto quality.

Absolutely.

> but i guess for glibc it's safer to keep the current
> implementation just with single-thread checks.

I think it is useful to rewrite it using a much simpler implementation
that just has 64 bits of global state. Then you don't need any locks at all,
we could use a relaxed atomic operation for updates or add it to TLS
since its small enough.

> either way, it is not the best candidate for rseq
> optimizations, as that complexity is not justified, to
> speed up code with undefined behaviour..

Agreed. The complexity of the current random algorithm and all the
associated locking is already way out of order. Programming is all about
keeping stuff as simple and fast as possible.

Cheers,
Wilco
  
Adhemerval Zanella March 21, 2024, 3:53 p.m. UTC | #13
On 21/03/24 10:33, Cristian Rodríguez wrote:
> On Thu, Mar 21, 2024 at 4:39 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Adhemerval Zanella Netto:
>>
>>> On 20/03/24 11:18, Cristian Rodríguez wrote:
>>>> On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>
>>>>
>>>>> I don't have a strong opinion on this rand patch, if this idea is to
>>>>> have it as an workbench for a possible single-thread lock optimization
>>>>> it should be fine.  It is just that I don't see much gain in optimizing
>>>>> such a bad interface (although we still lack a proper userland PRNG).
>>>>
>>>> Yeah, it should be no surprise this interfaces are bad,
>>>> I thought this was common knowledge.
>>>>
>>>> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md
>>>> which prety much outperforms even non-CS algorithms in at least 64 bit x86.
>>>> but the question of the state remains.global? TLS? how to discard it
>>>> in all the appropriate occasions?
>>>
>>> And this is the arc4random in userspace discussion all over again.
>>
>> Agreed.  But what has changed that we know now that Linux won't provide
>> us with vDSO acceleration for arc4random.  So I think it wouldn't be
>> unreasonable to roll our own.  Right now, the switch to arc4random
>> provided by glibc is a massive performance regression compared to other
>> implementations.
> 
> 
> Afaik the other path that hasn't been tried is rseq no ? could the
> kernel provide a random state this way that it is cleared on the right
> conditions..

If I recall correctly the main issues with initial arc4random implementation,
and which lead us to roll back the chacha20 optimized ones (similar to what
other OS do); was when and how to reset the state (there are other minor
hardening, like to no spill out state on memory that vDSO experiment also
did).  The initial attempt used some heuristics, like when it reached a certain
limits of byes; plus some obvious points (like fork); that might not be 
sufficient for all cases.

And even if arc4random is explicit a non CPRNG, there were some worries that 
users might misuse the interface and thus add some security issues.  I had to
check again, but last time I checked on how OpenBSD uses this interface was 
mainly for hardening (for instance on malloc and other stuff) and to provide 
fast PRNG for simulation and such.
  
Zack Weinberg March 22, 2024, 2:27 p.m. UTC | #14
On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote:
> And even if arc4random is explicit a non CPRNG, there were some worries that 
> users might misuse the interface and thus add some security issues.

No opinion about anything else in this thread, but if we add arc4random at all
it MUST be a CSPRNG.  That's a documented guarantee on all the systems that
do have it, and applications rely on it.

zw
  
Adhemerval Zanella March 22, 2024, 2:46 p.m. UTC | #15
On 22/03/24 11:27, Zack Weinberg wrote:
> On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote:
>> And even if arc4random is explicit a non CPRNG, there were some worries that 
>> users might misuse the interface and thus add some security issues.
> 
> No opinion about anything else in this thread, but if we add arc4random at all
> it MUST be a CSPRNG.  That's a documented guarantee on all the systems that
> do have it, and applications rely on it.

Yeah, this is another point of contention where one might consider that a
userland CPRNG that has no feedback from kernel to where/how to properly
reseed might not be considered a CPRNG.
  
Zack Weinberg March 22, 2024, 3:30 p.m. UTC | #16
On Fri, Mar 22, 2024, at 10:46 AM, Adhemerval Zanella Netto wrote:
> On 22/03/24 11:27, Zack Weinberg wrote:
>> On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote:
>>> And even if arc4random is explicit a non CPRNG, there were some worries that 
>>> users might misuse the interface and thus add some security issues.
>> 
>> No opinion about anything else in this thread, but if we add arc4random at all
>> it MUST be a CSPRNG.  That's a documented guarantee on all the systems that
>> do have it, and applications rely on it.
>
> Yeah, this is another point of contention where one might consider that a
> userland CPRNG that has no feedback from kernel to where/how to properly
> reseed might not be considered a CPRNG.

I would describe that as a "CSPRNG with a known bug that makes it unsuitable
for use under some conditions", but not as "not a CSPRNG".  I would only
call it "not a CSPRNG" if the cryptographic primitives were no good
(e.g. RC4 or Xorshift or something even more predictable) or if there was
a way to leak or clone the state *in a single-threaded program that does
not fork*.

On a related note, why is MADV_WIPEONFORK not adequate "feedback from the
kernel"?

zw
  
Adhemerval Zanella March 22, 2024, 6:05 p.m. UTC | #17
On 22/03/24 12:30, Zack Weinberg wrote:
> On Fri, Mar 22, 2024, at 10:46 AM, Adhemerval Zanella Netto wrote:
>> On 22/03/24 11:27, Zack Weinberg wrote:
>>> On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote:
>>>> And even if arc4random is explicit a non CPRNG, there were some worries that 
>>>> users might misuse the interface and thus add some security issues.
>>>
>>> No opinion about anything else in this thread, but if we add arc4random at all
>>> it MUST be a CSPRNG.  That's a documented guarantee on all the systems that
>>> do have it, and applications rely on it.
>>
>> Yeah, this is another point of contention where one might consider that a
>> userland CPRNG that has no feedback from kernel to where/how to properly
>> reseed might not be considered a CPRNG.
> 
> I would describe that as a "CSPRNG with a known bug that makes it unsuitable
> for use under some conditions", but not as "not a CSPRNG".  I would only
> call it "not a CSPRNG" if the cryptographic primitives were no good
> (e.g. RC4 or Xorshift or something even more predictable) or if there was
> a way to leak or clone the state *in a single-threaded program that does
> not fork*.

I tend to agree, but the contention point was really 'that makes it unsuitable
for use under some conditions' was a deal breaker in face that kernel provides
an API with better guarantees.

> 
> On a related note, why is MADV_WIPEONFORK not adequate "feedback from the
> kernel"?

If I recall correctly, the problem was not only state wipe on fork (with
MADV_WIPEONFORK should take care), but rather when the state needs to be
reseed due various situations outside of the userland knowledge (on the
arc4random thread Jason gave us some examples, I don't really recall all
of them by hearth). That's why the idea of providing the arc4random through
a vDSO primitive (where kernel can reseed any time it likes).
  
Mathieu Desnoyers March 22, 2024, 7:47 p.m. UTC | #18
On 2024-03-22 14:05, Adhemerval Zanella Netto wrote:
> 
> 
> On 22/03/24 12:30, Zack Weinberg wrote:
>> On Fri, Mar 22, 2024, at 10:46 AM, Adhemerval Zanella Netto wrote:
>>> On 22/03/24 11:27, Zack Weinberg wrote:
>>>> On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote:
>>>>> And even if arc4random is explicit a non CPRNG, there were some worries that
>>>>> users might misuse the interface and thus add some security issues.
>>>>
>>>> No opinion about anything else in this thread, but if we add arc4random at all
>>>> it MUST be a CSPRNG.  That's a documented guarantee on all the systems that
>>>> do have it, and applications rely on it.
>>>
>>> Yeah, this is another point of contention where one might consider that a
>>> userland CPRNG that has no feedback from kernel to where/how to properly
>>> reseed might not be considered a CPRNG.
>>
>> I would describe that as a "CSPRNG with a known bug that makes it unsuitable
>> for use under some conditions", but not as "not a CSPRNG".  I would only
>> call it "not a CSPRNG" if the cryptographic primitives were no good
>> (e.g. RC4 or Xorshift or something even more predictable) or if there was
>> a way to leak or clone the state *in a single-threaded program that does
>> not fork*.
> 
> I tend to agree, but the contention point was really 'that makes it unsuitable
> for use under some conditions' was a deal breaker in face that kernel provides
> an API with better guarantees.
> 
>>
>> On a related note, why is MADV_WIPEONFORK not adequate "feedback from the
>> kernel"?
> 
> If I recall correctly, the problem was not only state wipe on fork (with
> MADV_WIPEONFORK should take care), but rather when the state needs to be
> reseed due various situations outside of the userland knowledge (on the
> arc4random thread Jason gave us some examples, I don't really recall all
> of them by hearth). That's why the idea of providing the arc4random through
> a vDSO primitive (where kernel can reseed any time it likes).

If the goal is to let userspace know that it needs to reseed due to
various kernel events happening, one way I see we could extend rseq
to support this would be to add a 64-bit "seed generation counter"
in the struct rseq per-thread area which would be incremented by the
kernel when the seed needs to be updated in userspace.

This would allow many userspace PRNG libraries to use this facility
within a process, as there would be no single per-library location the
kernel needs to reset, and would remove all knowledge of the PRNG
internal state from the kernel.

Thoughts ?

Thanks,

Mathieu
  
Cristian Rodríguez March 22, 2024, 10:54 p.m. UTC | #19
On Fri, Mar 22, 2024 at 4:47 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-03-22 14:05, Adhemerval Zanella Netto wrote:
> >
> >
> > On 22/03/24 12:30, Zack Weinberg wrote:
> >> On Fri, Mar 22, 2024, at 10:46 AM, Adhemerval Zanella Netto wrote:
> >>> On 22/03/24 11:27, Zack Weinberg wrote:
> >>>> On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote:

> If the goal is to let userspace know that it needs to reseed due to
> various kernel events happening,

Yes, this is the missing information.. "some event happened that the
kernel knows about
we need to drop the state and start over" being fork, suspend, resume,
whatever future unknown event the kernel might consider.
  
Zack Weinberg March 23, 2024, 2:01 p.m. UTC | #20
On Fri, Mar 22, 2024, at 3:47 PM, Mathieu Desnoyers wrote:
> On 2024-03-22 14:05, Adhemerval Zanella Netto wrote:
>> On 22/03/24 12:30, Zack Weinberg wrote:

>>> I would describe that as a "CSPRNG with a known bug that makes it
>>> unsuitable for use under some conditions", but not as "not a CSPRNG".
...
>> I tend to agree, but the contention point was really 'that makes it
>> unsuitable for use under some conditions' was a deal breaker in face that
>> kernel provides an API with better guarantees.

How strong exactly are the guarantees that OpenBSD provides for its
arc4random?  I don't think we *need* to do any better than that,
although obviously we should if we can.

>>> On a related note, why is MADV_WIPEONFORK not adequate "feedback from the
>>> kernel"?
>> If I recall correctly, the problem was not only state wipe on fork (with
>> MADV_WIPEONFORK should take care), but rather when the state needs to be
>> reseed due various situations outside of the userland knowledge
...
> If the goal is to let userspace know that it needs to reseed due to
> various kernel events happening, one way I see we could extend rseq
> to support this would be to add a 64-bit "seed generation counter"
> in the struct rseq per-thread area which would be incremented by the
> kernel when the seed needs to be updated in userspace.

I don't know hardly anything about rseq.  I think that sounds workable
from libc's side of the fence; the remaining questions I see are

1) Will the kernel take your patch?
2) Is it OK for us to provide an arc4random implementation that uses
this generation counter when available, but, when it's not available,
doesn't reseed on these events that are invisible to user space?

---

Independently, I propose that the existing non-cryptographic PRNGs
(rand(), random(), etc.) should all be changed to run off a thread-local
scrambled-linear generator
(https://vigna.di.unimi.it/ftp/papers/ScrambledLinear.pdf).  These have
better statistical properties than anything we currently offer, and a
state space that's small enough (256 bits) that it's reasonable for us
to have one per thread, obviating locking concerns.

zw
  
Mathieu Desnoyers March 23, 2024, 3:23 p.m. UTC | #21
On 2024-03-23 10:01, Zack Weinberg wrote:

[...]

> ...
>> If the goal is to let userspace know that it needs to reseed due to
>> various kernel events happening, one way I see we could extend rseq
>> to support this would be to add a 64-bit "seed generation counter"
>> in the struct rseq per-thread area which would be incremented by the
>> kernel when the seed needs to be updated in userspace.
> 
> I don't know hardly anything about rseq.  I think that sounds workable
> from libc's side of the fence; the remaining questions I see are
> 
> 1) Will the kernel take your patch?

I can start by creating a proof-of-concept patch. If there are
use-cases justifying its integration, I don't see why the other
rseq co-maintainers would object.

As maintainer of the Linux kernel rseq system call, I would be OK
with it as long as it fits in the RSEQ design constraints: each new
rseq field should support many users per process (executable and
libraries), and we should try to minimize coupling between kernel
and user-space.

There are a few things I would need to know to create a prototype:

- Do we need a 64-bit counter for this, or is 32-bit enough ?

- What kernel events are we interested in ? I suspect that some are
   global (e.g. sleep, hibernate) and some are per-process (e.g.
   fork/clone). Are there other events I am missing here ?

- At which point would the generation counter be validated ? Would
   that be before generating a PRN or after ? If it's before, then
   what happens to the validity of this PNG if the kernel event
   happens exactly while the PRN is being generated ?

> 2) Is it OK for us to provide an arc4random implementation that uses
> this generation counter when available, but, when it's not available,
> doesn't reseed on these events that are invisible to user space?

That's up to you really. Or you could make this configurable: a user
could request a PRNG with security guarantees, or not. rseq
provides interfaces to query which fields are supported, so depending
on the user needs, your library could either return an error or allow
generating a less-secure random number.

> 
> ---
> 
> Independently, I propose that the existing non-cryptographic PRNGs
> (rand(), random(), etc.) should all be changed to run off a thread-local
> scrambled-linear generator
> (https://vigna.di.unimi.it/ftp/papers/ScrambledLinear.pdf).  These have
> better statistical properties than anything we currently offer, and a
> state space that's small enough (256 bits) that it's reasonable for us
> to have one per thread, obviating locking concerns.

I've been wondering if doing so would break the PRNG guarantees from a
whole-program (multithreaded) perspective: let's suppose we have many
threads which are synchronized in such a way that the order of execution
of statements across threads is guaranteed to be the same across runs.
Should they expect a global PRNG with the same initial seed to have
the same behavior ? How would the per-thread seeds allow the global
PRNG walk to be reproducible across runs ?

Thanks,

Mathieu
  
Cristian Rodríguez March 24, 2024, 12:59 a.m. UTC | #22
On Sat, Mar 23, 2024 at 11:02 AM Zack Weinberg <zack@owlfolio.org> wrote:

> Independently, I propose that the existing non-cryptographic PRNGs
> (rand(), random(), etc.)

Afaik that's possible .. except the *48 interfaces that are explicit
on what algorithm to use..
  
Florian Weimer March 25, 2024, 6:44 a.m. UTC | #23
* Zack Weinberg:

> On Fri, Mar 22, 2024, at 3:47 PM, Mathieu Desnoyers wrote:
>> On 2024-03-22 14:05, Adhemerval Zanella Netto wrote:
>>> On 22/03/24 12:30, Zack Weinberg wrote:
>
>>>> I would describe that as a "CSPRNG with a known bug that makes it
>>>> unsuitable for use under some conditions", but not as "not a CSPRNG".
> ...
>>> I tend to agree, but the contention point was really 'that makes it
>>> unsuitable for use under some conditions' was a deal breaker in face that
>>> kernel provides an API with better guarantees.
>
> How strong exactly are the guarantees that OpenBSD provides for its
> arc4random?  I don't think we *need* to do any better than that,
> although obviously we should if we can.

I don't think OpenBSD deals with virtualization in this context.  I
don't know their reasons, but the use case must be vanishingly small.  I
don't expect that there are many who worry about key disclosure due to
VM snapshots and live migration, and, at the same time, are fine with
virtualization itself as potential source of leaks.

> Independently, I propose that the existing non-cryptographic PRNGs
> (rand(), random(), etc.) should all be changed to run off a thread-local
> scrambled-linear generator
> (https://vigna.di.unimi.it/ftp/papers/ScrambledLinear.pdf).  These have
> better statistical properties than anything we currently offer, and a
> state space that's small enough (256 bits) that it's reasonable for us
> to have one per thread, obviating locking concerns.

I think that's only possible if the process has not called srand.

Thanks,
Florian
  
Mathieu Desnoyers March 25, 2024, 2:09 p.m. UTC | #24
On 2024-03-23 11:23, Mathieu Desnoyers wrote:
> On 2024-03-23 10:01, Zack Weinberg wrote:
> 
> [...]
> 
>> ...
>>> If the goal is to let userspace know that it needs to reseed due to
>>> various kernel events happening, one way I see we could extend rseq
>>> to support this would be to add a 64-bit "seed generation counter"
>>> in the struct rseq per-thread area which would be incremented by the
>>> kernel when the seed needs to be updated in userspace.
>>
>> I don't know hardly anything about rseq.  I think that sounds workable
>> from libc's side of the fence; the remaining questions I see are
>>
>> 1) Will the kernel take your patch?
> 
> I can start by creating a proof-of-concept patch. If there are
> use-cases justifying its integration, I don't see why the other
> rseq co-maintainers would object.

Reading the follow-up discussion on the thread, I now understand
that a generation counter is probably not sufficient. Or maybe there
are more than one use-case here ?

If the intended purpose is to have the kernel wipe out the keys
before going to sleep/hibernate, then you'll want an integration
similar to madvise MADV_WIPEONFORK, but with a new semantic that
also wipes on sleep/hibernate and other relevant events.

If the intent is to make sure that random numbers get re-seeded
after those kernel events, then rseq with a generation counter
would work.

Thanks,

Mathieu
  
Adhemerval Zanella March 25, 2024, 5:52 p.m. UTC | #25
On 25/03/24 11:09, Mathieu Desnoyers wrote:
> On 2024-03-23 11:23, Mathieu Desnoyers wrote:
>> On 2024-03-23 10:01, Zack Weinberg wrote:
>>
>> [...]
>>
>>> ...
>>>> If the goal is to let userspace know that it needs to reseed due to
>>>> various kernel events happening, one way I see we could extend rseq
>>>> to support this would be to add a 64-bit "seed generation counter"
>>>> in the struct rseq per-thread area which would be incremented by the
>>>> kernel when the seed needs to be updated in userspace.
>>>
>>> I don't know hardly anything about rseq.  I think that sounds workable
>>> from libc's side of the fence; the remaining questions I see are
>>>
>>> 1) Will the kernel take your patch?
>>
>> I can start by creating a proof-of-concept patch. If there are
>> use-cases justifying its integration, I don't see why the other
>> rseq co-maintainers would object.
> 
> Reading the follow-up discussion on the thread, I now understand
> that a generation counter is probably not sufficient. Or maybe there
> are more than one use-case here ?
> 
> If the intended purpose is to have the kernel wipe out the keys
> before going to sleep/hibernate, then you'll want an integration
> similar to madvise MADV_WIPEONFORK, but with a new semantic that
> also wipes on sleep/hibernate and other relevant events.

I don't fully recall all the details, but I think that's was Jason
idea with the getrandom vDSO strategy. It used a new syscall to allocate 
the state and this syscall used internally a new mmap flag to trigger
the required wipe/reset on the required events. He also wanted to
prevent userland to incorrectly use the syscall, so he moved the
buffer allocation to vDSO as well.

The vDSO implementation used the same cypher as the kernel one
(chacha20, with a extra constraint to avoid leak internal state
on the stack), and subdivides the state region in backets that were
addressed based on thread pointer.  Each bucket kept a generation
counter, which was updated by the kernel. Any failure (either in
generation check or contention on backets) fallback to getrandom
syscall.

> 
> If the intent is to make sure that random numbers get re-seeded
> after those kernel events, then rseq with a generation counter
> would work.
> 
> Thanks,
  

Patch

diff --git a/stdlib/random.c b/stdlib/random.c
index 62f22fac8d58c7977f09c134bf80a797750da645..174603a8915fd8aa4b3ae64d023003c9e2c038f2 100644
--- a/stdlib/random.c
+++ b/stdlib/random.c
@@ -51,6 +51,7 @@ 
    SUCH DAMAGE.*/
 
 #include <libc-lock.h>
+#include <sys/single_threaded.h>
 #include <limits.h>
 #include <stddef.h>
 #include <stdlib.h>
@@ -288,6 +289,12 @@  __random (void)
 {
   int32_t retval;
 
+  if (SINGLE_THREAD_P)
+    {
+      (void) __random_r (&unsafe_state, &retval);
+      return retval;
+    }
+
   __libc_lock_lock (lock);
 
   (void) __random_r (&unsafe_state, &retval);