stdlib: Tuned down tst-arc4random-thread internal parameters
Checks
Context |
Check |
Description |
dj/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
dj/TryBot-32bit |
success
|
Build for i686
|
Commit Message
With new arc4random implementation, the internal parameters might
require a lot of runtime and/or trigger some contention on older
kernels (which might trigger spurious timeout failures).
Also, since we are now testing getrandom entropy instead of an
userspace RNG, there is no much need to extensive testing.
With this change the tst-arc4random-thread goes from about 1m to
5s on a Ryzen 9 with 5.15.0-41-generic.
Checked on x86_64-linux-gnu.
---
stdlib/tst-arc4random-thread.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
Comments
* Adhemerval Zanella:
> static int
> do_test (void)
> {
> + /* Do not run more threads than the maximum of online CPUs. */
> + long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
> + if (cpus != -1)
> + /* Limit the number to not overload the system. */
> + outer_threads = (cpus / 2) / inner_threads;
Doesn't this make outer_threads on many systems?
By the way, I think we should switch to the standard arc4random_uniform
implementation that doesn't try to conserve bits. It's cheaper to get a
larger number of bits once instead of obtaining 24 bits first, then 8
bits.
Thanks,
Florian
On 27/07/22 09:25, Florian Weimer wrote:
> * Adhemerval Zanella:
>
>> static int
>> do_test (void)
>> {
>> + /* Do not run more threads than the maximum of online CPUs. */
>> + long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
>> + if (cpus != -1)
>> + /* Limit the number to not overload the system. */
>> + outer_threads = (cpus / 2) / inner_threads;
>
> Doesn't this make outer_threads on many systems?
I did not quite follow, make outer_threads large do you mean?
>
> By the way, I think we should switch to the standard arc4random_uniform
> implementation that doesn't try to conserve bits. It's cheaper to get a
> larger number of bits once instead of obtaining 24 bits first, then 8
> bits.
Yeah, it should simpler indeed. But I do not consider this urgent
for 2.36.
* Adhemerval Zanella Netto:
> On 27/07/22 09:25, Florian Weimer wrote:
>> * Adhemerval Zanella:
>>
>>> static int
>>> do_test (void)
>>> {
>>> + /* Do not run more threads than the maximum of online CPUs. */
>>> + long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
>>> + if (cpus != -1)
>>> + /* Limit the number to not overload the system. */
>>> + outer_threads = (cpus / 2) / inner_threads;
>>
>> Doesn't this make outer_threads on many systems?
>
> I did not quite follow, make outer_threads large do you mean?
Sorry, make outer_threads *zero*.
>> By the way, I think we should switch to the standard arc4random_uniform
>> implementation that doesn't try to conserve bits. It's cheaper to get a
>> larger number of bits once instead of obtaining 24 bits first, then 8
>> bits.
>
> Yeah, it should simpler indeed. But I do not consider this urgent
> for 2.36.
Well, the standard way probably has more obvious statistical properties
and is harder to screw up. 8-/
Thanks,
Florian
On 27/07/22 09:42, Florian Weimer wrote:
> * Adhemerval Zanella Netto:
>
>> On 27/07/22 09:25, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> static int
>>>> do_test (void)
>>>> {
>>>> + /* Do not run more threads than the maximum of online CPUs. */
>>>> + long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
>>>> + if (cpus != -1)
>>>> + /* Limit the number to not overload the system. */
>>>> + outer_threads = (cpus / 2) / inner_threads;
>>>
>>> Doesn't this make outer_threads on many systems?
>>
>> I did not quite follow, make outer_threads large do you mean?
>
> Sorry, make outer_threads *zero*.
Yeah, it should be "(cpus / 2) / inner_threads ?: 1". I will send
an update.
>
>>> By the way, I think we should switch to the standard arc4random_uniform
>>> implementation that doesn't try to conserve bits. It's cheaper to get a
>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>> bits.
>>
>> Yeah, it should simpler indeed. But I do not consider this urgent
>> for 2.36.
>
> Well, the standard way probably has more obvious statistical properties
> and is harder to screw up. 8-/
Right, do you consider this a blocker? I think I can send a patch to
use a simpler algorithm.
* Adhemerval Zanella Netto:
>>>> By the way, I think we should switch to the standard arc4random_uniform
>>>> implementation that doesn't try to conserve bits. It's cheaper to get a
>>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>>> bits.
>>>
>>> Yeah, it should simpler indeed. But I do not consider this urgent
>>> for 2.36.
>>
>> Well, the standard way probably has more obvious statistical properties
>> and is harder to screw up. 8-/
>
> Right, do you consider this a blocker? I think I can send a patch to
> use a simpler algorithm.
I like it because it would minimize risk. It's not a strict blocker.
I'm not sure if I'll be able to work on it this week. Maybe I can
review a patch.
Thanks,
Florian
On 27/07/22 12:36, Florian Weimer wrote:
> * Adhemerval Zanella Netto:
>
>>>>> By the way, I think we should switch to the standard arc4random_uniform
>>>>> implementation that doesn't try to conserve bits. It's cheaper to get a
>>>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>>>> bits.
>>>>
>>>> Yeah, it should simpler indeed. But I do not consider this urgent
>>>> for 2.36.
>>>
>>> Well, the standard way probably has more obvious statistical properties
>>> and is harder to screw up. 8-/
>>
>> Right, do you consider this a blocker? I think I can send a patch to
>> use a simpler algorithm.
>
> I like it because it would minimize risk. It's not a strict blocker.
> I'm not sure if I'll be able to work on it this week. Maybe I can
> review a patch.
Without trying to be clever, what about something like the below, adapted
from Bitmask with Rejection [1]. It should be fast on most case since it
avoid the modulo operation, and the last part that tries to reuse the
extra entropy bits are not strictly required.
uint32_t
__arc4random_uniform (uint32_t n)
{
if (n <= 1)
/* There is no valid return value for a zero limit, and 0 is the
only possible result for limit 1. */
return 0;
/* mask is the smallest power of 2 minus 1 which is larger than n. */
int z = __builtin_clz (n);
int bits = CHAR_BIT * sizeof (uint32_t) - z;
uint32_t mask = ~UINT32_C(0) >> z;
while (1)
{
uint32_t value = __arc4random ();
/* Return is the lower power of 2 minus 1 satisfy the condition. */
uint32_t r = value & mask;
if (r < n)
return r;
/* Otherwise check if remaining bits of entropy provides fits in the
bound. */
int bits_left = z;
while (bits_left >= bits)
{
value >>= bits;
r = value & mask;
if (r < n)
return r;
bits_left -= bits;
}
}
}
[1] https://www.pcg-random.org/posts/bounded-rands.html
Hi,
Le 27/07/2022 à 17:57, Adhemerval Zanella Netto via Libc-alpha a écrit :
>
> On 27/07/22 12:36, Florian Weimer wrote:
>> * Adhemerval Zanella Netto:
>>
>>>>>> By the way, I think we should switch to the standard arc4random_uniform
>>>>>> implementation that doesn't try to conserve bits. It's cheaper to get a
>>>>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>>>>> bits.
>>>>> Yeah, it should simpler indeed. But I do not consider this urgent
>>>>> for 2.36.
>>>> Well, the standard way probably has more obvious statistical properties
>>>> and is harder to screw up. 8-/
>>> Right, do you consider this a blocker? I think I can send a patch to
>>> use a simpler algorithm.
>> I like it because it would minimize risk. It's not a strict blocker.
>> I'm not sure if I'll be able to work on it this week. Maybe I can
>> review a patch.
> Without trying to be clever, what about something like the below, adapted
> from Bitmask with Rejection [1]. It should be fast on most case since it
> avoid the modulo operation, and the last part that tries to reuse the
> extra entropy bits are not strictly required.
>
>
> uint32_t
> __arc4random_uniform (uint32_t n)
> {
> if (n <= 1)
> /* There is no valid return value for a zero limit, and 0 is the
> only possible result for limit 1. */
> return 0;
>
> /* mask is the smallest power of 2 minus 1 which is larger than n. */
> int z = __builtin_clz (n);
For the special case where n is power of two, I think it should be
__builtin_clz (n - 1).
For example n = 8, arc4random_uniform() returns a value up to 7, but
mask would be 0x1111, while 0x111 would have been enough.
> int bits = CHAR_BIT * sizeof (uint32_t) - z;
> uint32_t mask = ~UINT32_C(0) >> z;
>
> while (1)
> {
> uint32_t value = __arc4random ();
>
> /* Return is the lower power of 2 minus 1 satisfy the condition. */
> uint32_t r = value & mask;
> if (r < n)
> return r;
>
> /* Otherwise check if remaining bits of entropy provides fits in the
> bound. */
> int bits_left = z;
> while (bits_left >= bits)
> {
> value >>= bits;
> r = value & mask;
> if (r < n)
> return r;
> bits_left -= bits;
> }
> }
> }
>
> [1] https://www.pcg-random.org/posts/bounded-rands.html
Thanks
On 27/07/22 14:10, Yann Droneaud wrote:
> Hi,
>
> Le 27/07/2022 à 17:57, Adhemerval Zanella Netto via Libc-alpha a écrit :
>>
>> On 27/07/22 12:36, Florian Weimer wrote:
>>> * Adhemerval Zanella Netto:
>>>
>>>>>>> By the way, I think we should switch to the standard arc4random_uniform
>>>>>>> implementation that doesn't try to conserve bits. It's cheaper to get a
>>>>>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>>>>>> bits.
>>>>>> Yeah, it should simpler indeed. But I do not consider this urgent
>>>>>> for 2.36.
>>>>> Well, the standard way probably has more obvious statistical properties
>>>>> and is harder to screw up. 8-/
>>>> Right, do you consider this a blocker? I think I can send a patch to
>>>> use a simpler algorithm.
>>> I like it because it would minimize risk. It's not a strict blocker.
>>> I'm not sure if I'll be able to work on it this week. Maybe I can
>>> review a patch.
>> Without trying to be clever, what about something like the below, adapted
>> from Bitmask with Rejection [1]. It should be fast on most case since it
>> avoid the modulo operation, and the last part that tries to reuse the
>> extra entropy bits are not strictly required.
>>
>>
>> uint32_t
>> __arc4random_uniform (uint32_t n)
>> {
>> if (n <= 1)
>> /* There is no valid return value for a zero limit, and 0 is the
>> only possible result for limit 1. */
>> return 0;
>>
>> /* mask is the smallest power of 2 minus 1 which is larger than n. */
>> int z = __builtin_clz (n);
>
>
> For the special case where n is power of two, I think it should be __builtin_clz (n - 1).
>
> For example n = 8, arc4random_uniform() returns a value up to 7, but mask would be 0x1111, while 0x111 would have been enough.
Yeah, it makes sense. I will send a patch to use this method.
@@ -24,16 +24,19 @@
#include <support/namespace.h>
#include <support/support.h>
#include <support/xthread.h>
+#include <unistd.h>
/* Number of arc4random_buf calls per thread. */
-enum { count_per_thread = 5000 };
+enum { count_per_thread = 2048 };
+
+/* Number of threads in subprocess. */
+enum { fork_threads = 3 };
/* Number of threads computing randomness. */
-enum { inner_threads = 5 };
+enum { inner_threads = 4 };
-/* Number of threads launching other threads. Chosen as to not to
- overload the system. */
-enum { outer_threads = 7 };
+/* Number of threads launching other threads. */
+static int outer_threads = 1;
/* Number of launching rounds performed by the outer threads. */
enum { outer_rounds = 10 };
@@ -331,6 +334,15 @@ do_test_func (const char *fname, void (*func)(unsigned char *, size_t))
static int
do_test (void)
{
+ /* Do not run more threads than the maximum of online CPUs. */
+ long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
+ if (cpus != -1)
+ /* Limit the number to not overload the system. */
+ outer_threads = (cpus / 2) / inner_threads;
+
+ printf ("info: outer_threads=%d inner_threads=%d\n", outer_threads,
+ inner_threads);
+
do_test_func ("arc4random", generate_arc4random);
do_test_func ("arc4random_buf", generate_arc4random_buf);
do_test_func ("arc4random_uniform", generate_arc4random_uniform);