stdlib: Tuned down tst-arc4random-thread internal parameters

Message ID 20220727120600.2003708-1-adhemerval.zanella@linaro.org
State Superseded
Headers
Series stdlib: Tuned down tst-arc4random-thread internal parameters |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Adhemerval Zanella July 27, 2022, 12:06 p.m. UTC
  With new arc4random implementation, the internal parameters might
require a lot of runtime and/or trigger some contention on older
kernels (which might trigger spurious timeout failures).

Also, since we are now testing getrandom entropy instead of an
userspace RNG, there is no much need to extensive testing.

With this change the tst-arc4random-thread goes from about 1m to
5s on a Ryzen 9 with 5.15.0-41-generic.

Checked on x86_64-linux-gnu.
---
 stdlib/tst-arc4random-thread.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)
  

Comments

Florian Weimer July 27, 2022, 12:25 p.m. UTC | #1
* Adhemerval Zanella:

>  static int
>  do_test (void)
>  {
> +  /* Do not run more threads than the maximum of online CPUs.  */
> +  long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
> +  if (cpus != -1)
> +    /* Limit the number to not overload the system.  */
> +    outer_threads = (cpus / 2) / inner_threads;

Doesn't this make outer_threads on many systems?

By the way, I think we should switch to the standard arc4random_uniform
implementation that doesn't try to conserve bits.  It's cheaper to get a
larger number of bits once instead of obtaining 24 bits first, then 8
bits.

Thanks,
Florian
  
Adhemerval Zanella July 27, 2022, 12:29 p.m. UTC | #2
On 27/07/22 09:25, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>>  static int
>>  do_test (void)
>>  {
>> +  /* Do not run more threads than the maximum of online CPUs.  */
>> +  long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
>> +  if (cpus != -1)
>> +    /* Limit the number to not overload the system.  */
>> +    outer_threads = (cpus / 2) / inner_threads;
> 
> Doesn't this make outer_threads on many systems?

I did not quite follow, make outer_threads large do you mean?

> 
> By the way, I think we should switch to the standard arc4random_uniform
> implementation that doesn't try to conserve bits.  It's cheaper to get a
> larger number of bits once instead of obtaining 24 bits first, then 8
> bits.

Yeah, it should simpler indeed.  But I do not consider this urgent
for 2.36.
  
Florian Weimer July 27, 2022, 12:42 p.m. UTC | #3
* Adhemerval Zanella Netto:

> On 27/07/22 09:25, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>>  static int
>>>  do_test (void)
>>>  {
>>> +  /* Do not run more threads than the maximum of online CPUs.  */
>>> +  long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
>>> +  if (cpus != -1)
>>> +    /* Limit the number to not overload the system.  */
>>> +    outer_threads = (cpus / 2) / inner_threads;
>> 
>> Doesn't this make outer_threads on many systems?
>
> I did not quite follow, make outer_threads large do you mean?

Sorry, make outer_threads *zero*.

>> By the way, I think we should switch to the standard arc4random_uniform
>> implementation that doesn't try to conserve bits.  It's cheaper to get a
>> larger number of bits once instead of obtaining 24 bits first, then 8
>> bits.
>
> Yeah, it should simpler indeed.  But I do not consider this urgent
> for 2.36.

Well, the standard way probably has more obvious statistical properties
and is harder to screw up. 8-/

Thanks,
Florian
  
Adhemerval Zanella July 27, 2022, 1 p.m. UTC | #4
On 27/07/22 09:42, Florian Weimer wrote:
> * Adhemerval Zanella Netto:
> 
>> On 27/07/22 09:25, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>>  static int
>>>>  do_test (void)
>>>>  {
>>>> +  /* Do not run more threads than the maximum of online CPUs.  */
>>>> +  long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
>>>> +  if (cpus != -1)
>>>> +    /* Limit the number to not overload the system.  */
>>>> +    outer_threads = (cpus / 2) / inner_threads;
>>>
>>> Doesn't this make outer_threads on many systems?
>>
>> I did not quite follow, make outer_threads large do you mean?
> 
> Sorry, make outer_threads *zero*.

Yeah, it should be "(cpus / 2) / inner_threads ?: 1". I will send
an update.

> 
>>> By the way, I think we should switch to the standard arc4random_uniform
>>> implementation that doesn't try to conserve bits.  It's cheaper to get a
>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>> bits.
>>
>> Yeah, it should simpler indeed.  But I do not consider this urgent
>> for 2.36.
> 
> Well, the standard way probably has more obvious statistical properties
> and is harder to screw up. 8-/

Right, do you consider this a blocker? I think I can send a patch to 
use a simpler algorithm.
  
Florian Weimer July 27, 2022, 3:36 p.m. UTC | #5
* Adhemerval Zanella Netto:

>>>> By the way, I think we should switch to the standard arc4random_uniform
>>>> implementation that doesn't try to conserve bits.  It's cheaper to get a
>>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>>> bits.
>>>
>>> Yeah, it should simpler indeed.  But I do not consider this urgent
>>> for 2.36.
>> 
>> Well, the standard way probably has more obvious statistical properties
>> and is harder to screw up. 8-/
>
> Right, do you consider this a blocker? I think I can send a patch to 
> use a simpler algorithm.

I like it because it would minimize risk.  It's not a strict blocker.
I'm not sure if I'll be able to work on it this week.  Maybe I can
review a patch.

Thanks,
Florian
  
Adhemerval Zanella July 27, 2022, 3:57 p.m. UTC | #6
On 27/07/22 12:36, Florian Weimer wrote:
> * Adhemerval Zanella Netto:
> 
>>>>> By the way, I think we should switch to the standard arc4random_uniform
>>>>> implementation that doesn't try to conserve bits.  It's cheaper to get a
>>>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>>>> bits.
>>>>
>>>> Yeah, it should simpler indeed.  But I do not consider this urgent
>>>> for 2.36.
>>>
>>> Well, the standard way probably has more obvious statistical properties
>>> and is harder to screw up. 8-/
>>
>> Right, do you consider this a blocker? I think I can send a patch to 
>> use a simpler algorithm.
> 
> I like it because it would minimize risk.  It's not a strict blocker.
> I'm not sure if I'll be able to work on it this week.  Maybe I can
> review a patch.

Without trying to be clever, what about something like the below, adapted
from Bitmask with Rejection [1].  It should be fast on most case since it
avoid the modulo operation, and the last part that tries to reuse the
extra entropy bits are not strictly required.


uint32_t
__arc4random_uniform (uint32_t n)
{
  if (n <= 1)
    /* There is no valid return value for a zero limit, and 0 is the
       only possible result for limit 1.  */
    return 0;

  /* mask is the smallest power of 2 minus 1 which is larger than n.  */
  int z = __builtin_clz (n);
  int bits = CHAR_BIT * sizeof (uint32_t) - z;
  uint32_t mask = ~UINT32_C(0) >> z;

  while (1)
    {
      uint32_t value = __arc4random ();

      /* Return is the lower power of 2 minus 1 satisfy the condition.  */
      uint32_t r = value & mask;
      if (r < n)
        return r;

      /* Otherwise check if remaining bits of entropy provides fits in the
         bound.  */
      int bits_left = z;
      while (bits_left >= bits)
        {
          value >>= bits;
          r = value & mask;
          if (r < n)
            return r;
          bits_left -= bits;
        }
    }
}

[1] https://www.pcg-random.org/posts/bounded-rands.html
  
Yann Droneaud July 27, 2022, 5:10 p.m. UTC | #7
Hi,

Le 27/07/2022 à 17:57, Adhemerval Zanella Netto via Libc-alpha a écrit :
>
> On 27/07/22 12:36, Florian Weimer wrote:
>> * Adhemerval Zanella Netto:
>>
>>>>>> By the way, I think we should switch to the standard arc4random_uniform
>>>>>> implementation that doesn't try to conserve bits.  It's cheaper to get a
>>>>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>>>>> bits.
>>>>> Yeah, it should simpler indeed.  But I do not consider this urgent
>>>>> for 2.36.
>>>> Well, the standard way probably has more obvious statistical properties
>>>> and is harder to screw up. 8-/
>>> Right, do you consider this a blocker? I think I can send a patch to
>>> use a simpler algorithm.
>> I like it because it would minimize risk.  It's not a strict blocker.
>> I'm not sure if I'll be able to work on it this week.  Maybe I can
>> review a patch.
> Without trying to be clever, what about something like the below, adapted
> from Bitmask with Rejection [1].  It should be fast on most case since it
> avoid the modulo operation, and the last part that tries to reuse the
> extra entropy bits are not strictly required.
>
>
> uint32_t
> __arc4random_uniform (uint32_t n)
> {
>    if (n <= 1)
>      /* There is no valid return value for a zero limit, and 0 is the
>         only possible result for limit 1.  */
>      return 0;
>
>    /* mask is the smallest power of 2 minus 1 which is larger than n.  */
>    int z = __builtin_clz (n);


For the special case where n is power of two, I think it should be 
__builtin_clz (n - 1).

For example n = 8, arc4random_uniform() returns a value up to 7, but 
mask would be 0x1111, while 0x111 would have been enough.


>    int bits = CHAR_BIT * sizeof (uint32_t) - z;
>    uint32_t mask = ~UINT32_C(0) >> z;
>
>    while (1)
>      {
>        uint32_t value = __arc4random ();
>
>        /* Return is the lower power of 2 minus 1 satisfy the condition.  */
>        uint32_t r = value & mask;
>        if (r < n)
>          return r;
>
>        /* Otherwise check if remaining bits of entropy provides fits in the
>           bound.  */
>        int bits_left = z;
>        while (bits_left >= bits)
>          {
>            value >>= bits;
>            r = value & mask;
>            if (r < n)
>              return r;
>            bits_left -= bits;
>          }
>      }
> }
>
> [1] https://www.pcg-random.org/posts/bounded-rands.html

Thanks
  
Adhemerval Zanella July 27, 2022, 7:18 p.m. UTC | #8
On 27/07/22 14:10, Yann Droneaud wrote:
> Hi,
> 
> Le 27/07/2022 à 17:57, Adhemerval Zanella Netto via Libc-alpha a écrit :
>>
>> On 27/07/22 12:36, Florian Weimer wrote:
>>> * Adhemerval Zanella Netto:
>>>
>>>>>>> By the way, I think we should switch to the standard arc4random_uniform
>>>>>>> implementation that doesn't try to conserve bits.  It's cheaper to get a
>>>>>>> larger number of bits once instead of obtaining 24 bits first, then 8
>>>>>>> bits.
>>>>>> Yeah, it should simpler indeed.  But I do not consider this urgent
>>>>>> for 2.36.
>>>>> Well, the standard way probably has more obvious statistical properties
>>>>> and is harder to screw up. 8-/
>>>> Right, do you consider this a blocker? I think I can send a patch to
>>>> use a simpler algorithm.
>>> I like it because it would minimize risk.  It's not a strict blocker.
>>> I'm not sure if I'll be able to work on it this week.  Maybe I can
>>> review a patch.
>> Without trying to be clever, what about something like the below, adapted
>> from Bitmask with Rejection [1].  It should be fast on most case since it
>> avoid the modulo operation, and the last part that tries to reuse the
>> extra entropy bits are not strictly required.
>>
>>
>> uint32_t
>> __arc4random_uniform (uint32_t n)
>> {
>>    if (n <= 1)
>>      /* There is no valid return value for a zero limit, and 0 is the
>>         only possible result for limit 1.  */
>>      return 0;
>>
>>    /* mask is the smallest power of 2 minus 1 which is larger than n.  */
>>    int z = __builtin_clz (n);
> 
> 
> For the special case where n is power of two, I think it should be __builtin_clz (n - 1).
> 
> For example n = 8, arc4random_uniform() returns a value up to 7, but mask would be 0x1111, while 0x111 would have been enough.

Yeah, it makes sense. I will send a patch to use this method.
  

Patch

diff --git a/stdlib/tst-arc4random-thread.c b/stdlib/tst-arc4random-thread.c
index 3373d4d446..9d25f3526b 100644
--- a/stdlib/tst-arc4random-thread.c
+++ b/stdlib/tst-arc4random-thread.c
@@ -24,16 +24,19 @@ 
 #include <support/namespace.h>
 #include <support/support.h>
 #include <support/xthread.h>
+#include <unistd.h>
 
 /* Number of arc4random_buf calls per thread.  */
-enum { count_per_thread = 5000 };
+enum { count_per_thread = 2048 };
+
+/* Number of threads in subprocess.  */
+enum { fork_threads = 3 };
 
 /* Number of threads computing randomness.  */
-enum { inner_threads = 5 };
+enum { inner_threads = 4 };
 
-/* Number of threads launching other threads.  Chosen as to not to
-   overload the system.  */
-enum { outer_threads = 7 };
+/* Number of threads launching other threads.  */
+static int outer_threads = 1;
 
 /* Number of launching rounds performed by the outer threads.  */
 enum { outer_rounds = 10 };
@@ -331,6 +334,15 @@  do_test_func (const char *fname, void (*func)(unsigned char *, size_t))
 static int
 do_test (void)
 {
+  /* Do not run more threads than the maximum of online CPUs.  */
+  long int cpus = sysconf (_SC_NPROCESSORS_ONLN);
+  if (cpus != -1)
+    /* Limit the number to not overload the system.  */
+    outer_threads = (cpus / 2) / inner_threads;
+
+  printf ("info: outer_threads=%d inner_threads=%d\n", outer_threads,
+	  inner_threads);
+
   do_test_func ("arc4random", generate_arc4random);
   do_test_func ("arc4random_buf", generate_arc4random_buf);
   do_test_func ("arc4random_uniform", generate_arc4random_uniform);