linux: Simplify get_nprocs

Message ID 20210713142632.2813759-1-adhemerval.zanella@linaro.org
State Superseded
Headers
Series linux: Simplify get_nprocs |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit fail Patch caused testsuite regressions

Commit Message

Adhemerval Zanella July 13, 2021, 2:26 p.m. UTC
  This patch simplifies the memory allocation code and uses the sched
routines instead of reimplement it.  This still uses a stack
allocation buffer, so it can be used on malloc initialization code.

Linux currently supports at maximum of 4096 cpus for most architectures:

$ find -iname Kconfig | xargs git grep -A10 -w NR_CPUS | grep -w range
arch/alpha/Kconfig-	range 2 32
arch/arc/Kconfig-	range 2 4096
arch/arm/Kconfig-	range 2 16 if DEBUG_KMAP_LOCAL
arch/arm/Kconfig-	range 2 32 if !DEBUG_KMAP_LOCAL
arch/arm64/Kconfig-	range 2 4096
arch/csky/Kconfig-	range 2 32
arch/hexagon/Kconfig-	range 2 6 if SMP
arch/ia64/Kconfig-	range 2 4096
arch/mips/Kconfig-	range 2 256
arch/openrisc/Kconfig-	range 2 32
arch/parisc/Kconfig-	range 2 32
arch/riscv/Kconfig-	range 2 32
arch/s390/Kconfig-	range 2 512
arch/sh/Kconfig-	range 2 32
arch/sparc/Kconfig-	range 2 32 if SPARC32
arch/sparc/Kconfig-	range 2 4096 if SPARC64
arch/um/Kconfig-	range 1 1
arch/x86/Kconfig-# [NR_CPUS_RANGE_BEGIN ... NR_CPUS_RANGE_END] range.
arch/x86/Kconfig-	range NR_CPUS_RANGE_BEGIN NR_CPUS_RANGE_END
arch/xtensa/Kconfig-	range 2 32

With x86 supporting 8192:

arch/x86/Kconfig
 976 config NR_CPUS_RANGE_END
 977         int
 978         depends on X86_64
 979         default 8192 if  SMP && CPUMASK_OFFSTACK
 980         default  512 if  SMP && !CPUMASK_OFFSTACK
 981         default    1 if !SMP

So using a maximum of 32k cpu should cover all cases (and I would
expect once we start to have many more CPUs that Linux would provide
a more straightforward way to query for such information).

Checked on x86_64-linux-gnu and i686-linux-gnu.
---
 sysdeps/unix/sysv/linux/getsysstats.c | 64 ++++++---------------------
 1 file changed, 14 insertions(+), 50 deletions(-)
  

Comments

Florian Weimer July 13, 2021, 2:38 p.m. UTC | #1
* Adhemerval Zanella:

> This patch simplifies the memory allocation code and uses the sched
> routines instead of reimplement it.  This still uses a stack
> allocation buffer, so it can be used on malloc initialization code.

Could you add a run-time test for that limit?  With the added test, I
think this is fine.

Thanks,
Florian
  
Adhemerval Zanella July 13, 2021, 6:28 p.m. UTC | #2
On 13/07/2021 11:38, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> This patch simplifies the memory allocation code and uses the sched
>> routines instead of reimplement it.  This still uses a stack
>> allocation buffer, so it can be used on malloc initialization code.
> 
> Could you add a run-time test for that limit?  With the added test, I
> think this is fine.

What kind of test?  To check if on a larger system with more than 32k
logical CPU the value is capped?
  
Florian Weimer July 13, 2021, 6:39 p.m. UTC | #3
* Adhemerval Zanella:

> On 13/07/2021 11:38, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> This patch simplifies the memory allocation code and uses the sched
>>> routines instead of reimplement it.  This still uses a stack
>>> allocation buffer, so it can be used on malloc initialization code.
>> 
>> Could you add a run-time test for that limit?  With the added test, I
>> think this is fine.
>
> What kind of test?  To check if on a larger system with more than 32k
> logical CPU the value is capped?

Sorry, no, to check that sched_getaffiny with a 4K buffer succeeds.

Thanks,
Florian
  
Adhemerval Zanella July 13, 2021, 7:03 p.m. UTC | #4
On 13/07/2021 15:39, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 13/07/2021 11:38, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> This patch simplifies the memory allocation code and uses the sched
>>>> routines instead of reimplement it.  This still uses a stack
>>>> allocation buffer, so it can be used on malloc initialization code.
>>>
>>> Could you add a run-time test for that limit?  With the added test, I
>>> think this is fine.
>>
>> What kind of test?  To check if on a larger system with more than 32k
>> logical CPU the value is capped?
> 
> Sorry, no, to check that sched_getaffiny with a 4K buffer succeeds.

Ah right, I will work on it.
  
Adhemerval Zanella July 13, 2021, 7:04 p.m. UTC | #5
On 13/07/2021 16:03, Adhemerval Zanella wrote:
> 
> 
> On 13/07/2021 15:39, Florian Weimer wrote:
>> * Adhemerval Zanella:
>>
>>> On 13/07/2021 11:38, Florian Weimer wrote:
>>>> * Adhemerval Zanella:
>>>>
>>>>> This patch simplifies the memory allocation code and uses the sched
>>>>> routines instead of reimplement it.  This still uses a stack
>>>>> allocation buffer, so it can be used on malloc initialization code.
>>>>
>>>> Could you add a run-time test for that limit?  With the added test, I
>>>> think this is fine.
>>>
>>> What kind of test?  To check if on a larger system with more than 32k
>>> logical CPU the value is capped?
>>
>> Sorry, no, to check that sched_getaffiny with a 4K buffer succeeds.
> 
> Ah right, I will work on it.
> 

Although currently we already call get_nprocs on some internal usages
(malloc for instance) and I haven't see any regression with my change.
Do we really need to add an explicit test for it?
  
Florian Weimer July 13, 2021, 7:07 p.m. UTC | #6
* Adhemerval Zanella:

> Although currently we already call get_nprocs on some internal usages
> (malloc for instance) and I haven't see any regression with my change.
> Do we really need to add an explicit test for it?

Yes, we want to know if such large systems arrive.  Think of it as a
reminder for something that might happen twenty years from now.

Thanks,
Florian
  
Adhemerval Zanella July 13, 2021, 7:15 p.m. UTC | #7
On 13/07/2021 16:07, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> Although currently we already call get_nprocs on some internal usages
>> (malloc for instance) and I haven't see any regression with my change.
>> Do we really need to add an explicit test for it?
> 
> Yes, we want to know if such large systems arrive.  Think of it as a
> reminder for something that might happen twenty years from now.

Right, but what exactly would we be exercising that we already do on
malloc initialization, for instance (that would call sched_getaffiny 
with the 4k stack allocated buffer)?
  
Florian Weimer July 13, 2021, 7:17 p.m. UTC | #8
* Adhemerval Zanella:

> On 13/07/2021 16:07, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> Although currently we already call get_nprocs on some internal usages
>>> (malloc for instance) and I haven't see any regression with my change.
>>> Do we really need to add an explicit test for it?
>> 
>> Yes, we want to know if such large systems arrive.  Think of it as a
>> reminder for something that might happen twenty years from now.
>
> Right, but what exactly would we be exercising that we already do on
> malloc initialization, for instance (that would call sched_getaffiny 
> with the 4k stack allocated buffer)?

malloc initialization will still succeed if sched_getaffiny fails, so
we'll never learn about it unless we have a test for it.

Thanks,
Florian
  

Patch

diff --git a/sysdeps/unix/sysv/linux/getsysstats.c b/sysdeps/unix/sysv/linux/getsysstats.c
index 1391e360b8..e752b0a111 100644
--- a/sysdeps/unix/sysv/linux/getsysstats.c
+++ b/sysdeps/unix/sysv/linux/getsysstats.c
@@ -29,61 +29,25 @@ 
 #include <sys/sysinfo.h>
 #include <sysdep.h>
 
-/* Compute the population count of the entire array.  */
-static int
-__get_nprocs_count (const unsigned long int *array, size_t length)
-{
-  int count = 0;
-  for (size_t i = 0; i < length; ++i)
-    if (__builtin_add_overflow (count,  __builtin_popcountl (array[i]),
-				&count))
-      return INT_MAX;
-  return count;
-}
-
-/* __get_nprocs with a large buffer.  */
-static int
-__get_nprocs_large (void)
-{
-  /* This code cannot use scratch_buffer because it is used during
-     malloc initialization.  */
-  size_t pagesize = GLRO (dl_pagesize);
-  unsigned long int *page = __mmap (0, pagesize, PROT_READ | PROT_WRITE,
-				    MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
-  if (page == MAP_FAILED)
-    return 2;
-  int r = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0, pagesize, page);
-  int count;
-  if (r > 0)
-    count = __get_nprocs_count (page, pagesize / sizeof (unsigned long int));
-  else if (r == -EINVAL)
-    /* One page is still not enough to store the bits.  A more-or-less
-       arbitrary value.  This assumes t hat such large systems never
-       happen in practice.  */
-    count = GLRO (dl_pagesize) * CHAR_BIT;
-  else
-    count = 2;
-  __munmap (page, GLRO (dl_pagesize));
-  return count;
-}
-
 int
 __get_nprocs (void)
 {
-  /* Fast path for most systems.  The kernel expects a buffer size
-     that is a multiple of 8.  */
-  unsigned long int small_buffer[1024 / CHAR_BIT / sizeof (unsigned long int)];
-  int r = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0,
-				 sizeof (small_buffer), small_buffer);
+  /* This cannot use malloc because it is used on malloc initialization.  */
+  enum { max_num_cpus = 32768 };
+  size_t cpu_bits_size = CPU_ALLOC_SIZE (max_num_cpus);
+  __cpu_mask cpu_bits[cpu_bits_size / sizeof (__cpu_mask)];
+  int r = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0, cpu_bits_size,
+				 cpu_bits);
   if (r > 0)
-    return __get_nprocs_count (small_buffer, r / sizeof (unsigned long int));
+    return CPU_COUNT_S (cpu_bits_size, (cpu_set_t*) cpu_bits);
   else if (r == -EINVAL)
-    /* The kernel requests a larger buffer to store the data.  */
-    return __get_nprocs_large ();
-  else
-    /* Some other error.  2 is conservative (not a uniprocessor
-       system, so atomics are needed). */
-    return 2;
+    /* The input buffer is still not enough to store the number of cpus.  This
+       is an arbitrary values assuming such systems should be rare and there
+       is no offline cpus.  */
+    return max_num_cpus;
+  /* Some other error.  2 is conservative (not a uniprocessor system, so
+     atomics are needed). */
+  return 2;
 }
 libc_hidden_def (__get_nprocs)
 weak_alias (__get_nprocs, get_nprocs)