linux: Simplify get_nprocs
Checks
Commit Message
This patch simplifies the memory allocation code and uses the sched
routines instead of reimplement it. This still uses a stack
allocation buffer, so it can be used on malloc initialization code.
Linux currently supports at maximum of 4096 cpus for most architectures:
$ find -iname Kconfig | xargs git grep -A10 -w NR_CPUS | grep -w range
arch/alpha/Kconfig- range 2 32
arch/arc/Kconfig- range 2 4096
arch/arm/Kconfig- range 2 16 if DEBUG_KMAP_LOCAL
arch/arm/Kconfig- range 2 32 if !DEBUG_KMAP_LOCAL
arch/arm64/Kconfig- range 2 4096
arch/csky/Kconfig- range 2 32
arch/hexagon/Kconfig- range 2 6 if SMP
arch/ia64/Kconfig- range 2 4096
arch/mips/Kconfig- range 2 256
arch/openrisc/Kconfig- range 2 32
arch/parisc/Kconfig- range 2 32
arch/riscv/Kconfig- range 2 32
arch/s390/Kconfig- range 2 512
arch/sh/Kconfig- range 2 32
arch/sparc/Kconfig- range 2 32 if SPARC32
arch/sparc/Kconfig- range 2 4096 if SPARC64
arch/um/Kconfig- range 1 1
arch/x86/Kconfig-# [NR_CPUS_RANGE_BEGIN ... NR_CPUS_RANGE_END] range.
arch/x86/Kconfig- range NR_CPUS_RANGE_BEGIN NR_CPUS_RANGE_END
arch/xtensa/Kconfig- range 2 32
With x86 supporting 8192:
arch/x86/Kconfig
976 config NR_CPUS_RANGE_END
977 int
978 depends on X86_64
979 default 8192 if SMP && CPUMASK_OFFSTACK
980 default 512 if SMP && !CPUMASK_OFFSTACK
981 default 1 if !SMP
So using a maximum of 32k cpu should cover all cases (and I would
expect once we start to have many more CPUs that Linux would provide
a more straightforward way to query for such information).
Checked on x86_64-linux-gnu and i686-linux-gnu.
---
sysdeps/unix/sysv/linux/getsysstats.c | 64 ++++++---------------------
1 file changed, 14 insertions(+), 50 deletions(-)
Comments
* Adhemerval Zanella:
> This patch simplifies the memory allocation code and uses the sched
> routines instead of reimplement it. This still uses a stack
> allocation buffer, so it can be used on malloc initialization code.
Could you add a run-time test for that limit? With the added test, I
think this is fine.
Thanks,
Florian
On 13/07/2021 11:38, Florian Weimer wrote:
> * Adhemerval Zanella:
>
>> This patch simplifies the memory allocation code and uses the sched
>> routines instead of reimplement it. This still uses a stack
>> allocation buffer, so it can be used on malloc initialization code.
>
> Could you add a run-time test for that limit? With the added test, I
> think this is fine.
What kind of test? To check if on a larger system with more than 32k
logical CPU the value is capped?
* Adhemerval Zanella:
> On 13/07/2021 11:38, Florian Weimer wrote:
>> * Adhemerval Zanella:
>>
>>> This patch simplifies the memory allocation code and uses the sched
>>> routines instead of reimplement it. This still uses a stack
>>> allocation buffer, so it can be used on malloc initialization code.
>>
>> Could you add a run-time test for that limit? With the added test, I
>> think this is fine.
>
> What kind of test? To check if on a larger system with more than 32k
> logical CPU the value is capped?
Sorry, no, to check that sched_getaffiny with a 4K buffer succeeds.
Thanks,
Florian
On 13/07/2021 15:39, Florian Weimer wrote:
> * Adhemerval Zanella:
>
>> On 13/07/2021 11:38, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> This patch simplifies the memory allocation code and uses the sched
>>>> routines instead of reimplement it. This still uses a stack
>>>> allocation buffer, so it can be used on malloc initialization code.
>>>
>>> Could you add a run-time test for that limit? With the added test, I
>>> think this is fine.
>>
>> What kind of test? To check if on a larger system with more than 32k
>> logical CPU the value is capped?
>
> Sorry, no, to check that sched_getaffiny with a 4K buffer succeeds.
Ah right, I will work on it.
On 13/07/2021 16:03, Adhemerval Zanella wrote:
>
>
> On 13/07/2021 15:39, Florian Weimer wrote:
>> * Adhemerval Zanella:
>>
>>> On 13/07/2021 11:38, Florian Weimer wrote:
>>>> * Adhemerval Zanella:
>>>>
>>>>> This patch simplifies the memory allocation code and uses the sched
>>>>> routines instead of reimplement it. This still uses a stack
>>>>> allocation buffer, so it can be used on malloc initialization code.
>>>>
>>>> Could you add a run-time test for that limit? With the added test, I
>>>> think this is fine.
>>>
>>> What kind of test? To check if on a larger system with more than 32k
>>> logical CPU the value is capped?
>>
>> Sorry, no, to check that sched_getaffiny with a 4K buffer succeeds.
>
> Ah right, I will work on it.
>
Although currently we already call get_nprocs on some internal usages
(malloc for instance) and I haven't see any regression with my change.
Do we really need to add an explicit test for it?
* Adhemerval Zanella:
> Although currently we already call get_nprocs on some internal usages
> (malloc for instance) and I haven't see any regression with my change.
> Do we really need to add an explicit test for it?
Yes, we want to know if such large systems arrive. Think of it as a
reminder for something that might happen twenty years from now.
Thanks,
Florian
On 13/07/2021 16:07, Florian Weimer wrote:
> * Adhemerval Zanella:
>
>> Although currently we already call get_nprocs on some internal usages
>> (malloc for instance) and I haven't see any regression with my change.
>> Do we really need to add an explicit test for it?
>
> Yes, we want to know if such large systems arrive. Think of it as a
> reminder for something that might happen twenty years from now.
Right, but what exactly would we be exercising that we already do on
malloc initialization, for instance (that would call sched_getaffiny
with the 4k stack allocated buffer)?
* Adhemerval Zanella:
> On 13/07/2021 16:07, Florian Weimer wrote:
>> * Adhemerval Zanella:
>>
>>> Although currently we already call get_nprocs on some internal usages
>>> (malloc for instance) and I haven't see any regression with my change.
>>> Do we really need to add an explicit test for it?
>>
>> Yes, we want to know if such large systems arrive. Think of it as a
>> reminder for something that might happen twenty years from now.
>
> Right, but what exactly would we be exercising that we already do on
> malloc initialization, for instance (that would call sched_getaffiny
> with the 4k stack allocated buffer)?
malloc initialization will still succeed if sched_getaffiny fails, so
we'll never learn about it unless we have a test for it.
Thanks,
Florian
@@ -29,61 +29,25 @@
#include <sys/sysinfo.h>
#include <sysdep.h>
-/* Compute the population count of the entire array. */
-static int
-__get_nprocs_count (const unsigned long int *array, size_t length)
-{
- int count = 0;
- for (size_t i = 0; i < length; ++i)
- if (__builtin_add_overflow (count, __builtin_popcountl (array[i]),
- &count))
- return INT_MAX;
- return count;
-}
-
-/* __get_nprocs with a large buffer. */
-static int
-__get_nprocs_large (void)
-{
- /* This code cannot use scratch_buffer because it is used during
- malloc initialization. */
- size_t pagesize = GLRO (dl_pagesize);
- unsigned long int *page = __mmap (0, pagesize, PROT_READ | PROT_WRITE,
- MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
- if (page == MAP_FAILED)
- return 2;
- int r = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0, pagesize, page);
- int count;
- if (r > 0)
- count = __get_nprocs_count (page, pagesize / sizeof (unsigned long int));
- else if (r == -EINVAL)
- /* One page is still not enough to store the bits. A more-or-less
- arbitrary value. This assumes t hat such large systems never
- happen in practice. */
- count = GLRO (dl_pagesize) * CHAR_BIT;
- else
- count = 2;
- __munmap (page, GLRO (dl_pagesize));
- return count;
-}
-
int
__get_nprocs (void)
{
- /* Fast path for most systems. The kernel expects a buffer size
- that is a multiple of 8. */
- unsigned long int small_buffer[1024 / CHAR_BIT / sizeof (unsigned long int)];
- int r = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0,
- sizeof (small_buffer), small_buffer);
+ /* This cannot use malloc because it is used on malloc initialization. */
+ enum { max_num_cpus = 32768 };
+ size_t cpu_bits_size = CPU_ALLOC_SIZE (max_num_cpus);
+ __cpu_mask cpu_bits[cpu_bits_size / sizeof (__cpu_mask)];
+ int r = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0, cpu_bits_size,
+ cpu_bits);
if (r > 0)
- return __get_nprocs_count (small_buffer, r / sizeof (unsigned long int));
+ return CPU_COUNT_S (cpu_bits_size, (cpu_set_t*) cpu_bits);
else if (r == -EINVAL)
- /* The kernel requests a larger buffer to store the data. */
- return __get_nprocs_large ();
- else
- /* Some other error. 2 is conservative (not a uniprocessor
- system, so atomics are needed). */
- return 2;
+ /* The input buffer is still not enough to store the number of cpus. This
+ is an arbitrary values assuming such systems should be rare and there
+ is no offline cpus. */
+ return max_num_cpus;
+ /* Some other error. 2 is conservative (not a uniprocessor system, so
+ atomics are needed). */
+ return 2;
}
libc_hidden_def (__get_nprocs)
weak_alias (__get_nprocs, get_nprocs)