malloc: Optimize the number of arenas for better application performance
Checks
Commit Message
At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2%
between glibc-2.36 and glibc2.28.
Comparing the code, I find that the two commits causes performance
degradation.
11a02b035b46 (misc: Add __get_nprocs_sched)
97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with
garbage [BZ #28850])
These two patches modify the default behavior.
However, my machine is 96 cores and I have 91 cores bound.
It means that perhaps the current way of computing arenas is not optimal.
So I roll back some of the code submitted by 11a02b035(misc: Add
__get_nprocs_sched).
---
malloc/arena.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
* Yang Yanchao via Libc-alpha:
> At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2%
> between glibc-2.36 and glibc2.28.
> Comparing the code, I find that the two commits causes performance
> degradation.
> 11a02b035b46 (misc: Add __get_nprocs_sched)
> 97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with
> garbage [BZ #28850])
>
> These two patches modify the default behavior.
> However, my machine is 96 cores and I have 91 cores bound.
> It means that perhaps the current way of computing arenas is not optimal.
> So I roll back some of the code submitted by 11a02b035(misc: Add
> __get_nprocs_sched).
>
> ---
> malloc/arena.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/malloc/arena.c b/malloc/arena.c
> index 0a684a720d..a1ee7928d3 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
> narenas_limit = mp_.arena_max;
> else if (narenas > mp_.arena_test)
> {
> - int n = __get_nprocs_sched ();
> + int n = __get_nprocs ();
>
> if (n >= 1)
> narenas_limit = NARENAS_FROM_NCORES (n);
How many threads does tpcc-mysql create?
I wonder if all threads get their own arena with the larger count, and
there is some arena sharing with the smaller count.
Thanks,
Florian
On 28/06/2022 16:48, Florian Weimer wrote:
> * Yang Yanchao via Libc-alpha:
>
>> At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2%
>> between glibc-2.36 and glibc2.28.
>> Comparing the code, I find that the two commits causes performance
>> degradation.
>> 11a02b035b46 (misc: Add __get_nprocs_sched)
>> 97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with
>> garbage [BZ #28850])
>>
>> These two patches modify the default behavior.
>> However, my machine is 96 cores and I have 91 cores bound.
>> It means that perhaps the current way of computing arenas is not optimal.
>> So I roll back some of the code submitted by 11a02b035(misc: Add
>> __get_nprocs_sched).
>>
>> ---
>> malloc/arena.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/malloc/arena.c b/malloc/arena.c
>> index 0a684a720d..a1ee7928d3 100644
>> --- a/malloc/arena.c
>> +++ b/malloc/arena.c
>> @@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
>> narenas_limit = mp_.arena_max;
>> else if (narenas > mp_.arena_test)
>> {
>> - int n = __get_nprocs_sched ();
>> + int n = __get_nprocs ();
>>
>> if (n >= 1)
>> narenas_limit = NARENAS_FROM_NCORES (n);
>
> How many threads does tpcc-mysql create?
>
> I wonder if all threads get their own arena with the larger count, and
> there is some arena sharing with the smaller count.
A simple test to determine this could be to repeat the test with
different values of arena_max to see if there's a trend.
Siddhesh
> On 28 Jun 2022, at 06:40, Yang Yanchao <yangyanchao6@huawei.com> wrote:
>
> At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2% between glibc-2.36 and glibc2.28.
> Comparing the code, I find that the two commits causes performance degradation.
> 11a02b035b46 (misc: Add __get_nprocs_sched)
> 97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with garbage [BZ #28850])
>
> These two patches modify the default behavior.
> However, my machine is 96 cores and I have 91 cores bound.
> It means that perhaps the current way of computing arenas is not optimal.
> So I roll back some of the code submitted by 11a02b035(misc: Add __get_nprocs_sched).
>
> ---
> malloc/arena.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/malloc/arena.c b/malloc/arena.c
> index 0a684a720d..a1ee7928d3 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
> narenas_limit = mp_.arena_max;
> else if (narenas > mp_.arena_test)
> {
> - int n = __get_nprocs_sched ();
> + int n = __get_nprocs ();
>
> if (n >= 1)
> narenas_limit = NARENAS_FROM_NCORES (n);
> --
> 2.33.0
In fact 11a02b035b46 only changed __get_nprocs_sched to call __get_nproc,
33099d72e41c was the one that actually changed __get_nproc to use
sched_getaffinity.
I think it makes sense to get back to old behavior, since this change
was motivated mainly to avoid the malloc call in arena initialization.
The __get_nprocs does not call malloc through opendir anymore, so it should
be safe.
Yang Yanchao <yangyanchao6@huawei.com> writes:
> However, my machine is 96 cores and I have 91 cores bound.
One benchmark on one uncommon configuration is not sufficient reason to
change a core tunable. What about other platforms? Other benchmarks?
Other percentages of cores scheduled?
I would reject this patch based solely on the lack of data backing up
your claims.
> - int n = __get_nprocs_sched ();
> + int n = __get_nprocs ();
I've heard complaints about how our code leads to hundreds of arenas on
processes scheduled on only two CPUs. I think using the number of
*schedulable* cores makes more sense than using the number of *unusable*
cores.
I think this change warrants more research.
> On 28 Jun 2022, at 15:56, DJ Delorie <dj@redhat.com> wrote:
>
> Yang Yanchao <yangyanchao6@huawei.com> writes:
>> However, my machine is 96 cores and I have 91 cores bound.
>
> One benchmark on one uncommon configuration is not sufficient reason to
> change a core tunable. What about other platforms? Other benchmarks?
> Other percentages of cores scheduled?
>
> I would reject this patch based solely on the lack of data backing up
> your claims.
>
>> - int n = __get_nprocs_sched ();
>> + int n = __get_nprocs ();
>
> I've heard complaints about how our code leads to hundreds of arenas on
> processes scheduled on only two CPUs. I think using the number of
> *schedulable* cores makes more sense than using the number of *unusable*
> cores.
>
> I think this change warrants more research.
I think this patch make sense mainly because we changed to use the
schedulable cores without much though either. Maybe we can revert
to previous semantic and investigate that using the schedulable
number makes more sense.
>> On 28 Jun 2022, at 15:56, DJ Delorie <dj@redhat.com> wrote:
>>
>> Yang Yanchao <yangyanchao6@huawei.com> writes:
>>> However, my machine is 96 cores and I have 91 cores bound.
>>
>> One benchmark on one uncommon configuration is not sufficient reason to
>> change a core tunable. What about other platforms? Other benchmarks?
>> Other percentages of cores scheduled?
>>
>> I would reject this patch based solely on the lack of data backing up
>> your claims.
>>
>>> - int n = __get_nprocs_sched ();
>>> + int n = __get_nprocs ();
>>
>> I've heard complaints about how our code leads to hundreds of arenas on
>> processes scheduled on only two CPUs. I think using the number of
>> *schedulable* cores makes more sense than using the number of *unusable*
>> cores.
>>
>> I think this change warrants more research.
>
> I think this patch make sense mainly because we changed to use the
> schedulable cores without much though either. Maybe we can revert
> to previous semantic and investigate that using the schedulable
> number makes more sense.
>
Agreed, the variable narenas_limit just Initialize once, if there has a scenario that we dynamic adjust the cpu affnity,
__get_nprocs_sched is not a good choice. my opinion first use __get_nprocs as a static valule(the old default behavior),
and user use glibc.TUNABLE to adjust arana number.
Also, we can do more research and test to optimize the default arena number.
On 29/06/2022 08:07, Qingqing Li via Libc-alpha wrote:
>>> On 28 Jun 2022, at 15:56, DJ Delorie <dj@redhat.com> wrote:
>>>
>>> Yang Yanchao <yangyanchao6@huawei.com> writes:
>>>> However, my machine is 96 cores and I have 91 cores bound.
>>>
>>> One benchmark on one uncommon configuration is not sufficient reason to
>>> change a core tunable. What about other platforms? Other benchmarks?
>>> Other percentages of cores scheduled?
>>>
>>> I would reject this patch based solely on the lack of data backing up
>>> your claims.
>>>
>>>> - int n = __get_nprocs_sched ();
>>>> + int n = __get_nprocs ();
>>>
>>> I've heard complaints about how our code leads to hundreds of arenas on
>>> processes scheduled on only two CPUs. I think using the number of
>>> *schedulable* cores makes more sense than using the number of *unusable*
>>> cores.
>>>
>>> I think this change warrants more research.
Agreed, but the Huawei system use case appears to suggest that at least
that system prefers the superset, which is a data point against this
change. The original change[1] doesn't appear to provide any extensive
research (as Adhemerval admitted below too), so that seems sufficient
grounds to revert a change that has regressed performance.
Of course, if you or anyone else has data points that strongly suggest
*better* overall performance with __get_nprocs_sched (I am aware of
complaints about hundreds of arenas and address space usage and on many
occasions in the past I've been able to successfully resolve them by
showing that their code is nearly 20-30% faster because of that) then
maybe there's scope to explore the possibility of having
architecture-specific defaults.
>> I think this patch make sense mainly because we changed to use the
>> schedulable cores without much though either. Maybe we can revert
>> to previous semantic and investigate that using the schedulable
>> number makes more sense.
>>
> Agreed, the variable narenas_limit just Initialize once, if there has a scenario that we dynamic adjust the cpu affnity,
> __get_nprocs_sched is not a good choice. my opinion first use __get_nprocs as a static valule(the old default behavior),
> and user use glibc.TUNABLE to adjust arana number.
> Also, we can do more research and test to optimize the default arena number.
I'm not strongly opposed to reverting (see above for more nuance) but
would like us to use this opportunity to at least track a project to
improve the arenas heuristic. Could you or Adhemerval please add a bug
report on sourceware to study the performance impact of arena counts and
their relationship to core counts on different architectures/systems?
Thanks,
Siddhesh
[1]
https://patchwork.sourceware.org/project/glibc/patch/20210907122259.79800-2-adhemerval.zanella@linaro.org/
On 2022/6/29 13:25, Siddhesh Poyarekar wrote:
> I'm not strongly opposed to reverting (see above for more nuance) but
> would like us to use this opportunity to at least track a project to
> improve the arenas heuristic. Could you or Adhemerval please add a bug
> report on sourceware to study the performance impact of arena counts and
> their relationship to core counts on different architectures/systems?
Thanks to the attention paid to this issue, I created a bug report on
sourceware(https://sourceware.org/bugzilla/show_bug.cgi?id=29296).
Based on the openEuler 2203 LTS, we tested in different application
scenarios(ceph, Hbase, Hibench, spark sql, hive sql). Only tpcc-mysql
has this performance degradation because only tpcc-mysql has core
binding configured.
I'm using different arena counts and their relationship to core counts,
based on x86_64 and aarch64. tpcc-mysql tests are slow to run and will
take several days.
@@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
narenas_limit = mp_.arena_max;
else if (narenas > mp_.arena_test)
{
- int n = __get_nprocs_sched ();
+ int n = __get_nprocs ();
if (n >= 1)
narenas_limit = NARENAS_FROM_NCORES (n);