malloc: Optimize the number of arenas for better application performance

Message ID 4ffb458f-9c32-f0e5-8503-44838d87defe@huawei.com
State Failed CI
Headers
Series malloc: Optimize the number of arenas for better application performance |

Checks

Context Check Description
dj/TryBot-apply_patch fail Patch failed to apply to master at the time it was sent
dj/TryBot-32bit fail Patch series failed to apply

Commit Message

Yang Yanchao June 28, 2022, 9:40 a.m. UTC
  At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2% 
between glibc-2.36 and glibc2.28.
Comparing the code, I find that the two commits causes performance 
degradation.
11a02b035b46 (misc: Add __get_nprocs_sched)
97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with 
garbage [BZ #28850])

These two patches modify the default behavior.
However, my machine is 96 cores and I have 91 cores bound.
It means that perhaps the current way of computing arenas is not optimal.
So I roll back some of the code submitted by 11a02b035(misc: Add 
__get_nprocs_sched).

---
  malloc/arena.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Florian Weimer June 28, 2022, 11:18 a.m. UTC | #1
* Yang Yanchao via Libc-alpha:

> At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2% 
> between glibc-2.36 and glibc2.28.
> Comparing the code, I find that the two commits causes performance 
> degradation.
> 11a02b035b46 (misc: Add __get_nprocs_sched)
> 97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with 
> garbage [BZ #28850])
>
> These two patches modify the default behavior.
> However, my machine is 96 cores and I have 91 cores bound.
> It means that perhaps the current way of computing arenas is not optimal.
> So I roll back some of the code submitted by 11a02b035(misc: Add 
> __get_nprocs_sched).
>
> ---
>   malloc/arena.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/malloc/arena.c b/malloc/arena.c
> index 0a684a720d..a1ee7928d3 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
>               narenas_limit = mp_.arena_max;
>             else if (narenas > mp_.arena_test)
>               {
> -              int n = __get_nprocs_sched ();
> +              int n = __get_nprocs ();
>
>                 if (n >= 1)
>                   narenas_limit = NARENAS_FROM_NCORES (n);

How many threads does tpcc-mysql create?

I wonder if all threads get their own arena with the larger count, and
there is some arena sharing with the smaller count.

Thanks,
Florian
  
Siddhesh Poyarekar June 28, 2022, 12:38 p.m. UTC | #2
On 28/06/2022 16:48, Florian Weimer wrote:
> * Yang Yanchao via Libc-alpha:
> 
>> At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2%
>> between glibc-2.36 and glibc2.28.
>> Comparing the code, I find that the two commits causes performance
>> degradation.
>> 11a02b035b46 (misc: Add __get_nprocs_sched)
>> 97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with
>> garbage [BZ #28850])
>>
>> These two patches modify the default behavior.
>> However, my machine is 96 cores and I have 91 cores bound.
>> It means that perhaps the current way of computing arenas is not optimal.
>> So I roll back some of the code submitted by 11a02b035(misc: Add
>> __get_nprocs_sched).
>>
>> ---
>>    malloc/arena.c | 2 +-
>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/malloc/arena.c b/malloc/arena.c
>> index 0a684a720d..a1ee7928d3 100644
>> --- a/malloc/arena.c
>> +++ b/malloc/arena.c
>> @@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
>>                narenas_limit = mp_.arena_max;
>>              else if (narenas > mp_.arena_test)
>>                {
>> -              int n = __get_nprocs_sched ();
>> +              int n = __get_nprocs ();
>>
>>                  if (n >= 1)
>>                    narenas_limit = NARENAS_FROM_NCORES (n);
> 
> How many threads does tpcc-mysql create?
> 
> I wonder if all threads get their own arena with the larger count, and
> there is some arena sharing with the smaller count.

A simple test to determine this could be to repeat the test with 
different values of arena_max to see if there's a trend.

Siddhesh
  
Adhemerval Zanella June 28, 2022, 1:35 p.m. UTC | #3
> On 28 Jun 2022, at 06:40, Yang Yanchao <yangyanchao6@huawei.com> wrote:
> 
> At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2% between glibc-2.36 and glibc2.28.
> Comparing the code, I find that the two commits causes performance degradation.
> 11a02b035b46 (misc: Add __get_nprocs_sched)
> 97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with garbage [BZ #28850])
> 
> These two patches modify the default behavior.
> However, my machine is 96 cores and I have 91 cores bound.
> It means that perhaps the current way of computing arenas is not optimal.
> So I roll back some of the code submitted by 11a02b035(misc: Add __get_nprocs_sched).
> 
> ---
> malloc/arena.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/malloc/arena.c b/malloc/arena.c
> index 0a684a720d..a1ee7928d3 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
>             narenas_limit = mp_.arena_max;
>           else if (narenas > mp_.arena_test)
>             {
> -              int n = __get_nprocs_sched ();
> +              int n = __get_nprocs ();
> 
>               if (n >= 1)
>                 narenas_limit = NARENAS_FROM_NCORES (n);
> -- 
> 2.33.0

In fact 11a02b035b46 only changed __get_nprocs_sched to call __get_nproc, 
33099d72e41c was the one that actually changed __get_nproc to use 
sched_getaffinity.

I think it makes sense to get back to old behavior, since this change
was motivated mainly to avoid the malloc call in arena initialization.
The __get_nprocs does not call malloc through opendir anymore, so it should
be safe.
  
DJ Delorie June 28, 2022, 6:56 p.m. UTC | #4
Yang Yanchao <yangyanchao6@huawei.com> writes:
> However, my machine is 96 cores and I have 91 cores bound.

One benchmark on one uncommon configuration is not sufficient reason to
change a core tunable.  What about other platforms?  Other benchmarks?
Other percentages of cores scheduled?

I would reject this patch based solely on the lack of data backing up
your claims.

> -              int n = __get_nprocs_sched ();
> +              int n = __get_nprocs ();

I've heard complaints about how our code leads to hundreds of arenas on
processes scheduled on only two CPUs.  I think using the number of
*schedulable* cores makes more sense than using the number of *unusable*
cores.

I think this change warrants more research.
  
Adhemerval Zanella June 28, 2022, 7:17 p.m. UTC | #5
> On 28 Jun 2022, at 15:56, DJ Delorie <dj@redhat.com> wrote:
> 
> Yang Yanchao <yangyanchao6@huawei.com> writes:
>> However, my machine is 96 cores and I have 91 cores bound.
> 
> One benchmark on one uncommon configuration is not sufficient reason to
> change a core tunable.  What about other platforms?  Other benchmarks?
> Other percentages of cores scheduled?
> 
> I would reject this patch based solely on the lack of data backing up
> your claims.
> 
>> -              int n = __get_nprocs_sched ();
>> +              int n = __get_nprocs ();
> 
> I've heard complaints about how our code leads to hundreds of arenas on
> processes scheduled on only two CPUs.  I think using the number of
> *schedulable* cores makes more sense than using the number of *unusable*
> cores.
> 
> I think this change warrants more research.

I think this patch make sense mainly because we changed to use the 
schedulable cores without much though either.  Maybe we can revert
to previous semantic and investigate that using the schedulable 
number makes more sense.
  
Qingqing Li June 29, 2022, 2:37 a.m. UTC | #6
>> On 28 Jun 2022, at 15:56, DJ Delorie <dj@redhat.com> wrote:
>>
>> Yang Yanchao <yangyanchao6@huawei.com> writes:
>>> However, my machine is 96 cores and I have 91 cores bound.
>>
>> One benchmark on one uncommon configuration is not sufficient reason to
>> change a core tunable.  What about other platforms?  Other benchmarks?
>> Other percentages of cores scheduled?
>>
>> I would reject this patch based solely on the lack of data backing up
>> your claims.
>>
>>> -              int n = __get_nprocs_sched ();
>>> +              int n = __get_nprocs ();
>>
>> I've heard complaints about how our code leads to hundreds of arenas on
>> processes scheduled on only two CPUs.  I think using the number of
>> *schedulable* cores makes more sense than using the number of *unusable*
>> cores.
>>
>> I think this change warrants more research.
> 
> I think this patch make sense mainly because we changed to use the 
> schedulable cores without much though either.  Maybe we can revert
> to previous semantic and investigate that using the schedulable 
> number makes more sense.
> 
Agreed, the variable narenas_limit just Initialize once, if there has a scenario that we dynamic adjust the cpu affnity,
__get_nprocs_sched is not a good choice. my opinion first use  __get_nprocs as a static valule(the old default behavior),
and user use glibc.TUNABLE to adjust arana number.
Also, we can do more research and test to optimize the default arena number.
  
Siddhesh Poyarekar June 29, 2022, 5:25 a.m. UTC | #7
On 29/06/2022 08:07, Qingqing Li via Libc-alpha wrote:
>>> On 28 Jun 2022, at 15:56, DJ Delorie <dj@redhat.com> wrote:
>>>
>>> Yang Yanchao <yangyanchao6@huawei.com> writes:
>>>> However, my machine is 96 cores and I have 91 cores bound.
>>>
>>> One benchmark on one uncommon configuration is not sufficient reason to
>>> change a core tunable.  What about other platforms?  Other benchmarks?
>>> Other percentages of cores scheduled?
>>>
>>> I would reject this patch based solely on the lack of data backing up
>>> your claims.
>>>
>>>> -              int n = __get_nprocs_sched ();
>>>> +              int n = __get_nprocs ();
>>>
>>> I've heard complaints about how our code leads to hundreds of arenas on
>>> processes scheduled on only two CPUs.  I think using the number of
>>> *schedulable* cores makes more sense than using the number of *unusable*
>>> cores.
>>>
>>> I think this change warrants more research.

Agreed, but the Huawei system use case appears to suggest that at least 
that system prefers the superset, which is a data point against this 
change.  The original change[1] doesn't appear to provide any extensive 
research (as Adhemerval admitted below too), so that seems sufficient 
grounds to revert a change that has regressed performance.

Of course, if you or anyone else has data points that strongly suggest 
*better* overall performance with __get_nprocs_sched (I am aware of 
complaints about hundreds of arenas and address space usage and on many 
occasions in the past I've been able to successfully resolve them by 
showing that their code is nearly 20-30% faster because of that) then 
maybe there's scope to explore the possibility of having 
architecture-specific defaults.

>> I think this patch make sense mainly because we changed to use the
>> schedulable cores without much though either.  Maybe we can revert
>> to previous semantic and investigate that using the schedulable
>> number makes more sense.
>>
> Agreed, the variable narenas_limit just Initialize once, if there has a scenario that we dynamic adjust the cpu affnity,
> __get_nprocs_sched is not a good choice. my opinion first use  __get_nprocs as a static valule(the old default behavior),
> and user use glibc.TUNABLE to adjust arana number.
> Also, we can do more research and test to optimize the default arena number.

I'm not strongly opposed to reverting (see above for more nuance) but 
would like us to use this opportunity to at least track a project to 
improve the arenas heuristic.  Could you or Adhemerval please add a bug 
report on sourceware to study the performance impact of arena counts and 
their relationship to core counts on different architectures/systems?

Thanks,
Siddhesh

[1] 
https://patchwork.sourceware.org/project/glibc/patch/20210907122259.79800-2-adhemerval.zanella@linaro.org/
  
Yang Yanchao June 29, 2022, 8:05 a.m. UTC | #8
On 2022/6/29 13:25, Siddhesh Poyarekar wrote:
> I'm not strongly opposed to reverting (see above for more nuance) but 
> would like us to use this opportunity to at least track a project to 
> improve the arenas heuristic.  Could you or Adhemerval please add a bug 
> report on sourceware to study the performance impact of arena counts and 
> their relationship to core counts on different architectures/systems?

Thanks to the attention paid to this issue, I created a bug report on 
sourceware(https://sourceware.org/bugzilla/show_bug.cgi?id=29296).

Based on the openEuler 2203 LTS, we tested in different application 
scenarios(ceph, Hbase, Hibench, spark sql, hive sql). Only tpcc-mysql 
has this performance degradation because only tpcc-mysql has core 
binding configured.

I'm using different arena counts and their relationship to core counts, 
based on x86_64 and aarch64. tpcc-mysql tests are slow to run and will 
take several days.
  

Patch

diff --git a/malloc/arena.c b/malloc/arena.c
index 0a684a720d..a1ee7928d3 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -937,7 +937,7 @@  arena_get2 (size_t size, mstate avoid_arena)
              narenas_limit = mp_.arena_max;
            else if (narenas > mp_.arena_test)
              {
-              int n = __get_nprocs_sched ();
+              int n = __get_nprocs ();

                if (n >= 1)
                  narenas_limit = NARENAS_FROM_NCORES (n);