[v3,3/5] malloc: Arena is not needed for tcache path in free()

Message ID 20240829062732.1663342-4-wangyang.guo@intel.com (mailing list archive)
State Superseded
Delegated to: Florian Weimer
Headers
Series malloc: TCACHE improvement for free and calloc |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed

Commit Message

Wangyang Guo Aug. 29, 2024, 6:27 a.m. UTC
  Arena is not needed for _int_free_check() in non-DEBUG mode. This commit
defers arena deference to _int_free_chunk() thus accelerate tcache path.
When DEBUG enabled, arena can be obtained from p in do_check_inuse_chunk().

Result of bench-malloc-thread benchmark

Test Platform: Xeon-8380
Ratio: New / Original time_per_iteration (Lower is Better)

Threads#   | Ratio
-----------|------
1 thread   | 0.994
4 threads  | 0.968

The data shows it can brings 3% performance gain in multi-thread scenario.

---
Changes in v2:
- _int_free_check() should be put outside of USE_TCACHE.
- Link to v1: https://sourceware.org/pipermail/libc-alpha/2024-August/159360.html
---
 malloc/malloc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)
  

Comments

H.J. Lu Nov. 25, 2024, 10:34 a.m. UTC | #1
On Thu, Aug 29, 2024 at 2:31 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
>
> Arena is not needed for _int_free_check() in non-DEBUG mode. This commit
> defers arena deference to _int_free_chunk() thus accelerate tcache path.
> When DEBUG enabled, arena can be obtained from p in do_check_inuse_chunk().
>
> Result of bench-malloc-thread benchmark
>
> Test Platform: Xeon-8380
> Ratio: New / Original time_per_iteration (Lower is Better)
>
> Threads#   | Ratio
> -----------|------
> 1 thread   | 0.994
> 4 threads  | 0.968
>
> The data shows it can brings 3% performance gain in multi-thread scenario.
>
> ---
> Changes in v2:
> - _int_free_check() should be put outside of USE_TCACHE.
> - Link to v1: https://sourceware.org/pipermail/libc-alpha/2024-August/159360.html
> ---
>  malloc/malloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 264f35e1a3..efb5292e9f 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -2143,6 +2143,9 @@ do_check_inuse_chunk (mstate av, mchunkptr p)
>  {
>    mchunkptr next;
>
> +  if (av == NULL)
> +    av = arena_for_chunk (p);
> +
>    do_check_chunk (av, p);
>
>    if (chunk_is_mmapped (p))
> @@ -3447,9 +3450,10 @@ __libc_free (void *mem)
>        /* Mark the chunk as belonging to the library again.  */
>        (void)tag_region (chunk2mem (p), memsize (p));
>
> -      ar_ptr = arena_for_chunk (p);
>        INTERNAL_SIZE_T size = chunksize (p);
> -      _int_free_check (ar_ptr, p, size);
> +      /* av is not needed for _int_free_check in non-DEBUG mode,
> +        in DEBUG mode, av will fetch from p in do_check_inuse_chunk.  */
> +      _int_free_check (NULL, p, size);

If _int_free_check doesn't need av, can you push it further down?

>
>  #if USE_TCACHE
>        if (tcache_free (p, size))
> @@ -3458,6 +3462,8 @@ __libc_free (void *mem)
>           return;
>         }
>  #endif
> +
> +      ar_ptr = arena_for_chunk (p);
>        _int_free_chunk (ar_ptr, p, size, 0);
>      }
>
> --
> 2.43.0
>
  
H.J. Lu Nov. 25, 2024, 11:58 a.m. UTC | #2
On Mon, Nov 25, 2024 at 6:34 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Aug 29, 2024 at 2:31 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
> >
> > Arena is not needed for _int_free_check() in non-DEBUG mode. This commit
> > defers arena deference to _int_free_chunk() thus accelerate tcache path.
> > When DEBUG enabled, arena can be obtained from p in do_check_inuse_chunk().
> >
> > Result of bench-malloc-thread benchmark
> >
> > Test Platform: Xeon-8380
> > Ratio: New / Original time_per_iteration (Lower is Better)
> >
> > Threads#   | Ratio
> > -----------|------
> > 1 thread   | 0.994
> > 4 threads  | 0.968
> >
> > The data shows it can brings 3% performance gain in multi-thread scenario.
> >
> > ---
> > Changes in v2:
> > - _int_free_check() should be put outside of USE_TCACHE.
> > - Link to v1: https://sourceware.org/pipermail/libc-alpha/2024-August/159360.html
> > ---
> >  malloc/malloc.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/malloc/malloc.c b/malloc/malloc.c
> > index 264f35e1a3..efb5292e9f 100644
> > --- a/malloc/malloc.c
> > +++ b/malloc/malloc.c
> > @@ -2143,6 +2143,9 @@ do_check_inuse_chunk (mstate av, mchunkptr p)
> >  {
> >    mchunkptr next;
> >
> > +  if (av == NULL)
> > +    av = arena_for_chunk (p);
> > +
> >    do_check_chunk (av, p);
> >
> >    if (chunk_is_mmapped (p))
> > @@ -3447,9 +3450,10 @@ __libc_free (void *mem)
> >        /* Mark the chunk as belonging to the library again.  */
> >        (void)tag_region (chunk2mem (p), memsize (p));
> >
> > -      ar_ptr = arena_for_chunk (p);
> >        INTERNAL_SIZE_T size = chunksize (p);
> > -      _int_free_check (ar_ptr, p, size);
> > +      /* av is not needed for _int_free_check in non-DEBUG mode,
> > +        in DEBUG mode, av will fetch from p in do_check_inuse_chunk.  */
> > +      _int_free_check (NULL, p, size);
>
> If _int_free_check doesn't need av, can you push it further down?
>

Something like this?
  
Wangyang Guo Nov. 26, 2024, 2:55 a.m. UTC | #3
On 11/25/2024 7:58 PM, H.J. Lu wrote:

> On Mon, Nov 25, 2024 at 6:34 PM H.J. Lu<hjl.tools@gmail.com> wrote:
>> On Thu, Aug 29, 2024 at 2:31 PM Wangyang Guo<wangyang.guo@intel.com> wrote:
>>> Arena is not needed for _int_free_check() in non-DEBUG mode. This commit
>>> defers arena deference to _int_free_chunk() thus accelerate tcache path.
>>> When DEBUG enabled, arena can be obtained from p in do_check_inuse_chunk().
>>>
>>> Result of bench-malloc-thread benchmark
>>>
>>> Test Platform: Xeon-8380
>>> Ratio: New / Original time_per_iteration (Lower is Better)
>>>
>>> Threads#   | Ratio
>>> -----------|------
>>> 1 thread   | 0.994
>>> 4 threads  | 0.968
>>>
>>> The data shows it can brings 3% performance gain in multi-thread scenario.
>>>
>>> ---
>>> Changes in v2:
>>> - _int_free_check() should be put outside of USE_TCACHE.
>>> - Link to v1:https://sourceware.org/pipermail/libc-alpha/2024-August/159360.html
>>> ---
>>>   malloc/malloc.c | 10 ++++++++--
>>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/malloc/malloc.c b/malloc/malloc.c
>>> index 264f35e1a3..efb5292e9f 100644
>>> --- a/malloc/malloc.c
>>> +++ b/malloc/malloc.c
>>> @@ -2143,6 +2143,9 @@ do_check_inuse_chunk (mstate av, mchunkptr p)
>>>   {
>>>     mchunkptr next;
>>>
>>> +  if (av == NULL)
>>> +    av = arena_for_chunk (p);
>>> +
>>>     do_check_chunk (av, p);
>>>
>>>     if (chunk_is_mmapped (p))
>>> @@ -3447,9 +3450,10 @@ __libc_free (void *mem)
>>>         /* Mark the chunk as belonging to the library again.  */
>>>         (void)tag_region (chunk2mem (p), memsize (p));
>>>
>>> -      ar_ptr = arena_for_chunk (p);
>>>         INTERNAL_SIZE_T size = chunksize (p);
>>> -      _int_free_check (ar_ptr, p, size);
>>> +      /* av is not needed for _int_free_check in non-DEBUG mode,
>>> +        in DEBUG mode, av will fetch from p in do_check_inuse_chunk.  */
>>> +      _int_free_check (NULL, p, size);
>> If _int_free_check doesn't need av, can you push it further down?
>>
> Something like this?
The problem is _init_free is also used in other places. We can not make 
sure `av` always got from `arena_for_chunk (p)`.
  
H.J. Lu Nov. 26, 2024, 3:23 a.m. UTC | #4
On Tue, Nov 26, 2024, 10:55 AM Guo, Wangyang <wangyang.guo@intel.com> wrote:

> On 11/25/2024 7:58 PM, H.J. Lu wrote:
>
> On Mon, Nov 25, 2024 at 6:34 PM H.J. Lu <hjl.tools@gmail.com> <hjl.tools@gmail.com> wrote:
>
> On Thu, Aug 29, 2024 at 2:31 PM Wangyang Guo <wangyang.guo@intel.com> <wangyang.guo@intel.com> wrote:
>
> Arena is not needed for _int_free_check() in non-DEBUG mode. This commit
> defers arena deference to _int_free_chunk() thus accelerate tcache path.
> When DEBUG enabled, arena can be obtained from p in do_check_inuse_chunk().
>
> Result of bench-malloc-thread benchmark
>
> Test Platform: Xeon-8380
> Ratio: New / Original time_per_iteration (Lower is Better)
>
> Threads#   | Ratio
> -----------|------
> 1 thread   | 0.994
> 4 threads  | 0.968
>
> The data shows it can brings 3% performance gain in multi-thread scenario.
>
> ---
> Changes in v2:
> - _int_free_check() should be put outside of USE_TCACHE.
> - Link to v1: https://sourceware.org/pipermail/libc-alpha/2024-August/159360.html
> ---
>  malloc/malloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 264f35e1a3..efb5292e9f 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -2143,6 +2143,9 @@ do_check_inuse_chunk (mstate av, mchunkptr p)
>  {
>    mchunkptr next;
>
> +  if (av == NULL)
> +    av = arena_for_chunk (p);
> +
>    do_check_chunk (av, p);
>
>    if (chunk_is_mmapped (p))
> @@ -3447,9 +3450,10 @@ __libc_free (void *mem)
>        /* Mark the chunk as belonging to the library again.  */
>        (void)tag_region (chunk2mem (p), memsize (p));
>
> -      ar_ptr = arena_for_chunk (p);
>        INTERNAL_SIZE_T size = chunksize (p);
> -      _int_free_check (ar_ptr, p, size);
> +      /* av is not needed for _int_free_check in non-DEBUG mode,
> +        in DEBUG mode, av will fetch from p in do_check_inuse_chunk.  */
> +      _int_free_check (NULL, p, size);
>
> If _int_free_check doesn't need av, can you push it further down?
>
>
> Something like this?
>
> The problem is _init_free is also used in other places. We can not make
> sure `av` always got from `arena_for_chunk (p)`.
>

Have you looked at my change?
  
Wangyang Guo Nov. 26, 2024, 3:41 a.m. UTC | #5
On 11/26/2024 11:23 AM, H.J. Lu wrote:

>
>
> On Tue, Nov 26, 2024, 10:55 AM Guo, Wangyang <wangyang.guo@intel.com> 
> wrote:
>
>     On 11/25/2024 7:58 PM, H.J. Lu wrote:
>
>>     On Mon, Nov 25, 2024 at 6:34 PM H.J. Lu<hjl.tools@gmail.com> <mailto:hjl.tools@gmail.com> wrote:
>>>     On Thu, Aug 29, 2024 at 2:31 PM Wangyang Guo<wangyang.guo@intel.com> <mailto:wangyang.guo@intel.com> wrote:
>>>>     Arena is not needed for _int_free_check() in non-DEBUG mode. This commit
>>>>     defers arena deference to _int_free_chunk() thus accelerate tcache path.
>>>>     When DEBUG enabled, arena can be obtained from p in do_check_inuse_chunk().
>>>>
>>>>     Result of bench-malloc-thread benchmark
>>>>
>>>>     Test Platform: Xeon-8380
>>>>     Ratio: New / Original time_per_iteration (Lower is Better)
>>>>
>>>>     Threads#   | Ratio
>>>>     -----------|------
>>>>     1 thread   | 0.994
>>>>     4 threads  | 0.968
>>>>
>>>>     The data shows it can brings 3% performance gain in multi-thread scenario.
>>>>
>>>>     ---
>>>>     Changes in v2:
>>>>     - _int_free_check() should be put outside of USE_TCACHE.
>>>>     - Link to v1:https://sourceware.org/pipermail/libc-alpha/2024-August/159360.html
>>>>     ---
>>>>       malloc/malloc.c | 10 ++++++++--
>>>>       1 file changed, 8 insertions(+), 2 deletions(-)
>>>>
>>>>     diff --git a/malloc/malloc.c b/malloc/malloc.c
>>>>     index 264f35e1a3..efb5292e9f 100644
>>>>     --- a/malloc/malloc.c
>>>>     +++ b/malloc/malloc.c
>>>>     @@ -2143,6 +2143,9 @@ do_check_inuse_chunk (mstate av, mchunkptr p)
>>>>       {
>>>>         mchunkptr next;
>>>>
>>>>     +  if (av == NULL)
>>>>     +    av = arena_for_chunk (p);
>>>>     +
>>>>         do_check_chunk (av, p);
>>>>
>>>>         if (chunk_is_mmapped (p))
>>>>     @@ -3447,9 +3450,10 @@ __libc_free (void *mem)
>>>>             /* Mark the chunk as belonging to the library again.  */
>>>>             (void)tag_region (chunk2mem (p), memsize (p));
>>>>
>>>>     -      ar_ptr = arena_for_chunk (p);
>>>>             INTERNAL_SIZE_T size = chunksize (p);
>>>>     -      _int_free_check (ar_ptr, p, size);
>>>>     +      /* av is not needed for _int_free_check in non-DEBUG mode,
>>>>     +        in DEBUG mode, av will fetch from p in do_check_inuse_chunk.  */
>>>>     +      _int_free_check (NULL, p, size);
>>>     If _int_free_check doesn't need av, can you push it further down?
>>>
>>     Something like this?
>     The problem is _init_free is also used in other places. We can not
>     make sure `av` always got from `arena_for_chunk (p)`.
>
>
> Have you looked at my change?

Your patch try to remove `av` parameters in _int_free_check:

@@ -4684,7 +4693,7 @@ _int_free (mstate av, mchunkptr p, int have_lock)

    size = chunksize (p);

-  _int_free_check (av, p, size);
+  _int_free_check (p, size);

And fetch `av` by `arena_for_chunk (p)`.

+static inline void +do_check_inuse_chunk_fast (mstate av, mchunkptr p) 
+{ + mstate av = arena_for_chunk (p); + do_check_inuse_chunk (av, p); +}

My point is for _int_free(av, p, have_lock), if `av` is not equals to 
`arena_for_chunk (p)`, then we can not simply omit it from 
`_int_free_check`.
  
H.J. Lu Nov. 26, 2024, 6:20 a.m. UTC | #6
On Tue, Nov 26, 2024, 11:41 AM Guo, Wangyang <wangyang.guo@intel.com> wrote:
>
> On 11/26/2024 11:23 AM, H.J. Lu wrote:
>
>
>
> On Tue, Nov 26, 2024, 10:55 AM Guo, Wangyang <wangyang.guo@intel.com> wrote:
>>
>> On 11/25/2024 7:58 PM, H.J. Lu wrote:
>>
>> On Mon, Nov 25, 2024 at 6:34 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Thu, Aug 29, 2024 at 2:31 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
>>
>> Arena is not needed for _int_free_check() in non-DEBUG mode. This commit
>> defers arena deference to _int_free_chunk() thus accelerate tcache path.
>> When DEBUG enabled, arena can be obtained from p in do_check_inuse_chunk().
>>
>> Result of bench-malloc-thread benchmark
>>
>> Test Platform: Xeon-8380
>> Ratio: New / Original time_per_iteration (Lower is Better)
>>
>> Threads#   | Ratio
>> -----------|------
>> 1 thread   | 0.994
>> 4 threads  | 0.968
>>
>> The data shows it can brings 3% performance gain in multi-thread scenario.
>>
>> ---
>> Changes in v2:
>> - _int_free_check() should be put outside of USE_TCACHE.
>> - Link to v1: https://sourceware.org/pipermail/libc-alpha/2024-August/159360.html
>> ---
>>  malloc/malloc.c | 10 ++++++++--
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/malloc/malloc.c b/malloc/malloc.c
>> index 264f35e1a3..efb5292e9f 100644
>> --- a/malloc/malloc.c
>> +++ b/malloc/malloc.c
>> @@ -2143,6 +2143,9 @@ do_check_inuse_chunk (mstate av, mchunkptr p)
>>  {
>>    mchunkptr next;
>>
>> +  if (av == NULL)
>> +    av = arena_for_chunk (p);
>> +
>>    do_check_chunk (av, p);
>>
>>    if (chunk_is_mmapped (p))
>> @@ -3447,9 +3450,10 @@ __libc_free (void *mem)
>>        /* Mark the chunk as belonging to the library again.  */
>>        (void)tag_region (chunk2mem (p), memsize (p));
>>
>> -      ar_ptr = arena_for_chunk (p);
>>        INTERNAL_SIZE_T size = chunksize (p);
>> -      _int_free_check (ar_ptr, p, size);
>> +      /* av is not needed for _int_free_check in non-DEBUG mode,
>> +        in DEBUG mode, av will fetch from p in do_check_inuse_chunk.  */
>> +      _int_free_check (NULL, p, size);
>>
>> If _int_free_check doesn't need av, can you push it further down?
>>
>> Something like this?
>>
>> The problem is _init_free is also used in other places. We can not make sure `av` always got from `arena_for_chunk (p)`.
>
>
> Have you looked at my change?
>
> Your patch try to remove `av` parameters in _int_free_check:
>
> @@ -4684,7 +4693,7 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>
>    size = chunksize (p);
>
> -  _int_free_check (av, p, size);
> +  _int_free_check (p, size);
>
> And fetch `av` by `arena_for_chunk (p)`.
>
> +static inline void +do_check_inuse_chunk_fast (mstate av, mchunkptr p) +{ + mstate av = arena_for_chunk (p); + do_check_inuse_chunk (av, p); +}
>
> My point is for _int_free(av, p, have_lock), if `av` is not equals to `arena_for_chunk (p)`, then we can not simply omit it from `_int_free_check`.

Is inlining _int_free good enough for performance improvement?

H.J.
  
Wangyang Guo Nov. 26, 2024, 6:27 a.m. UTC | #7
On 11/26/2024 2:20 PM, H.J. Lu wrote:

> On Tue, Nov 26, 2024, 11:41 AM Guo, Wangyang <wangyang.guo@intel.com> wrote:
>> On 11/26/2024 11:23 AM, H.J. Lu wrote:
>>
>>
>>
>> On Tue, Nov 26, 2024, 10:55 AM Guo, Wangyang <wangyang.guo@intel.com> wrote:
>>> On 11/25/2024 7:58 PM, H.J. Lu wrote:
>>>
>>> On Mon, Nov 25, 2024 at 6:34 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>
>>> On Thu, Aug 29, 2024 at 2:31 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
>>>
>>> Arena is not needed for _int_free_check() in non-DEBUG mode. This commit
>>> defers arena deference to _int_free_chunk() thus accelerate tcache path.
>>> When DEBUG enabled, arena can be obtained from p in do_check_inuse_chunk().
>>>
>>> Result of bench-malloc-thread benchmark
>>>
>>> Test Platform: Xeon-8380
>>> Ratio: New / Original time_per_iteration (Lower is Better)
>>>
>>> Threads#   | Ratio
>>> -----------|------
>>> 1 thread   | 0.994
>>> 4 threads  | 0.968
>>>
>>> The data shows it can brings 3% performance gain in multi-thread scenario.
>>>
>>> ---
>>> Changes in v2:
>>> - _int_free_check() should be put outside of USE_TCACHE.
>>> - Link to v1: https://sourceware.org/pipermail/libc-alpha/2024-August/159360.html
>>> ---
>>>   malloc/malloc.c | 10 ++++++++--
>>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/malloc/malloc.c b/malloc/malloc.c
>>> index 264f35e1a3..efb5292e9f 100644
>>> --- a/malloc/malloc.c
>>> +++ b/malloc/malloc.c
>>> @@ -2143,6 +2143,9 @@ do_check_inuse_chunk (mstate av, mchunkptr p)
>>>   {
>>>     mchunkptr next;
>>>
>>> +  if (av == NULL)
>>> +    av = arena_for_chunk (p);
>>> +
>>>     do_check_chunk (av, p);
>>>
>>>     if (chunk_is_mmapped (p))
>>> @@ -3447,9 +3450,10 @@ __libc_free (void *mem)
>>>         /* Mark the chunk as belonging to the library again.  */
>>>         (void)tag_region (chunk2mem (p), memsize (p));
>>>
>>> -      ar_ptr = arena_for_chunk (p);
>>>         INTERNAL_SIZE_T size = chunksize (p);
>>> -      _int_free_check (ar_ptr, p, size);
>>> +      /* av is not needed for _int_free_check in non-DEBUG mode,
>>> +        in DEBUG mode, av will fetch from p in do_check_inuse_chunk.  */
>>> +      _int_free_check (NULL, p, size);
>>>
>>> If _int_free_check doesn't need av, can you push it further down?
>>>
>>> Something like this?
>>>
>>> The problem is _init_free is also used in other places. We can not make sure `av` always got from `arena_for_chunk (p)`.
>>
>> Have you looked at my change?
>>
>> Your patch try to remove `av` parameters in _int_free_check:
>>
>> @@ -4684,7 +4693,7 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>>
>>     size = chunksize (p);
>>
>> -  _int_free_check (av, p, size);
>> +  _int_free_check (p, size);
>>
>> And fetch `av` by `arena_for_chunk (p)`.
>>
>> +static inline void +do_check_inuse_chunk_fast (mstate av, mchunkptr p) +{ + mstate av = arena_for_chunk (p); + do_check_inuse_chunk (av, p); +}
>>
>> My point is for _int_free(av, p, have_lock), if `av` is not equals to `arena_for_chunk (p)`, then we can not simply omit it from `_int_free_check`.
> Is inlining _int_free good enough for performance improvement?
>
> H.J.
The major performance gain comes from inlining. I think that is good 
enough and can drop this patch.
  
Wangyang Guo Nov. 26, 2024, 7:41 a.m. UTC | #8
On 11/26/2024 2:27 PM, Guo, Wangyang wrote:

> On 11/26/2024 2:20 PM, H.J. Lu wrote:
>
>> On Tue, Nov 26, 2024, 11:41 AM Guo, Wangyang <wangyang.guo@intel.com> 
>> wrote:
>>> On 11/26/2024 11:23 AM, H.J. Lu wrote:
>>>
>>>
>>>
>>> On Tue, Nov 26, 2024, 10:55 AM Guo, Wangyang 
>>> <wangyang.guo@intel.com> wrote:
>>>> On 11/25/2024 7:58 PM, H.J. Lu wrote:
>>>>
>>>> On Mon, Nov 25, 2024 at 6:34 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>
>>>> On Thu, Aug 29, 2024 at 2:31 PM Wangyang Guo 
>>>> <wangyang.guo@intel.com> wrote:
>>>>
>>>> Arena is not needed for _int_free_check() in non-DEBUG mode. This 
>>>> commit
>>>> defers arena deference to _int_free_chunk() thus accelerate tcache 
>>>> path.
>>>> When DEBUG enabled, arena can be obtained from p in 
>>>> do_check_inuse_chunk().
>>>>
>>>> Result of bench-malloc-thread benchmark
>>>>
>>>> Test Platform: Xeon-8380
>>>> Ratio: New / Original time_per_iteration (Lower is Better)
>>>>
>>>> Threads#   | Ratio
>>>> -----------|------
>>>> 1 thread   | 0.994
>>>> 4 threads  | 0.968
>>>>
>>>> The data shows it can brings 3% performance gain in multi-thread 
>>>> scenario.
>>>>
>>>> ---
>>>> Changes in v2:
>>>> - _int_free_check() should be put outside of USE_TCACHE.
>>>> - Link to v1: 
>>>> https://sourceware.org/pipermail/libc-alpha/2024-August/159360.html
>>>> ---
>>>>   malloc/malloc.c | 10 ++++++++--
>>>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/malloc/malloc.c b/malloc/malloc.c
>>>> index 264f35e1a3..efb5292e9f 100644
>>>> --- a/malloc/malloc.c
>>>> +++ b/malloc/malloc.c
>>>> @@ -2143,6 +2143,9 @@ do_check_inuse_chunk (mstate av, mchunkptr p)
>>>>   {
>>>>     mchunkptr next;
>>>>
>>>> +  if (av == NULL)
>>>> +    av = arena_for_chunk (p);
>>>> +
>>>>     do_check_chunk (av, p);
>>>>
>>>>     if (chunk_is_mmapped (p))
>>>> @@ -3447,9 +3450,10 @@ __libc_free (void *mem)
>>>>         /* Mark the chunk as belonging to the library again.  */
>>>>         (void)tag_region (chunk2mem (p), memsize (p));
>>>>
>>>> -      ar_ptr = arena_for_chunk (p);
>>>>         INTERNAL_SIZE_T size = chunksize (p);
>>>> -      _int_free_check (ar_ptr, p, size);
>>>> +      /* av is not needed for _int_free_check in non-DEBUG mode,
>>>> +        in DEBUG mode, av will fetch from p in 
>>>> do_check_inuse_chunk.  */
>>>> +      _int_free_check (NULL, p, size);
>>>>
>>>> If _int_free_check doesn't need av, can you push it further down?
>>>>
>>>> Something like this?
>>>>
>>>> The problem is _init_free is also used in other places. We can not 
>>>> make sure `av` always got from `arena_for_chunk (p)`.
>>>
>>> Have you looked at my change?
>>>
>>> Your patch try to remove `av` parameters in _int_free_check:
>>>
>>> @@ -4684,7 +4693,7 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>>>
>>>     size = chunksize (p);
>>>
>>> -  _int_free_check (av, p, size);
>>> +  _int_free_check (p, size);
>>>
>>> And fetch `av` by `arena_for_chunk (p)`.
>>>
>>> +static inline void +do_check_inuse_chunk_fast (mstate av, mchunkptr 
>>> p) +{ + mstate av = arena_for_chunk (p); + do_check_inuse_chunk (av, 
>>> p); +}
>>>
>>> My point is for _int_free(av, p, have_lock), if `av` is not equals 
>>> to `arena_for_chunk (p)`, then we can not simply omit it from 
>>> `_int_free_check`.
>> Is inlining _int_free good enough for performance improvement?
>>
>> H.J.
> The major performance gain comes from inlining. I think that is good 
> enough and can drop this patch.
New patchset has been sent out.
  

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index 264f35e1a3..efb5292e9f 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -2143,6 +2143,9 @@  do_check_inuse_chunk (mstate av, mchunkptr p)
 {
   mchunkptr next;
 
+  if (av == NULL)
+    av = arena_for_chunk (p);
+
   do_check_chunk (av, p);
 
   if (chunk_is_mmapped (p))
@@ -3447,9 +3450,10 @@  __libc_free (void *mem)
       /* Mark the chunk as belonging to the library again.  */
       (void)tag_region (chunk2mem (p), memsize (p));
 
-      ar_ptr = arena_for_chunk (p);
       INTERNAL_SIZE_T size = chunksize (p);
-      _int_free_check (ar_ptr, p, size);
+      /* av is not needed for _int_free_check in non-DEBUG mode,
+	 in DEBUG mode, av will fetch from p in do_check_inuse_chunk.  */
+      _int_free_check (NULL, p, size);
 
 #if USE_TCACHE
       if (tcache_free (p, size))
@@ -3458,6 +3462,8 @@  __libc_free (void *mem)
 	  return;
 	}
 #endif
+
+      ar_ptr = arena_for_chunk (p);
       _int_free_chunk (ar_ptr, p, size, 0);
     }