[updated] malloc per-thread cache ready for review

  Inline comments and updated patch attached...

Florian Weimer <fweimer@redhat.com> writes:
> These type names remind me of Turbo Vision.  I don't think we use the 
> CamelCase convention in glibc. :)

DeCamelCased

(I did a lot of my early programming in TurboC)

> I would like to see some discussion of the per-thread overhead the 
> thread cache introduces, maybe along the documentation of the tunables, 
> both the static overhead (due to new per-thread state) and some bounds 
> on the additional per-thread allocation retention.

I put some math in the tunables docs, and also see
https://sourceware.org/ml/libc-alpha/2017-01/msg00452.html where this
was discussed upstream.

"Carlos O'Donell" <carlos@redhat.com> writes:
> - Adding legacy environment variables. Not OK.

Removed.

> - Missing manual documentation for new tcache tunables. Not OK.
> - Missing manual documentation for new probes. Not OK.

Added.

>   The malloc tunables should all be 'security_level: SXID_IGNORE'.

Done.

>   In _int_malloc we have some code duplication where tcache and the normal
>   code both do the same thing e.g. remove a block. Can we avoid some of
>   this duplication with a first refactoring patch?

I pulled out the common parts, it amounted to only a few lines but
they're complex enough to warrant it.

> Must be -DUSE_TCACHE=1, please follow -Wundef rules:

Done.

>> +#if USE_TCACHE
>> +/* We want 64 entries.  This is an arbitrary limit, which tunables can reduce.  */
>> +# define MAX_TCACHE_SIZE	(MALLOC_ALIGNMENT * 63)
>
> Why use 63? Is it because 0 is reserved for the empty zero byte set? If so then just 
> say "We want 63 non-zero sized entries."?

I changed it to 64 and tweaked the macros so that all 64 bins are used
(and re-audited them to make sure they were all consistent), which means
the largest cacheable size is just over a power of two now (1032 and 516
bytes).  Note sure how having this new power-of-two size included will
affect the benchmarks, though - previously, the limit was just below
those powers-of-two.

>> +# define TCACHE_IDX		((MAX_TCACHE_SIZE / MALLOC_ALIGNMENT) + 1)
>
> Since this is the maximum index can we call this TCACHE_MAX_IDX? 

It's not a max, it's a size.  I renamed it to TACHE_MAX_BINS.

>> +# define size2tidx_(bytes)	(((bytes) + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
>
> Why the trailing underscore?

Explanation elided, as I ended up removing it anyway.

>> +/* Rounds up, so...
>> +   idx 0   bytes 0
>> +   idx 1   bytes 1..8
>> +   idx 2   bytes 9..16
>> +   etc.  */
>
> This doesn't look correct for a 64-bit x86_64 system (maybe 32-bit?).
>
> Could you please adjust the comment for 64-bit x86_64 where MALLOC_ALIGNMENT
> should be 16 bytes, and indicate the comment is for that given machine.

Tweaked.

>> +/* This is another arbitrary limit, which tunables can change.  */
>> +# define TCACHE_FILL_COUNT 7
>
> Comment should describe in more detail what a developer should need to
> know about this.

Done.  The tunables docs have more info too.

>> +  /* Maximum number of chunks to remove from the unsorted list, which
>> +     don't match.  */
>
> Which don't match what?

Don't match the cache line we're trying to fill.  Edited.

>> +typedef struct TCacheEntry {
>> +  struct TCacheEntry *next;
>> +} TCacheEntry;
>
> Needs explaining in detail that these apparently incomplete types are layered
> on top of a chunk.

Done.

>> +/* There is one of these for each thread, which contains the
>> +   per-thread cache (hence "TCache").  Keeping overall size low is
>> +   mildly important.  Note that COUNTS and ENTRIES are redundant, this
>> +   is for performance reasons.  */
>
> Why do you say they are redundant? Because you could in theory have a thread
> local pointer to TCacheEntry and walk that list when a malloc request arrives?
> If so, please expand the comment to cover that.

Yes.  Done.

>> +static void
>> +tcache_put (mchunkptr chunk, size_t tc_idx)
>> +{
>> +  TCacheEntry *e = (TCacheEntry *) chunk2mem (chunk);
>
> Can we assert the safety of tc_idx? assert (tc_idx < TCACHE_IDX); ?

I noted in a comment that it's the caller's responsibility to be safe,
and the code is only ever used inside suitable conditionals for index
and count, but I added asserts anyway.

> Please add a comment here that in a low-memory condition we will have
> victim == NULL and tcache == NULL and keep calling tcache_init
> at every opportunity to eventually get enough memory for the tcache.

Done.  In such a condition, there won't be enough memory to do much of
anything, though...

>> +#if USE_TCACHE
>> +  INTERNAL_SIZE_T tcache_nb = 0;
>> +  size_t tc_idx = csize2tidx (nb);
>> +  if (tcache && tc_idx < mp_.tcache_max)
>> +    tcache_nb = nb;
>> +  int return_cached = 0;
>> +
>> +  tcache_unsorted_count = 0;
>
> We selectively apply this limit to just unsorted, but why not all the
> other places where we iterate filling the cache? Because in the other
> cases we have perfect size matches we know we want?

I noted in the tunables doc that scanning the other lists is bounded,
since the other lists always have known sizes.  The unsorted list may be
unbounded, and the non-matching chunks we find have to be processed and
sorted, which makes that expensive.  Hence, I added a user-definable
limit here just for that list.

[updated] malloc per-thread cache ready for review

Commit Message

Comments

Patch