[updated] malloc per-thread cache ready for review
Commit Message
> ChangeLog?
Included.
> > + tcache_unsorted_limit {
> > + type: SIZE_T
> > + security_level: SXID_IGNORE
> > + }
>
> Why does this need to be inherited by children of setxid processes? I
> guess the fact that the remaining malloc tunables are is one probable
> reason, but those also need to be reviewed and brought down to
> SXID_ERASE if necessary.
IIRC Florian asked for that setting; I don't know specifically what the
difference is but if you two want to fight it out, I'll get some popcorn ;-)
> This needs to use the new TUNABLE_CALLBACK_FNDECL.
Fixed. My git repo has been munging my merges, it kept reverting that
trunk DL_* patch :-(
> This needs to use the new TUNABLE_GET macro.
Likewise.
> > +/* We overlay this structure on the user-data portion of a chunk when
> > + the chunk is stored in the per-thread cache. */
> > +typedef struct tcache_entry {
>
> Incorrect formatting.
You weren't specific here, so I had to guess which formatting you were
referring to, as the style I used matches code elsewhere in malloc...
> > +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
>
> __attribute__ should be on its own line.
Likewise, I was copying the formatting used elsewhere in malloc. What
is the correct formatting for that line? Do I need to fix the other
existing instances also?
> > + for (i = 0; i < TCACHE_MAX_BINS; ++i) {
> > + while (tcache_tmp->entries[i])
> > + {
> > + tcache_entry *e = tcache_tmp->entries[i];
> > + tcache_tmp->entries[i] = e->next;
> > + __libc_free (e);
> > + }
>
> Incorrect formatting.
Not sure what formatting you're referring to here, unless it's the tabs
being messed up by the "> >"...
> Avoid implicit bool conversion for victim.
Fixed.
> Indent 1 space.
Fixed.
> Implicit boolean conversion.
Fixed.
> Unnecessary braces.
Removed, although there are pre-existing cases of that throughout the
code... at least, when the "if" clause is multi-line.
> > + if (value >= 0 && value <= MAX_TCACHE_SIZE)
>
> This validation should be done in tunables by setting the minval and
> maxval values in dl-tunables.list.
I don't see how I can put MAX_TCACHE_SIZE in dl-tunables.list. Should I
hard-code the default value, and also include the test here? Or is
there no other other use of min/max in dl-tunables.list?
Revised patch attached.
2017-06-27 DJ Delorie <dj@delorie.com>
* config.make.in: Enable experimental malloc option.
* configure.ac: Likewise.
* configure: Regenerate.
* malloc/Makefile: Likewise.
* malloc/malloc.c: Add per-thread cache (tcache).
(tcache_put): New.
(tcache_get): New.
(tcache_thread_freeres): New.
(tcache_init): New.
(__libc_malloc): Use cached chunks if available.
(__libc_free): Initialize tcache if needed.
(__libc_realloc): Likewise.
(__libc_calloc): Likewise.
(_int_malloc): Prefill tcache when appropriate.
(_int_free): Likewise.
(do_set_tcache_max): New.
(do_set_tcache_count): New.
(do_set_tcache_unsorted_limit): New.
* manual/probes.texi: Document new probes.
* malloc/arena.c: Add new tcache tunables.
* elf/dl-tunables.list: Likewise.
* manual/tunables.texi: Document them.
Comments
On Wednesday 28 June 2017 02:12 AM, DJ Delorie wrote:
>> Why does this need to be inherited by children of setxid processes? I
>> guess the fact that the remaining malloc tunables are is one probable
>> reason, but those also need to be reviewed and brought down to
>> SXID_ERASE if necessary.
>
> IIRC Florian asked for that setting; I don't know specifically what the
> difference is but if you two want to fight it out, I'll get some popcorn ;-)
haha, interesting, because I remember Florian and I had this discussion
about the utility of SXID_IGNORE six months ago when I designed the
scope control for the envvars. Lets have Florian explain this.
>>> +/* We overlay this structure on the user-data portion of a chunk when
>>> + the chunk is stored in the per-thread cache. */
>>> +typedef struct tcache_entry {
>>
>> Incorrect formatting.
>
> You weren't specific here, so I had to guess which formatting you were
> referring to, as the style I used matches code elsewhere in malloc...
The braces should be on the next line:
typedef struct tcache_entry
{
...
>>> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
>>
>> __attribute__ should be on its own line.
>
> Likewise, I was copying the formatting used elsewhere in malloc. What
> is the correct formatting for that line? Do I need to fix the other
> existing instances also?
Sure, but in a separate (obvious) patch. The malloc code was in a
completely different format when it was ported in from dlmalloc and
since then it has been a bit of a mishmash that needs a cleanup.
>>> + for (i = 0; i < TCACHE_MAX_BINS; ++i) {
>>> + while (tcache_tmp->entries[i])
>>> + {
>>> + tcache_entry *e = tcache_tmp->entries[i];
>>> + tcache_tmp->entries[i] = e->next;
>>> + __libc_free (e);
>>> + }
>>
>> Incorrect formatting.
>
> Not sure what formatting you're referring to here, unless it's the tabs
> being messed up by the "> >"...
Opening brace for the for loop should be on the next line and everything
else indented accordingly.
>> This validation should be done in tunables by setting the minval and
>> maxval values in dl-tunables.list.
>
> I don't see how I can put MAX_TCACHE_SIZE in dl-tunables.list. Should I
> hard-code the default value, and also include the test here? Or is
> there no other other use of min/max in dl-tunables.list?
If you hardcode the value in dl-tunables.list, you won't need the test
since __init_tunables will do that validation for you. However, I
noticed that it is a function of a bunch of other macros and unless
those are constant across architectures it won't go into tunables as is.
I guess you can let it be as it is and I'll fix it up later.
I think this is OK with the couple of formatting fixes above and with
the SXID_IGNORE removed, which we can then discuss and add back if
necessary as an add-on change.
Thanks,
Siddhesh
>
> Revised patch attached.
>
> 2017-06-27 DJ Delorie <dj@delorie.com>
>
> * config.make.in: Enable experimental malloc option.
> * configure.ac: Likewise.
> * configure: Regenerate.
> * malloc/Makefile: Likewise.
> * malloc/malloc.c: Add per-thread cache (tcache).
> (tcache_put): New.
> (tcache_get): New.
> (tcache_thread_freeres): New.
> (tcache_init): New.
> (__libc_malloc): Use cached chunks if available.
> (__libc_free): Initialize tcache if needed.
> (__libc_realloc): Likewise.
> (__libc_calloc): Likewise.
> (_int_malloc): Prefill tcache when appropriate.
> (_int_free): Likewise.
> (do_set_tcache_max): New.
> (do_set_tcache_count): New.
> (do_set_tcache_unsorted_limit): New.
> * manual/probes.texi: Document new probes.
> * malloc/arena.c: Add new tcache tunables.
> * elf/dl-tunables.list: Likewise.
> * manual/tunables.texi: Document them.
>
> diff -x .git -x po -r -U3 -p glibc.pristine/config.make.in glibc.djmalloc-tcache/config.make.in
> --- glibc.pristine/config.make.in 2017-06-20 12:54:22.636484566 -0400
> +++ glibc.djmalloc-tcache/config.make.in 2017-06-26 17:32:41.164888977 -0400
> @@ -78,6 +78,8 @@ multi-arch = @multi_arch@
>
> mach-interface-list = @mach_interface_list@
>
> +experimental-malloc = @experimental_malloc@
> +
> nss-crypt = @libc_cv_nss_crypt@
> static-nss-crypt = @libc_cv_static_nss_crypt@
>
> diff -x .git -x po -r -U3 -p glibc.pristine/configure glibc.djmalloc-tcache/configure
> --- glibc.pristine/configure 2017-06-20 12:54:22.637484566 -0400
> +++ glibc.djmalloc-tcache/configure 2017-06-26 17:32:41.163888976 -0400
> @@ -674,6 +674,7 @@ build_obsolete_nsl
> link_obsolete_rpc
> libc_cv_static_nss_crypt
> libc_cv_nss_crypt
> +experimental_malloc
> enable_werror
> all_warnings
> force_install
> @@ -779,6 +780,7 @@ enable_kernel
> enable_all_warnings
> enable_werror
> enable_multi_arch
> +enable_experimental_malloc
> enable_nss_crypt
> enable_obsolete_rpc
> enable_obsolete_nsl
> @@ -1450,6 +1452,8 @@ Optional Features:
> --disable-werror do not build with -Werror
> --enable-multi-arch enable single DSO with optimizations for multiple
> architectures
> + --disable-experimental-malloc
> + disable experimental malloc features
> --enable-nss-crypt enable libcrypt to use nss
> --enable-obsolete-rpc build and install the obsolete RPC code for
> link-time usage
> @@ -3522,6 +3526,15 @@ else
> fi
>
>
> +# Check whether --enable-experimental-malloc was given.
> +if test "${enable_experimental_malloc+set}" = set; then :
> + enableval=$enable_experimental_malloc; experimental_malloc=$enableval
> +else
> + experimental_malloc=yes
> +fi
> +
> +
> +
> # Check whether --enable-nss-crypt was given.
> if test "${enable_nss_crypt+set}" = set; then :
> enableval=$enable_nss_crypt; nss_crypt=$enableval
> diff -x .git -x po -r -U3 -p glibc.pristine/configure.ac glibc.djmalloc-tcache/configure.ac
> --- glibc.pristine/configure.ac 2017-06-20 12:54:22.638484567 -0400
> +++ glibc.djmalloc-tcache/configure.ac 2017-06-26 17:32:41.155888970 -0400
> @@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch],
> [multi_arch=$enableval],
> [multi_arch=default])
>
> +AC_ARG_ENABLE([experimental-malloc],
> + AC_HELP_STRING([--disable-experimental-malloc],
> + [disable experimental malloc features]),
> + [experimental_malloc=$enableval],
> + [experimental_malloc=yes])
> +AC_SUBST(experimental_malloc)
> +
> AC_ARG_ENABLE([nss-crypt],
> AC_HELP_STRING([--enable-nss-crypt],
> [enable libcrypt to use nss]),
> diff -x .git -x po -r -U3 -p glibc.pristine/elf/dl-tunables.list glibc.djmalloc-tcache/elf/dl-tunables.list
> --- glibc.pristine/elf/dl-tunables.list 2017-06-23 20:49:10.174148477 -0400
> +++ glibc.djmalloc-tcache/elf/dl-tunables.list 2017-06-26 17:32:41.154888970 -0400
> @@ -76,6 +76,18 @@ glibc {
> minval: 1
> security_level: SXID_IGNORE
> }
> + tcache_max {
> + type: SIZE_T
> + security_level: SXID_IGNORE
> + }
> + tcache_count {
> + type: SIZE_T
> + security_level: SXID_IGNORE
> + }
> + tcache_unsorted_limit {
> + type: SIZE_T
> + security_level: SXID_IGNORE
> + }
> }
> tune {
> hwcap_mask {
> diff -x .git -x po -r -U3 -p glibc.pristine/malloc/Makefile glibc.djmalloc-tcache/malloc/Makefile
> --- glibc.pristine/malloc/Makefile 2017-06-23 20:49:10.199148493 -0400
> +++ glibc.djmalloc-tcache/malloc/Makefile 2017-06-26 17:32:41.153888968 -0400
> @@ -189,6 +189,11 @@ tst-malloc-usable-static-ENV = $(tst-mal
> tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3
> tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV)
>
> +ifeq ($(experimental-malloc),yes)
> +CPPFLAGS-malloc.c += -DUSE_TCACHE=1
> +else
> +CPPFLAGS-malloc.c += -DUSE_TCACHE=0
> +endif
> # Uncomment this for test releases. For public releases it is too expensive.
> #CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1
>
> diff -x .git -x po -r -U3 -p glibc.pristine/malloc/arena.c glibc.djmalloc-tcache/malloc/arena.c
> --- glibc.pristine/malloc/arena.c 2017-06-20 12:54:22.645484573 -0400
> +++ glibc.djmalloc-tcache/malloc/arena.c 2017-06-26 17:49:33.701608305 -0400
> @@ -236,6 +236,11 @@ TUNABLE_CALLBACK_FNDECL (set_perturb_byt
> TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t)
> TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t)
> TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t)
> +#if USE_TCACHE
> +TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t)
> +TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t)
> +TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t)
> +#endif
> #else
> /* Initialization routine. */
> #include <string.h>
> @@ -322,6 +327,12 @@ ptmalloc_init (void)
> TUNABLE_GET (mmap_max, int32_t, TUNABLE_CALLBACK (set_mmaps_max));
> TUNABLE_GET (arena_max, size_t, TUNABLE_CALLBACK (set_arena_max));
> TUNABLE_GET (arena_test, size_t, TUNABLE_CALLBACK (set_arena_test));
> +#if USE_TCACHE
> + TUNABLE_GET (tcache_max, size_t, TUNABLE_CALLBACK (set_tcache_max));
> + TUNABLE_GET (tcache_count, size_t, TUNABLE_CALLBACK (set_tcache_count));
> + TUNABLE_GET (tcache_unsorted_limit, size_t,
> + TUNABLE_CALLBACK (set_tcache_unsorted_limit));
> +#endif
> __libc_lock_unlock (main_arena.mutex);
> #else
> const char *s = NULL;
> diff -x .git -x po -r -U3 -p glibc.pristine/malloc/malloc.c glibc.djmalloc-tcache/malloc/malloc.c
> --- glibc.pristine/malloc/malloc.c 2017-05-03 16:26:47.559839224 -0400
> +++ glibc.djmalloc-tcache/malloc/malloc.c 2017-06-26 17:32:41.151888967 -0400
> @@ -296,6 +296,30 @@ __malloc_assert (const char *assertion,
> }
> #endif
>
> +#if USE_TCACHE
> +/* We want 64 entries. This is an arbitrary limit, which tunables can reduce. */
> +# define TCACHE_MAX_BINS 64
> +# define MAX_TCACHE_SIZE tidx2usize (TCACHE_MAX_BINS-1)
> +
> +/* Only used to pre-fill the tunables. */
> +# define tidx2usize(idx) (((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ)
> +
> +/* When "x" is from chunksize(). */
> +# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
> +/* When "x" is a user-provided size. */
> +# define usize2tidx(x) csize2tidx (request2size (x))
> +
> +/* With rounding and alignment, the bins are...
> + idx 0 bytes 0..24 (64-bit) or 0..12 (32-bit)
> + idx 1 bytes 25..40 or 13..20
> + idx 2 bytes 41..56 or 21..28
> + etc. */
> +
> +/* This is another arbitrary limit, which tunables can change. Each
> + tcache bin will hold at most this number of chunks. */
> +# define TCACHE_FILL_COUNT 7
> +#endif
> +
>
> /*
> REALLOC_ZERO_BYTES_FREES should be set if a call to
> @@ -1712,6 +1736,17 @@ struct malloc_par
>
> /* First address handed out by MORECORE/sbrk. */
> char *sbrk_base;
> +
> +#if USE_TCACHE
> + /* Maximum number of buckets to use. */
> + size_t tcache_bins;
> + size_t tcache_max_bytes;
> + /* Maximum number of chunks in each bucket. */
> + size_t tcache_count;
> + /* Maximum number of chunks to remove from the unsorted list, which
> + aren't used to prefill the cache. */
> + size_t tcache_unsorted_limit;
> +#endif
> };
>
> /* There are several instances of this struct ("arenas") in this
> @@ -1750,6 +1785,13 @@ static struct malloc_par mp_ =
> .trim_threshold = DEFAULT_TRIM_THRESHOLD,
> #define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
> .arena_test = NARENAS_FROM_NCORES (1)
> +#if USE_TCACHE
> + ,
> + .tcache_count = TCACHE_FILL_COUNT,
> + .tcache_bins = TCACHE_MAX_BINS,
> + .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1),
> + .tcache_unsorted_limit = 0 /* No limit. */
> +#endif
> };
>
> /* Maximum size of memory handled in fastbins. */
> @@ -2875,6 +2917,121 @@ mremap_chunk (mchunkptr p, size_t new_si
>
> /*------------------------ Public wrappers. --------------------------------*/
>
> +#if USE_TCACHE
> +
> +/* We overlay this structure on the user-data portion of a chunk when
> + the chunk is stored in the per-thread cache. */
> +typedef struct tcache_entry {
> + struct tcache_entry *next;
> +} tcache_entry;
> +
> +/* There is one of these for each thread, which contains the
> + per-thread cache (hence "tcache_perthread_struct"). Keeping
> + overall size low is mildly important. Note that COUNTS and ENTRIES
> + are redundant (we could have just counted the linked list each
> + time), this is for performance reasons. */
> +typedef struct tcache_perthread_struct {
> + char counts[TCACHE_MAX_BINS];
> + tcache_entry *entries[TCACHE_MAX_BINS];
> +} tcache_perthread_struct;
> +
> +static __thread char tcache_shutting_down = 0;
> +static __thread tcache_perthread_struct *tcache = NULL;
> +
> +/* Caller must ensure that we know tc_idx is valid and there's room
> + for more chunks. */
> +static void
> +tcache_put (mchunkptr chunk, size_t tc_idx)
> +{
> + tcache_entry *e = (tcache_entry *) chunk2mem (chunk);
> + assert (tc_idx < TCACHE_MAX_BINS);
> + e->next = tcache->entries[tc_idx];
> + tcache->entries[tc_idx] = e;
> + ++(tcache->counts[tc_idx]);
> +}
> +
> +/* Caller must ensure that we know tc_idx is valid and there's
> + available chunks to remove. */
> +static void *
> +tcache_get (size_t tc_idx)
> +{
> + tcache_entry *e = tcache->entries[tc_idx];
> + assert (tc_idx < TCACHE_MAX_BINS);
> + assert (tcache->entries[tc_idx] > 0);
> + tcache->entries[tc_idx] = e->next;
> + --(tcache->counts[tc_idx]);
> + return (void *) e;
> +}
> +
> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
> +tcache_thread_freeres (void)
> +{
> + int i;
> + tcache_perthread_struct *tcache_tmp = tcache;
> +
> + if (!tcache)
> + return;
> +
> + tcache = NULL;
> +
> + for (i = 0; i < TCACHE_MAX_BINS; ++i) {
> + while (tcache_tmp->entries[i])
> + {
> + tcache_entry *e = tcache_tmp->entries[i];
> + tcache_tmp->entries[i] = e->next;
> + __libc_free (e);
> + }
> + }
> +
> + __libc_free (tcache_tmp);
> +
> + tcache_shutting_down = 1;
> +}
> +text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
> +
> +static void
> +tcache_init(void)
> +{
> + mstate ar_ptr;
> + void *victim = 0;
> + const size_t bytes = sizeof (tcache_perthread_struct);
> +
> + if (tcache_shutting_down)
> + return;
> +
> + arena_get (ar_ptr, bytes);
> + victim = _int_malloc (ar_ptr, bytes);
> + if (!victim && ar_ptr != NULL)
> + {
> + ar_ptr = arena_get_retry (ar_ptr, bytes);
> + victim = _int_malloc (ar_ptr, bytes);
> + }
> +
> +
> + if (ar_ptr != NULL)
> + __libc_lock_unlock (ar_ptr->mutex);
> +
> + /* In a low memory situation, we may not be able to allocate memory
> + - in which case, we just keep trying later. However, we
> + typically do this very early, so either there is sufficient
> + memory, or there isn't enough memory to do non-trivial
> + allocations anyway. */
> + if (victim)
> + {
> + tcache = (tcache_perthread_struct *) victim;
> + memset (tcache, 0, sizeof (tcache_perthread_struct));
> + }
> +
> +}
> +
> +#define MAYBE_INIT_TCACHE() \
> + if (__glibc_unlikely (tcache == NULL)) \
> + tcache_init();
> +
> +#else
> +#define MAYBE_INIT_TCACHE()
> +#endif
> +
> void *
> __libc_malloc (size_t bytes)
> {
> @@ -2885,6 +3042,21 @@ __libc_malloc (size_t bytes)
> = atomic_forced_read (__malloc_hook);
> if (__builtin_expect (hook != NULL, 0))
> return (*hook)(bytes, RETURN_ADDRESS (0));
> +#if USE_TCACHE
> + /* int_free also calls request2size, be careful to not pad twice. */
> + size_t tbytes = request2size (bytes);
> + size_t tc_idx = csize2tidx (tbytes);
> +
> + MAYBE_INIT_TCACHE ();
> +
> + if (tc_idx < mp_.tcache_bins
> + && tc_idx < TCACHE_MAX_BINS /* to appease gcc */
> + && tcache
> + && tcache->entries[tc_idx] != NULL)
> + {
> + return tcache_get (tc_idx);
> + }
> +#endif
>
> arena_get (ar_ptr, bytes);
>
> @@ -2944,6 +3116,8 @@ __libc_free (void *mem)
> return;
> }
>
> + MAYBE_INIT_TCACHE ();
> +
> ar_ptr = arena_for_chunk (p);
> _int_free (ar_ptr, p, 0);
> }
> @@ -2981,7 +3155,10 @@ __libc_realloc (void *oldmem, size_t byt
> if (chunk_is_mmapped (oldp))
> ar_ptr = NULL;
> else
> - ar_ptr = arena_for_chunk (oldp);
> + {
> + MAYBE_INIT_TCACHE ();
> + ar_ptr = arena_for_chunk (oldp);
> + }
>
> /* Little security check which won't hurt performance: the allocator
> never wrapps around at the end of the address space. Therefore
> @@ -3206,6 +3383,8 @@ __libc_calloc (size_t n, size_t elem_siz
>
> sz = bytes;
>
> + MAYBE_INIT_TCACHE ();
> +
> arena_get (av, sz);
> if (av)
> {
> @@ -3336,6 +3515,10 @@ _int_malloc (mstate av, size_t bytes)
> mchunkptr fwd; /* misc temp for linking */
> mchunkptr bck; /* misc temp for linking */
>
> +#if USE_TCACHE
> + size_t tcache_unsorted_count; /* count of unsorted chunks processed */
> +#endif
> +
> const char *errstr = NULL;
>
> /*
> @@ -3365,19 +3548,22 @@ _int_malloc (mstate av, size_t bytes)
> can try it without checking, which saves some time on this fast path.
> */
>
> +#define REMOVE_FB(fb, victim, pp) \
> + do \
> + { \
> + victim = pp; \
> + if (victim == NULL) \
> + break; \
> + } \
> + while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
> + != victim); \
> +
> if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
> {
> idx = fastbin_index (nb);
> mfastbinptr *fb = &fastbin (av, idx);
> mchunkptr pp = *fb;
> - do
> - {
> - victim = pp;
> - if (victim == NULL)
> - break;
> - }
> - while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
> - != victim);
> + REMOVE_FB (fb, victim, pp);
> if (victim != 0)
> {
> if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
> @@ -3388,6 +3574,26 @@ _int_malloc (mstate av, size_t bytes)
> return NULL;
> }
> check_remalloced_chunk (av, victim, nb);
> +#if USE_TCACHE
> + /* While we're here, if we see other chunks of the same size,
> + stash them in the tcache. */
> + size_t tc_idx = csize2tidx (nb);
> + if (tcache && tc_idx < mp_.tcache_bins)
> + {
> + mchunkptr tc_victim;
> +
> + /* While bin not empty and tcache not full, copy chunks over. */
> + while (tcache->counts[tc_idx] < mp_.tcache_count
> + && (pp = *fb) != NULL)
> + {
> + REMOVE_FB (fb, tc_victim, pp);
> + if (tc_victim != 0)
> + {
> + tcache_put (tc_victim, tc_idx);
> + }
> + }
> + }
> +#endif
> void *p = chunk2mem (victim);
> alloc_perturb (p, bytes);
> return p;
> @@ -3426,6 +3632,32 @@ _int_malloc (mstate av, size_t bytes)
> if (av != &main_arena)
> set_non_main_arena (victim);
> check_malloced_chunk (av, victim, nb);
> +#if USE_TCACHE
> + /* While we're here, if we see other chunks of the same size,
> + stash them in the tcache. */
> + size_t tc_idx = csize2tidx (nb);
> + if (tcache && tc_idx < mp_.tcache_bins)
> + {
> + mchunkptr tc_victim;
> +
> + /* While bin not empty and tcache not full, copy chunks over. */
> + while (tcache->counts[tc_idx] < mp_.tcache_count
> + && (tc_victim = last (bin)) != bin)
> + {
> + if (tc_victim != 0)
> + {
> + bck = tc_victim->bk;
> + set_inuse_bit_at_offset (tc_victim, nb);
> + if (av != &main_arena)
> + set_non_main_arena (tc_victim);
> + bin->bk = bck;
> + bck->fd = bin;
> +
> + tcache_put (tc_victim, tc_idx);
> + }
> + }
> + }
> +#endif
> void *p = chunk2mem (victim);
> alloc_perturb (p, bytes);
> return p;
> @@ -3464,6 +3696,16 @@ _int_malloc (mstate av, size_t bytes)
> otherwise need to expand memory to service a "small" request.
> */
>
> +#if USE_TCACHE
> + INTERNAL_SIZE_T tcache_nb = 0;
> + size_t tc_idx = csize2tidx (nb);
> + if (tcache && tc_idx < mp_.tcache_bins)
> + tcache_nb = nb;
> + int return_cached = 0;
> +
> + tcache_unsorted_count = 0;
> +#endif
> +
> for (;; )
> {
> int iters = 0;
> @@ -3524,10 +3766,26 @@ _int_malloc (mstate av, size_t bytes)
> set_inuse_bit_at_offset (victim, size);
> if (av != &main_arena)
> set_non_main_arena (victim);
> +#if USE_TCACHE
> + /* Fill cache first, return to user only if cache fills.
> + We may return one of these chunks later. */
> + if (tcache_nb
> + && tcache->counts[tc_idx] < mp_.tcache_count)
> + {
> + tcache_put (victim, tc_idx);
> + return_cached = 1;
> + continue;
> + }
> + else
> + {
> +#endif
> check_malloced_chunk (av, victim, nb);
> void *p = chunk2mem (victim);
> alloc_perturb (p, bytes);
> return p;
> +#if USE_TCACHE
> + }
> +#endif
> }
>
> /* place chunk in bin */
> @@ -3594,11 +3852,31 @@ _int_malloc (mstate av, size_t bytes)
> fwd->bk = victim;
> bck->fd = victim;
>
> +#if USE_TCACHE
> + /* If we've processed as many chunks as we're allowed while
> + filling the cache, return one of the cached ones. */
> + ++tcache_unsorted_count;
> + if (return_cached
> + && mp_.tcache_unsorted_limit > 0
> + && tcache_unsorted_count > mp_.tcache_unsorted_limit)
> + {
> + return tcache_get (tc_idx);
> + }
> +#endif
> +
> #define MAX_ITERS 10000
> if (++iters >= MAX_ITERS)
> break;
> }
>
> +#if USE_TCACHE
> + /* If all the small chunks we found ended up cached, return one now. */
> + if (return_cached)
> + {
> + return tcache_get (tc_idx);
> + }
> +#endif
> +
> /*
> If a large request, scan through the chunks of current bin in
> sorted order to find smallest that fits. Use the skip list for this.
> @@ -3884,6 +4162,20 @@ _int_free (mstate av, mchunkptr p, int h
>
> check_inuse_chunk(av, p);
>
> +#if USE_TCACHE
> + {
> + size_t tc_idx = csize2tidx (size);
> +
> + if (tcache
> + && tc_idx < mp_.tcache_bins
> + && tcache->counts[tc_idx] < mp_.tcache_count)
> + {
> + tcache_put (p, tc_idx);
> + return;
> + }
> + }
> +#endif
> +
> /*
> If eligible, place chunk on a fastbin so it can be found
> and used quickly in malloc.
> @@ -4845,6 +5137,38 @@ do_set_arena_max (size_t value)
> return 1;
> }
>
> +#if USE_TCACHE
> +static inline int
> +__always_inline
> +do_set_tcache_max (size_t value)
> +{
> + if (value >= 0 && value <= MAX_TCACHE_SIZE)
> + {
> + LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes);
> + mp_.tcache_max_bytes = value;
> + mp_.tcache_bins = csize2tidx (request2size(value)) + 1;
> + }
> + return 1;
> +}
> +
> +static inline int
> +__always_inline
> +do_set_tcache_count (size_t value)
> +{
> + LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
> + mp_.tcache_count = value;
> + return 1;
> +}
> +
> +static inline int
> +__always_inline
> +do_set_tcache_unsorted_limit (size_t value)
> +{
> + LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit);
> + mp_.tcache_unsorted_limit = value;
> + return 1;
> +}
> +#endif
>
> int
> __libc_mallopt (int param_number, int value)
> diff -x .git -x po -r -U3 -p glibc.pristine/manual/probes.texi glibc.djmalloc-tcache/manual/probes.texi
> --- glibc.pristine/manual/probes.texi 2017-01-10 17:02:06.305343587 -0500
> +++ glibc.djmalloc-tcache/manual/probes.texi 2017-06-26 17:32:35.785884967 -0400
> @@ -231,6 +231,25 @@ dynamic brk/mmap thresholds. Argument @
> the adjusted mmap and trim thresholds, respectively.
> @end deftp
>
> +@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the @code{glibc.malloc.tcache_max}
> +tunable is set. Argument @var{$arg1} is the requested value, and
> +@var{$arg2} is the previous value of this tunable.
> +@end deftp
> +
> +@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the @code{glibc.malloc.tcache_count}
> +tunable is set. Argument @var{$arg1} is the requested value, and
> +@var{$arg2} is the previous value of this tunable.
> +@end deftp
> +
> +@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the
> +@code{glibc.malloc.tcache_unsorted_limit} tunable is set. Argument
> +@var{$arg1} is the requested value, and @var{$arg2} is the previous
> +value of this tunable.
> +@end deftp
> +
> @node Mathematical Function Probes
> @section Mathematical Function Probes
>
> diff -x .git -x po -r -U3 -p glibc.pristine/manual/tunables.texi glibc.djmalloc-tcache/manual/tunables.texi
> --- glibc.pristine/manual/tunables.texi 2017-06-23 20:49:10.201148494 -0400
> +++ glibc.djmalloc-tcache/manual/tunables.texi 2017-06-26 17:35:07.952998383 -0400
> @@ -193,6 +193,39 @@ systems the limit is twice the number of
> is 8 times the number of cores online.
> @end deftp
>
> +@deftp Tunable glibc.malloc.tcache_max
> +The maximum size of a request (in bytes) which may be met via the
> +per-thread cache. The default (and maximum) value is 1032 bytes on
> +64-bit systems and 516 bytes on 32-bit systems.
> +@end deftp
> +
> +@deftp Tunable glibc.malloc.tcache_count
> +The maximum number of chunks of each size to cache. The default is 7.
> +There is no upper limit, other than available system memory. Note
> +that chunks are rounded up to malloc's guaranteed alignment - this
> +count is per rounded size, not per user-provided size.
> +
> +The approximate maximum overhead of the per-thread cache (for each
> +thread, of course) is thus @code{glibc.malloc.tcache_max} (in bins,
> +max 64 bins) times @code{glibc.malloc.tcache_count} times the size for
> +each bin. With defaults, this is about 236 KB on 64-bit systems and
> +118 KB on 32-bit systems.
> +@end deftp
> +
> +@deftp Tunable glibc.malloc.tcache_unsorted_limit
> +When the user requests memory and the request cannot be met via the
> +per-thread cache, the arenas are used to meet the request. At this
> +time, additional chunks will be moved from existing arena lists to
> +pre-fill the corresponding cache. While copies from the fastbins,
> +smallbins, and regular bins are bounded and predictable due to the bin
> +sizes, copies from the unsorted bin are not bounded, and incur
> +additional time penalties as they need to be sorted as they're
> +scanned. To make scanning the unsorted list more predictable and
> +bounded, the user may set this tunable to limit the number of blocks
> +that are scanned from the unsorted list while searching for chunks to
> +pre-fill the per-thread cache with. The default, or when set to zero,
> +is no limit.
> +
> @node Hardware Capability Tunables
> @section Hardware Capability Tunables
> @cindex hardware capability tunables
>
On 06/28/2017 09:28 AM, Siddhesh Poyarekar wrote:
> On Wednesday 28 June 2017 02:12 AM, DJ Delorie wrote:
>>> Why does this need to be inherited by children of setxid processes? I
>>> guess the fact that the remaining malloc tunables are is one probable
>>> reason, but those also need to be reviewed and brought down to
>>> SXID_ERASE if necessary.
>>
>> IIRC Florian asked for that setting; I don't know specifically what the
>> difference is but if you two want to fight it out, I'll get some popcorn ;-)
>
> haha, interesting, because I remember Florian and I had this discussion
> about the utility of SXID_IGNORE six months ago when I designed the
> scope control for the envvars. Lets have Florian explain this.
I don't recall discussing this. I think we should use SXID_ERASE for
new tunables, and revisit that decision only when necessary.
Thanks,
Florian
@@ -78,6 +78,8 @@ multi-arch = @multi_arch@
mach-interface-list = @mach_interface_list@
+experimental-malloc = @experimental_malloc@
+
nss-crypt = @libc_cv_nss_crypt@
static-nss-crypt = @libc_cv_static_nss_crypt@
@@ -674,6 +674,7 @@ build_obsolete_nsl
link_obsolete_rpc
libc_cv_static_nss_crypt
libc_cv_nss_crypt
+experimental_malloc
enable_werror
all_warnings
force_install
@@ -779,6 +780,7 @@ enable_kernel
enable_all_warnings
enable_werror
enable_multi_arch
+enable_experimental_malloc
enable_nss_crypt
enable_obsolete_rpc
enable_obsolete_nsl
@@ -1450,6 +1452,8 @@ Optional Features:
--disable-werror do not build with -Werror
--enable-multi-arch enable single DSO with optimizations for multiple
architectures
+ --disable-experimental-malloc
+ disable experimental malloc features
--enable-nss-crypt enable libcrypt to use nss
--enable-obsolete-rpc build and install the obsolete RPC code for
link-time usage
@@ -3522,6 +3526,15 @@ else
fi
+# Check whether --enable-experimental-malloc was given.
+if test "${enable_experimental_malloc+set}" = set; then :
+ enableval=$enable_experimental_malloc; experimental_malloc=$enableval
+else
+ experimental_malloc=yes
+fi
+
+
+
# Check whether --enable-nss-crypt was given.
if test "${enable_nss_crypt+set}" = set; then :
enableval=$enable_nss_crypt; nss_crypt=$enableval
@@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch],
[multi_arch=$enableval],
[multi_arch=default])
+AC_ARG_ENABLE([experimental-malloc],
+ AC_HELP_STRING([--disable-experimental-malloc],
+ [disable experimental malloc features]),
+ [experimental_malloc=$enableval],
+ [experimental_malloc=yes])
+AC_SUBST(experimental_malloc)
+
AC_ARG_ENABLE([nss-crypt],
AC_HELP_STRING([--enable-nss-crypt],
[enable libcrypt to use nss]),
@@ -76,6 +76,18 @@ glibc {
minval: 1
security_level: SXID_IGNORE
}
+ tcache_max {
+ type: SIZE_T
+ security_level: SXID_IGNORE
+ }
+ tcache_count {
+ type: SIZE_T
+ security_level: SXID_IGNORE
+ }
+ tcache_unsorted_limit {
+ type: SIZE_T
+ security_level: SXID_IGNORE
+ }
}
tune {
hwcap_mask {
@@ -189,6 +189,11 @@ tst-malloc-usable-static-ENV = $(tst-mal
tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3
tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV)
+ifeq ($(experimental-malloc),yes)
+CPPFLAGS-malloc.c += -DUSE_TCACHE=1
+else
+CPPFLAGS-malloc.c += -DUSE_TCACHE=0
+endif
# Uncomment this for test releases. For public releases it is too expensive.
#CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1
@@ -236,6 +236,11 @@ TUNABLE_CALLBACK_FNDECL (set_perturb_byt
TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t)
TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t)
TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t)
+#if USE_TCACHE
+TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t)
+TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t)
+TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t)
+#endif
#else
/* Initialization routine. */
#include <string.h>
@@ -322,6 +327,12 @@ ptmalloc_init (void)
TUNABLE_GET (mmap_max, int32_t, TUNABLE_CALLBACK (set_mmaps_max));
TUNABLE_GET (arena_max, size_t, TUNABLE_CALLBACK (set_arena_max));
TUNABLE_GET (arena_test, size_t, TUNABLE_CALLBACK (set_arena_test));
+#if USE_TCACHE
+ TUNABLE_GET (tcache_max, size_t, TUNABLE_CALLBACK (set_tcache_max));
+ TUNABLE_GET (tcache_count, size_t, TUNABLE_CALLBACK (set_tcache_count));
+ TUNABLE_GET (tcache_unsorted_limit, size_t,
+ TUNABLE_CALLBACK (set_tcache_unsorted_limit));
+#endif
__libc_lock_unlock (main_arena.mutex);
#else
const char *s = NULL;
@@ -296,6 +296,30 @@ __malloc_assert (const char *assertion,
}
#endif
+#if USE_TCACHE
+/* We want 64 entries. This is an arbitrary limit, which tunables can reduce. */
+# define TCACHE_MAX_BINS 64
+# define MAX_TCACHE_SIZE tidx2usize (TCACHE_MAX_BINS-1)
+
+/* Only used to pre-fill the tunables. */
+# define tidx2usize(idx) (((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ)
+
+/* When "x" is from chunksize(). */
+# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
+/* When "x" is a user-provided size. */
+# define usize2tidx(x) csize2tidx (request2size (x))
+
+/* With rounding and alignment, the bins are...
+ idx 0 bytes 0..24 (64-bit) or 0..12 (32-bit)
+ idx 1 bytes 25..40 or 13..20
+ idx 2 bytes 41..56 or 21..28
+ etc. */
+
+/* This is another arbitrary limit, which tunables can change. Each
+ tcache bin will hold at most this number of chunks. */
+# define TCACHE_FILL_COUNT 7
+#endif
+
/*
REALLOC_ZERO_BYTES_FREES should be set if a call to
@@ -1712,6 +1736,17 @@ struct malloc_par
/* First address handed out by MORECORE/sbrk. */
char *sbrk_base;
+
+#if USE_TCACHE
+ /* Maximum number of buckets to use. */
+ size_t tcache_bins;
+ size_t tcache_max_bytes;
+ /* Maximum number of chunks in each bucket. */
+ size_t tcache_count;
+ /* Maximum number of chunks to remove from the unsorted list, which
+ aren't used to prefill the cache. */
+ size_t tcache_unsorted_limit;
+#endif
};
/* There are several instances of this struct ("arenas") in this
@@ -1750,6 +1785,13 @@ static struct malloc_par mp_ =
.trim_threshold = DEFAULT_TRIM_THRESHOLD,
#define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
.arena_test = NARENAS_FROM_NCORES (1)
+#if USE_TCACHE
+ ,
+ .tcache_count = TCACHE_FILL_COUNT,
+ .tcache_bins = TCACHE_MAX_BINS,
+ .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1),
+ .tcache_unsorted_limit = 0 /* No limit. */
+#endif
};
/* Maximum size of memory handled in fastbins. */
@@ -2875,6 +2917,121 @@ mremap_chunk (mchunkptr p, size_t new_si
/*------------------------ Public wrappers. --------------------------------*/
+#if USE_TCACHE
+
+/* We overlay this structure on the user-data portion of a chunk when
+ the chunk is stored in the per-thread cache. */
+typedef struct tcache_entry {
+ struct tcache_entry *next;
+} tcache_entry;
+
+/* There is one of these for each thread, which contains the
+ per-thread cache (hence "tcache_perthread_struct"). Keeping
+ overall size low is mildly important. Note that COUNTS and ENTRIES
+ are redundant (we could have just counted the linked list each
+ time), this is for performance reasons. */
+typedef struct tcache_perthread_struct {
+ char counts[TCACHE_MAX_BINS];
+ tcache_entry *entries[TCACHE_MAX_BINS];
+} tcache_perthread_struct;
+
+static __thread char tcache_shutting_down = 0;
+static __thread tcache_perthread_struct *tcache = NULL;
+
+/* Caller must ensure that we know tc_idx is valid and there's room
+ for more chunks. */
+static void
+tcache_put (mchunkptr chunk, size_t tc_idx)
+{
+ tcache_entry *e = (tcache_entry *) chunk2mem (chunk);
+ assert (tc_idx < TCACHE_MAX_BINS);
+ e->next = tcache->entries[tc_idx];
+ tcache->entries[tc_idx] = e;
+ ++(tcache->counts[tc_idx]);
+}
+
+/* Caller must ensure that we know tc_idx is valid and there's
+ available chunks to remove. */
+static void *
+tcache_get (size_t tc_idx)
+{
+ tcache_entry *e = tcache->entries[tc_idx];
+ assert (tc_idx < TCACHE_MAX_BINS);
+ assert (tcache->entries[tc_idx] > 0);
+ tcache->entries[tc_idx] = e->next;
+ --(tcache->counts[tc_idx]);
+ return (void *) e;
+}
+
+static void __attribute__ ((section ("__libc_thread_freeres_fn")))
+tcache_thread_freeres (void)
+{
+ int i;
+ tcache_perthread_struct *tcache_tmp = tcache;
+
+ if (!tcache)
+ return;
+
+ tcache = NULL;
+
+ for (i = 0; i < TCACHE_MAX_BINS; ++i) {
+ while (tcache_tmp->entries[i])
+ {
+ tcache_entry *e = tcache_tmp->entries[i];
+ tcache_tmp->entries[i] = e->next;
+ __libc_free (e);
+ }
+ }
+
+ __libc_free (tcache_tmp);
+
+ tcache_shutting_down = 1;
+}
+text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
+
+static void
+tcache_init(void)
+{
+ mstate ar_ptr;
+ void *victim = 0;
+ const size_t bytes = sizeof (tcache_perthread_struct);
+
+ if (tcache_shutting_down)
+ return;
+
+ arena_get (ar_ptr, bytes);
+ victim = _int_malloc (ar_ptr, bytes);
+ if (!victim && ar_ptr != NULL)
+ {
+ ar_ptr = arena_get_retry (ar_ptr, bytes);
+ victim = _int_malloc (ar_ptr, bytes);
+ }
+
+
+ if (ar_ptr != NULL)
+ __libc_lock_unlock (ar_ptr->mutex);
+
+ /* In a low memory situation, we may not be able to allocate memory
+ - in which case, we just keep trying later. However, we
+ typically do this very early, so either there is sufficient
+ memory, or there isn't enough memory to do non-trivial
+ allocations anyway. */
+ if (victim)
+ {
+ tcache = (tcache_perthread_struct *) victim;
+ memset (tcache, 0, sizeof (tcache_perthread_struct));
+ }
+
+}
+
+#define MAYBE_INIT_TCACHE() \
+ if (__glibc_unlikely (tcache == NULL)) \
+ tcache_init();
+
+#else
+#define MAYBE_INIT_TCACHE()
+#endif
+
void *
__libc_malloc (size_t bytes)
{
@@ -2885,6 +3042,21 @@ __libc_malloc (size_t bytes)
= atomic_forced_read (__malloc_hook);
if (__builtin_expect (hook != NULL, 0))
return (*hook)(bytes, RETURN_ADDRESS (0));
+#if USE_TCACHE
+ /* int_free also calls request2size, be careful to not pad twice. */
+ size_t tbytes = request2size (bytes);
+ size_t tc_idx = csize2tidx (tbytes);
+
+ MAYBE_INIT_TCACHE ();
+
+ if (tc_idx < mp_.tcache_bins
+ && tc_idx < TCACHE_MAX_BINS /* to appease gcc */
+ && tcache
+ && tcache->entries[tc_idx] != NULL)
+ {
+ return tcache_get (tc_idx);
+ }
+#endif
arena_get (ar_ptr, bytes);
@@ -2944,6 +3116,8 @@ __libc_free (void *mem)
return;
}
+ MAYBE_INIT_TCACHE ();
+
ar_ptr = arena_for_chunk (p);
_int_free (ar_ptr, p, 0);
}
@@ -2981,7 +3155,10 @@ __libc_realloc (void *oldmem, size_t byt
if (chunk_is_mmapped (oldp))
ar_ptr = NULL;
else
- ar_ptr = arena_for_chunk (oldp);
+ {
+ MAYBE_INIT_TCACHE ();
+ ar_ptr = arena_for_chunk (oldp);
+ }
/* Little security check which won't hurt performance: the allocator
never wrapps around at the end of the address space. Therefore
@@ -3206,6 +3383,8 @@ __libc_calloc (size_t n, size_t elem_siz
sz = bytes;
+ MAYBE_INIT_TCACHE ();
+
arena_get (av, sz);
if (av)
{
@@ -3336,6 +3515,10 @@ _int_malloc (mstate av, size_t bytes)
mchunkptr fwd; /* misc temp for linking */
mchunkptr bck; /* misc temp for linking */
+#if USE_TCACHE
+ size_t tcache_unsorted_count; /* count of unsorted chunks processed */
+#endif
+
const char *errstr = NULL;
/*
@@ -3365,19 +3548,22 @@ _int_malloc (mstate av, size_t bytes)
can try it without checking, which saves some time on this fast path.
*/
+#define REMOVE_FB(fb, victim, pp) \
+ do \
+ { \
+ victim = pp; \
+ if (victim == NULL) \
+ break; \
+ } \
+ while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
+ != victim); \
+
if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
{
idx = fastbin_index (nb);
mfastbinptr *fb = &fastbin (av, idx);
mchunkptr pp = *fb;
- do
- {
- victim = pp;
- if (victim == NULL)
- break;
- }
- while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
- != victim);
+ REMOVE_FB (fb, victim, pp);
if (victim != 0)
{
if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
@@ -3388,6 +3574,26 @@ _int_malloc (mstate av, size_t bytes)
return NULL;
}
check_remalloced_chunk (av, victim, nb);
+#if USE_TCACHE
+ /* While we're here, if we see other chunks of the same size,
+ stash them in the tcache. */
+ size_t tc_idx = csize2tidx (nb);
+ if (tcache && tc_idx < mp_.tcache_bins)
+ {
+ mchunkptr tc_victim;
+
+ /* While bin not empty and tcache not full, copy chunks over. */
+ while (tcache->counts[tc_idx] < mp_.tcache_count
+ && (pp = *fb) != NULL)
+ {
+ REMOVE_FB (fb, tc_victim, pp);
+ if (tc_victim != 0)
+ {
+ tcache_put (tc_victim, tc_idx);
+ }
+ }
+ }
+#endif
void *p = chunk2mem (victim);
alloc_perturb (p, bytes);
return p;
@@ -3426,6 +3632,32 @@ _int_malloc (mstate av, size_t bytes)
if (av != &main_arena)
set_non_main_arena (victim);
check_malloced_chunk (av, victim, nb);
+#if USE_TCACHE
+ /* While we're here, if we see other chunks of the same size,
+ stash them in the tcache. */
+ size_t tc_idx = csize2tidx (nb);
+ if (tcache && tc_idx < mp_.tcache_bins)
+ {
+ mchunkptr tc_victim;
+
+ /* While bin not empty and tcache not full, copy chunks over. */
+ while (tcache->counts[tc_idx] < mp_.tcache_count
+ && (tc_victim = last (bin)) != bin)
+ {
+ if (tc_victim != 0)
+ {
+ bck = tc_victim->bk;
+ set_inuse_bit_at_offset (tc_victim, nb);
+ if (av != &main_arena)
+ set_non_main_arena (tc_victim);
+ bin->bk = bck;
+ bck->fd = bin;
+
+ tcache_put (tc_victim, tc_idx);
+ }
+ }
+ }
+#endif
void *p = chunk2mem (victim);
alloc_perturb (p, bytes);
return p;
@@ -3464,6 +3696,16 @@ _int_malloc (mstate av, size_t bytes)
otherwise need to expand memory to service a "small" request.
*/
+#if USE_TCACHE
+ INTERNAL_SIZE_T tcache_nb = 0;
+ size_t tc_idx = csize2tidx (nb);
+ if (tcache && tc_idx < mp_.tcache_bins)
+ tcache_nb = nb;
+ int return_cached = 0;
+
+ tcache_unsorted_count = 0;
+#endif
+
for (;; )
{
int iters = 0;
@@ -3524,10 +3766,26 @@ _int_malloc (mstate av, size_t bytes)
set_inuse_bit_at_offset (victim, size);
if (av != &main_arena)
set_non_main_arena (victim);
+#if USE_TCACHE
+ /* Fill cache first, return to user only if cache fills.
+ We may return one of these chunks later. */
+ if (tcache_nb
+ && tcache->counts[tc_idx] < mp_.tcache_count)
+ {
+ tcache_put (victim, tc_idx);
+ return_cached = 1;
+ continue;
+ }
+ else
+ {
+#endif
check_malloced_chunk (av, victim, nb);
void *p = chunk2mem (victim);
alloc_perturb (p, bytes);
return p;
+#if USE_TCACHE
+ }
+#endif
}
/* place chunk in bin */
@@ -3594,11 +3852,31 @@ _int_malloc (mstate av, size_t bytes)
fwd->bk = victim;
bck->fd = victim;
+#if USE_TCACHE
+ /* If we've processed as many chunks as we're allowed while
+ filling the cache, return one of the cached ones. */
+ ++tcache_unsorted_count;
+ if (return_cached
+ && mp_.tcache_unsorted_limit > 0
+ && tcache_unsorted_count > mp_.tcache_unsorted_limit)
+ {
+ return tcache_get (tc_idx);
+ }
+#endif
+
#define MAX_ITERS 10000
if (++iters >= MAX_ITERS)
break;
}
+#if USE_TCACHE
+ /* If all the small chunks we found ended up cached, return one now. */
+ if (return_cached)
+ {
+ return tcache_get (tc_idx);
+ }
+#endif
+
/*
If a large request, scan through the chunks of current bin in
sorted order to find smallest that fits. Use the skip list for this.
@@ -3884,6 +4162,20 @@ _int_free (mstate av, mchunkptr p, int h
check_inuse_chunk(av, p);
+#if USE_TCACHE
+ {
+ size_t tc_idx = csize2tidx (size);
+
+ if (tcache
+ && tc_idx < mp_.tcache_bins
+ && tcache->counts[tc_idx] < mp_.tcache_count)
+ {
+ tcache_put (p, tc_idx);
+ return;
+ }
+ }
+#endif
+
/*
If eligible, place chunk on a fastbin so it can be found
and used quickly in malloc.
@@ -4845,6 +5137,38 @@ do_set_arena_max (size_t value)
return 1;
}
+#if USE_TCACHE
+static inline int
+__always_inline
+do_set_tcache_max (size_t value)
+{
+ if (value >= 0 && value <= MAX_TCACHE_SIZE)
+ {
+ LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes);
+ mp_.tcache_max_bytes = value;
+ mp_.tcache_bins = csize2tidx (request2size(value)) + 1;
+ }
+ return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_count (size_t value)
+{
+ LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
+ mp_.tcache_count = value;
+ return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_unsorted_limit (size_t value)
+{
+ LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit);
+ mp_.tcache_unsorted_limit = value;
+ return 1;
+}
+#endif
int
__libc_mallopt (int param_number, int value)
@@ -231,6 +231,25 @@ dynamic brk/mmap thresholds. Argument @
the adjusted mmap and trim thresholds, respectively.
@end deftp
+@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the @code{glibc.malloc.tcache_max}
+tunable is set. Argument @var{$arg1} is the requested value, and
+@var{$arg2} is the previous value of this tunable.
+@end deftp
+
+@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the @code{glibc.malloc.tcache_count}
+tunable is set. Argument @var{$arg1} is the requested value, and
+@var{$arg2} is the previous value of this tunable.
+@end deftp
+
+@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the
+@code{glibc.malloc.tcache_unsorted_limit} tunable is set. Argument
+@var{$arg1} is the requested value, and @var{$arg2} is the previous
+value of this tunable.
+@end deftp
+
@node Mathematical Function Probes
@section Mathematical Function Probes
@@ -193,6 +193,39 @@ systems the limit is twice the number of
is 8 times the number of cores online.
@end deftp
+@deftp Tunable glibc.malloc.tcache_max
+The maximum size of a request (in bytes) which may be met via the
+per-thread cache. The default (and maximum) value is 1032 bytes on
+64-bit systems and 516 bytes on 32-bit systems.
+@end deftp
+
+@deftp Tunable glibc.malloc.tcache_count
+The maximum number of chunks of each size to cache. The default is 7.
+There is no upper limit, other than available system memory. Note
+that chunks are rounded up to malloc's guaranteed alignment - this
+count is per rounded size, not per user-provided size.
+
+The approximate maximum overhead of the per-thread cache (for each
+thread, of course) is thus @code{glibc.malloc.tcache_max} (in bins,
+max 64 bins) times @code{glibc.malloc.tcache_count} times the size for
+each bin. With defaults, this is about 236 KB on 64-bit systems and
+118 KB on 32-bit systems.
+@end deftp
+
+@deftp Tunable glibc.malloc.tcache_unsorted_limit
+When the user requests memory and the request cannot be met via the
+per-thread cache, the arenas are used to meet the request. At this
+time, additional chunks will be moved from existing arena lists to
+pre-fill the corresponding cache. While copies from the fastbins,
+smallbins, and regular bins are bounded and predictable due to the bin
+sizes, copies from the unsorted bin are not bounded, and incur
+additional time penalties as they need to be sorted as they're
+scanned. To make scanning the unsorted list more predictable and
+bounded, the user may set this tunable to limit the number of blocks
+that are scanned from the unsorted list while searching for chunks to
+pre-fill the per-thread cache with. The default, or when set to zero,
+is no limit.
+
@node Hardware Capability Tunables
@section Hardware Capability Tunables
@cindex hardware capability tunables