[updated] malloc per-thread cache ready for review
Commit Message
Siddhesh Poyarekar <siddhesh@gotplt.org> writes:
> haha, interesting, because I remember Florian and I had this discussion
> about the utility of SXID_IGNORE six months ago when I designed the
> scope control for the envvars. Lets have Florian explain this.
Changed to SXID_ERASE.
> The braces should be on the next line:
Fixed.
>>>> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
>>>
>>> __attribute__ should be on its own line.
>>
>> Likewise, I was copying the formatting used elsewhere in malloc. What
>> is the correct formatting for that line? Do I need to fix the other
>> existing instances also?
>
> Sure, but in a separate (obvious) patch. The malloc code was in a
> completely different format when it was ported in from dlmalloc and
> since then it has been a bit of a mishmash that needs a cleanup.
I scanned all of glibc, and the same-line syntax seems to be nearly
ubiquitous, with only one example of it on a separate line. I couldn't
find any mention of attributes in the GNU or GLIBC coding standards
either...
I'm certainly willing to do a global tweak patch later, but... is this
documented somewhere?
> Opening brace for the for loop should be on the next line and everything
> else indented accordingly.
Fixed.
> However, I noticed that it is a function of a bunch of other macros
> and unless those are constant across architectures it won't go into
> tunables as is.
They're different for 64-bit vs 32-bit at least.
Tweaked patch attached.
Comments
On Thursday 29 June 2017 06:46 AM, DJ Delorie wrote:
> Siddhesh Poyarekar <siddhesh@gotplt.org> writes:
>> haha, interesting, because I remember Florian and I had this discussion
>> about the utility of SXID_IGNORE six months ago when I designed the
>> scope control for the envvars. Lets have Florian explain this.
>
> Changed to SXID_ERASE.
>
>> The braces should be on the next line:
>
> Fixed.
>
>>>>> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
>>>>
>>>> __attribute__ should be on its own line.
>>>
>>> Likewise, I was copying the formatting used elsewhere in malloc. What
>>> is the correct formatting for that line? Do I need to fix the other
>>> existing instances also?
>>
>> Sure, but in a separate (obvious) patch. The malloc code was in a
>> completely different format when it was ported in from dlmalloc and
>> since then it has been a bit of a mishmash that needs a cleanup.
>
> I scanned all of glibc, and the same-line syntax seems to be nearly
> ubiquitous, with only one example of it on a separate line. I couldn't
> find any mention of attributes in the GNU or GLIBC coding standards
> either...
>
> I'm certainly willing to do a global tweak patch later, but... is this
> documented somewhere?
I thought it should have been but now that you've forced me to think
harder, my impression may be due to the way macro sections were written
in some math functions.
Looks good to me, please commit.
Thanks,
Siddhesh
>
>> Opening brace for the for loop should be on the next line and everything
>> else indented accordingly.
>
> Fixed.
>
>> However, I noticed that it is a function of a bunch of other macros
>> and unless those are constant across architectures it won't go into
>> tunables as is.
>
> They're different for 64-bit vs 32-bit at least.
>
> Tweaked patch attached.
>
> diff -x .git -x po -rup glibc.pristine/config.make.in glibc.djmalloc-tcache/config.make.in
> --- glibc.pristine/config.make.in 2017-06-20 12:54:22.636484566 -0400
> +++ glibc.djmalloc-tcache/config.make.in 2017-06-26 17:32:41.164888977 -0400
> @@ -78,6 +78,8 @@ multi-arch = @multi_arch@
>
> mach-interface-list = @mach_interface_list@
>
> +experimental-malloc = @experimental_malloc@
> +
> nss-crypt = @libc_cv_nss_crypt@
> static-nss-crypt = @libc_cv_static_nss_crypt@
>
> diff -x .git -x po -rup glibc.pristine/configure glibc.djmalloc-tcache/configure
> --- glibc.pristine/configure 2017-06-20 12:54:22.637484566 -0400
> +++ glibc.djmalloc-tcache/configure 2017-06-26 17:32:41.163888976 -0400
> @@ -674,6 +674,7 @@ build_obsolete_nsl
> link_obsolete_rpc
> libc_cv_static_nss_crypt
> libc_cv_nss_crypt
> +experimental_malloc
> enable_werror
> all_warnings
> force_install
> @@ -779,6 +780,7 @@ enable_kernel
> enable_all_warnings
> enable_werror
> enable_multi_arch
> +enable_experimental_malloc
> enable_nss_crypt
> enable_obsolete_rpc
> enable_obsolete_nsl
> @@ -1450,6 +1452,8 @@ Optional Features:
> --disable-werror do not build with -Werror
> --enable-multi-arch enable single DSO with optimizations for multiple
> architectures
> + --disable-experimental-malloc
> + disable experimental malloc features
> --enable-nss-crypt enable libcrypt to use nss
> --enable-obsolete-rpc build and install the obsolete RPC code for
> link-time usage
> @@ -3522,6 +3526,15 @@ else
> fi
>
>
> +# Check whether --enable-experimental-malloc was given.
> +if test "${enable_experimental_malloc+set}" = set; then :
> + enableval=$enable_experimental_malloc; experimental_malloc=$enableval
> +else
> + experimental_malloc=yes
> +fi
> +
> +
> +
> # Check whether --enable-nss-crypt was given.
> if test "${enable_nss_crypt+set}" = set; then :
> enableval=$enable_nss_crypt; nss_crypt=$enableval
> diff -x .git -x po -rup glibc.pristine/configure.ac glibc.djmalloc-tcache/configure.ac
> --- glibc.pristine/configure.ac 2017-06-20 12:54:22.638484567 -0400
> +++ glibc.djmalloc-tcache/configure.ac 2017-06-26 17:32:41.155888970 -0400
> @@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch],
> [multi_arch=$enableval],
> [multi_arch=default])
>
> +AC_ARG_ENABLE([experimental-malloc],
> + AC_HELP_STRING([--disable-experimental-malloc],
> + [disable experimental malloc features]),
> + [experimental_malloc=$enableval],
> + [experimental_malloc=yes])
> +AC_SUBST(experimental_malloc)
> +
> AC_ARG_ENABLE([nss-crypt],
> AC_HELP_STRING([--enable-nss-crypt],
> [enable libcrypt to use nss]),
> diff -x .git -x po -rup glibc.pristine/elf/dl-tunables.list glibc.djmalloc-tcache/elf/dl-tunables.list
> --- glibc.pristine/elf/dl-tunables.list 2017-06-23 20:49:10.174148477 -0400
> +++ glibc.djmalloc-tcache/elf/dl-tunables.list 2017-06-28 21:01:06.748079482 -0400
> @@ -76,6 +76,18 @@ glibc {
> minval: 1
> security_level: SXID_IGNORE
> }
> + tcache_max {
> + type: SIZE_T
> + security_level: SXID_ERASE
> + }
> + tcache_count {
> + type: SIZE_T
> + security_level: SXID_ERASE
> + }
> + tcache_unsorted_limit {
> + type: SIZE_T
> + security_level: SXID_ERASE
> + }
> }
> tune {
> hwcap_mask {
> diff -x .git -x po -rup glibc.pristine/malloc/Makefile glibc.djmalloc-tcache/malloc/Makefile
> --- glibc.pristine/malloc/Makefile 2017-06-23 20:49:10.199148493 -0400
> +++ glibc.djmalloc-tcache/malloc/Makefile 2017-06-26 17:32:41.153888968 -0400
> @@ -189,6 +189,11 @@ tst-malloc-usable-static-ENV = $(tst-mal
> tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3
> tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV)
>
> +ifeq ($(experimental-malloc),yes)
> +CPPFLAGS-malloc.c += -DUSE_TCACHE=1
> +else
> +CPPFLAGS-malloc.c += -DUSE_TCACHE=0
> +endif
> # Uncomment this for test releases. For public releases it is too expensive.
> #CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1
>
> diff -x .git -x po -rup glibc.pristine/malloc/arena.c glibc.djmalloc-tcache/malloc/arena.c
> --- glibc.pristine/malloc/arena.c 2017-06-20 12:54:22.645484573 -0400
> +++ glibc.djmalloc-tcache/malloc/arena.c 2017-06-26 17:49:33.701608305 -0400
> @@ -236,6 +236,11 @@ TUNABLE_CALLBACK_FNDECL (set_perturb_byt
> TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t)
> TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t)
> TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t)
> +#if USE_TCACHE
> +TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t)
> +TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t)
> +TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t)
> +#endif
> #else
> /* Initialization routine. */
> #include <string.h>
> @@ -322,6 +327,12 @@ ptmalloc_init (void)
> TUNABLE_GET (mmap_max, int32_t, TUNABLE_CALLBACK (set_mmaps_max));
> TUNABLE_GET (arena_max, size_t, TUNABLE_CALLBACK (set_arena_max));
> TUNABLE_GET (arena_test, size_t, TUNABLE_CALLBACK (set_arena_test));
> +#if USE_TCACHE
> + TUNABLE_GET (tcache_max, size_t, TUNABLE_CALLBACK (set_tcache_max));
> + TUNABLE_GET (tcache_count, size_t, TUNABLE_CALLBACK (set_tcache_count));
> + TUNABLE_GET (tcache_unsorted_limit, size_t,
> + TUNABLE_CALLBACK (set_tcache_unsorted_limit));
> +#endif
> __libc_lock_unlock (main_arena.mutex);
> #else
> const char *s = NULL;
> diff -x .git -x po -rup glibc.pristine/malloc/malloc.c glibc.djmalloc-tcache/malloc/malloc.c
> --- glibc.pristine/malloc/malloc.c 2017-05-03 16:26:47.559839224 -0400
> +++ glibc.djmalloc-tcache/malloc/malloc.c 2017-06-28 21:02:22.059151815 -0400
> @@ -296,6 +296,30 @@ __malloc_assert (const char *assertion,
> }
> #endif
>
> +#if USE_TCACHE
> +/* We want 64 entries. This is an arbitrary limit, which tunables can reduce. */
> +# define TCACHE_MAX_BINS 64
> +# define MAX_TCACHE_SIZE tidx2usize (TCACHE_MAX_BINS-1)
> +
> +/* Only used to pre-fill the tunables. */
> +# define tidx2usize(idx) (((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ)
> +
> +/* When "x" is from chunksize(). */
> +# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
> +/* When "x" is a user-provided size. */
> +# define usize2tidx(x) csize2tidx (request2size (x))
> +
> +/* With rounding and alignment, the bins are...
> + idx 0 bytes 0..24 (64-bit) or 0..12 (32-bit)
> + idx 1 bytes 25..40 or 13..20
> + idx 2 bytes 41..56 or 21..28
> + etc. */
> +
> +/* This is another arbitrary limit, which tunables can change. Each
> + tcache bin will hold at most this number of chunks. */
> +# define TCACHE_FILL_COUNT 7
> +#endif
> +
>
> /*
> REALLOC_ZERO_BYTES_FREES should be set if a call to
> @@ -1712,6 +1736,17 @@ struct malloc_par
>
> /* First address handed out by MORECORE/sbrk. */
> char *sbrk_base;
> +
> +#if USE_TCACHE
> + /* Maximum number of buckets to use. */
> + size_t tcache_bins;
> + size_t tcache_max_bytes;
> + /* Maximum number of chunks in each bucket. */
> + size_t tcache_count;
> + /* Maximum number of chunks to remove from the unsorted list, which
> + aren't used to prefill the cache. */
> + size_t tcache_unsorted_limit;
> +#endif
> };
>
> /* There are several instances of this struct ("arenas") in this
> @@ -1750,6 +1785,13 @@ static struct malloc_par mp_ =
> .trim_threshold = DEFAULT_TRIM_THRESHOLD,
> #define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
> .arena_test = NARENAS_FROM_NCORES (1)
> +#if USE_TCACHE
> + ,
> + .tcache_count = TCACHE_FILL_COUNT,
> + .tcache_bins = TCACHE_MAX_BINS,
> + .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1),
> + .tcache_unsorted_limit = 0 /* No limit. */
> +#endif
> };
>
> /* Maximum size of memory handled in fastbins. */
> @@ -2875,6 +2917,124 @@ mremap_chunk (mchunkptr p, size_t new_si
>
> /*------------------------ Public wrappers. --------------------------------*/
>
> +#if USE_TCACHE
> +
> +/* We overlay this structure on the user-data portion of a chunk when
> + the chunk is stored in the per-thread cache. */
> +typedef struct tcache_entry
> +{
> + struct tcache_entry *next;
> +} tcache_entry;
> +
> +/* There is one of these for each thread, which contains the
> + per-thread cache (hence "tcache_perthread_struct"). Keeping
> + overall size low is mildly important. Note that COUNTS and ENTRIES
> + are redundant (we could have just counted the linked list each
> + time), this is for performance reasons. */
> +typedef struct tcache_perthread_struct
> +{
> + char counts[TCACHE_MAX_BINS];
> + tcache_entry *entries[TCACHE_MAX_BINS];
> +} tcache_perthread_struct;
> +
> +static __thread char tcache_shutting_down = 0;
> +static __thread tcache_perthread_struct *tcache = NULL;
> +
> +/* Caller must ensure that we know tc_idx is valid and there's room
> + for more chunks. */
> +static void
> +tcache_put (mchunkptr chunk, size_t tc_idx)
> +{
> + tcache_entry *e = (tcache_entry *) chunk2mem (chunk);
> + assert (tc_idx < TCACHE_MAX_BINS);
> + e->next = tcache->entries[tc_idx];
> + tcache->entries[tc_idx] = e;
> + ++(tcache->counts[tc_idx]);
> +}
> +
> +/* Caller must ensure that we know tc_idx is valid and there's
> + available chunks to remove. */
> +static void *
> +tcache_get (size_t tc_idx)
> +{
> + tcache_entry *e = tcache->entries[tc_idx];
> + assert (tc_idx < TCACHE_MAX_BINS);
> + assert (tcache->entries[tc_idx] > 0);
> + tcache->entries[tc_idx] = e->next;
> + --(tcache->counts[tc_idx]);
> + return (void *) e;
> +}
> +
> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
> +tcache_thread_freeres (void)
> +{
> + int i;
> + tcache_perthread_struct *tcache_tmp = tcache;
> +
> + if (!tcache)
> + return;
> +
> + tcache = NULL;
> +
> + for (i = 0; i < TCACHE_MAX_BINS; ++i)
> + {
> + while (tcache_tmp->entries[i])
> + {
> + tcache_entry *e = tcache_tmp->entries[i];
> + tcache_tmp->entries[i] = e->next;
> + __libc_free (e);
> + }
> + }
> +
> + __libc_free (tcache_tmp);
> +
> + tcache_shutting_down = 1;
> +}
> +text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
> +
> +static void
> +tcache_init(void)
> +{
> + mstate ar_ptr;
> + void *victim = 0;
> + const size_t bytes = sizeof (tcache_perthread_struct);
> +
> + if (tcache_shutting_down)
> + return;
> +
> + arena_get (ar_ptr, bytes);
> + victim = _int_malloc (ar_ptr, bytes);
> + if (!victim && ar_ptr != NULL)
> + {
> + ar_ptr = arena_get_retry (ar_ptr, bytes);
> + victim = _int_malloc (ar_ptr, bytes);
> + }
> +
> +
> + if (ar_ptr != NULL)
> + __libc_lock_unlock (ar_ptr->mutex);
> +
> + /* In a low memory situation, we may not be able to allocate memory
> + - in which case, we just keep trying later. However, we
> + typically do this very early, so either there is sufficient
> + memory, or there isn't enough memory to do non-trivial
> + allocations anyway. */
> + if (victim)
> + {
> + tcache = (tcache_perthread_struct *) victim;
> + memset (tcache, 0, sizeof (tcache_perthread_struct));
> + }
> +
> +}
> +
> +#define MAYBE_INIT_TCACHE() \
> + if (__glibc_unlikely (tcache == NULL)) \
> + tcache_init();
> +
> +#else
> +#define MAYBE_INIT_TCACHE()
> +#endif
> +
> void *
> __libc_malloc (size_t bytes)
> {
> @@ -2885,6 +3045,21 @@ __libc_malloc (size_t bytes)
> = atomic_forced_read (__malloc_hook);
> if (__builtin_expect (hook != NULL, 0))
> return (*hook)(bytes, RETURN_ADDRESS (0));
> +#if USE_TCACHE
> + /* int_free also calls request2size, be careful to not pad twice. */
> + size_t tbytes = request2size (bytes);
> + size_t tc_idx = csize2tidx (tbytes);
> +
> + MAYBE_INIT_TCACHE ();
> +
> + if (tc_idx < mp_.tcache_bins
> + && tc_idx < TCACHE_MAX_BINS /* to appease gcc */
> + && tcache
> + && tcache->entries[tc_idx] != NULL)
> + {
> + return tcache_get (tc_idx);
> + }
> +#endif
>
> arena_get (ar_ptr, bytes);
>
> @@ -2944,6 +3119,8 @@ __libc_free (void *mem)
> return;
> }
>
> + MAYBE_INIT_TCACHE ();
> +
> ar_ptr = arena_for_chunk (p);
> _int_free (ar_ptr, p, 0);
> }
> @@ -2981,7 +3158,10 @@ __libc_realloc (void *oldmem, size_t byt
> if (chunk_is_mmapped (oldp))
> ar_ptr = NULL;
> else
> - ar_ptr = arena_for_chunk (oldp);
> + {
> + MAYBE_INIT_TCACHE ();
> + ar_ptr = arena_for_chunk (oldp);
> + }
>
> /* Little security check which won't hurt performance: the allocator
> never wrapps around at the end of the address space. Therefore
> @@ -3206,6 +3386,8 @@ __libc_calloc (size_t n, size_t elem_siz
>
> sz = bytes;
>
> + MAYBE_INIT_TCACHE ();
> +
> arena_get (av, sz);
> if (av)
> {
> @@ -3336,6 +3518,10 @@ _int_malloc (mstate av, size_t bytes)
> mchunkptr fwd; /* misc temp for linking */
> mchunkptr bck; /* misc temp for linking */
>
> +#if USE_TCACHE
> + size_t tcache_unsorted_count; /* count of unsorted chunks processed */
> +#endif
> +
> const char *errstr = NULL;
>
> /*
> @@ -3365,19 +3551,22 @@ _int_malloc (mstate av, size_t bytes)
> can try it without checking, which saves some time on this fast path.
> */
>
> +#define REMOVE_FB(fb, victim, pp) \
> + do \
> + { \
> + victim = pp; \
> + if (victim == NULL) \
> + break; \
> + } \
> + while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
> + != victim); \
> +
> if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
> {
> idx = fastbin_index (nb);
> mfastbinptr *fb = &fastbin (av, idx);
> mchunkptr pp = *fb;
> - do
> - {
> - victim = pp;
> - if (victim == NULL)
> - break;
> - }
> - while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
> - != victim);
> + REMOVE_FB (fb, victim, pp);
> if (victim != 0)
> {
> if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
> @@ -3388,6 +3577,26 @@ _int_malloc (mstate av, size_t bytes)
> return NULL;
> }
> check_remalloced_chunk (av, victim, nb);
> +#if USE_TCACHE
> + /* While we're here, if we see other chunks of the same size,
> + stash them in the tcache. */
> + size_t tc_idx = csize2tidx (nb);
> + if (tcache && tc_idx < mp_.tcache_bins)
> + {
> + mchunkptr tc_victim;
> +
> + /* While bin not empty and tcache not full, copy chunks over. */
> + while (tcache->counts[tc_idx] < mp_.tcache_count
> + && (pp = *fb) != NULL)
> + {
> + REMOVE_FB (fb, tc_victim, pp);
> + if (tc_victim != 0)
> + {
> + tcache_put (tc_victim, tc_idx);
> + }
> + }
> + }
> +#endif
> void *p = chunk2mem (victim);
> alloc_perturb (p, bytes);
> return p;
> @@ -3426,6 +3635,32 @@ _int_malloc (mstate av, size_t bytes)
> if (av != &main_arena)
> set_non_main_arena (victim);
> check_malloced_chunk (av, victim, nb);
> +#if USE_TCACHE
> + /* While we're here, if we see other chunks of the same size,
> + stash them in the tcache. */
> + size_t tc_idx = csize2tidx (nb);
> + if (tcache && tc_idx < mp_.tcache_bins)
> + {
> + mchunkptr tc_victim;
> +
> + /* While bin not empty and tcache not full, copy chunks over. */
> + while (tcache->counts[tc_idx] < mp_.tcache_count
> + && (tc_victim = last (bin)) != bin)
> + {
> + if (tc_victim != 0)
> + {
> + bck = tc_victim->bk;
> + set_inuse_bit_at_offset (tc_victim, nb);
> + if (av != &main_arena)
> + set_non_main_arena (tc_victim);
> + bin->bk = bck;
> + bck->fd = bin;
> +
> + tcache_put (tc_victim, tc_idx);
> + }
> + }
> + }
> +#endif
> void *p = chunk2mem (victim);
> alloc_perturb (p, bytes);
> return p;
> @@ -3464,6 +3699,16 @@ _int_malloc (mstate av, size_t bytes)
> otherwise need to expand memory to service a "small" request.
> */
>
> +#if USE_TCACHE
> + INTERNAL_SIZE_T tcache_nb = 0;
> + size_t tc_idx = csize2tidx (nb);
> + if (tcache && tc_idx < mp_.tcache_bins)
> + tcache_nb = nb;
> + int return_cached = 0;
> +
> + tcache_unsorted_count = 0;
> +#endif
> +
> for (;; )
> {
> int iters = 0;
> @@ -3524,10 +3769,26 @@ _int_malloc (mstate av, size_t bytes)
> set_inuse_bit_at_offset (victim, size);
> if (av != &main_arena)
> set_non_main_arena (victim);
> +#if USE_TCACHE
> + /* Fill cache first, return to user only if cache fills.
> + We may return one of these chunks later. */
> + if (tcache_nb
> + && tcache->counts[tc_idx] < mp_.tcache_count)
> + {
> + tcache_put (victim, tc_idx);
> + return_cached = 1;
> + continue;
> + }
> + else
> + {
> +#endif
> check_malloced_chunk (av, victim, nb);
> void *p = chunk2mem (victim);
> alloc_perturb (p, bytes);
> return p;
> +#if USE_TCACHE
> + }
> +#endif
> }
>
> /* place chunk in bin */
> @@ -3594,11 +3855,31 @@ _int_malloc (mstate av, size_t bytes)
> fwd->bk = victim;
> bck->fd = victim;
>
> +#if USE_TCACHE
> + /* If we've processed as many chunks as we're allowed while
> + filling the cache, return one of the cached ones. */
> + ++tcache_unsorted_count;
> + if (return_cached
> + && mp_.tcache_unsorted_limit > 0
> + && tcache_unsorted_count > mp_.tcache_unsorted_limit)
> + {
> + return tcache_get (tc_idx);
> + }
> +#endif
> +
> #define MAX_ITERS 10000
> if (++iters >= MAX_ITERS)
> break;
> }
>
> +#if USE_TCACHE
> + /* If all the small chunks we found ended up cached, return one now. */
> + if (return_cached)
> + {
> + return tcache_get (tc_idx);
> + }
> +#endif
> +
> /*
> If a large request, scan through the chunks of current bin in
> sorted order to find smallest that fits. Use the skip list for this.
> @@ -3884,6 +4165,20 @@ _int_free (mstate av, mchunkptr p, int h
>
> check_inuse_chunk(av, p);
>
> +#if USE_TCACHE
> + {
> + size_t tc_idx = csize2tidx (size);
> +
> + if (tcache
> + && tc_idx < mp_.tcache_bins
> + && tcache->counts[tc_idx] < mp_.tcache_count)
> + {
> + tcache_put (p, tc_idx);
> + return;
> + }
> + }
> +#endif
> +
> /*
> If eligible, place chunk on a fastbin so it can be found
> and used quickly in malloc.
> @@ -4845,6 +5140,38 @@ do_set_arena_max (size_t value)
> return 1;
> }
>
> +#if USE_TCACHE
> +static inline int
> +__always_inline
> +do_set_tcache_max (size_t value)
> +{
> + if (value >= 0 && value <= MAX_TCACHE_SIZE)
> + {
> + LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes);
> + mp_.tcache_max_bytes = value;
> + mp_.tcache_bins = csize2tidx (request2size(value)) + 1;
> + }
> + return 1;
> +}
> +
> +static inline int
> +__always_inline
> +do_set_tcache_count (size_t value)
> +{
> + LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
> + mp_.tcache_count = value;
> + return 1;
> +}
> +
> +static inline int
> +__always_inline
> +do_set_tcache_unsorted_limit (size_t value)
> +{
> + LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit);
> + mp_.tcache_unsorted_limit = value;
> + return 1;
> +}
> +#endif
>
> int
> __libc_mallopt (int param_number, int value)
> diff -x .git -x po -rup glibc.pristine/manual/probes.texi glibc.djmalloc-tcache/manual/probes.texi
> --- glibc.pristine/manual/probes.texi 2017-01-10 17:02:06.305343587 -0500
> +++ glibc.djmalloc-tcache/manual/probes.texi 2017-06-26 17:32:35.785884967 -0400
> @@ -231,6 +231,25 @@ dynamic brk/mmap thresholds. Argument @
> the adjusted mmap and trim thresholds, respectively.
> @end deftp
>
> +@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the @code{glibc.malloc.tcache_max}
> +tunable is set. Argument @var{$arg1} is the requested value, and
> +@var{$arg2} is the previous value of this tunable.
> +@end deftp
> +
> +@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the @code{glibc.malloc.tcache_count}
> +tunable is set. Argument @var{$arg1} is the requested value, and
> +@var{$arg2} is the previous value of this tunable.
> +@end deftp
> +
> +@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the
> +@code{glibc.malloc.tcache_unsorted_limit} tunable is set. Argument
> +@var{$arg1} is the requested value, and @var{$arg2} is the previous
> +value of this tunable.
> +@end deftp
> +
> @node Mathematical Function Probes
> @section Mathematical Function Probes
>
> diff -x .git -x po -rup glibc.pristine/manual/tunables.texi glibc.djmalloc-tcache/manual/tunables.texi
> --- glibc.pristine/manual/tunables.texi 2017-06-23 20:49:10.201148494 -0400
> +++ glibc.djmalloc-tcache/manual/tunables.texi 2017-06-26 17:35:07.952998383 -0400
> @@ -193,6 +193,39 @@ systems the limit is twice the number of
> is 8 times the number of cores online.
> @end deftp
>
> +@deftp Tunable glibc.malloc.tcache_max
> +The maximum size of a request (in bytes) which may be met via the
> +per-thread cache. The default (and maximum) value is 1032 bytes on
> +64-bit systems and 516 bytes on 32-bit systems.
> +@end deftp
> +
> +@deftp Tunable glibc.malloc.tcache_count
> +The maximum number of chunks of each size to cache. The default is 7.
> +There is no upper limit, other than available system memory. Note
> +that chunks are rounded up to malloc's guaranteed alignment - this
> +count is per rounded size, not per user-provided size.
> +
> +The approximate maximum overhead of the per-thread cache (for each
> +thread, of course) is thus @code{glibc.malloc.tcache_max} (in bins,
> +max 64 bins) times @code{glibc.malloc.tcache_count} times the size for
> +each bin. With defaults, this is about 236 KB on 64-bit systems and
> +118 KB on 32-bit systems.
> +@end deftp
> +
> +@deftp Tunable glibc.malloc.tcache_unsorted_limit
> +When the user requests memory and the request cannot be met via the
> +per-thread cache, the arenas are used to meet the request. At this
> +time, additional chunks will be moved from existing arena lists to
> +pre-fill the corresponding cache. While copies from the fastbins,
> +smallbins, and regular bins are bounded and predictable due to the bin
> +sizes, copies from the unsorted bin are not bounded, and incur
> +additional time penalties as they need to be sorted as they're
> +scanned. To make scanning the unsorted list more predictable and
> +bounded, the user may set this tunable to limit the number of blocks
> +that are scanned from the unsorted list while searching for chunks to
> +pre-fill the per-thread cache with. The default, or when set to zero,
> +is no limit.
> +
> @node Hardware Capability Tunables
> @section Hardware Capability Tunables
> @cindex hardware capability tunables
>
On 06/28/2017 09:35 PM, Siddhesh Poyarekar wrote:
> On Thursday 29 June 2017 06:46 AM, DJ Delorie wrote:
>> Siddhesh Poyarekar <siddhesh@gotplt.org> writes:
>>> haha, interesting, because I remember Florian and I had this discussion
>>> about the utility of SXID_IGNORE six months ago when I designed the
>>> scope control for the envvars. Lets have Florian explain this.
>>
>> Changed to SXID_ERASE.
>>
>>> The braces should be on the next line:
>>
>> Fixed.
>>
>>>>>> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
>>>>>
>>>>> __attribute__ should be on its own line.
>>>>
>>>> Likewise, I was copying the formatting used elsewhere in malloc. What
>>>> is the correct formatting for that line? Do I need to fix the other
>>>> existing instances also?
>>>
>>> Sure, but in a separate (obvious) patch. The malloc code was in a
>>> completely different format when it was ported in from dlmalloc and
>>> since then it has been a bit of a mishmash that needs a cleanup.
>>
>> I scanned all of glibc, and the same-line syntax seems to be nearly
>> ubiquitous, with only one example of it on a separate line. I couldn't
>> find any mention of attributes in the GNU or GLIBC coding standards
>> either...
>>
>> I'm certainly willing to do a global tweak patch later, but... is this
>> documented somewhere?
>
> I thought it should have been but now that you've forced me to think
> harder, my impression may be due to the way macro sections were written
> in some math functions.
>
> Looks good to me, please commit.
I would like to do one more pass of review on this final patch just to
make sure we didn't miss anything.
On 06/30/2017 04:25 PM, Carlos O'Donell wrote:
> On 06/28/2017 09:35 PM, Siddhesh Poyarekar wrote:
>> On Thursday 29 June 2017 06:46 AM, DJ Delorie wrote:
>>> Siddhesh Poyarekar <siddhesh@gotplt.org> writes:
>>>> haha, interesting, because I remember Florian and I had this discussion
>>>> about the utility of SXID_IGNORE six months ago when I designed the
>>>> scope control for the envvars. Lets have Florian explain this.
>>>
>>> Changed to SXID_ERASE.
>>>
>>>> The braces should be on the next line:
>>>
>>> Fixed.
>>>
>>>>>>> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
>>>>>>
>>>>>> __attribute__ should be on its own line.
>>>>>
>>>>> Likewise, I was copying the formatting used elsewhere in malloc. What
>>>>> is the correct formatting for that line? Do I need to fix the other
>>>>> existing instances also?
>>>>
>>>> Sure, but in a separate (obvious) patch. The malloc code was in a
>>>> completely different format when it was ported in from dlmalloc and
>>>> since then it has been a bit of a mishmash that needs a cleanup.
>>>
>>> I scanned all of glibc, and the same-line syntax seems to be nearly
>>> ubiquitous, with only one example of it on a separate line. I couldn't
>>> find any mention of attributes in the GNU or GLIBC coding standards
>>> either...
>>>
>>> I'm certainly willing to do a global tweak patch later, but... is this
>>> documented somewhere?
>>
>> I thought it should have been but now that you've forced me to think
>> harder, my impression may be due to the way macro sections were written
>> in some math functions.
>>
>> Looks good to me, please commit.
>
> I would like to do one more pass of review on this final patch just to
> make sure we didn't miss anything.
Added to release blockers:
https://sourceware.org/glibc/wiki/Release/2.26#Release_blockers.3F
Thread local cache has significant performance gains in the measured
workloads, and I'd like to see those gains translated to users.
The obvious next steps are probably more deterministic memory usage
from glibc's malloc, but that's a next project.
DJ,
OK to commit with:
* Two doc changes (see review below).
* Fix the 'to appease gcc' issue (see review below).
> diff -x .git -x po -rup glibc.pristine/config.make.in glibc.djmalloc-tcache/config.make.in
> --- glibc.pristine/config.make.in 2017-06-20 12:54:22.636484566 -0400
> +++ glibc.djmalloc-tcache/config.make.in 2017-06-26 17:32:41.164888977 -0400
> @@ -78,6 +78,8 @@ multi-arch = @multi_arch@
>
> mach-interface-list = @mach_interface_list@
>
> +experimental-malloc = @experimental_malloc@
> +
OK.
> nss-crypt = @libc_cv_nss_crypt@
> static-nss-crypt = @libc_cv_static_nss_crypt@
>
> diff -x .git -x po -rup glibc.pristine/configure.ac glibc.djmalloc-tcache/configure.ac
> --- glibc.pristine/configure.ac 2017-06-20 12:54:22.638484567 -0400
> +++ glibc.djmalloc-tcache/configure.ac 2017-06-26 17:32:41.155888970 -0400
> @@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch],
> [multi_arch=$enableval],
> [multi_arch=default])
>
> +AC_ARG_ENABLE([experimental-malloc],
> + AC_HELP_STRING([--disable-experimental-malloc],
> + [disable experimental malloc features]),
> + [experimental_malloc=$enableval],
> + [experimental_malloc=yes])
> +AC_SUBST(experimental_malloc)
> +
OK. Enabled by default is correct. We see this as universally helpful, and only
to be disabled if there are issues discovered (not that we've seen any).
> AC_ARG_ENABLE([nss-crypt],
> AC_HELP_STRING([--enable-nss-crypt],
> [enable libcrypt to use nss]),
> diff -x .git -x po -rup glibc.pristine/elf/dl-tunables.list glibc.djmalloc-tcache/elf/dl-tunables.list
> --- glibc.pristine/elf/dl-tunables.list 2017-06-23 20:49:10.174148477 -0400
> +++ glibc.djmalloc-tcache/elf/dl-tunables.list 2017-06-28 21:01:06.748079482 -0400
> @@ -76,6 +76,18 @@ glibc {
> minval: 1
> security_level: SXID_IGNORE
> }
> + tcache_max {
> + type: SIZE_T
> + security_level: SXID_ERASE
> + }
> + tcache_count {
> + type: SIZE_T
> + security_level: SXID_ERASE
> + }
> + tcache_unsorted_limit {
> + type: SIZE_T
> + security_level: SXID_ERASE
> + }
OK.
> }
> tune {
> hwcap_mask {
> diff -x .git -x po -rup glibc.pristine/malloc/Makefile glibc.djmalloc-tcache/malloc/Makefile
> --- glibc.pristine/malloc/Makefile 2017-06-23 20:49:10.199148493 -0400
> +++ glibc.djmalloc-tcache/malloc/Makefile 2017-06-26 17:32:41.153888968 -0400
> @@ -189,6 +189,11 @@ tst-malloc-usable-static-ENV = $(tst-mal
> tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3
> tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV)
>
> +ifeq ($(experimental-malloc),yes)
> +CPPFLAGS-malloc.c += -DUSE_TCACHE=1
> +else
> +CPPFLAGS-malloc.c += -DUSE_TCACHE=0
> +endif
OK.
> # Uncomment this for test releases. For public releases it is too expensive.
> #CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1
>
> diff -x .git -x po -rup glibc.pristine/malloc/arena.c glibc.djmalloc-tcache/malloc/arena.c
> --- glibc.pristine/malloc/arena.c 2017-06-20 12:54:22.645484573 -0400
> +++ glibc.djmalloc-tcache/malloc/arena.c 2017-06-26 17:49:33.701608305 -0400
> @@ -236,6 +236,11 @@ TUNABLE_CALLBACK_FNDECL (set_perturb_byt
> TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t)
> TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t)
> TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t)
> +#if USE_TCACHE
> +TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t)
> +TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t)
> +TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t)
> +#endif
OK.
> #else
> /* Initialization routine. */
> #include <string.h>
> @@ -322,6 +327,12 @@ ptmalloc_init (void)
> TUNABLE_GET (mmap_max, int32_t, TUNABLE_CALLBACK (set_mmaps_max));
> TUNABLE_GET (arena_max, size_t, TUNABLE_CALLBACK (set_arena_max));
> TUNABLE_GET (arena_test, size_t, TUNABLE_CALLBACK (set_arena_test));
> +#if USE_TCACHE
> + TUNABLE_GET (tcache_max, size_t, TUNABLE_CALLBACK (set_tcache_max));
> + TUNABLE_GET (tcache_count, size_t, TUNABLE_CALLBACK (set_tcache_count));
> + TUNABLE_GET (tcache_unsorted_limit, size_t,
> + TUNABLE_CALLBACK (set_tcache_unsorted_limit));
OK.
> +#endif
> __libc_lock_unlock (main_arena.mutex);
> #else
> const char *s = NULL;
> diff -x .git -x po -rup glibc.pristine/malloc/malloc.c glibc.djmalloc-tcache/malloc/malloc.c
> --- glibc.pristine/malloc/malloc.c 2017-05-03 16:26:47.559839224 -0400
> +++ glibc.djmalloc-tcache/malloc/malloc.c 2017-06-28 21:02:22.059151815 -0400
> @@ -296,6 +296,30 @@ __malloc_assert (const char *assertion,
> }
> #endif
>
> +#if USE_TCACHE
> +/* We want 64 entries. This is an arbitrary limit, which tunables can reduce. */
> +# define TCACHE_MAX_BINS 64
> +# define MAX_TCACHE_SIZE tidx2usize (TCACHE_MAX_BINS-1)
OK.
> +
> +/* Only used to pre-fill the tunables. */
> +# define tidx2usize(idx) (((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ)
> +
> +/* When "x" is from chunksize(). */
> +# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
> +/* When "x" is a user-provided size. */
> +# define usize2tidx(x) csize2tidx (request2size (x))
> +
> +/* With rounding and alignment, the bins are...
> + idx 0 bytes 0..24 (64-bit) or 0..12 (32-bit)
> + idx 1 bytes 25..40 or 13..20
> + idx 2 bytes 41..56 or 21..28
> + etc. */
> +
> +/* This is another arbitrary limit, which tunables can change. Each
> + tcache bin will hold at most this number of chunks. */
> +# define TCACHE_FILL_COUNT 7
> +#endif
OK.
> +
>
> /*
> REALLOC_ZERO_BYTES_FREES should be set if a call to
> @@ -1712,6 +1736,17 @@ struct malloc_par
>
> /* First address handed out by MORECORE/sbrk. */
> char *sbrk_base;
> +
> +#if USE_TCACHE
> + /* Maximum number of buckets to use. */
> + size_t tcache_bins;
> + size_t tcache_max_bytes;
> + /* Maximum number of chunks in each bucket. */
> + size_t tcache_count;
> + /* Maximum number of chunks to remove from the unsorted list, which
> + aren't used to prefill the cache. */
> + size_t tcache_unsorted_limit;
OK.
> +#endif
> };
>
> /* There are several instances of this struct ("arenas") in this
> @@ -1750,6 +1785,13 @@ static struct malloc_par mp_ =
> .trim_threshold = DEFAULT_TRIM_THRESHOLD,
> #define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
> .arena_test = NARENAS_FROM_NCORES (1)
> +#if USE_TCACHE
> + ,
> + .tcache_count = TCACHE_FILL_COUNT,
> + .tcache_bins = TCACHE_MAX_BINS,
> + .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1),
> + .tcache_unsorted_limit = 0 /* No limit. */
> +#endif
OK.
> };
>
> /* Maximum size of memory handled in fastbins. */
> @@ -2875,6 +2917,124 @@ mremap_chunk (mchunkptr p, size_t new_si
>
> /*------------------------ Public wrappers. --------------------------------*/
>
> +#if USE_TCACHE
> +
> +/* We overlay this structure on the user-data portion of a chunk when
> + the chunk is stored in the per-thread cache. */
> +typedef struct tcache_entry
> +{
> + struct tcache_entry *next;
> +} tcache_entry;
> +
> +/* There is one of these for each thread, which contains the
> + per-thread cache (hence "tcache_perthread_struct"). Keeping
> + overall size low is mildly important. Note that COUNTS and ENTRIES
> + are redundant (we could have just counted the linked list each
> + time), this is for performance reasons. */
> +typedef struct tcache_perthread_struct
> +{
> + char counts[TCACHE_MAX_BINS];
> + tcache_entry *entries[TCACHE_MAX_BINS];
> +} tcache_perthread_struct;
> +
> +static __thread char tcache_shutting_down = 0;
> +static __thread tcache_perthread_struct *tcache = NULL;
> +
OK.
> +/* Caller must ensure that we know tc_idx is valid and there's room
> + for more chunks. */
> +static void
> +tcache_put (mchunkptr chunk, size_t tc_idx)
> +{
> + tcache_entry *e = (tcache_entry *) chunk2mem (chunk);
> + assert (tc_idx < TCACHE_MAX_BINS);
> + e->next = tcache->entries[tc_idx];
> + tcache->entries[tc_idx] = e;
> + ++(tcache->counts[tc_idx]);
> +}
OK.
> +
> +/* Caller must ensure that we know tc_idx is valid and there's
> + available chunks to remove. */
> +static void *
> +tcache_get (size_t tc_idx)
> +{
> + tcache_entry *e = tcache->entries[tc_idx];
> + assert (tc_idx < TCACHE_MAX_BINS);
> + assert (tcache->entries[tc_idx] > 0);
> + tcache->entries[tc_idx] = e->next;
> + --(tcache->counts[tc_idx]);
> + return (void *) e;
> +}
OK.
> +
> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
> +tcache_thread_freeres (void)
> +{
> + int i;
> + tcache_perthread_struct *tcache_tmp = tcache;
> +
> + if (!tcache)
> + return;
> +
> + tcache = NULL;
> +
> + for (i = 0; i < TCACHE_MAX_BINS; ++i)
> + {
> + while (tcache_tmp->entries[i])
> + {
> + tcache_entry *e = tcache_tmp->entries[i];
> + tcache_tmp->entries[i] = e->next;
> + __libc_free (e);
OK.
> + }
> + }
> +
> + __libc_free (tcache_tmp);
> +
> + tcache_shutting_down = 1;
> +}
> +text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
OK.
> +
> +static void
> +tcache_init(void)
> +{
> + mstate ar_ptr;
> + void *victim = 0;
> + const size_t bytes = sizeof (tcache_perthread_struct);
> +
> + if (tcache_shutting_down)
> + return;
> +
> + arena_get (ar_ptr, bytes);
> + victim = _int_malloc (ar_ptr, bytes);
> + if (!victim && ar_ptr != NULL)
> + {
> + ar_ptr = arena_get_retry (ar_ptr, bytes);
> + victim = _int_malloc (ar_ptr, bytes);
> + }
> +
> +
> + if (ar_ptr != NULL)
> + __libc_lock_unlock (ar_ptr->mutex);
> +
> + /* In a low memory situation, we may not be able to allocate memory
> + - in which case, we just keep trying later. However, we
> + typically do this very early, so either there is sufficient
> + memory, or there isn't enough memory to do non-trivial
> + allocations anyway. */
> + if (victim)
> + {
> + tcache = (tcache_perthread_struct *) victim;
> + memset (tcache, 0, sizeof (tcache_perthread_struct));
> + }
> +
> +}
OK.
> +
> +#define MAYBE_INIT_TCACHE() \
> + if (__glibc_unlikely (tcache == NULL)) \
> + tcache_init();
> +
> +#else
> +#define MAYBE_INIT_TCACHE()
> +#endif
OK.
> +
> void *
> __libc_malloc (size_t bytes)
> {
> @@ -2885,6 +3045,21 @@ __libc_malloc (size_t bytes)
> = atomic_forced_read (__malloc_hook);
> if (__builtin_expect (hook != NULL, 0))
> return (*hook)(bytes, RETURN_ADDRESS (0));
> +#if USE_TCACHE
> + /* int_free also calls request2size, be careful to not pad twice. */
> + size_t tbytes = request2size (bytes);
> + size_t tc_idx = csize2tidx (tbytes);
> +
> + MAYBE_INIT_TCACHE ();
> +
> + if (tc_idx < mp_.tcache_bins
> + && tc_idx < TCACHE_MAX_BINS /* to appease gcc */
Right, 'tc_idx < mp_.tcache_bins && tc_idx < TCACHE_MAX_BINS' should
always be true and reduce to 'tc_idx < mp_.tcache_bins', but I bet the
compiler generates a warning when you remove the static check?
In which case this is not the right fix. You need to remove the redundant
check e.g. 'tc_idx < TCACHE_MAX_BINS' and then use DIAG_PUSH_NEEDS_COMMENT,
DIAG_IGNORE_NEEDS_COMMENT, DIAG_POP_NEEDS_COMMENT, to document that you
are ignoring the compiler warning and at which version of gcc. That way
future developers can eventually fix the compiler or remove the warning
suppression.
> + && tcache
> + && tcache->entries[tc_idx] != NULL)
OK. On first call to malloc we do not fill the cache but move on
to _int_malloc() to do work.
> + {
> + return tcache_get (tc_idx);
> + }
> +#endif
>
> arena_get (ar_ptr, bytes);
>
> @@ -2944,6 +3119,8 @@ __libc_free (void *mem)
> return;
> }
>
> + MAYBE_INIT_TCACHE ();
OK. Initialize in __libc_free before calling _int_free().
> +
> ar_ptr = arena_for_chunk (p);
> _int_free (ar_ptr, p, 0);
> }
> @@ -2981,7 +3158,10 @@ __libc_realloc (void *oldmem, size_t byt
> if (chunk_is_mmapped (oldp))
> ar_ptr = NULL;
> else
> - ar_ptr = arena_for_chunk (oldp);
> + {
> + MAYBE_INIT_TCACHE ();
> + ar_ptr = arena_for_chunk (oldp);
> + }
OK.
>
> /* Little security check which won't hurt performance: the allocator
> never wrapps around at the end of the address space. Therefore
> @@ -3206,6 +3386,8 @@ __libc_calloc (size_t n, size_t elem_siz
>
> sz = bytes;
>
> + MAYBE_INIT_TCACHE ();
OK.
> +
> arena_get (av, sz);
> if (av)
> {
> @@ -3336,6 +3518,10 @@ _int_malloc (mstate av, size_t bytes)
> mchunkptr fwd; /* misc temp for linking */
> mchunkptr bck; /* misc temp for linking */
>
> +#if USE_TCACHE
> + size_t tcache_unsorted_count; /* count of unsorted chunks processed */
OK.
> +#endif
> +
> const char *errstr = NULL;
>
> /*
> @@ -3365,19 +3551,22 @@ _int_malloc (mstate av, size_t bytes)
> can try it without checking, which saves some time on this fast path.
> */
>
> +#define REMOVE_FB(fb, victim, pp) \
> + do \
> + { \
> + victim = pp; \
> + if (victim == NULL) \
> + break; \
> + } \
> + while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
> + != victim); \
> +
OK.
> if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
> {
> idx = fastbin_index (nb);
> mfastbinptr *fb = &fastbin (av, idx);
> mchunkptr pp = *fb;
> - do
> - {
> - victim = pp;
> - if (victim == NULL)
> - break;
> - }
> - while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
> - != victim);
> + REMOVE_FB (fb, victim, pp);
OK.
> if (victim != 0)
> {
> if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
> @@ -3388,6 +3577,26 @@ _int_malloc (mstate av, size_t bytes)
> return NULL;
> }
> check_remalloced_chunk (av, victim, nb);
> +#if USE_TCACHE
> + /* While we're here, if we see other chunks of the same size,
> + stash them in the tcache. */
> + size_t tc_idx = csize2tidx (nb);
> + if (tcache && tc_idx < mp_.tcache_bins)
OK (Interesting the compiler doesn't warn here).
> + {
> + mchunkptr tc_victim;
> +
> + /* While bin not empty and tcache not full, copy chunks over. */
> + while (tcache->counts[tc_idx] < mp_.tcache_count
> + && (pp = *fb) != NULL)
> + {
> + REMOVE_FB (fb, tc_victim, pp);
> + if (tc_victim != 0)
> + {
> + tcache_put (tc_victim, tc_idx);
> + }
> + }
OK. We fill tcache from fastbins.
> + }
> +#endif
> void *p = chunk2mem (victim);
> alloc_perturb (p, bytes);
> return p;
> @@ -3426,6 +3635,32 @@ _int_malloc (mstate av, size_t bytes)
> if (av != &main_arena)
> set_non_main_arena (victim);
> check_malloced_chunk (av, victim, nb);
> +#if USE_TCACHE
> + /* While we're here, if we see other chunks of the same size,
> + stash them in the tcache. */
> + size_t tc_idx = csize2tidx (nb);
> + if (tcache && tc_idx < mp_.tcache_bins)
> + {
> + mchunkptr tc_victim;
> +
> + /* While bin not empty and tcache not full, copy chunks over. */
> + while (tcache->counts[tc_idx] < mp_.tcache_count
> + && (tc_victim = last (bin)) != bin)
> + {
> + if (tc_victim != 0)
OK.
Though I find it disconcerting to see 0 and NULL used interchangeably.
It makes it harder to follow the code. This is a pointer and we should
probably switch them all to NULL. However, I appreciate you're following
the existing code sequences and how they operate on the same structures.
It is also a pain to remember that bin 0 doesn't exist and is just
a hack of fd/bck pointer overlay on two bin array entries. We need a better
design for this some day. I had to relearn why 'tc_victim = last (bin)) != bin'
even works to determine you've searched the whole list.
> + {
> + bck = tc_victim->bk;
> + set_inuse_bit_at_offset (tc_victim, nb);
> + if (av != &main_arena)
> + set_non_main_arena (tc_victim);
> + bin->bk = bck;
> + bck->fd = bin;
> +
> + tcache_put (tc_victim, tc_idx);
> + }
OK.
> + }
> + }
> +#endif
> void *p = chunk2mem (victim);
> alloc_perturb (p, bytes);
> return p;
> @@ -3464,6 +3699,16 @@ _int_malloc (mstate av, size_t bytes)
> otherwise need to expand memory to service a "small" request.
> */
>
> +#if USE_TCACHE
> + INTERNAL_SIZE_T tcache_nb = 0;
> + size_t tc_idx = csize2tidx (nb);
> + if (tcache && tc_idx < mp_.tcache_bins)
> + tcache_nb = nb;
> + int return_cached = 0;
> +
> + tcache_unsorted_count = 0;
OK.
> +#endif
> +
> for (;; )
> {
> int iters = 0;
> @@ -3524,10 +3769,26 @@ _int_malloc (mstate av, size_t bytes)
> set_inuse_bit_at_offset (victim, size);
> if (av != &main_arena)
> set_non_main_arena (victim);
OK, we are in the larger 'for (;; )' loop in _int_malloc analyzing all of
the chunks we got in free().
> +#if USE_TCACHE
> + /* Fill cache first, return to user only if cache fills.
> + We may return one of these chunks later. */
> + if (tcache_nb
> + && tcache->counts[tc_idx] < mp_.tcache_count)
> + {
> + tcache_put (victim, tc_idx);
> + return_cached = 1;
> + continue;
> + }
> + else
> + {
> +#endif
> check_malloced_chunk (av, victim, nb);
> void *p = chunk2mem (victim);
> alloc_perturb (p, bytes);
> return p;
> +#if USE_TCACHE
> + }
> +#endif
OK.
> }
>
> /* place chunk in bin */
> @@ -3594,11 +3855,31 @@ _int_malloc (mstate av, size_t bytes)
> fwd->bk = victim;
> bck->fd = victim;
>
> +#if USE_TCACHE
> + /* If we've processed as many chunks as we're allowed while
> + filling the cache, return one of the cached ones. */
> + ++tcache_unsorted_count;
> + if (return_cached
> + && mp_.tcache_unsorted_limit > 0
> + && tcache_unsorted_count > mp_.tcache_unsorted_limit)
> + {
> + return tcache_get (tc_idx);
OK.
> + }
> +#endif
> +
> #define MAX_ITERS 10000
> if (++iters >= MAX_ITERS)
> break;
> }
>
> +#if USE_TCACHE
> + /* If all the small chunks we found ended up cached, return one now. */
> + if (return_cached)
> + {
> + return tcache_get (tc_idx);
> + }
> +#endif
OK.
> +
> /*
> If a large request, scan through the chunks of current bin in
> sorted order to find smallest that fits. Use the skip list for this.
> @@ -3884,6 +4165,20 @@ _int_free (mstate av, mchunkptr p, int h
>
> check_inuse_chunk(av, p);
>
> +#if USE_TCACHE
> + {
> + size_t tc_idx = csize2tidx (size);
> +
> + if (tcache
> + && tc_idx < mp_.tcache_bins
> + && tcache->counts[tc_idx] < mp_.tcache_count)
> + {
> + tcache_put (p, tc_idx);
> + return;
> + }
OK. Put back into the tcache.
> + }
> +#endif
> +
> /*
> If eligible, place chunk on a fastbin so it can be found
> and used quickly in malloc.
> @@ -4845,6 +5140,38 @@ do_set_arena_max (size_t value)
> return 1;
> }
>
> +#if USE_TCACHE
> +static inline int
> +__always_inline
> +do_set_tcache_max (size_t value)
> +{
> + if (value >= 0 && value <= MAX_TCACHE_SIZE)
> + {
> + LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes);
> + mp_.tcache_max_bytes = value;
> + mp_.tcache_bins = csize2tidx (request2size(value)) + 1;
> + }
> + return 1;
> +}
OK.
> +
> +static inline int
> +__always_inline
> +do_set_tcache_count (size_t value)
> +{
> + LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
> + mp_.tcache_count = value;
> + return 1;
> +}
OK.
> +
> +static inline int
> +__always_inline
> +do_set_tcache_unsorted_limit (size_t value)
> +{
> + LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit);
> + mp_.tcache_unsorted_limit = value;
> + return 1;
> +}
OK.
> +#endif
>
> int
> __libc_mallopt (int param_number, int value)
> diff -x .git -x po -rup glibc.pristine/manual/probes.texi glibc.djmalloc-tcache/manual/probes.texi
> --- glibc.pristine/manual/probes.texi 2017-01-10 17:02:06.305343587 -0500
> +++ glibc.djmalloc-tcache/manual/probes.texi 2017-06-26 17:32:35.785884967 -0400
> @@ -231,6 +231,25 @@ dynamic brk/mmap thresholds. Argument @
> the adjusted mmap and trim thresholds, respectively.
> @end deftp
>
> +@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the @code{glibc.malloc.tcache_max}
> +tunable is set. Argument @var{$arg1} is the requested value, and
> +@var{$arg2} is the previous value of this tunable.
> +@end deftp
> +
> +@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the @code{glibc.malloc.tcache_count}
> +tunable is set. Argument @var{$arg1} is the requested value, and
> +@var{$arg2} is the previous value of this tunable.
> +@end deftp
> +
> +@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the
> +@code{glibc.malloc.tcache_unsorted_limit} tunable is set. Argument
> +@var{$arg1} is the requested value, and @var{$arg2} is the previous
> +value of this tunable.
> +@end deftp
OK.
> +
> @node Mathematical Function Probes
> @section Mathematical Function Probes
>
> diff -x .git -x po -rup glibc.pristine/manual/tunables.texi glibc.djmalloc-tcache/manual/tunables.texi
> --- glibc.pristine/manual/tunables.texi 2017-06-23 20:49:10.201148494 -0400
> +++ glibc.djmalloc-tcache/manual/tunables.texi 2017-06-26 17:35:07.952998383 -0400
> @@ -193,6 +193,39 @@ systems the limit is twice the number of
> is 8 times the number of cores online.
> @end deftp
>
> +@deftp Tunable glibc.malloc.tcache_max
> +The maximum size of a request (in bytes) which may be met via the
> +per-thread cache. The default (and maximum) value is 1032 bytes on
> +64-bit systems and 516 bytes on 32-bit systems.
> +@end deftp
OK.
> +
> +@deftp Tunable glibc.malloc.tcache_count
> +The maximum number of chunks of each size to cache.
You have a lot of implementation dependent details in this definition.
I suggest this trimmed down version:
@deftp Tunable glibc.malloc.tcache_count
The maximum number of chunks of each size to cache. The default is 7.
There is no upper limit, other than available system memory.
The approximate maximum overhead of the per-thread cache is thus equal
to the number of bins times the chunk count in each bin times the size
of each chunk.
With defaults, the approximate maximum overhead of the per-thread cache
is approximately 236 KB on 64-bit systems and 118 KB on 32-bit systems.
> +@end deftp
> +
> +@deftp Tunable glibc.malloc.tcache_unsorted_limit
> +When the user requests memory and the request cannot be met via the
> +per-thread cache, the arenas are used to meet the request. At this
> +time, additional chunks will be moved from existing arena lists to
> +pre-fill the corresponding cache. While copies from the fastbins,
> +smallbins, and regular bins are bounded and predictable due to the bin
> +sizes, copies from the unsorted bin are not bounded, and incur
> +additional time penalties as they need to be sorted as they're
> +scanned. To make scanning the unsorted list more predictable and
> +bounded, the user may set this tunable to limit the number of blocks
s/blocks/chunks/g
> +that are scanned from the unsorted list while searching for chunks to
> +pre-fill the per-thread cache with. The default, or when set to zero,
> +is no limit.
OK.
> +
> @node Hardware Capability Tunables
> @section Hardware Capability Tunables
> @cindex hardware capability tunables
>
On Tue, 4 Jul 2017, Carlos O'Donell wrote:
> DJ,
>
> OK to commit with:
>
> * Two doc changes (see review below).
Also regarding documentation: there's a new configure option, so it needs
to be added to install.texi, and the INSTALL file regenerated.
On 07/05/2017 09:19 AM, Joseph Myers wrote:
> On Tue, 4 Jul 2017, Carlos O'Donell wrote:
>
>> DJ,
>>
>> OK to commit with:
>>
>> * Two doc changes (see review below).
>
> Also regarding documentation: there's a new configure option, so it needs
> to be added to install.texi, and the INSTALL file regenerated.
Yes! Good catch.
On Wed, 5 Jul 2017, Carlos O'Donell wrote:
> On 07/05/2017 09:19 AM, Joseph Myers wrote:
> > On Tue, 4 Jul 2017, Carlos O'Donell wrote:
> >
> >> DJ,
> >>
> >> OK to commit with:
> >>
> >> * Two doc changes (see review below).
> >
> > Also regarding documentation: there's a new configure option, so it needs
> > to be added to install.texi, and the INSTALL file regenerated.
>
> Yes! Good catch.
And this also seems like it should have a NEWS entry.
On 07/05/2017 12:19 PM, Joseph Myers wrote:
> On Wed, 5 Jul 2017, Carlos O'Donell wrote:
>
>> On 07/05/2017 09:19 AM, Joseph Myers wrote:
>>> On Tue, 4 Jul 2017, Carlos O'Donell wrote:
>>>
>>>> DJ,
>>>>
>>>> OK to commit with:
>>>>
>>>> * Two doc changes (see review below).
>>>
>>> Also regarding documentation: there's a new configure option, so it needs
>>> to be added to install.texi, and the INSTALL file regenerated.
>>
>> Yes! Good catch.
>
> And this also seems like it should have a NEWS entry.
It should, and I expect DJ to add one.
@@ -78,6 +78,8 @@ multi-arch = @multi_arch@
mach-interface-list = @mach_interface_list@
+experimental-malloc = @experimental_malloc@
+
nss-crypt = @libc_cv_nss_crypt@
static-nss-crypt = @libc_cv_static_nss_crypt@
@@ -674,6 +674,7 @@ build_obsolete_nsl
link_obsolete_rpc
libc_cv_static_nss_crypt
libc_cv_nss_crypt
+experimental_malloc
enable_werror
all_warnings
force_install
@@ -779,6 +780,7 @@ enable_kernel
enable_all_warnings
enable_werror
enable_multi_arch
+enable_experimental_malloc
enable_nss_crypt
enable_obsolete_rpc
enable_obsolete_nsl
@@ -1450,6 +1452,8 @@ Optional Features:
--disable-werror do not build with -Werror
--enable-multi-arch enable single DSO with optimizations for multiple
architectures
+ --disable-experimental-malloc
+ disable experimental malloc features
--enable-nss-crypt enable libcrypt to use nss
--enable-obsolete-rpc build and install the obsolete RPC code for
link-time usage
@@ -3522,6 +3526,15 @@ else
fi
+# Check whether --enable-experimental-malloc was given.
+if test "${enable_experimental_malloc+set}" = set; then :
+ enableval=$enable_experimental_malloc; experimental_malloc=$enableval
+else
+ experimental_malloc=yes
+fi
+
+
+
# Check whether --enable-nss-crypt was given.
if test "${enable_nss_crypt+set}" = set; then :
enableval=$enable_nss_crypt; nss_crypt=$enableval
@@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch],
[multi_arch=$enableval],
[multi_arch=default])
+AC_ARG_ENABLE([experimental-malloc],
+ AC_HELP_STRING([--disable-experimental-malloc],
+ [disable experimental malloc features]),
+ [experimental_malloc=$enableval],
+ [experimental_malloc=yes])
+AC_SUBST(experimental_malloc)
+
AC_ARG_ENABLE([nss-crypt],
AC_HELP_STRING([--enable-nss-crypt],
[enable libcrypt to use nss]),
@@ -76,6 +76,18 @@ glibc {
minval: 1
security_level: SXID_IGNORE
}
+ tcache_max {
+ type: SIZE_T
+ security_level: SXID_ERASE
+ }
+ tcache_count {
+ type: SIZE_T
+ security_level: SXID_ERASE
+ }
+ tcache_unsorted_limit {
+ type: SIZE_T
+ security_level: SXID_ERASE
+ }
}
tune {
hwcap_mask {
@@ -189,6 +189,11 @@ tst-malloc-usable-static-ENV = $(tst-mal
tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3
tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV)
+ifeq ($(experimental-malloc),yes)
+CPPFLAGS-malloc.c += -DUSE_TCACHE=1
+else
+CPPFLAGS-malloc.c += -DUSE_TCACHE=0
+endif
# Uncomment this for test releases. For public releases it is too expensive.
#CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1
@@ -236,6 +236,11 @@ TUNABLE_CALLBACK_FNDECL (set_perturb_byt
TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t)
TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t)
TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t)
+#if USE_TCACHE
+TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t)
+TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t)
+TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t)
+#endif
#else
/* Initialization routine. */
#include <string.h>
@@ -322,6 +327,12 @@ ptmalloc_init (void)
TUNABLE_GET (mmap_max, int32_t, TUNABLE_CALLBACK (set_mmaps_max));
TUNABLE_GET (arena_max, size_t, TUNABLE_CALLBACK (set_arena_max));
TUNABLE_GET (arena_test, size_t, TUNABLE_CALLBACK (set_arena_test));
+#if USE_TCACHE
+ TUNABLE_GET (tcache_max, size_t, TUNABLE_CALLBACK (set_tcache_max));
+ TUNABLE_GET (tcache_count, size_t, TUNABLE_CALLBACK (set_tcache_count));
+ TUNABLE_GET (tcache_unsorted_limit, size_t,
+ TUNABLE_CALLBACK (set_tcache_unsorted_limit));
+#endif
__libc_lock_unlock (main_arena.mutex);
#else
const char *s = NULL;
@@ -296,6 +296,30 @@ __malloc_assert (const char *assertion,
}
#endif
+#if USE_TCACHE
+/* We want 64 entries. This is an arbitrary limit, which tunables can reduce. */
+# define TCACHE_MAX_BINS 64
+# define MAX_TCACHE_SIZE tidx2usize (TCACHE_MAX_BINS-1)
+
+/* Only used to pre-fill the tunables. */
+# define tidx2usize(idx) (((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ)
+
+/* When "x" is from chunksize(). */
+# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
+/* When "x" is a user-provided size. */
+# define usize2tidx(x) csize2tidx (request2size (x))
+
+/* With rounding and alignment, the bins are...
+ idx 0 bytes 0..24 (64-bit) or 0..12 (32-bit)
+ idx 1 bytes 25..40 or 13..20
+ idx 2 bytes 41..56 or 21..28
+ etc. */
+
+/* This is another arbitrary limit, which tunables can change. Each
+ tcache bin will hold at most this number of chunks. */
+# define TCACHE_FILL_COUNT 7
+#endif
+
/*
REALLOC_ZERO_BYTES_FREES should be set if a call to
@@ -1712,6 +1736,17 @@ struct malloc_par
/* First address handed out by MORECORE/sbrk. */
char *sbrk_base;
+
+#if USE_TCACHE
+ /* Maximum number of buckets to use. */
+ size_t tcache_bins;
+ size_t tcache_max_bytes;
+ /* Maximum number of chunks in each bucket. */
+ size_t tcache_count;
+ /* Maximum number of chunks to remove from the unsorted list, which
+ aren't used to prefill the cache. */
+ size_t tcache_unsorted_limit;
+#endif
};
/* There are several instances of this struct ("arenas") in this
@@ -1750,6 +1785,13 @@ static struct malloc_par mp_ =
.trim_threshold = DEFAULT_TRIM_THRESHOLD,
#define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
.arena_test = NARENAS_FROM_NCORES (1)
+#if USE_TCACHE
+ ,
+ .tcache_count = TCACHE_FILL_COUNT,
+ .tcache_bins = TCACHE_MAX_BINS,
+ .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1),
+ .tcache_unsorted_limit = 0 /* No limit. */
+#endif
};
/* Maximum size of memory handled in fastbins. */
@@ -2875,6 +2917,124 @@ mremap_chunk (mchunkptr p, size_t new_si
/*------------------------ Public wrappers. --------------------------------*/
+#if USE_TCACHE
+
+/* We overlay this structure on the user-data portion of a chunk when
+ the chunk is stored in the per-thread cache. */
+typedef struct tcache_entry
+{
+ struct tcache_entry *next;
+} tcache_entry;
+
+/* There is one of these for each thread, which contains the
+ per-thread cache (hence "tcache_perthread_struct"). Keeping
+ overall size low is mildly important. Note that COUNTS and ENTRIES
+ are redundant (we could have just counted the linked list each
+ time), this is for performance reasons. */
+typedef struct tcache_perthread_struct
+{
+ char counts[TCACHE_MAX_BINS];
+ tcache_entry *entries[TCACHE_MAX_BINS];
+} tcache_perthread_struct;
+
+static __thread char tcache_shutting_down = 0;
+static __thread tcache_perthread_struct *tcache = NULL;
+
+/* Caller must ensure that we know tc_idx is valid and there's room
+ for more chunks. */
+static void
+tcache_put (mchunkptr chunk, size_t tc_idx)
+{
+ tcache_entry *e = (tcache_entry *) chunk2mem (chunk);
+ assert (tc_idx < TCACHE_MAX_BINS);
+ e->next = tcache->entries[tc_idx];
+ tcache->entries[tc_idx] = e;
+ ++(tcache->counts[tc_idx]);
+}
+
+/* Caller must ensure that we know tc_idx is valid and there's
+ available chunks to remove. */
+static void *
+tcache_get (size_t tc_idx)
+{
+ tcache_entry *e = tcache->entries[tc_idx];
+ assert (tc_idx < TCACHE_MAX_BINS);
+ assert (tcache->entries[tc_idx] > 0);
+ tcache->entries[tc_idx] = e->next;
+ --(tcache->counts[tc_idx]);
+ return (void *) e;
+}
+
+static void __attribute__ ((section ("__libc_thread_freeres_fn")))
+tcache_thread_freeres (void)
+{
+ int i;
+ tcache_perthread_struct *tcache_tmp = tcache;
+
+ if (!tcache)
+ return;
+
+ tcache = NULL;
+
+ for (i = 0; i < TCACHE_MAX_BINS; ++i)
+ {
+ while (tcache_tmp->entries[i])
+ {
+ tcache_entry *e = tcache_tmp->entries[i];
+ tcache_tmp->entries[i] = e->next;
+ __libc_free (e);
+ }
+ }
+
+ __libc_free (tcache_tmp);
+
+ tcache_shutting_down = 1;
+}
+text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
+
+static void
+tcache_init(void)
+{
+ mstate ar_ptr;
+ void *victim = 0;
+ const size_t bytes = sizeof (tcache_perthread_struct);
+
+ if (tcache_shutting_down)
+ return;
+
+ arena_get (ar_ptr, bytes);
+ victim = _int_malloc (ar_ptr, bytes);
+ if (!victim && ar_ptr != NULL)
+ {
+ ar_ptr = arena_get_retry (ar_ptr, bytes);
+ victim = _int_malloc (ar_ptr, bytes);
+ }
+
+
+ if (ar_ptr != NULL)
+ __libc_lock_unlock (ar_ptr->mutex);
+
+ /* In a low memory situation, we may not be able to allocate memory
+ - in which case, we just keep trying later. However, we
+ typically do this very early, so either there is sufficient
+ memory, or there isn't enough memory to do non-trivial
+ allocations anyway. */
+ if (victim)
+ {
+ tcache = (tcache_perthread_struct *) victim;
+ memset (tcache, 0, sizeof (tcache_perthread_struct));
+ }
+
+}
+
+#define MAYBE_INIT_TCACHE() \
+ if (__glibc_unlikely (tcache == NULL)) \
+ tcache_init();
+
+#else
+#define MAYBE_INIT_TCACHE()
+#endif
+
void *
__libc_malloc (size_t bytes)
{
@@ -2885,6 +3045,21 @@ __libc_malloc (size_t bytes)
= atomic_forced_read (__malloc_hook);
if (__builtin_expect (hook != NULL, 0))
return (*hook)(bytes, RETURN_ADDRESS (0));
+#if USE_TCACHE
+ /* int_free also calls request2size, be careful to not pad twice. */
+ size_t tbytes = request2size (bytes);
+ size_t tc_idx = csize2tidx (tbytes);
+
+ MAYBE_INIT_TCACHE ();
+
+ if (tc_idx < mp_.tcache_bins
+ && tc_idx < TCACHE_MAX_BINS /* to appease gcc */
+ && tcache
+ && tcache->entries[tc_idx] != NULL)
+ {
+ return tcache_get (tc_idx);
+ }
+#endif
arena_get (ar_ptr, bytes);
@@ -2944,6 +3119,8 @@ __libc_free (void *mem)
return;
}
+ MAYBE_INIT_TCACHE ();
+
ar_ptr = arena_for_chunk (p);
_int_free (ar_ptr, p, 0);
}
@@ -2981,7 +3158,10 @@ __libc_realloc (void *oldmem, size_t byt
if (chunk_is_mmapped (oldp))
ar_ptr = NULL;
else
- ar_ptr = arena_for_chunk (oldp);
+ {
+ MAYBE_INIT_TCACHE ();
+ ar_ptr = arena_for_chunk (oldp);
+ }
/* Little security check which won't hurt performance: the allocator
never wrapps around at the end of the address space. Therefore
@@ -3206,6 +3386,8 @@ __libc_calloc (size_t n, size_t elem_siz
sz = bytes;
+ MAYBE_INIT_TCACHE ();
+
arena_get (av, sz);
if (av)
{
@@ -3336,6 +3518,10 @@ _int_malloc (mstate av, size_t bytes)
mchunkptr fwd; /* misc temp for linking */
mchunkptr bck; /* misc temp for linking */
+#if USE_TCACHE
+ size_t tcache_unsorted_count; /* count of unsorted chunks processed */
+#endif
+
const char *errstr = NULL;
/*
@@ -3365,19 +3551,22 @@ _int_malloc (mstate av, size_t bytes)
can try it without checking, which saves some time on this fast path.
*/
+#define REMOVE_FB(fb, victim, pp) \
+ do \
+ { \
+ victim = pp; \
+ if (victim == NULL) \
+ break; \
+ } \
+ while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
+ != victim); \
+
if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
{
idx = fastbin_index (nb);
mfastbinptr *fb = &fastbin (av, idx);
mchunkptr pp = *fb;
- do
- {
- victim = pp;
- if (victim == NULL)
- break;
- }
- while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
- != victim);
+ REMOVE_FB (fb, victim, pp);
if (victim != 0)
{
if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
@@ -3388,6 +3577,26 @@ _int_malloc (mstate av, size_t bytes)
return NULL;
}
check_remalloced_chunk (av, victim, nb);
+#if USE_TCACHE
+ /* While we're here, if we see other chunks of the same size,
+ stash them in the tcache. */
+ size_t tc_idx = csize2tidx (nb);
+ if (tcache && tc_idx < mp_.tcache_bins)
+ {
+ mchunkptr tc_victim;
+
+ /* While bin not empty and tcache not full, copy chunks over. */
+ while (tcache->counts[tc_idx] < mp_.tcache_count
+ && (pp = *fb) != NULL)
+ {
+ REMOVE_FB (fb, tc_victim, pp);
+ if (tc_victim != 0)
+ {
+ tcache_put (tc_victim, tc_idx);
+ }
+ }
+ }
+#endif
void *p = chunk2mem (victim);
alloc_perturb (p, bytes);
return p;
@@ -3426,6 +3635,32 @@ _int_malloc (mstate av, size_t bytes)
if (av != &main_arena)
set_non_main_arena (victim);
check_malloced_chunk (av, victim, nb);
+#if USE_TCACHE
+ /* While we're here, if we see other chunks of the same size,
+ stash them in the tcache. */
+ size_t tc_idx = csize2tidx (nb);
+ if (tcache && tc_idx < mp_.tcache_bins)
+ {
+ mchunkptr tc_victim;
+
+ /* While bin not empty and tcache not full, copy chunks over. */
+ while (tcache->counts[tc_idx] < mp_.tcache_count
+ && (tc_victim = last (bin)) != bin)
+ {
+ if (tc_victim != 0)
+ {
+ bck = tc_victim->bk;
+ set_inuse_bit_at_offset (tc_victim, nb);
+ if (av != &main_arena)
+ set_non_main_arena (tc_victim);
+ bin->bk = bck;
+ bck->fd = bin;
+
+ tcache_put (tc_victim, tc_idx);
+ }
+ }
+ }
+#endif
void *p = chunk2mem (victim);
alloc_perturb (p, bytes);
return p;
@@ -3464,6 +3699,16 @@ _int_malloc (mstate av, size_t bytes)
otherwise need to expand memory to service a "small" request.
*/
+#if USE_TCACHE
+ INTERNAL_SIZE_T tcache_nb = 0;
+ size_t tc_idx = csize2tidx (nb);
+ if (tcache && tc_idx < mp_.tcache_bins)
+ tcache_nb = nb;
+ int return_cached = 0;
+
+ tcache_unsorted_count = 0;
+#endif
+
for (;; )
{
int iters = 0;
@@ -3524,10 +3769,26 @@ _int_malloc (mstate av, size_t bytes)
set_inuse_bit_at_offset (victim, size);
if (av != &main_arena)
set_non_main_arena (victim);
+#if USE_TCACHE
+ /* Fill cache first, return to user only if cache fills.
+ We may return one of these chunks later. */
+ if (tcache_nb
+ && tcache->counts[tc_idx] < mp_.tcache_count)
+ {
+ tcache_put (victim, tc_idx);
+ return_cached = 1;
+ continue;
+ }
+ else
+ {
+#endif
check_malloced_chunk (av, victim, nb);
void *p = chunk2mem (victim);
alloc_perturb (p, bytes);
return p;
+#if USE_TCACHE
+ }
+#endif
}
/* place chunk in bin */
@@ -3594,11 +3855,31 @@ _int_malloc (mstate av, size_t bytes)
fwd->bk = victim;
bck->fd = victim;
+#if USE_TCACHE
+ /* If we've processed as many chunks as we're allowed while
+ filling the cache, return one of the cached ones. */
+ ++tcache_unsorted_count;
+ if (return_cached
+ && mp_.tcache_unsorted_limit > 0
+ && tcache_unsorted_count > mp_.tcache_unsorted_limit)
+ {
+ return tcache_get (tc_idx);
+ }
+#endif
+
#define MAX_ITERS 10000
if (++iters >= MAX_ITERS)
break;
}
+#if USE_TCACHE
+ /* If all the small chunks we found ended up cached, return one now. */
+ if (return_cached)
+ {
+ return tcache_get (tc_idx);
+ }
+#endif
+
/*
If a large request, scan through the chunks of current bin in
sorted order to find smallest that fits. Use the skip list for this.
@@ -3884,6 +4165,20 @@ _int_free (mstate av, mchunkptr p, int h
check_inuse_chunk(av, p);
+#if USE_TCACHE
+ {
+ size_t tc_idx = csize2tidx (size);
+
+ if (tcache
+ && tc_idx < mp_.tcache_bins
+ && tcache->counts[tc_idx] < mp_.tcache_count)
+ {
+ tcache_put (p, tc_idx);
+ return;
+ }
+ }
+#endif
+
/*
If eligible, place chunk on a fastbin so it can be found
and used quickly in malloc.
@@ -4845,6 +5140,38 @@ do_set_arena_max (size_t value)
return 1;
}
+#if USE_TCACHE
+static inline int
+__always_inline
+do_set_tcache_max (size_t value)
+{
+ if (value >= 0 && value <= MAX_TCACHE_SIZE)
+ {
+ LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes);
+ mp_.tcache_max_bytes = value;
+ mp_.tcache_bins = csize2tidx (request2size(value)) + 1;
+ }
+ return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_count (size_t value)
+{
+ LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
+ mp_.tcache_count = value;
+ return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_unsorted_limit (size_t value)
+{
+ LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit);
+ mp_.tcache_unsorted_limit = value;
+ return 1;
+}
+#endif
int
__libc_mallopt (int param_number, int value)
@@ -231,6 +231,25 @@ dynamic brk/mmap thresholds. Argument @
the adjusted mmap and trim thresholds, respectively.
@end deftp
+@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the @code{glibc.malloc.tcache_max}
+tunable is set. Argument @var{$arg1} is the requested value, and
+@var{$arg2} is the previous value of this tunable.
+@end deftp
+
+@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the @code{glibc.malloc.tcache_count}
+tunable is set. Argument @var{$arg1} is the requested value, and
+@var{$arg2} is the previous value of this tunable.
+@end deftp
+
+@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the
+@code{glibc.malloc.tcache_unsorted_limit} tunable is set. Argument
+@var{$arg1} is the requested value, and @var{$arg2} is the previous
+value of this tunable.
+@end deftp
+
@node Mathematical Function Probes
@section Mathematical Function Probes
@@ -193,6 +193,39 @@ systems the limit is twice the number of
is 8 times the number of cores online.
@end deftp
+@deftp Tunable glibc.malloc.tcache_max
+The maximum size of a request (in bytes) which may be met via the
+per-thread cache. The default (and maximum) value is 1032 bytes on
+64-bit systems and 516 bytes on 32-bit systems.
+@end deftp
+
+@deftp Tunable glibc.malloc.tcache_count
+The maximum number of chunks of each size to cache. The default is 7.
+There is no upper limit, other than available system memory. Note
+that chunks are rounded up to malloc's guaranteed alignment - this
+count is per rounded size, not per user-provided size.
+
+The approximate maximum overhead of the per-thread cache (for each
+thread, of course) is thus @code{glibc.malloc.tcache_max} (in bins,
+max 64 bins) times @code{glibc.malloc.tcache_count} times the size for
+each bin. With defaults, this is about 236 KB on 64-bit systems and
+118 KB on 32-bit systems.
+@end deftp
+
+@deftp Tunable glibc.malloc.tcache_unsorted_limit
+When the user requests memory and the request cannot be met via the
+per-thread cache, the arenas are used to meet the request. At this
+time, additional chunks will be moved from existing arena lists to
+pre-fill the corresponding cache. While copies from the fastbins,
+smallbins, and regular bins are bounded and predictable due to the bin
+sizes, copies from the unsorted bin are not bounded, and incur
+additional time penalties as they need to be sorted as they're
+scanned. To make scanning the unsorted list more predictable and
+bounded, the user may set this tunable to limit the number of blocks
+that are scanned from the unsorted list while searching for chunks to
+pre-fill the per-thread cache with. The default, or when set to zero,
+is no limit.
+
@node Hardware Capability Tunables
@section Hardware Capability Tunables
@cindex hardware capability tunables