[updated] malloc per-thread cache ready for review

Message ID xnpodpxiv9.fsf@greed.delorie.com
State Committed
Headers

Commit Message

DJ Delorie June 27, 2017, 8:42 p.m. UTC
  > ChangeLog?

Included.

> > +    tcache_unsorted_limit {
> > +      type: SIZE_T
> > +      security_level: SXID_IGNORE
> > +    }
> 
> Why does this need to be inherited by children of setxid processes?  I
> guess the fact that the remaining malloc tunables are is one probable
> reason, but those also need to be reviewed and brought down to
> SXID_ERASE if necessary.

IIRC Florian asked for that setting; I don't know specifically what the
difference is but if you two want to fight it out, I'll get some popcorn ;-)

> This needs to use the new TUNABLE_CALLBACK_FNDECL.

Fixed.  My git repo has been munging my merges, it kept reverting that
trunk DL_* patch :-(

> This needs to use the new TUNABLE_GET macro.

Likewise.

> > +/* We overlay this structure on the user-data portion of a chunk when
> > +   the chunk is stored in the per-thread cache.  */
> > +typedef struct tcache_entry {
> 
> Incorrect formatting.

You weren't specific here, so I had to guess which formatting you were
referring to, as the style I used matches code elsewhere in malloc...

> > +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
> 
> __attribute__ should be on its own line.

Likewise, I was copying the formatting used elsewhere in malloc.  What
is the correct formatting for that line?  Do I need to fix the other
existing instances also?

> > +  for (i = 0; i < TCACHE_MAX_BINS; ++i) {
> > +    while (tcache_tmp->entries[i])
> > +      {
> > +	tcache_entry *e = tcache_tmp->entries[i];
> > +	tcache_tmp->entries[i] = e->next;
> > +	__libc_free (e);
> > +      }
> 
> Incorrect formatting.

Not sure what formatting you're referring to here, unless it's the tabs
being messed up by the "> >"...

> Avoid implicit bool conversion for victim.

Fixed.

> Indent 1 space.

Fixed.

> Implicit boolean conversion.

Fixed.

> Unnecessary braces.

Removed, although there are pre-existing cases of that throughout the
code... at least, when the "if" clause is multi-line.

> > +  if (value >= 0 && value <= MAX_TCACHE_SIZE)
> 
> This validation should be done in tunables by setting the minval and
> maxval values in dl-tunables.list.

I don't see how I can put MAX_TCACHE_SIZE in dl-tunables.list.  Should I
hard-code the default value, and also include the test here?  Or is
there no other other use of min/max in dl-tunables.list?

Revised patch attached.

2017-06-27  DJ Delorie  <dj@delorie.com>

	* config.make.in: Enable experimental malloc option.
	* configure.ac: Likewise.
	* configure: Regenerate.
	* malloc/Makefile: Likewise.
	* malloc/malloc.c: Add per-thread cache (tcache).
	(tcache_put): New.
	(tcache_get): New.
	(tcache_thread_freeres): New.
	(tcache_init): New.
	(__libc_malloc): Use cached chunks if available.
	(__libc_free): Initialize tcache if needed.
	(__libc_realloc): Likewise.
	(__libc_calloc): Likewise.
	(_int_malloc): Prefill tcache when appropriate.
	(_int_free): Likewise.
	(do_set_tcache_max): New.
	(do_set_tcache_count): New.
	(do_set_tcache_unsorted_limit): New.
	* manual/probes.texi: Document new probes.
	* malloc/arena.c: Add new tcache tunables.
	* elf/dl-tunables.list: Likewise.
	* manual/tunables.texi: Document them.
  

Comments

Siddhesh Poyarekar June 28, 2017, 7:28 a.m. UTC | #1
On Wednesday 28 June 2017 02:12 AM, DJ Delorie wrote:
>> Why does this need to be inherited by children of setxid processes?  I
>> guess the fact that the remaining malloc tunables are is one probable
>> reason, but those also need to be reviewed and brought down to
>> SXID_ERASE if necessary.
> 
> IIRC Florian asked for that setting; I don't know specifically what the
> difference is but if you two want to fight it out, I'll get some popcorn ;-)

haha, interesting, because I remember Florian and I had this discussion
about the utility of SXID_IGNORE six months ago when I designed the
scope control for the envvars.  Lets have Florian explain this.

>>> +/* We overlay this structure on the user-data portion of a chunk when
>>> +   the chunk is stored in the per-thread cache.  */
>>> +typedef struct tcache_entry {
>>
>> Incorrect formatting.
> 
> You weren't specific here, so I had to guess which formatting you were
> referring to, as the style I used matches code elsewhere in malloc...

The braces should be on the next line:

typedef struct tcache_entry
{
...


>>> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
>>
>> __attribute__ should be on its own line.
> 
> Likewise, I was copying the formatting used elsewhere in malloc.  What
> is the correct formatting for that line?  Do I need to fix the other
> existing instances also?

Sure, but in a separate (obvious) patch.  The malloc code was in a
completely different format when it was ported in from dlmalloc and
since then it has been a bit of a mishmash that needs a cleanup.

>>> +  for (i = 0; i < TCACHE_MAX_BINS; ++i) {
>>> +    while (tcache_tmp->entries[i])
>>> +      {
>>> +	tcache_entry *e = tcache_tmp->entries[i];
>>> +	tcache_tmp->entries[i] = e->next;
>>> +	__libc_free (e);
>>> +      }
>>
>> Incorrect formatting.
> 
> Not sure what formatting you're referring to here, unless it's the tabs
> being messed up by the "> >"...

Opening brace for the for loop should be on the next line and everything
else indented accordingly.

>> This validation should be done in tunables by setting the minval and
>> maxval values in dl-tunables.list.
> 
> I don't see how I can put MAX_TCACHE_SIZE in dl-tunables.list.  Should I
> hard-code the default value, and also include the test here?  Or is
> there no other other use of min/max in dl-tunables.list?

If you hardcode the value in dl-tunables.list, you won't need the test
since __init_tunables will do that validation for you.  However, I
noticed that it is a function of a bunch of other macros and unless
those are constant across architectures it won't go into tunables as is.
 I guess you can let it be as it is and I'll fix it up later.

I think this is OK with the couple of formatting fixes above and with
the SXID_IGNORE removed, which we can then discuss and add back if
necessary as an add-on change.

Thanks,
Siddhesh

> 
> Revised patch attached.
> 
> 2017-06-27  DJ Delorie  <dj@delorie.com>
> 
> 	* config.make.in: Enable experimental malloc option.
> 	* configure.ac: Likewise.
> 	* configure: Regenerate.
> 	* malloc/Makefile: Likewise.
> 	* malloc/malloc.c: Add per-thread cache (tcache).
> 	(tcache_put): New.
> 	(tcache_get): New.
> 	(tcache_thread_freeres): New.
> 	(tcache_init): New.
> 	(__libc_malloc): Use cached chunks if available.
> 	(__libc_free): Initialize tcache if needed.
> 	(__libc_realloc): Likewise.
> 	(__libc_calloc): Likewise.
> 	(_int_malloc): Prefill tcache when appropriate.
> 	(_int_free): Likewise.
> 	(do_set_tcache_max): New.
> 	(do_set_tcache_count): New.
> 	(do_set_tcache_unsorted_limit): New.
> 	* manual/probes.texi: Document new probes.
> 	* malloc/arena.c: Add new tcache tunables.
> 	* elf/dl-tunables.list: Likewise.
> 	* manual/tunables.texi: Document them.
> 
> diff -x .git -x po -r -U3 -p glibc.pristine/config.make.in glibc.djmalloc-tcache/config.make.in
> --- glibc.pristine/config.make.in	2017-06-20 12:54:22.636484566 -0400
> +++ glibc.djmalloc-tcache/config.make.in	2017-06-26 17:32:41.164888977 -0400
> @@ -78,6 +78,8 @@ multi-arch = @multi_arch@
>  
>  mach-interface-list = @mach_interface_list@
>  
> +experimental-malloc = @experimental_malloc@
> +
>  nss-crypt = @libc_cv_nss_crypt@
>  static-nss-crypt = @libc_cv_static_nss_crypt@
>  
> diff -x .git -x po -r -U3 -p glibc.pristine/configure glibc.djmalloc-tcache/configure
> --- glibc.pristine/configure	2017-06-20 12:54:22.637484566 -0400
> +++ glibc.djmalloc-tcache/configure	2017-06-26 17:32:41.163888976 -0400
> @@ -674,6 +674,7 @@ build_obsolete_nsl
>  link_obsolete_rpc
>  libc_cv_static_nss_crypt
>  libc_cv_nss_crypt
> +experimental_malloc
>  enable_werror
>  all_warnings
>  force_install
> @@ -779,6 +780,7 @@ enable_kernel
>  enable_all_warnings
>  enable_werror
>  enable_multi_arch
> +enable_experimental_malloc
>  enable_nss_crypt
>  enable_obsolete_rpc
>  enable_obsolete_nsl
> @@ -1450,6 +1452,8 @@ Optional Features:
>    --disable-werror        do not build with -Werror
>    --enable-multi-arch     enable single DSO with optimizations for multiple
>                            architectures
> +  --disable-experimental-malloc
> +                          disable experimental malloc features
>    --enable-nss-crypt      enable libcrypt to use nss
>    --enable-obsolete-rpc   build and install the obsolete RPC code for
>                            link-time usage
> @@ -3522,6 +3526,15 @@ else
>  fi
>  
>  
> +# Check whether --enable-experimental-malloc was given.
> +if test "${enable_experimental_malloc+set}" = set; then :
> +  enableval=$enable_experimental_malloc; experimental_malloc=$enableval
> +else
> +  experimental_malloc=yes
> +fi
> +
> +
> +
>  # Check whether --enable-nss-crypt was given.
>  if test "${enable_nss_crypt+set}" = set; then :
>    enableval=$enable_nss_crypt; nss_crypt=$enableval
> diff -x .git -x po -r -U3 -p glibc.pristine/configure.ac glibc.djmalloc-tcache/configure.ac
> --- glibc.pristine/configure.ac	2017-06-20 12:54:22.638484567 -0400
> +++ glibc.djmalloc-tcache/configure.ac	2017-06-26 17:32:41.155888970 -0400
> @@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch],
>  	      [multi_arch=$enableval],
>  	      [multi_arch=default])
>  
> +AC_ARG_ENABLE([experimental-malloc],
> +	      AC_HELP_STRING([--disable-experimental-malloc],
> +			     [disable experimental malloc features]),
> +	      [experimental_malloc=$enableval],
> +	      [experimental_malloc=yes])
> +AC_SUBST(experimental_malloc)
> +
>  AC_ARG_ENABLE([nss-crypt],
>  	      AC_HELP_STRING([--enable-nss-crypt],
>  			     [enable libcrypt to use nss]),
> diff -x .git -x po -r -U3 -p glibc.pristine/elf/dl-tunables.list glibc.djmalloc-tcache/elf/dl-tunables.list
> --- glibc.pristine/elf/dl-tunables.list	2017-06-23 20:49:10.174148477 -0400
> +++ glibc.djmalloc-tcache/elf/dl-tunables.list	2017-06-26 17:32:41.154888970 -0400
> @@ -76,6 +76,18 @@ glibc {
>        minval: 1
>        security_level: SXID_IGNORE
>      }
> +    tcache_max {
> +      type: SIZE_T
> +      security_level: SXID_IGNORE
> +    }
> +    tcache_count {
> +      type: SIZE_T
> +      security_level: SXID_IGNORE
> +    }
> +    tcache_unsorted_limit {
> +      type: SIZE_T
> +      security_level: SXID_IGNORE
> +    }
>    }
>    tune {
>      hwcap_mask {
> diff -x .git -x po -r -U3 -p glibc.pristine/malloc/Makefile glibc.djmalloc-tcache/malloc/Makefile
> --- glibc.pristine/malloc/Makefile	2017-06-23 20:49:10.199148493 -0400
> +++ glibc.djmalloc-tcache/malloc/Makefile	2017-06-26 17:32:41.153888968 -0400
> @@ -189,6 +189,11 @@ tst-malloc-usable-static-ENV = $(tst-mal
>  tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3
>  tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV)
>  
> +ifeq ($(experimental-malloc),yes)
> +CPPFLAGS-malloc.c += -DUSE_TCACHE=1
> +else
> +CPPFLAGS-malloc.c += -DUSE_TCACHE=0
> +endif
>  # Uncomment this for test releases.  For public releases it is too expensive.
>  #CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1
>  
> diff -x .git -x po -r -U3 -p glibc.pristine/malloc/arena.c glibc.djmalloc-tcache/malloc/arena.c
> --- glibc.pristine/malloc/arena.c	2017-06-20 12:54:22.645484573 -0400
> +++ glibc.djmalloc-tcache/malloc/arena.c	2017-06-26 17:49:33.701608305 -0400
> @@ -236,6 +236,11 @@ TUNABLE_CALLBACK_FNDECL (set_perturb_byt
>  TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t)
>  TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t)
>  TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t)
> +#if USE_TCACHE
> +TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t)
> +TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t)
> +TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t)
> +#endif
>  #else
>  /* Initialization routine. */
>  #include <string.h>
> @@ -322,6 +327,12 @@ ptmalloc_init (void)
>    TUNABLE_GET (mmap_max, int32_t, TUNABLE_CALLBACK (set_mmaps_max));
>    TUNABLE_GET (arena_max, size_t, TUNABLE_CALLBACK (set_arena_max));
>    TUNABLE_GET (arena_test, size_t, TUNABLE_CALLBACK (set_arena_test));
> +#if USE_TCACHE
> +  TUNABLE_GET (tcache_max, size_t, TUNABLE_CALLBACK (set_tcache_max));
> +  TUNABLE_GET (tcache_count, size_t, TUNABLE_CALLBACK (set_tcache_count));
> +  TUNABLE_GET (tcache_unsorted_limit, size_t,
> +	       TUNABLE_CALLBACK (set_tcache_unsorted_limit));
> +#endif
>    __libc_lock_unlock (main_arena.mutex);
>  #else
>    const char *s = NULL;
> diff -x .git -x po -r -U3 -p glibc.pristine/malloc/malloc.c glibc.djmalloc-tcache/malloc/malloc.c
> --- glibc.pristine/malloc/malloc.c	2017-05-03 16:26:47.559839224 -0400
> +++ glibc.djmalloc-tcache/malloc/malloc.c	2017-06-26 17:32:41.151888967 -0400
> @@ -296,6 +296,30 @@ __malloc_assert (const char *assertion,
>  }
>  #endif
>  
> +#if USE_TCACHE
> +/* We want 64 entries.  This is an arbitrary limit, which tunables can reduce.  */
> +# define TCACHE_MAX_BINS		64
> +# define MAX_TCACHE_SIZE	tidx2usize (TCACHE_MAX_BINS-1)
> +
> +/* Only used to pre-fill the tunables.  */
> +# define tidx2usize(idx)	(((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ)
> +
> +/* When "x" is from chunksize().  */
> +# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
> +/* When "x" is a user-provided size.  */
> +# define usize2tidx(x) csize2tidx (request2size (x))
> +
> +/* With rounding and alignment, the bins are...
> +   idx 0   bytes 0..24 (64-bit) or 0..12 (32-bit)
> +   idx 1   bytes 25..40 or 13..20
> +   idx 2   bytes 41..56 or 21..28
> +   etc.  */
> +
> +/* This is another arbitrary limit, which tunables can change.  Each
> +   tcache bin will hold at most this number of chunks.  */
> +# define TCACHE_FILL_COUNT 7
> +#endif
> +
>  
>  /*
>    REALLOC_ZERO_BYTES_FREES should be set if a call to
> @@ -1712,6 +1736,17 @@ struct malloc_par
>  
>    /* First address handed out by MORECORE/sbrk.  */
>    char *sbrk_base;
> +
> +#if USE_TCACHE
> +  /* Maximum number of buckets to use.  */
> +  size_t tcache_bins;
> +  size_t tcache_max_bytes;
> +  /* Maximum number of chunks in each bucket.  */
> +  size_t tcache_count;
> +  /* Maximum number of chunks to remove from the unsorted list, which
> +     aren't used to prefill the cache.  */
> +  size_t tcache_unsorted_limit;
> +#endif
>  };
>  
>  /* There are several instances of this struct ("arenas") in this
> @@ -1750,6 +1785,13 @@ static struct malloc_par mp_ =
>    .trim_threshold = DEFAULT_TRIM_THRESHOLD,
>  #define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
>    .arena_test = NARENAS_FROM_NCORES (1)
> +#if USE_TCACHE
> +  ,
> +  .tcache_count = TCACHE_FILL_COUNT,
> +  .tcache_bins = TCACHE_MAX_BINS,
> +  .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1),
> +  .tcache_unsorted_limit = 0 /* No limit.  */
> +#endif
>  };
>  
>  /* Maximum size of memory handled in fastbins.  */
> @@ -2875,6 +2917,121 @@ mremap_chunk (mchunkptr p, size_t new_si
>  
>  /*------------------------ Public wrappers. --------------------------------*/
>  
> +#if USE_TCACHE
> +
> +/* We overlay this structure on the user-data portion of a chunk when
> +   the chunk is stored in the per-thread cache.  */
> +typedef struct tcache_entry {
> +  struct tcache_entry *next;
> +} tcache_entry;
> +
> +/* There is one of these for each thread, which contains the
> +   per-thread cache (hence "tcache_perthread_struct").  Keeping
> +   overall size low is mildly important.  Note that COUNTS and ENTRIES
> +   are redundant (we could have just counted the linked list each
> +   time), this is for performance reasons.  */
> +typedef struct tcache_perthread_struct {
> +  char counts[TCACHE_MAX_BINS];
> +  tcache_entry *entries[TCACHE_MAX_BINS];
> +} tcache_perthread_struct;
> +
> +static __thread char tcache_shutting_down = 0;
> +static __thread tcache_perthread_struct *tcache = NULL;
> +
> +/* Caller must ensure that we know tc_idx is valid and there's room
> +   for more chunks.  */
> +static void
> +tcache_put (mchunkptr chunk, size_t tc_idx)
> +{
> +  tcache_entry *e = (tcache_entry *) chunk2mem (chunk);
> +  assert (tc_idx < TCACHE_MAX_BINS);
> +  e->next = tcache->entries[tc_idx];
> +  tcache->entries[tc_idx] = e;
> +  ++(tcache->counts[tc_idx]);
> +}
> +
> +/* Caller must ensure that we know tc_idx is valid and there's
> +   available chunks to remove.  */
> +static void *
> +tcache_get (size_t tc_idx)
> +{
> +  tcache_entry *e = tcache->entries[tc_idx];
> +  assert (tc_idx < TCACHE_MAX_BINS);
> +  assert (tcache->entries[tc_idx] > 0);
> +  tcache->entries[tc_idx] = e->next;
> +  --(tcache->counts[tc_idx]);
> +  return (void *) e;
> +}
> +
> +static void __attribute__ ((section ("__libc_thread_freeres_fn")))
> +tcache_thread_freeres (void)
> +{
> +  int i;
> +  tcache_perthread_struct *tcache_tmp = tcache;
> +
> +  if (!tcache)
> +    return;
> +
> +  tcache = NULL;
> +
> +  for (i = 0; i < TCACHE_MAX_BINS; ++i) {
> +    while (tcache_tmp->entries[i])
> +      {
> +	tcache_entry *e = tcache_tmp->entries[i];
> +	tcache_tmp->entries[i] = e->next;
> +	__libc_free (e);
> +      }
> +  }
> +
> +  __libc_free (tcache_tmp);
> +
> +  tcache_shutting_down = 1;
> +}
> +text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
> +
> +static void
> +tcache_init(void)
> +{
> +  mstate ar_ptr;
> +  void *victim = 0;
> +  const size_t bytes = sizeof (tcache_perthread_struct);
> +
> +  if (tcache_shutting_down)
> +    return;
> +
> +  arena_get (ar_ptr, bytes);
> +  victim = _int_malloc (ar_ptr, bytes);
> +  if (!victim && ar_ptr != NULL)
> +    {
> +      ar_ptr = arena_get_retry (ar_ptr, bytes);
> +      victim = _int_malloc (ar_ptr, bytes);
> +    }
> +
> +
> +  if (ar_ptr != NULL)
> +    __libc_lock_unlock (ar_ptr->mutex);
> +
> +  /* In a low memory situation, we may not be able to allocate memory
> +     - in which case, we just keep trying later.  However, we
> +     typically do this very early, so either there is sufficient
> +     memory, or there isn't enough memory to do non-trivial
> +     allocations anyway.  */
> +  if (victim)
> +    {
> +      tcache = (tcache_perthread_struct *) victim;
> +      memset (tcache, 0, sizeof (tcache_perthread_struct));
> +    }
> +
> +}
> +
> +#define MAYBE_INIT_TCACHE() \
> +  if (__glibc_unlikely (tcache == NULL)) \
> +    tcache_init();
> +
> +#else
> +#define MAYBE_INIT_TCACHE()
> +#endif
> +
>  void *
>  __libc_malloc (size_t bytes)
>  {
> @@ -2885,6 +3042,21 @@ __libc_malloc (size_t bytes)
>      = atomic_forced_read (__malloc_hook);
>    if (__builtin_expect (hook != NULL, 0))
>      return (*hook)(bytes, RETURN_ADDRESS (0));
> +#if USE_TCACHE
> +  /* int_free also calls request2size, be careful to not pad twice.  */
> +  size_t tbytes = request2size (bytes);
> +  size_t tc_idx = csize2tidx (tbytes);
> +
> +  MAYBE_INIT_TCACHE ();
> +
> +  if (tc_idx < mp_.tcache_bins
> +      && tc_idx < TCACHE_MAX_BINS /* to appease gcc */
> +      && tcache
> +      && tcache->entries[tc_idx] != NULL)
> +    {
> +      return tcache_get (tc_idx);
> +    }
> +#endif
>  
>    arena_get (ar_ptr, bytes);
>  
> @@ -2944,6 +3116,8 @@ __libc_free (void *mem)
>        return;
>      }
>  
> +  MAYBE_INIT_TCACHE ();
> +
>    ar_ptr = arena_for_chunk (p);
>    _int_free (ar_ptr, p, 0);
>  }
> @@ -2981,7 +3155,10 @@ __libc_realloc (void *oldmem, size_t byt
>    if (chunk_is_mmapped (oldp))
>      ar_ptr = NULL;
>    else
> -    ar_ptr = arena_for_chunk (oldp);
> +    {
> +      MAYBE_INIT_TCACHE ();
> +      ar_ptr = arena_for_chunk (oldp);
> +    }
>  
>    /* Little security check which won't hurt performance: the allocator
>       never wrapps around at the end of the address space.  Therefore
> @@ -3206,6 +3383,8 @@ __libc_calloc (size_t n, size_t elem_siz
>  
>    sz = bytes;
>  
> +  MAYBE_INIT_TCACHE ();
> +
>    arena_get (av, sz);
>    if (av)
>      {
> @@ -3336,6 +3515,10 @@ _int_malloc (mstate av, size_t bytes)
>    mchunkptr fwd;                    /* misc temp for linking */
>    mchunkptr bck;                    /* misc temp for linking */
>  
> +#if USE_TCACHE
> +  size_t tcache_unsorted_count;	    /* count of unsorted chunks processed */
> +#endif
> +
>    const char *errstr = NULL;
>  
>    /*
> @@ -3365,19 +3548,22 @@ _int_malloc (mstate av, size_t bytes)
>       can try it without checking, which saves some time on this fast path.
>     */
>  
> +#define REMOVE_FB(fb, victim, pp)			\
> +  do							\
> +    {							\
> +      victim = pp;					\
> +      if (victim == NULL)				\
> +	break;						\
> +    }							\
> +  while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
> +	 != victim);					\
> +
>    if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
>      {
>        idx = fastbin_index (nb);
>        mfastbinptr *fb = &fastbin (av, idx);
>        mchunkptr pp = *fb;
> -      do
> -        {
> -          victim = pp;
> -          if (victim == NULL)
> -            break;
> -        }
> -      while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
> -             != victim);
> +      REMOVE_FB (fb, victim, pp);
>        if (victim != 0)
>          {
>            if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
> @@ -3388,6 +3574,26 @@ _int_malloc (mstate av, size_t bytes)
>                return NULL;
>              }
>            check_remalloced_chunk (av, victim, nb);
> +#if USE_TCACHE
> +	  /* While we're here, if we see other chunks of the same size,
> +	     stash them in the tcache.  */
> +	  size_t tc_idx = csize2tidx (nb);
> +	  if (tcache && tc_idx < mp_.tcache_bins)
> +	    {
> +	      mchunkptr tc_victim;
> +
> +	      /* While bin not empty and tcache not full, copy chunks over.  */
> +	      while (tcache->counts[tc_idx] < mp_.tcache_count
> +		     && (pp = *fb) != NULL)
> +		{
> +		  REMOVE_FB (fb, tc_victim, pp);
> +		  if (tc_victim != 0)
> +		    {
> +		      tcache_put (tc_victim, tc_idx);
> +	            }
> +		}
> +	    }
> +#endif
>            void *p = chunk2mem (victim);
>            alloc_perturb (p, bytes);
>            return p;
> @@ -3426,6 +3632,32 @@ _int_malloc (mstate av, size_t bytes)
>                if (av != &main_arena)
>  		set_non_main_arena (victim);
>                check_malloced_chunk (av, victim, nb);
> +#if USE_TCACHE
> +	  /* While we're here, if we see other chunks of the same size,
> +	     stash them in the tcache.  */
> +	  size_t tc_idx = csize2tidx (nb);
> +	  if (tcache && tc_idx < mp_.tcache_bins)
> +	    {
> +	      mchunkptr tc_victim;
> +
> +	      /* While bin not empty and tcache not full, copy chunks over.  */
> +	      while (tcache->counts[tc_idx] < mp_.tcache_count
> +		     && (tc_victim = last (bin)) != bin)
> +		{
> +		  if (tc_victim != 0)
> +		    {
> +		      bck = tc_victim->bk;
> +		      set_inuse_bit_at_offset (tc_victim, nb);
> +		      if (av != &main_arena)
> +			set_non_main_arena (tc_victim);
> +		      bin->bk = bck;
> +		      bck->fd = bin;
> +
> +		      tcache_put (tc_victim, tc_idx);
> +	            }
> +		}
> +	    }
> +#endif
>                void *p = chunk2mem (victim);
>                alloc_perturb (p, bytes);
>                return p;
> @@ -3464,6 +3696,16 @@ _int_malloc (mstate av, size_t bytes)
>       otherwise need to expand memory to service a "small" request.
>     */
>  
> +#if USE_TCACHE
> +  INTERNAL_SIZE_T tcache_nb = 0;
> +  size_t tc_idx = csize2tidx (nb);
> +  if (tcache && tc_idx < mp_.tcache_bins)
> +    tcache_nb = nb;
> +  int return_cached = 0;
> +
> +  tcache_unsorted_count = 0;
> +#endif
> +
>    for (;; )
>      {
>        int iters = 0;
> @@ -3524,10 +3766,26 @@ _int_malloc (mstate av, size_t bytes)
>                set_inuse_bit_at_offset (victim, size);
>                if (av != &main_arena)
>  		set_non_main_arena (victim);
> +#if USE_TCACHE
> +	      /* Fill cache first, return to user only if cache fills.
> +		 We may return one of these chunks later.  */
> +	      if (tcache_nb
> +		  && tcache->counts[tc_idx] < mp_.tcache_count)
> +		{
> +		  tcache_put (victim, tc_idx);
> +		  return_cached = 1;
> +		  continue;
> +		}
> +	      else
> +		{
> +#endif
>                check_malloced_chunk (av, victim, nb);
>                void *p = chunk2mem (victim);
>                alloc_perturb (p, bytes);
>                return p;
> +#if USE_TCACHE
> +		}
> +#endif
>              }
>  
>            /* place chunk in bin */
> @@ -3594,11 +3852,31 @@ _int_malloc (mstate av, size_t bytes)
>            fwd->bk = victim;
>            bck->fd = victim;
>  
> +#if USE_TCACHE
> +      /* If we've processed as many chunks as we're allowed while
> +	 filling the cache, return one of the cached ones.  */
> +      ++tcache_unsorted_count;
> +      if (return_cached
> +	  && mp_.tcache_unsorted_limit > 0
> +	  && tcache_unsorted_count > mp_.tcache_unsorted_limit)
> +	{
> +	  return tcache_get (tc_idx);
> +	}
> +#endif
> +
>  #define MAX_ITERS       10000
>            if (++iters >= MAX_ITERS)
>              break;
>          }
>  
> +#if USE_TCACHE
> +      /* If all the small chunks we found ended up cached, return one now.  */
> +      if (return_cached)
> +	{
> +	  return tcache_get (tc_idx);
> +	}
> +#endif
> +
>        /*
>           If a large request, scan through the chunks of current bin in
>           sorted order to find smallest that fits.  Use the skip list for this.
> @@ -3884,6 +4162,20 @@ _int_free (mstate av, mchunkptr p, int h
>  
>    check_inuse_chunk(av, p);
>  
> +#if USE_TCACHE
> +  {
> +    size_t tc_idx = csize2tidx (size);
> +
> +    if (tcache
> +	&& tc_idx < mp_.tcache_bins
> +	&& tcache->counts[tc_idx] < mp_.tcache_count)
> +      {
> +	tcache_put (p, tc_idx);
> +	return;
> +      }
> +  }
> +#endif
> +
>    /*
>      If eligible, place chunk on a fastbin so it can be found
>      and used quickly in malloc.
> @@ -4845,6 +5137,38 @@ do_set_arena_max (size_t value)
>    return 1;
>  }
>  
> +#if USE_TCACHE
> +static inline int
> +__always_inline
> +do_set_tcache_max (size_t value)
> +{
> +  if (value >= 0 && value <= MAX_TCACHE_SIZE)
> +    {
> +      LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes);
> +      mp_.tcache_max_bytes = value;
> +      mp_.tcache_bins = csize2tidx (request2size(value)) + 1;
> +    }
> +  return 1;
> +}
> +
> +static inline int
> +__always_inline
> +do_set_tcache_count (size_t value)
> +{
> +  LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
> +  mp_.tcache_count = value;
> +  return 1;
> +}
> +
> +static inline int
> +__always_inline
> +do_set_tcache_unsorted_limit (size_t value)
> +{
> +  LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit);
> +  mp_.tcache_unsorted_limit = value;
> +  return 1;
> +}
> +#endif
>  
>  int
>  __libc_mallopt (int param_number, int value)
> diff -x .git -x po -r -U3 -p glibc.pristine/manual/probes.texi glibc.djmalloc-tcache/manual/probes.texi
> --- glibc.pristine/manual/probes.texi	2017-01-10 17:02:06.305343587 -0500
> +++ glibc.djmalloc-tcache/manual/probes.texi	2017-06-26 17:32:35.785884967 -0400
> @@ -231,6 +231,25 @@ dynamic brk/mmap thresholds.  Argument @
>  the adjusted mmap and trim thresholds, respectively.
>  @end deftp
>  
> +@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the @code{glibc.malloc.tcache_max}
> +tunable is set.  Argument @var{$arg1} is the requested value, and
> +@var{$arg2} is the previous value of this tunable.
> +@end deftp
> +
> +@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the @code{glibc.malloc.tcache_count}
> +tunable is set.  Argument @var{$arg1} is the requested value, and
> +@var{$arg2} is the previous value of this tunable.
> +@end deftp
> +
> +@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2})
> +This probe is triggered when the
> +@code{glibc.malloc.tcache_unsorted_limit} tunable is set.  Argument
> +@var{$arg1} is the requested value, and @var{$arg2} is the previous
> +value of this tunable.
> +@end deftp
> +
>  @node Mathematical Function Probes
>  @section Mathematical Function Probes
>  
> diff -x .git -x po -r -U3 -p glibc.pristine/manual/tunables.texi glibc.djmalloc-tcache/manual/tunables.texi
> --- glibc.pristine/manual/tunables.texi	2017-06-23 20:49:10.201148494 -0400
> +++ glibc.djmalloc-tcache/manual/tunables.texi	2017-06-26 17:35:07.952998383 -0400
> @@ -193,6 +193,39 @@ systems the limit is twice the number of
>  is 8 times the number of cores online.
>  @end deftp
>  
> +@deftp Tunable glibc.malloc.tcache_max
> +The maximum size of a request (in bytes) which may be met via the
> +per-thread cache.  The default (and maximum) value is 1032 bytes on
> +64-bit systems and 516 bytes on 32-bit systems.
> +@end deftp
> +
> +@deftp Tunable glibc.malloc.tcache_count
> +The maximum number of chunks of each size to cache.  The default is 7.
> +There is no upper limit, other than available system memory.  Note
> +that chunks are rounded up to malloc's guaranteed alignment - this
> +count is per rounded size, not per user-provided size.
> +
> +The approximate maximum overhead of the per-thread cache (for each
> +thread, of course) is thus @code{glibc.malloc.tcache_max} (in bins,
> +max 64 bins) times @code{glibc.malloc.tcache_count} times the size for
> +each bin.  With defaults, this is about 236 KB on 64-bit systems and
> +118 KB on 32-bit systems.
> +@end deftp
> +
> +@deftp Tunable glibc.malloc.tcache_unsorted_limit
> +When the user requests memory and the request cannot be met via the
> +per-thread cache, the arenas are used to meet the request.  At this
> +time, additional chunks will be moved from existing arena lists to
> +pre-fill the corresponding cache.  While copies from the fastbins,
> +smallbins, and regular bins are bounded and predictable due to the bin
> +sizes, copies from the unsorted bin are not bounded, and incur
> +additional time penalties as they need to be sorted as they're
> +scanned.  To make scanning the unsorted list more predictable and
> +bounded, the user may set this tunable to limit the number of blocks
> +that are scanned from the unsorted list while searching for chunks to
> +pre-fill the per-thread cache with.  The default, or when set to zero,
> +is no limit.
> +
>  @node Hardware Capability Tunables
>  @section Hardware Capability Tunables
>  @cindex hardware capability tunables
>
  
Florian Weimer June 28, 2017, 7:45 a.m. UTC | #2
On 06/28/2017 09:28 AM, Siddhesh Poyarekar wrote:
> On Wednesday 28 June 2017 02:12 AM, DJ Delorie wrote:
>>> Why does this need to be inherited by children of setxid processes?  I
>>> guess the fact that the remaining malloc tunables are is one probable
>>> reason, but those also need to be reviewed and brought down to
>>> SXID_ERASE if necessary.
>>
>> IIRC Florian asked for that setting; I don't know specifically what the
>> difference is but if you two want to fight it out, I'll get some popcorn ;-)
> 
> haha, interesting, because I remember Florian and I had this discussion
> about the utility of SXID_IGNORE six months ago when I designed the
> scope control for the envvars.  Lets have Florian explain this.

I don't recall discussing this.  I think we should use SXID_ERASE for
new tunables, and revisit that decision only when necessary.

Thanks,
Florian
  

Patch

diff -x .git -x po -r -U3 -p glibc.pristine/config.make.in glibc.djmalloc-tcache/config.make.in
--- glibc.pristine/config.make.in	2017-06-20 12:54:22.636484566 -0400
+++ glibc.djmalloc-tcache/config.make.in	2017-06-26 17:32:41.164888977 -0400
@@ -78,6 +78,8 @@  multi-arch = @multi_arch@
 
 mach-interface-list = @mach_interface_list@
 
+experimental-malloc = @experimental_malloc@
+
 nss-crypt = @libc_cv_nss_crypt@
 static-nss-crypt = @libc_cv_static_nss_crypt@
 
diff -x .git -x po -r -U3 -p glibc.pristine/configure glibc.djmalloc-tcache/configure
--- glibc.pristine/configure	2017-06-20 12:54:22.637484566 -0400
+++ glibc.djmalloc-tcache/configure	2017-06-26 17:32:41.163888976 -0400
@@ -674,6 +674,7 @@  build_obsolete_nsl
 link_obsolete_rpc
 libc_cv_static_nss_crypt
 libc_cv_nss_crypt
+experimental_malloc
 enable_werror
 all_warnings
 force_install
@@ -779,6 +780,7 @@  enable_kernel
 enable_all_warnings
 enable_werror
 enable_multi_arch
+enable_experimental_malloc
 enable_nss_crypt
 enable_obsolete_rpc
 enable_obsolete_nsl
@@ -1450,6 +1452,8 @@  Optional Features:
   --disable-werror        do not build with -Werror
   --enable-multi-arch     enable single DSO with optimizations for multiple
                           architectures
+  --disable-experimental-malloc
+                          disable experimental malloc features
   --enable-nss-crypt      enable libcrypt to use nss
   --enable-obsolete-rpc   build and install the obsolete RPC code for
                           link-time usage
@@ -3522,6 +3526,15 @@  else
 fi
 
 
+# Check whether --enable-experimental-malloc was given.
+if test "${enable_experimental_malloc+set}" = set; then :
+  enableval=$enable_experimental_malloc; experimental_malloc=$enableval
+else
+  experimental_malloc=yes
+fi
+
+
+
 # Check whether --enable-nss-crypt was given.
 if test "${enable_nss_crypt+set}" = set; then :
   enableval=$enable_nss_crypt; nss_crypt=$enableval
diff -x .git -x po -r -U3 -p glibc.pristine/configure.ac glibc.djmalloc-tcache/configure.ac
--- glibc.pristine/configure.ac	2017-06-20 12:54:22.638484567 -0400
+++ glibc.djmalloc-tcache/configure.ac	2017-06-26 17:32:41.155888970 -0400
@@ -313,6 +313,13 @@  AC_ARG_ENABLE([multi-arch],
 	      [multi_arch=$enableval],
 	      [multi_arch=default])
 
+AC_ARG_ENABLE([experimental-malloc],
+	      AC_HELP_STRING([--disable-experimental-malloc],
+			     [disable experimental malloc features]),
+	      [experimental_malloc=$enableval],
+	      [experimental_malloc=yes])
+AC_SUBST(experimental_malloc)
+
 AC_ARG_ENABLE([nss-crypt],
 	      AC_HELP_STRING([--enable-nss-crypt],
 			     [enable libcrypt to use nss]),
diff -x .git -x po -r -U3 -p glibc.pristine/elf/dl-tunables.list glibc.djmalloc-tcache/elf/dl-tunables.list
--- glibc.pristine/elf/dl-tunables.list	2017-06-23 20:49:10.174148477 -0400
+++ glibc.djmalloc-tcache/elf/dl-tunables.list	2017-06-26 17:32:41.154888970 -0400
@@ -76,6 +76,18 @@  glibc {
       minval: 1
       security_level: SXID_IGNORE
     }
+    tcache_max {
+      type: SIZE_T
+      security_level: SXID_IGNORE
+    }
+    tcache_count {
+      type: SIZE_T
+      security_level: SXID_IGNORE
+    }
+    tcache_unsorted_limit {
+      type: SIZE_T
+      security_level: SXID_IGNORE
+    }
   }
   tune {
     hwcap_mask {
diff -x .git -x po -r -U3 -p glibc.pristine/malloc/Makefile glibc.djmalloc-tcache/malloc/Makefile
--- glibc.pristine/malloc/Makefile	2017-06-23 20:49:10.199148493 -0400
+++ glibc.djmalloc-tcache/malloc/Makefile	2017-06-26 17:32:41.153888968 -0400
@@ -189,6 +189,11 @@  tst-malloc-usable-static-ENV = $(tst-mal
 tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3
 tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV)
 
+ifeq ($(experimental-malloc),yes)
+CPPFLAGS-malloc.c += -DUSE_TCACHE=1
+else
+CPPFLAGS-malloc.c += -DUSE_TCACHE=0
+endif
 # Uncomment this for test releases.  For public releases it is too expensive.
 #CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1
 
diff -x .git -x po -r -U3 -p glibc.pristine/malloc/arena.c glibc.djmalloc-tcache/malloc/arena.c
--- glibc.pristine/malloc/arena.c	2017-06-20 12:54:22.645484573 -0400
+++ glibc.djmalloc-tcache/malloc/arena.c	2017-06-26 17:49:33.701608305 -0400
@@ -236,6 +236,11 @@  TUNABLE_CALLBACK_FNDECL (set_perturb_byt
 TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t)
 TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t)
 TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t)
+#if USE_TCACHE
+TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t)
+TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t)
+TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t)
+#endif
 #else
 /* Initialization routine. */
 #include <string.h>
@@ -322,6 +327,12 @@  ptmalloc_init (void)
   TUNABLE_GET (mmap_max, int32_t, TUNABLE_CALLBACK (set_mmaps_max));
   TUNABLE_GET (arena_max, size_t, TUNABLE_CALLBACK (set_arena_max));
   TUNABLE_GET (arena_test, size_t, TUNABLE_CALLBACK (set_arena_test));
+#if USE_TCACHE
+  TUNABLE_GET (tcache_max, size_t, TUNABLE_CALLBACK (set_tcache_max));
+  TUNABLE_GET (tcache_count, size_t, TUNABLE_CALLBACK (set_tcache_count));
+  TUNABLE_GET (tcache_unsorted_limit, size_t,
+	       TUNABLE_CALLBACK (set_tcache_unsorted_limit));
+#endif
   __libc_lock_unlock (main_arena.mutex);
 #else
   const char *s = NULL;
diff -x .git -x po -r -U3 -p glibc.pristine/malloc/malloc.c glibc.djmalloc-tcache/malloc/malloc.c
--- glibc.pristine/malloc/malloc.c	2017-05-03 16:26:47.559839224 -0400
+++ glibc.djmalloc-tcache/malloc/malloc.c	2017-06-26 17:32:41.151888967 -0400
@@ -296,6 +296,30 @@  __malloc_assert (const char *assertion,
 }
 #endif
 
+#if USE_TCACHE
+/* We want 64 entries.  This is an arbitrary limit, which tunables can reduce.  */
+# define TCACHE_MAX_BINS		64
+# define MAX_TCACHE_SIZE	tidx2usize (TCACHE_MAX_BINS-1)
+
+/* Only used to pre-fill the tunables.  */
+# define tidx2usize(idx)	(((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ)
+
+/* When "x" is from chunksize().  */
+# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
+/* When "x" is a user-provided size.  */
+# define usize2tidx(x) csize2tidx (request2size (x))
+
+/* With rounding and alignment, the bins are...
+   idx 0   bytes 0..24 (64-bit) or 0..12 (32-bit)
+   idx 1   bytes 25..40 or 13..20
+   idx 2   bytes 41..56 or 21..28
+   etc.  */
+
+/* This is another arbitrary limit, which tunables can change.  Each
+   tcache bin will hold at most this number of chunks.  */
+# define TCACHE_FILL_COUNT 7
+#endif
+
 
 /*
   REALLOC_ZERO_BYTES_FREES should be set if a call to
@@ -1712,6 +1736,17 @@  struct malloc_par
 
   /* First address handed out by MORECORE/sbrk.  */
   char *sbrk_base;
+
+#if USE_TCACHE
+  /* Maximum number of buckets to use.  */
+  size_t tcache_bins;
+  size_t tcache_max_bytes;
+  /* Maximum number of chunks in each bucket.  */
+  size_t tcache_count;
+  /* Maximum number of chunks to remove from the unsorted list, which
+     aren't used to prefill the cache.  */
+  size_t tcache_unsorted_limit;
+#endif
 };
 
 /* There are several instances of this struct ("arenas") in this
@@ -1750,6 +1785,13 @@  static struct malloc_par mp_ =
   .trim_threshold = DEFAULT_TRIM_THRESHOLD,
 #define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
   .arena_test = NARENAS_FROM_NCORES (1)
+#if USE_TCACHE
+  ,
+  .tcache_count = TCACHE_FILL_COUNT,
+  .tcache_bins = TCACHE_MAX_BINS,
+  .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1),
+  .tcache_unsorted_limit = 0 /* No limit.  */
+#endif
 };
 
 /* Maximum size of memory handled in fastbins.  */
@@ -2875,6 +2917,121 @@  mremap_chunk (mchunkptr p, size_t new_si
 
 /*------------------------ Public wrappers. --------------------------------*/
 
+#if USE_TCACHE
+
+/* We overlay this structure on the user-data portion of a chunk when
+   the chunk is stored in the per-thread cache.  */
+typedef struct tcache_entry {
+  struct tcache_entry *next;
+} tcache_entry;
+
+/* There is one of these for each thread, which contains the
+   per-thread cache (hence "tcache_perthread_struct").  Keeping
+   overall size low is mildly important.  Note that COUNTS and ENTRIES
+   are redundant (we could have just counted the linked list each
+   time), this is for performance reasons.  */
+typedef struct tcache_perthread_struct {
+  char counts[TCACHE_MAX_BINS];
+  tcache_entry *entries[TCACHE_MAX_BINS];
+} tcache_perthread_struct;
+
+static __thread char tcache_shutting_down = 0;
+static __thread tcache_perthread_struct *tcache = NULL;
+
+/* Caller must ensure that we know tc_idx is valid and there's room
+   for more chunks.  */
+static void
+tcache_put (mchunkptr chunk, size_t tc_idx)
+{
+  tcache_entry *e = (tcache_entry *) chunk2mem (chunk);
+  assert (tc_idx < TCACHE_MAX_BINS);
+  e->next = tcache->entries[tc_idx];
+  tcache->entries[tc_idx] = e;
+  ++(tcache->counts[tc_idx]);
+}
+
+/* Caller must ensure that we know tc_idx is valid and there's
+   available chunks to remove.  */
+static void *
+tcache_get (size_t tc_idx)
+{
+  tcache_entry *e = tcache->entries[tc_idx];
+  assert (tc_idx < TCACHE_MAX_BINS);
+  assert (tcache->entries[tc_idx] > 0);
+  tcache->entries[tc_idx] = e->next;
+  --(tcache->counts[tc_idx]);
+  return (void *) e;
+}
+
+static void __attribute__ ((section ("__libc_thread_freeres_fn")))
+tcache_thread_freeres (void)
+{
+  int i;
+  tcache_perthread_struct *tcache_tmp = tcache;
+
+  if (!tcache)
+    return;
+
+  tcache = NULL;
+
+  for (i = 0; i < TCACHE_MAX_BINS; ++i) {
+    while (tcache_tmp->entries[i])
+      {
+	tcache_entry *e = tcache_tmp->entries[i];
+	tcache_tmp->entries[i] = e->next;
+	__libc_free (e);
+      }
+  }
+
+  __libc_free (tcache_tmp);
+
+  tcache_shutting_down = 1;
+}
+text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
+
+static void
+tcache_init(void)
+{
+  mstate ar_ptr;
+  void *victim = 0;
+  const size_t bytes = sizeof (tcache_perthread_struct);
+
+  if (tcache_shutting_down)
+    return;
+
+  arena_get (ar_ptr, bytes);
+  victim = _int_malloc (ar_ptr, bytes);
+  if (!victim && ar_ptr != NULL)
+    {
+      ar_ptr = arena_get_retry (ar_ptr, bytes);
+      victim = _int_malloc (ar_ptr, bytes);
+    }
+
+
+  if (ar_ptr != NULL)
+    __libc_lock_unlock (ar_ptr->mutex);
+
+  /* In a low memory situation, we may not be able to allocate memory
+     - in which case, we just keep trying later.  However, we
+     typically do this very early, so either there is sufficient
+     memory, or there isn't enough memory to do non-trivial
+     allocations anyway.  */
+  if (victim)
+    {
+      tcache = (tcache_perthread_struct *) victim;
+      memset (tcache, 0, sizeof (tcache_perthread_struct));
+    }
+
+}
+
+#define MAYBE_INIT_TCACHE() \
+  if (__glibc_unlikely (tcache == NULL)) \
+    tcache_init();
+
+#else
+#define MAYBE_INIT_TCACHE()
+#endif
+
 void *
 __libc_malloc (size_t bytes)
 {
@@ -2885,6 +3042,21 @@  __libc_malloc (size_t bytes)
     = atomic_forced_read (__malloc_hook);
   if (__builtin_expect (hook != NULL, 0))
     return (*hook)(bytes, RETURN_ADDRESS (0));
+#if USE_TCACHE
+  /* int_free also calls request2size, be careful to not pad twice.  */
+  size_t tbytes = request2size (bytes);
+  size_t tc_idx = csize2tidx (tbytes);
+
+  MAYBE_INIT_TCACHE ();
+
+  if (tc_idx < mp_.tcache_bins
+      && tc_idx < TCACHE_MAX_BINS /* to appease gcc */
+      && tcache
+      && tcache->entries[tc_idx] != NULL)
+    {
+      return tcache_get (tc_idx);
+    }
+#endif
 
   arena_get (ar_ptr, bytes);
 
@@ -2944,6 +3116,8 @@  __libc_free (void *mem)
       return;
     }
 
+  MAYBE_INIT_TCACHE ();
+
   ar_ptr = arena_for_chunk (p);
   _int_free (ar_ptr, p, 0);
 }
@@ -2981,7 +3155,10 @@  __libc_realloc (void *oldmem, size_t byt
   if (chunk_is_mmapped (oldp))
     ar_ptr = NULL;
   else
-    ar_ptr = arena_for_chunk (oldp);
+    {
+      MAYBE_INIT_TCACHE ();
+      ar_ptr = arena_for_chunk (oldp);
+    }
 
   /* Little security check which won't hurt performance: the allocator
      never wrapps around at the end of the address space.  Therefore
@@ -3206,6 +3383,8 @@  __libc_calloc (size_t n, size_t elem_siz
 
   sz = bytes;
 
+  MAYBE_INIT_TCACHE ();
+
   arena_get (av, sz);
   if (av)
     {
@@ -3336,6 +3515,10 @@  _int_malloc (mstate av, size_t bytes)
   mchunkptr fwd;                    /* misc temp for linking */
   mchunkptr bck;                    /* misc temp for linking */
 
+#if USE_TCACHE
+  size_t tcache_unsorted_count;	    /* count of unsorted chunks processed */
+#endif
+
   const char *errstr = NULL;
 
   /*
@@ -3365,19 +3548,22 @@  _int_malloc (mstate av, size_t bytes)
      can try it without checking, which saves some time on this fast path.
    */
 
+#define REMOVE_FB(fb, victim, pp)			\
+  do							\
+    {							\
+      victim = pp;					\
+      if (victim == NULL)				\
+	break;						\
+    }							\
+  while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
+	 != victim);					\
+
   if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
     {
       idx = fastbin_index (nb);
       mfastbinptr *fb = &fastbin (av, idx);
       mchunkptr pp = *fb;
-      do
-        {
-          victim = pp;
-          if (victim == NULL)
-            break;
-        }
-      while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
-             != victim);
+      REMOVE_FB (fb, victim, pp);
       if (victim != 0)
         {
           if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
@@ -3388,6 +3574,26 @@  _int_malloc (mstate av, size_t bytes)
               return NULL;
             }
           check_remalloced_chunk (av, victim, nb);
+#if USE_TCACHE
+	  /* While we're here, if we see other chunks of the same size,
+	     stash them in the tcache.  */
+	  size_t tc_idx = csize2tidx (nb);
+	  if (tcache && tc_idx < mp_.tcache_bins)
+	    {
+	      mchunkptr tc_victim;
+
+	      /* While bin not empty and tcache not full, copy chunks over.  */
+	      while (tcache->counts[tc_idx] < mp_.tcache_count
+		     && (pp = *fb) != NULL)
+		{
+		  REMOVE_FB (fb, tc_victim, pp);
+		  if (tc_victim != 0)
+		    {
+		      tcache_put (tc_victim, tc_idx);
+	            }
+		}
+	    }
+#endif
           void *p = chunk2mem (victim);
           alloc_perturb (p, bytes);
           return p;
@@ -3426,6 +3632,32 @@  _int_malloc (mstate av, size_t bytes)
               if (av != &main_arena)
 		set_non_main_arena (victim);
               check_malloced_chunk (av, victim, nb);
+#if USE_TCACHE
+	  /* While we're here, if we see other chunks of the same size,
+	     stash them in the tcache.  */
+	  size_t tc_idx = csize2tidx (nb);
+	  if (tcache && tc_idx < mp_.tcache_bins)
+	    {
+	      mchunkptr tc_victim;
+
+	      /* While bin not empty and tcache not full, copy chunks over.  */
+	      while (tcache->counts[tc_idx] < mp_.tcache_count
+		     && (tc_victim = last (bin)) != bin)
+		{
+		  if (tc_victim != 0)
+		    {
+		      bck = tc_victim->bk;
+		      set_inuse_bit_at_offset (tc_victim, nb);
+		      if (av != &main_arena)
+			set_non_main_arena (tc_victim);
+		      bin->bk = bck;
+		      bck->fd = bin;
+
+		      tcache_put (tc_victim, tc_idx);
+	            }
+		}
+	    }
+#endif
               void *p = chunk2mem (victim);
               alloc_perturb (p, bytes);
               return p;
@@ -3464,6 +3696,16 @@  _int_malloc (mstate av, size_t bytes)
      otherwise need to expand memory to service a "small" request.
    */
 
+#if USE_TCACHE
+  INTERNAL_SIZE_T tcache_nb = 0;
+  size_t tc_idx = csize2tidx (nb);
+  if (tcache && tc_idx < mp_.tcache_bins)
+    tcache_nb = nb;
+  int return_cached = 0;
+
+  tcache_unsorted_count = 0;
+#endif
+
   for (;; )
     {
       int iters = 0;
@@ -3524,10 +3766,26 @@  _int_malloc (mstate av, size_t bytes)
               set_inuse_bit_at_offset (victim, size);
               if (av != &main_arena)
 		set_non_main_arena (victim);
+#if USE_TCACHE
+	      /* Fill cache first, return to user only if cache fills.
+		 We may return one of these chunks later.  */
+	      if (tcache_nb
+		  && tcache->counts[tc_idx] < mp_.tcache_count)
+		{
+		  tcache_put (victim, tc_idx);
+		  return_cached = 1;
+		  continue;
+		}
+	      else
+		{
+#endif
               check_malloced_chunk (av, victim, nb);
               void *p = chunk2mem (victim);
               alloc_perturb (p, bytes);
               return p;
+#if USE_TCACHE
+		}
+#endif
             }
 
           /* place chunk in bin */
@@ -3594,11 +3852,31 @@  _int_malloc (mstate av, size_t bytes)
           fwd->bk = victim;
           bck->fd = victim;
 
+#if USE_TCACHE
+      /* If we've processed as many chunks as we're allowed while
+	 filling the cache, return one of the cached ones.  */
+      ++tcache_unsorted_count;
+      if (return_cached
+	  && mp_.tcache_unsorted_limit > 0
+	  && tcache_unsorted_count > mp_.tcache_unsorted_limit)
+	{
+	  return tcache_get (tc_idx);
+	}
+#endif
+
 #define MAX_ITERS       10000
           if (++iters >= MAX_ITERS)
             break;
         }
 
+#if USE_TCACHE
+      /* If all the small chunks we found ended up cached, return one now.  */
+      if (return_cached)
+	{
+	  return tcache_get (tc_idx);
+	}
+#endif
+
       /*
          If a large request, scan through the chunks of current bin in
          sorted order to find smallest that fits.  Use the skip list for this.
@@ -3884,6 +4162,20 @@  _int_free (mstate av, mchunkptr p, int h
 
   check_inuse_chunk(av, p);
 
+#if USE_TCACHE
+  {
+    size_t tc_idx = csize2tidx (size);
+
+    if (tcache
+	&& tc_idx < mp_.tcache_bins
+	&& tcache->counts[tc_idx] < mp_.tcache_count)
+      {
+	tcache_put (p, tc_idx);
+	return;
+      }
+  }
+#endif
+
   /*
     If eligible, place chunk on a fastbin so it can be found
     and used quickly in malloc.
@@ -4845,6 +5137,38 @@  do_set_arena_max (size_t value)
   return 1;
 }
 
+#if USE_TCACHE
+static inline int
+__always_inline
+do_set_tcache_max (size_t value)
+{
+  if (value >= 0 && value <= MAX_TCACHE_SIZE)
+    {
+      LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes);
+      mp_.tcache_max_bytes = value;
+      mp_.tcache_bins = csize2tidx (request2size(value)) + 1;
+    }
+  return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_count (size_t value)
+{
+  LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
+  mp_.tcache_count = value;
+  return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_unsorted_limit (size_t value)
+{
+  LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit);
+  mp_.tcache_unsorted_limit = value;
+  return 1;
+}
+#endif
 
 int
 __libc_mallopt (int param_number, int value)
diff -x .git -x po -r -U3 -p glibc.pristine/manual/probes.texi glibc.djmalloc-tcache/manual/probes.texi
--- glibc.pristine/manual/probes.texi	2017-01-10 17:02:06.305343587 -0500
+++ glibc.djmalloc-tcache/manual/probes.texi	2017-06-26 17:32:35.785884967 -0400
@@ -231,6 +231,25 @@  dynamic brk/mmap thresholds.  Argument @
 the adjusted mmap and trim thresholds, respectively.
 @end deftp
 
+@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the @code{glibc.malloc.tcache_max}
+tunable is set.  Argument @var{$arg1} is the requested value, and
+@var{$arg2} is the previous value of this tunable.
+@end deftp
+
+@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the @code{glibc.malloc.tcache_count}
+tunable is set.  Argument @var{$arg1} is the requested value, and
+@var{$arg2} is the previous value of this tunable.
+@end deftp
+
+@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the
+@code{glibc.malloc.tcache_unsorted_limit} tunable is set.  Argument
+@var{$arg1} is the requested value, and @var{$arg2} is the previous
+value of this tunable.
+@end deftp
+
 @node Mathematical Function Probes
 @section Mathematical Function Probes
 
diff -x .git -x po -r -U3 -p glibc.pristine/manual/tunables.texi glibc.djmalloc-tcache/manual/tunables.texi
--- glibc.pristine/manual/tunables.texi	2017-06-23 20:49:10.201148494 -0400
+++ glibc.djmalloc-tcache/manual/tunables.texi	2017-06-26 17:35:07.952998383 -0400
@@ -193,6 +193,39 @@  systems the limit is twice the number of
 is 8 times the number of cores online.
 @end deftp
 
+@deftp Tunable glibc.malloc.tcache_max
+The maximum size of a request (in bytes) which may be met via the
+per-thread cache.  The default (and maximum) value is 1032 bytes on
+64-bit systems and 516 bytes on 32-bit systems.
+@end deftp
+
+@deftp Tunable glibc.malloc.tcache_count
+The maximum number of chunks of each size to cache.  The default is 7.
+There is no upper limit, other than available system memory.  Note
+that chunks are rounded up to malloc's guaranteed alignment - this
+count is per rounded size, not per user-provided size.
+
+The approximate maximum overhead of the per-thread cache (for each
+thread, of course) is thus @code{glibc.malloc.tcache_max} (in bins,
+max 64 bins) times @code{glibc.malloc.tcache_count} times the size for
+each bin.  With defaults, this is about 236 KB on 64-bit systems and
+118 KB on 32-bit systems.
+@end deftp
+
+@deftp Tunable glibc.malloc.tcache_unsorted_limit
+When the user requests memory and the request cannot be met via the
+per-thread cache, the arenas are used to meet the request.  At this
+time, additional chunks will be moved from existing arena lists to
+pre-fill the corresponding cache.  While copies from the fastbins,
+smallbins, and regular bins are bounded and predictable due to the bin
+sizes, copies from the unsorted bin are not bounded, and incur
+additional time penalties as they need to be sorted as they're
+scanned.  To make scanning the unsorted list more predictable and
+bounded, the user may set this tunable to limit the number of blocks
+that are scanned from the unsorted list while searching for chunks to
+pre-fill the per-thread cache with.  The default, or when set to zero,
+is no limit.
+
 @node Hardware Capability Tunables
 @section Hardware Capability Tunables
 @cindex hardware capability tunables