From patchwork Thu May 11 21:31:16 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: DJ Delorie <dj@redhat.com>
X-Patchwork-Id: 20421
Received: (qmail 61611 invoked by alias); 11 May 2017 21:31:35 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 61351 invoked by uid 89); 11 May 2017 21:31:28 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-26.1 required=5.0 tests=BAYES_00, GIT_PATCH_0,
	GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS,
	RP_MATCHES_RCVD,
	SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=201701,
	amounted, pulled
X-HELO: mx1.redhat.com
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 6A257C04B94B
Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com;
	dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com;
	spf=pass smtp.mailfrom=dj@redhat.com
DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 6A257C04B94B
From: DJ Delorie <dj@redhat.com>
To: libc-alpha@sourceware.org
Cc: "Carlos O'Donell" <carlos@redhat.com>,
	Florian Weimer <fweimer@redhat.com>
Subject: Re: [updated patch] malloc per-thread cache ready for review
In-Reply-To: <9b6d2a24-aa72-cc71-5ede-581dcbd3504d@redhat.com>
	(carlos@redhat.com)
Date: Thu, 11 May 2017 17:31:16 -0400
Message-ID: <xn60h7m66z.fsf@greed.delorie.com>
MIME-Version: 1.0

Inline comments and updated patch attached...

Florian Weimer <fweimer@redhat.com> writes:
> These type names remind me of Turbo Vision.  I don't think we use the 
> CamelCase convention in glibc. :)

DeCamelCased

(I did a lot of my early programming in TurboC)

> I would like to see some discussion of the per-thread overhead the 
> thread cache introduces, maybe along the documentation of the tunables, 
> both the static overhead (due to new per-thread state) and some bounds 
> on the additional per-thread allocation retention.

I put some math in the tunables docs, and also see
https://sourceware.org/ml/libc-alpha/2017-01/msg00452.html where this
was discussed upstream.

"Carlos O'Donell" <carlos@redhat.com> writes:
> - Adding legacy environment variables. Not OK.

Removed.

> - Missing manual documentation for new tcache tunables. Not OK.
> - Missing manual documentation for new probes. Not OK.

Added.

>   The malloc tunables should all be 'security_level: SXID_IGNORE'.

Done.

>   In _int_malloc we have some code duplication where tcache and the normal
>   code both do the same thing e.g. remove a block. Can we avoid some of
>   this duplication with a first refactoring patch?

I pulled out the common parts, it amounted to only a few lines but
they're complex enough to warrant it.

> Must be -DUSE_TCACHE=1, please follow -Wundef rules:

Done.

>> +#if USE_TCACHE
>> +/* We want 64 entries.  This is an arbitrary limit, which tunables can reduce.  */
>> +# define MAX_TCACHE_SIZE	(MALLOC_ALIGNMENT * 63)
>
> Why use 63? Is it because 0 is reserved for the empty zero byte set? If so then just 
> say "We want 63 non-zero sized entries."?

I changed it to 64 and tweaked the macros so that all 64 bins are used
(and re-audited them to make sure they were all consistent), which means
the largest cacheable size is just over a power of two now (1032 and 516
bytes).  Note sure how having this new power-of-two size included will
affect the benchmarks, though - previously, the limit was just below
those powers-of-two.

>> +# define TCACHE_IDX		((MAX_TCACHE_SIZE / MALLOC_ALIGNMENT) + 1)
>
> Since this is the maximum index can we call this TCACHE_MAX_IDX? 

It's not a max, it's a size.  I renamed it to TACHE_MAX_BINS.

>> +# define size2tidx_(bytes)	(((bytes) + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
>
> Why the trailing underscore?

Explanation elided, as I ended up removing it anyway.

>> +/* Rounds up, so...
>> +   idx 0   bytes 0
>> +   idx 1   bytes 1..8
>> +   idx 2   bytes 9..16
>> +   etc.  */
>
> This doesn't look correct for a 64-bit x86_64 system (maybe 32-bit?).
>
> Could you please adjust the comment for 64-bit x86_64 where MALLOC_ALIGNMENT
> should be 16 bytes, and indicate the comment is for that given machine.

Tweaked.

>> +/* This is another arbitrary limit, which tunables can change.  */
>> +# define TCACHE_FILL_COUNT 7
>
> Comment should describe in more detail what a developer should need to
> know about this.

Done.  The tunables docs have more info too.

>> +  /* Maximum number of chunks to remove from the unsorted list, which
>> +     don't match.  */
>
> Which don't match what?

Don't match the cache line we're trying to fill.  Edited.

>> +typedef struct TCacheEntry {
>> +  struct TCacheEntry *next;
>> +} TCacheEntry;
>
> Needs explaining in detail that these apparently incomplete types are layered
> on top of a chunk.

Done.

>> +/* There is one of these for each thread, which contains the
>> +   per-thread cache (hence "TCache").  Keeping overall size low is
>> +   mildly important.  Note that COUNTS and ENTRIES are redundant, this
>> +   is for performance reasons.  */
>
> Why do you say they are redundant? Because you could in theory have a thread
> local pointer to TCacheEntry and walk that list when a malloc request arrives?
> If so, please expand the comment to cover that.

Yes.  Done.

>> +static void
>> +tcache_put (mchunkptr chunk, size_t tc_idx)
>> +{
>> +  TCacheEntry *e = (TCacheEntry *) chunk2mem (chunk);
>
> Can we assert the safety of tc_idx? assert (tc_idx < TCACHE_IDX); ?

I noted in a comment that it's the caller's responsibility to be safe,
and the code is only ever used inside suitable conditionals for index
and count, but I added asserts anyway.

> Please add a comment here that in a low-memory condition we will have
> victim == NULL and tcache == NULL and keep calling tcache_init
> at every opportunity to eventually get enough memory for the tcache.

Done.  In such a condition, there won't be enough memory to do much of
anything, though...

>> +#if USE_TCACHE
>> +  INTERNAL_SIZE_T tcache_nb = 0;
>> +  size_t tc_idx = csize2tidx (nb);
>> +  if (tcache && tc_idx < mp_.tcache_max)
>> +    tcache_nb = nb;
>> +  int return_cached = 0;
>> +
>> +  tcache_unsorted_count = 0;
>
> We selectively apply this limit to just unsorted, but why not all the
> other places where we iterate filling the cache? Because in the other
> cases we have perfect size matches we know we want?

I noted in the tunables doc that scanning the other lists is bounded,
since the other lists always have known sizes.  The unsorted list may be
unbounded, and the non-matching chunks we find have to be processed and
sorted, which makes that expensive.  Hence, I added a user-definable
limit here just for that list.

======================================================================

diff --git a/config.make.in b/config.make.in
index 5836b32..0290d83 100644
--- a/config.make.in
+++ b/config.make.in
@@ -77,6 +77,8 @@ multi-arch = @multi_arch@
 
 mach-interface-list = @mach_interface_list@
 
+experimental-malloc = @experimental_malloc@
+
 nss-crypt = @libc_cv_nss_crypt@
 static-nss-crypt = @libc_cv_static_nss_crypt@
 
diff --git a/configure b/configure
index eecd0ac..b80fd4d 100755
--- a/configure
+++ b/configure
@@ -672,6 +672,7 @@ build_nscd
 link_obsolete_rpc
 libc_cv_static_nss_crypt
 libc_cv_nss_crypt
+experimental_malloc
 enable_werror
 all_warnings
 force_install
@@ -777,6 +778,7 @@ enable_kernel
 enable_all_warnings
 enable_werror
 enable_multi_arch
+enable_experimental_malloc
 enable_nss_crypt
 enable_obsolete_rpc
 enable_systemtap
@@ -1447,6 +1449,8 @@ Optional Features:
   --disable-werror        do not build with -Werror
   --enable-multi-arch     enable single DSO with optimizations for multiple
                           architectures
+  --disable-experimental-malloc
+                          disable experimental malloc features
   --enable-nss-crypt      enable libcrypt to use nss
   --enable-obsolete-rpc   build and install the obsolete RPC code for
                           link-time usage
@@ -3517,6 +3521,15 @@ else
 fi
 
 
+# Check whether --enable-experimental-malloc was given.
+if test "${enable_experimental_malloc+set}" = set; then :
+  enableval=$enable_experimental_malloc; experimental_malloc=$enableval
+else
+  experimental_malloc=yes
+fi
+
+
+
 # Check whether --enable-nss-crypt was given.
 if test "${enable_nss_crypt+set}" = set; then :
   enableval=$enable_nss_crypt; nss_crypt=$enableval
diff --git a/configure.ac b/configure.ac
index 4a77411..b929012 100644
--- a/configure.ac
+++ b/configure.ac
@@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch],
 	      [multi_arch=$enableval],
 	      [multi_arch=default])
 
+AC_ARG_ENABLE([experimental-malloc],
+	      AC_HELP_STRING([--disable-experimental-malloc],
+			     [disable experimental malloc features]),
+	      [experimental_malloc=$enableval],
+	      [experimental_malloc=yes])
+AC_SUBST(experimental_malloc)
+
 AC_ARG_ENABLE([nss-crypt],
 	      AC_HELP_STRING([--enable-nss-crypt],
 			     [enable libcrypt to use nss]),
diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
index cb9e8f1..af2b46f 100644
--- a/elf/dl-tunables.list
+++ b/elf/dl-tunables.list
@@ -76,5 +76,17 @@ glibc {
       minval: 1
       security_level: SXID_IGNORE
     }
+    tcache_max {
+      type: SIZE_T
+      security_level: SXID_IGNORE
+    }
+    tcache_count {
+      type: SIZE_T
+      security_level: SXID_IGNORE
+    }
+    tcache_unsorted_limit {
+      type: SIZE_T
+      security_level: SXID_IGNORE
+    }
   }
 }
diff --git a/malloc/Makefile b/malloc/Makefile
index e93b83b..e6ca11c 100644
--- a/malloc/Makefile
+++ b/malloc/Makefile
@@ -168,6 +168,11 @@ tst-malloc-usable-static-ENV = $(tst-malloc-usable-ENV)
 tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3
 tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV)
 
+ifeq ($(experimental-malloc),yes)
+CPPFLAGS-malloc.c += -DUSE_TCACHE=1
+else
+CPPFLAGS-malloc.c += -DUSE_TCACHE=0
+endif
 # Uncomment this for test releases.  For public releases it is too expensive.
 #CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1
 
diff --git a/malloc/arena.c b/malloc/arena.c
index d49e4a2..dacc481 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -236,6 +236,11 @@ DL_TUNABLE_CALLBACK_FNDECL (set_perturb_byte, int32_t)
 DL_TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t)
 DL_TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t)
 DL_TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t)
+#if USE_TCACHE
+DL_TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t)
+DL_TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t)
+DL_TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t)
+#endif
 #else
 /* Initialization routine. */
 #include <string.h>
@@ -322,6 +327,12 @@ ptmalloc_init (void)
   TUNABLE_SET_VAL_WITH_CALLBACK (mmap_max, NULL, set_mmaps_max);
   TUNABLE_SET_VAL_WITH_CALLBACK (arena_max, NULL, set_arena_max);
   TUNABLE_SET_VAL_WITH_CALLBACK (arena_test, NULL, set_arena_test);
+#if USE_TCACHE
+  TUNABLE_SET_VAL_WITH_CALLBACK (tcache_max, NULL, set_tcache_max);
+  TUNABLE_SET_VAL_WITH_CALLBACK (tcache_count, NULL, set_tcache_count);
+  TUNABLE_SET_VAL_WITH_CALLBACK (tcache_unsorted_limit, NULL,
+				 set_tcache_unsorted_limit);
+#endif
   __libc_lock_unlock (main_arena.mutex);
 #else
   const char *s = NULL;
diff --git a/malloc/malloc.c b/malloc/malloc.c
index e29105c..d904db8 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -297,6 +297,30 @@ __malloc_assert (const char *assertion, const char *file, unsigned int line,
 }
 #endif
 
+#if USE_TCACHE
+/* We want 64 entries.  This is an arbitrary limit, which tunables can reduce.  */
+# define TCACHE_MAX_BINS		64
+# define MAX_TCACHE_SIZE	tidx2usize (TCACHE_MAX_BINS-1)
+
+/* Only used to pre-fill the tunables.  */
+# define tidx2usize(idx)	(((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ)
+
+/* When "x" is from chunksize().  */
+# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
+/* When "x" is a user-provided size.  */
+# define usize2tidx(x) csize2tidx (request2size (x))
+
+/* With rounding and alignment, the bins are...
+   idx 0   bytes 0..24 (64-bit) or 0..12 (32-bit)
+   idx 1   bytes 25..40 or 13..20
+   idx 2   bytes 41..56 or 21..28
+   etc.  */
+
+/* This is another arbitrary limit, which tunables can change.  Each
+   tcache bin will hold at most this number of chunks.  */
+# define TCACHE_FILL_COUNT 7
+#endif
+
 
 /*
   REALLOC_ZERO_BYTES_FREES should be set if a call to
@@ -1711,6 +1735,17 @@ struct malloc_par
 
   /* First address handed out by MORECORE/sbrk.  */
   char *sbrk_base;
+
+#if USE_TCACHE
+  /* Maximum number of buckets to use.  */
+  size_t tcache_bins;
+  size_t tcache_max_bytes;
+  /* Maximum number of chunks in each bucket.  */
+  size_t tcache_count;
+  /* Maximum number of chunks to remove from the unsorted list, which
+     aren't used to prefill the cache.  */
+  size_t tcache_unsorted_limit;
+#endif
 };
 
 /* There are several instances of this struct ("arenas") in this
@@ -1749,6 +1784,13 @@ static struct malloc_par mp_ =
   .trim_threshold = DEFAULT_TRIM_THRESHOLD,
 #define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
   .arena_test = NARENAS_FROM_NCORES (1)
+#if USE_TCACHE
+  ,
+  .tcache_count = TCACHE_FILL_COUNT,
+  .tcache_bins = TCACHE_MAX_BINS,
+  .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1),
+  .tcache_unsorted_limit = 0 /* No limit.  */
+#endif
 };
 
 /* Maximum size of memory handled in fastbins.  */
@@ -2874,6 +2916,121 @@ mremap_chunk (mchunkptr p, size_t new_size)
 
 /*------------------------ Public wrappers. --------------------------------*/
 
+#if USE_TCACHE
+
+/* We overlay this structure on the user-data portion of a chunk when
+   the chunk is stored in the per-thread cache.  */
+typedef struct tcache_entry {
+  struct tcache_entry *next;
+} tcache_entry;
+
+/* There is one of these for each thread, which contains the
+   per-thread cache (hence "tcache_perthread_struct").  Keeping
+   overall size low is mildly important.  Note that COUNTS and ENTRIES
+   are redundant (we could have just counted the linked list each
+   time), this is for performance reasons.  */
+typedef struct tcache_perthread_struct {
+  char counts[TCACHE_MAX_BINS];
+  tcache_entry *entries[TCACHE_MAX_BINS];
+} tcache_perthread_struct;
+
+static __thread char tcache_shutting_down = 0;
+static __thread tcache_perthread_struct *tcache = NULL;
+
+/* Caller must ensure that we know tc_idx is valid and there's room
+   for more chunks.  */
+static void
+tcache_put (mchunkptr chunk, size_t tc_idx)
+{
+  tcache_entry *e = (tcache_entry *) chunk2mem (chunk);
+  assert (tc_idx < TCACHE_MAX_BINS);
+  e->next = tcache->entries[tc_idx];
+  tcache->entries[tc_idx] = e;
+  ++(tcache->counts[tc_idx]);
+}
+
+/* Caller must ensure that we know tc_idx is valid and there's
+   available chunks to remove.  */
+static void *
+tcache_get (size_t tc_idx)
+{
+  tcache_entry *e = tcache->entries[tc_idx];
+  assert (tc_idx < TCACHE_MAX_BINS);
+  assert (tcache->entries[tc_idx] > 0);
+  tcache->entries[tc_idx] = e->next;
+  --(tcache->counts[tc_idx]);
+  return (void *) e;
+}
+
+static void __attribute__ ((section ("__libc_thread_freeres_fn")))
+tcache_thread_freeres (void)
+{
+  int i;
+  tcache_perthread_struct *tcache_tmp = tcache;
+
+  if (!tcache)
+    return;
+
+  tcache = NULL;
+
+  for (i = 0; i < TCACHE_MAX_BINS; ++i) {
+    while (tcache_tmp->entries[i])
+      {
+	tcache_entry *e = tcache_tmp->entries[i];
+	tcache_tmp->entries[i] = e->next;
+	__libc_free (e);
+      }
+  }
+
+  __libc_free (tcache_tmp);
+
+  tcache_shutting_down = 1;
+}
+text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
+
+static void
+tcache_init(void)
+{
+  mstate ar_ptr;
+  void *victim = 0;
+  const size_t bytes = sizeof (tcache_perthread_struct);
+
+  if (tcache_shutting_down)
+    return;
+
+  arena_get (ar_ptr, bytes);
+  victim = _int_malloc (ar_ptr, bytes);
+  if (!victim && ar_ptr != NULL)
+    {
+      ar_ptr = arena_get_retry (ar_ptr, bytes);
+      victim = _int_malloc (ar_ptr, bytes);
+    }
+
+
+  if (ar_ptr != NULL)
+    __libc_lock_unlock (ar_ptr->mutex);
+
+  /* In a low memory situation, we may not be able to allocate memory
+     - in which case, we just keep trying later.  However, we
+     typically do this very early, so either there is sufficient
+     memory, or there isn't enough memory to do non-trivial
+     allocations anyway.  */
+  if (victim)
+    {
+      tcache = (tcache_perthread_struct *) victim;
+      memset (tcache, 0, sizeof (tcache_perthread_struct));
+    }
+
+}
+
+#define MAYBE_INIT_TCACHE() \
+  if (__glibc_unlikely (tcache == NULL)) \
+    tcache_init();
+
+#else
+#define MAYBE_INIT_TCACHE()
+#endif
+
 void *
 __libc_malloc (size_t bytes)
 {
@@ -2884,6 +3041,21 @@ __libc_malloc (size_t bytes)
     = atomic_forced_read (__malloc_hook);
   if (__builtin_expect (hook != NULL, 0))
     return (*hook)(bytes, RETURN_ADDRESS (0));
+#if USE_TCACHE
+  /* int_free also calls request2size, be careful to not pad twice.  */
+  size_t tbytes = request2size (bytes);
+  size_t tc_idx = csize2tidx (tbytes);
+
+  MAYBE_INIT_TCACHE ();
+
+  if (tc_idx < mp_.tcache_bins
+      && tc_idx < TCACHE_MAX_BINS /* to appease gcc */
+      && tcache
+      && tcache->entries[tc_idx] != NULL)
+    {
+      return tcache_get (tc_idx);
+    }
+#endif
 
   arena_get (ar_ptr, bytes);
 
@@ -2943,6 +3115,8 @@ __libc_free (void *mem)
       return;
     }
 
+  MAYBE_INIT_TCACHE ();
+
   ar_ptr = arena_for_chunk (p);
   _int_free (ar_ptr, p, 0);
 }
@@ -2980,7 +3154,10 @@ __libc_realloc (void *oldmem, size_t bytes)
   if (chunk_is_mmapped (oldp))
     ar_ptr = NULL;
   else
-    ar_ptr = arena_for_chunk (oldp);
+    {
+      MAYBE_INIT_TCACHE ();
+      ar_ptr = arena_for_chunk (oldp);
+    }
 
   /* Little security check which won't hurt performance: the allocator
      never wrapps around at the end of the address space.  Therefore
@@ -3205,6 +3382,8 @@ __libc_calloc (size_t n, size_t elem_size)
 
   sz = bytes;
 
+  MAYBE_INIT_TCACHE ();
+
   arena_get (av, sz);
   if (av)
     {
@@ -3335,6 +3514,10 @@ _int_malloc (mstate av, size_t bytes)
   mchunkptr fwd;                    /* misc temp for linking */
   mchunkptr bck;                    /* misc temp for linking */
 
+#if USE_TCACHE
+  size_t tcache_unsorted_count;	    /* count of unsorted chunks processed */
+#endif
+
   const char *errstr = NULL;
 
   /*
@@ -3364,19 +3547,22 @@ _int_malloc (mstate av, size_t bytes)
      can try it without checking, which saves some time on this fast path.
    */
 
+#define REMOVE_FB(fb, victim, pp)			\
+  do							\
+    {							\
+      victim = pp;					\
+      if (victim == NULL)				\
+	break;						\
+    }							\
+  while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
+	 != victim);					\
+
   if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
     {
       idx = fastbin_index (nb);
       mfastbinptr *fb = &fastbin (av, idx);
       mchunkptr pp = *fb;
-      do
-        {
-          victim = pp;
-          if (victim == NULL)
-            break;
-        }
-      while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
-             != victim);
+      REMOVE_FB (fb, victim, pp);
       if (victim != 0)
         {
           if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
@@ -3387,6 +3573,26 @@ _int_malloc (mstate av, size_t bytes)
               return NULL;
             }
           check_remalloced_chunk (av, victim, nb);
+#if USE_TCACHE
+	  /* While we're here, if we see other chunks of the same size,
+	     stash them in the tcache.  */
+	  size_t tc_idx = csize2tidx (nb);
+	  if (tcache && tc_idx < mp_.tcache_bins)
+	    {
+	      mchunkptr tc_victim;
+
+	      /* While bin not empty and tcache not full, copy chunks over.  */
+	      while (tcache->counts[tc_idx] < mp_.tcache_count
+		     && (pp = *fb) != NULL)
+		{
+		  REMOVE_FB (fb, tc_victim, pp);
+		  if (tc_victim != 0)
+		    {
+		      tcache_put (tc_victim, tc_idx);
+	            }
+		}
+	    }
+#endif
           void *p = chunk2mem (victim);
           alloc_perturb (p, bytes);
           return p;
@@ -3425,6 +3631,32 @@ _int_malloc (mstate av, size_t bytes)
               if (av != &main_arena)
 		set_non_main_arena (victim);
               check_malloced_chunk (av, victim, nb);
+#if USE_TCACHE
+	  /* While we're here, if we see other chunks of the same size,
+	     stash them in the tcache.  */
+	  size_t tc_idx = csize2tidx (nb);
+	  if (tcache && tc_idx < mp_.tcache_bins)
+	    {
+	      mchunkptr tc_victim;
+
+	      /* While bin not empty and tcache not full, copy chunks over.  */
+	      while (tcache->counts[tc_idx] < mp_.tcache_count
+		     && (tc_victim = last (bin)) != bin)
+		{
+		  if (tc_victim != 0)
+		    {
+		      bck = tc_victim->bk;
+		      set_inuse_bit_at_offset (tc_victim, nb);
+		      if (av != &main_arena)
+			set_non_main_arena (tc_victim);
+		      bin->bk = bck;
+		      bck->fd = bin;
+
+		      tcache_put (tc_victim, tc_idx);
+	            }
+		}
+	    }
+#endif
               void *p = chunk2mem (victim);
               alloc_perturb (p, bytes);
               return p;
@@ -3463,6 +3695,16 @@ _int_malloc (mstate av, size_t bytes)
      otherwise need to expand memory to service a "small" request.
    */
 
+#if USE_TCACHE
+  INTERNAL_SIZE_T tcache_nb = 0;
+  size_t tc_idx = csize2tidx (nb);
+  if (tcache && tc_idx < mp_.tcache_bins)
+    tcache_nb = nb;
+  int return_cached = 0;
+
+  tcache_unsorted_count = 0;
+#endif
+
   for (;; )
     {
       int iters = 0;
@@ -3523,10 +3765,26 @@ _int_malloc (mstate av, size_t bytes)
               set_inuse_bit_at_offset (victim, size);
               if (av != &main_arena)
 		set_non_main_arena (victim);
+#if USE_TCACHE
+	      /* Fill cache first, return to user only if cache fills.
+		 We may return one of these chunks later.  */
+	      if (tcache_nb
+		  && tcache->counts[tc_idx] < mp_.tcache_count)
+		{
+		  tcache_put (victim, tc_idx);
+		  return_cached = 1;
+		  continue;
+		}
+	      else
+		{
+#endif
               check_malloced_chunk (av, victim, nb);
               void *p = chunk2mem (victim);
               alloc_perturb (p, bytes);
               return p;
+#if USE_TCACHE
+		}
+#endif
             }
 
           /* place chunk in bin */
@@ -3593,11 +3851,31 @@ _int_malloc (mstate av, size_t bytes)
           fwd->bk = victim;
           bck->fd = victim;
 
+#if USE_TCACHE
+      /* If we've processed as many chunks as we're allowed while
+	 filling the cache, return one of the cached ones.  */
+      ++tcache_unsorted_count;
+      if (return_cached
+	  && mp_.tcache_unsorted_limit > 0
+	  && tcache_unsorted_count > mp_.tcache_unsorted_limit)
+	{
+	  return tcache_get (tc_idx);
+	}
+#endif
+
 #define MAX_ITERS       10000
           if (++iters >= MAX_ITERS)
             break;
         }
 
+#if USE_TCACHE
+      /* If all the small chunks we found ended up cached, return one now.  */
+      if (return_cached)
+	{
+	  return tcache_get (tc_idx);
+	}
+#endif
+
       /*
          If a large request, scan through the chunks of current bin in
          sorted order to find smallest that fits.  Use the skip list for this.
@@ -3883,6 +4161,20 @@ _int_free (mstate av, mchunkptr p, int have_lock)
 
   check_inuse_chunk(av, p);
 
+#if USE_TCACHE
+  {
+    size_t tc_idx = csize2tidx (size);
+
+    if (tcache
+	&& tc_idx < mp_.tcache_bins
+	&& tcache->counts[tc_idx] < mp_.tcache_count)
+      {
+	tcache_put (p, tc_idx);
+	return;
+      }
+  }
+#endif
+
   /*
     If eligible, place chunk on a fastbin so it can be found
     and used quickly in malloc.
@@ -4844,6 +5136,38 @@ do_set_arena_max (size_t value)
   return 1;
 }
 
+#if USE_TCACHE
+static inline int
+__always_inline
+do_set_tcache_max (size_t value)
+{
+  if (value >= 0 && value <= MAX_TCACHE_SIZE)
+    {
+      LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes);
+      mp_.tcache_max_bytes = value;
+      mp_.tcache_bins = csize2tidx (request2size(value)) + 1;
+    }
+  return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_count (size_t value)
+{
+  LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
+  mp_.tcache_count = value;
+  return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_unsorted_limit (size_t value)
+{
+  LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit);
+  mp_.tcache_unsorted_limit = value;
+  return 1;
+}
+#endif
 
 int
 __libc_mallopt (int param_number, int value)
diff --git a/manual/probes.texi b/manual/probes.texi
index eb91c62..96acaed 100644
--- a/manual/probes.texi
+++ b/manual/probes.texi
@@ -231,6 +231,25 @@ dynamic brk/mmap thresholds.  Argument @var{$arg1} and @var{$arg2} are
 the adjusted mmap and trim thresholds, respectively.
 @end deftp
 
+@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the @code{glibc.malloc.tcache_max}
+tunable is set.  Argument @var{$arg1} is the requested value, and
+@var{$arg2} is the previous value of this tunable.
+@end deftp
+
+@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the @code{glibc.malloc.tcache_count}
+tunable is set.  Argument @var{$arg1} is the requested value, and
+@var{$arg2} is the previous value of this tunable.
+@end deftp
+
+@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2})
+This probe is triggered when the
+@code{glibc.malloc.tcache_unsorted_limit} tunable is set.  Argument
+@var{$arg1} is the requested value, and @var{$arg2} is the previous
+value of this tunable.
+@end deftp
+
 @node Mathematical Function Probes
 @section Mathematical Function Probes
 
diff --git a/manual/tunables.texi b/manual/tunables.texi
index ac8c38f..b651a1d 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -190,3 +190,37 @@ number of arenas is determined by the number of CPU cores online.  For 32-bit
 systems the limit is twice the number of cores online and on 64-bit systems, it
 is 8 times the number of cores online.
 @end deftp
+
+@deftp Tunable glibc.malloc.tcache_max
+The maximum size of a request (in bytes) which may be met via the
+per-thread cache.  The default (and maximum) value is 1032 bytes on
+64-bit systems and 516 bytes on 32-bit systems.
+@end deftp
+
+@deftp Tunable glibc.malloc.tcache_count
+The maximum number of chunks of each size to cache.  The default is 7.
+There is no upper limit, other than available system memory.  Note
+that chunks are rounded up to malloc's guaranteed alignment - this
+count is per rounded size, not per user-provided size.
+
+The approximate maximum overhead of the per-thread cache (for each
+thread, of course) is thus @code{glibc.malloc.tcache_max} (in bins,
+max 64 bins) times @code{glibc.malloc.tcache_count} times the size for
+each bin.  With defaults, this is about 236 KB on 64-bit systems and
+118 KB on 32-bit systems.
+@end deftp
+
+@deftp Tunable glibc.malloc.tcache_unsorted_limit
+When the user requests memory and the request cannot be met via the
+per-thread cache, the arenas are used to meet the request.  At this
+time, additional chunks will be moved from existing arena lists to
+pre-fill the corresponding cache.  While copies from the fastbins,
+smallbins, and regular bins are bounded and predictable due to the bin
+sizes, copies from the unsorted bin are not bounded, and incur
+additional time penalties as they need to be sorted as they're
+scanned.  To make scanning the unsorted list more predictable and
+bounded, the user may set this tunable to limit the number of blocks
+that are scanned from the unsorted list while searching for chunks to
+pre-fill the per-thread cache with.  The default, or when set to zero,
+is no limit.
+@end deftp