[v5,1/1] memalign: Support scanning for aligned chunks.

Message ID xnr0t8xk7e.fsf@greed.delorie.com
State Committed
Headers
Series [v5,1/1] memalign: Support scanning for aligned chunks. |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

DJ Delorie March 29, 2023, 4:20 a.m. UTC
  Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> writes:
> Hi DJ (I think I got it right now),

Yup!

> patch looks good, some comments below.

v5 attached with changes as noted.

>> +/* Iterates through the tcache linked list.  */
>> +static __always_inline void *
>
> Why not use 'tcache_next *' as return type here?
>
>> +tcache_next (tcache_entry *e)

IIRC I copied tcache_get(), which returns that.
Fixed.

>> +	while (te != NULL && ((intptr_t)te & (alignment - 1)) != 0)
>
> Maybe use '!PTR_IS_ALIGNED (te, alignment)' here?

Yup.

>> +	  {
>> +	    tep = & (te->next);
>> +	    te = tcache_next (te);
>> +	  }
>> +	if (te != NULL)
>> +	  {
>> +	    void *victim = tcache_get_n (tc_idx, tep);
>> +	    return tag_new_usable (victim);
>> +	  }
>> +      }
>> +  }
>> +#endif
>> +
>>    if (SINGLE_THREAD_P)
>>      {
>>        p = _int_memalign (&main_arena, alignment, bytes);
>> @@ -3857,7 +3914,7 @@ _int_malloc (mstate av, size_t bytes)
>>  	      /* While we're here, if we see other chunks of the same size,
>>  		 stash them in the tcache.  */
>>  	      size_t tc_idx = csize2tidx (nb);
>> -	      if (tcache && tc_idx < mp_.tcache_bins)
>> +	      if (tcache != NULL && tc_idx < mp_.tcache_bins)
>>  		{
>>  		  mchunkptr tc_victim;
>>  
>
> I think the style chance should be on a different patch.

Perhaps but IIRC I needed those to get the warnings down to zero so I'd
prefer to leave them in.  Too much effort to split them out.

>> +/* Returns 0 if the chunk is not and does not contain the requested
>> +   aligned sub-chunk, else returns the amount of "waste" from
>> +   trimming.  BYTES is the *user* byte size, not the chunk byte
>> +   size.  */
>> +static int
>
> Shouldn't it return a size_t here?

Fixed.

>> +chunk_ok_for_memalign (mchunkptr p, size_t alignment, size_t bytes)
>>  _int_memalign (mstate av, size_t alignment, size_t bytes)
>>  {
>> @@ -4945,8 +5039,7 @@ _int_memalign (mstate av, size_t alignment, size_t bytes)
>>    mchunkptr remainder;            /* spare room at end to split off */
>>    unsigned long remainder_size;   /* its size */
>>    INTERNAL_SIZE_T size;
>> -
>> -
>
> Spurious extra new lines?

The original had three blank lines there for some reason.  I wouldn't
have bothered if I didn't have to add a new decl there anyway.

>> diff --git a/malloc/tst-memalign-2.c b/malloc/tst-memalign-2.c
>> new file mode 100644
>> index 0000000000..ed3660959a
>> --- /dev/null
>> +++ b/malloc/tst-memalign-2.c
>> @@ -0,0 +1,136 @@
>> +/* Test for memalign chunk reuse
>
> Missing period.

Fixed.

>> +  for (i = 0; i < TN; ++ i)
>> +    {
>> +      tcache_allocs[i].ptr1 = memalign (tcache_allocs[i].alignment, tcache_allocs[i].size);
>> +      free (tcache_allocs[i].ptr1);
>> +      /* This should return the same chunk as was just free'd.  */
>> +      tcache_allocs[i].ptr2 = memalign (tcache_allocs[i].alignment, tcache_allocs[i].size);
>> +      free (tcache_allocs[i].ptr2);
>
> Should we also check for non NULL and return alignment as sanity checks here?

Done.

>> +
>> +      TEST_VERIFY (tcache_allocs[i].ptr1 == tcache_allocs[i].ptr2);
>> +    }
>> +
>> +  /* Test for non-head tcache hits.  */
>> +  for (i = 0; i < 10; ++ i)
>
> Maybe use array_length (ptr) here.

Done.


From e32abda27e5c0aa82f4b736fdca35d56bf665cce Mon Sep 17 00:00:00 2001
From: DJ Delorie via Libc-alpha <libc-alpha@sourceware.org>
Date: Wed, 29 Mar 2023 00:18:40 -0400
Subject: memalign: Support scanning for aligned chunks.

This patch adds a chunk scanning algorithm to the _int_memalign code
path that reduces heap fragmentation by reusing already aligned chunks
instead of always looking for chunks of larger sizes and splitting
them.  The tcache macros are extended to allow removing a chunk from
the middle of the list.

The goal is to fix the pathological use cases where heaps grow
continuously in workloads that are heavy users of memalign.

Note that tst-memalign-2 checks for tcache operation, which
malloc-check bypasses.
  

Comments

Adhemerval Zanella Netto March 29, 2023, 7:41 p.m. UTC | #1
On 29/03/23 01:20, DJ Delorie wrote:
> From e32abda27e5c0aa82f4b736fdca35d56bf665cce Mon Sep 17 00:00:00 2001
> From: DJ Delorie via Libc-alpha <libc-alpha@sourceware.org>
> Date: Wed, 29 Mar 2023 00:18:40 -0400
> Subject: memalign: Support scanning for aligned chunks.
> 
> This patch adds a chunk scanning algorithm to the _int_memalign code
> path that reduces heap fragmentation by reusing already aligned chunks
> instead of always looking for chunks of larger sizes and splitting
> them.  The tcache macros are extended to allow removing a chunk from
> the middle of the list.
> 
> The goal is to fix the pathological use cases where heaps grow
> continuously in workloads that are heavy users of memalign.
> 
> Note that tst-memalign-2 checks for tcache operation, which
> malloc-check bypasses.

LGTM, thanks.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> 
> diff --git a/malloc/Makefile b/malloc/Makefile
> index dfb51d344c..79178c4905 100644
> --- a/malloc/Makefile
> +++ b/malloc/Makefile
> @@ -43,6 +43,7 @@ tests := mallocbug tst-malloc tst-valloc tst-calloc tst-obstack \
>  	 tst-tcfree1 tst-tcfree2 tst-tcfree3 \
>  	 tst-safe-linking \
>  	 tst-mallocalign1 \
> +	 tst-memalign-2
>  
>  tests-static := \
>  	 tst-interpose-static-nothread \
> @@ -72,7 +73,7 @@ test-srcs = tst-mtrace
>  # with MALLOC_CHECK_=3 because they expect a specific failure.
>  tests-exclude-malloc-check = tst-malloc-check tst-malloc-usable \
>  	tst-mxfast tst-safe-linking \
> -	tst-compathooks-off tst-compathooks-on
> +	tst-compathooks-off tst-compathooks-on tst-memalign-2
>  
>  # Run all tests with MALLOC_CHECK_=3
>  tests-malloc-check = $(filter-out $(tests-exclude-malloc-check) \
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 76c50e3f58..8ebc4372bc 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -3162,19 +3162,44 @@ tcache_put (mchunkptr chunk, size_t tc_idx)
>  }
>  
>  /* Caller must ensure that we know tc_idx is valid and there's
> -   available chunks to remove.  */
> +   available chunks to remove.  Removes chunk from the middle of the
> +   list.  */
>  static __always_inline void *
> -tcache_get (size_t tc_idx)
> +tcache_get_n (size_t tc_idx, tcache_entry **ep)
>  {
> -  tcache_entry *e = tcache->entries[tc_idx];
> +  tcache_entry *e;
> +  if (ep == &(tcache->entries[tc_idx]))
> +    e = *ep;
> +  else
> +    e = REVEAL_PTR (*ep);
> +
>    if (__glibc_unlikely (!aligned_OK (e)))
>      malloc_printerr ("malloc(): unaligned tcache chunk detected");
> -  tcache->entries[tc_idx] = REVEAL_PTR (e->next);
> +
> +  if (ep == &(tcache->entries[tc_idx]))
> +      *ep = REVEAL_PTR (e->next);
> +  else
> +    *ep = PROTECT_PTR (ep, REVEAL_PTR (e->next));
> +
>    --(tcache->counts[tc_idx]);
>    e->key = 0;
>    return (void *) e;
>  }
>  
> +/* Like the above, but removes from the head of the list.  */
> +static __always_inline void *
> +tcache_get (size_t tc_idx)
> +{
> +  return tcache_get_n (tc_idx, & tcache->entries[tc_idx]);
> +}
> +
> +/* Iterates through the tcache linked list.  */
> +static __always_inline tcache_entry *
> +tcache_next (tcache_entry *e)
> +{
> +  return (tcache_entry *) REVEAL_PTR (e->next);
> +}
> +
>  static void
>  tcache_thread_shutdown (void)
>  {
> @@ -3283,7 +3308,7 @@ __libc_malloc (size_t bytes)
>  
>    DIAG_PUSH_NEEDS_COMMENT;
>    if (tc_idx < mp_.tcache_bins
> -      && tcache
> +      && tcache != NULL
>        && tcache->counts[tc_idx] > 0)
>      {
>        victim = tcache_get (tc_idx);
> @@ -3542,6 +3567,38 @@ _mid_memalign (size_t alignment, size_t bytes, void *address)
>        alignment = a;
>      }
>  
> +#if USE_TCACHE
> +  {
> +    size_t tbytes;
> +    tbytes = checked_request2size (bytes);
> +    if (tbytes == 0)
> +      {
> +	__set_errno (ENOMEM);
> +	return NULL;
> +      }
> +    size_t tc_idx = csize2tidx (tbytes);
> +
> +    if (tc_idx < mp_.tcache_bins
> +	&& tcache != NULL
> +	&& tcache->counts[tc_idx] > 0)
> +      {
> +	/* The tcache itself isn't encoded, but the chain is.  */
> +	tcache_entry **tep = & tcache->entries[tc_idx];
> +	tcache_entry *te = *tep;
> +	while (te != NULL && !PTR_IS_ALIGNED (te, alignment))
> +	  {
> +	    tep = & (te->next);
> +	    te = tcache_next (te);
> +	  }
> +	if (te != NULL)
> +	  {
> +	    void *victim = tcache_get_n (tc_idx, tep);
> +	    return tag_new_usable (victim);
> +	  }
> +      }
> +  }
> +#endif
> +
>    if (SINGLE_THREAD_P)
>      {
>        p = _int_memalign (&main_arena, alignment, bytes);
> @@ -3847,7 +3904,7 @@ _int_malloc (mstate av, size_t bytes)
>  	      /* While we're here, if we see other chunks of the same size,
>  		 stash them in the tcache.  */
>  	      size_t tc_idx = csize2tidx (nb);
> -	      if (tcache && tc_idx < mp_.tcache_bins)
> +	      if (tcache != NULL && tc_idx < mp_.tcache_bins)
>  		{
>  		  mchunkptr tc_victim;
>  
> @@ -3905,7 +3962,7 @@ _int_malloc (mstate av, size_t bytes)
>  	  /* While we're here, if we see other chunks of the same size,
>  	     stash them in the tcache.  */
>  	  size_t tc_idx = csize2tidx (nb);
> -	  if (tcache && tc_idx < mp_.tcache_bins)
> +	  if (tcache != NULL && tc_idx < mp_.tcache_bins)
>  	    {
>  	      mchunkptr tc_victim;
>  
> @@ -3967,7 +4024,7 @@ _int_malloc (mstate av, size_t bytes)
>  #if USE_TCACHE
>    INTERNAL_SIZE_T tcache_nb = 0;
>    size_t tc_idx = csize2tidx (nb);
> -  if (tcache && tc_idx < mp_.tcache_bins)
> +  if (tcache != NULL && tc_idx < mp_.tcache_bins)
>      tcache_nb = nb;
>    int return_cached = 0;
>  
> @@ -4047,7 +4104,7 @@ _int_malloc (mstate av, size_t bytes)
>  #if USE_TCACHE
>  	      /* Fill cache first, return to user only if cache fills.
>  		 We may return one of these chunks later.  */
> -	      if (tcache_nb
> +	      if (tcache_nb > 0
>  		  && tcache->counts[tc_idx] < mp_.tcache_count)
>  		{
>  		  tcache_put (victim, tc_idx);
> @@ -4921,6 +4978,43 @@ _int_realloc (mstate av, mchunkptr oldp, INTERNAL_SIZE_T oldsize,
>     ------------------------------ memalign ------------------------------
>   */
>  
> +/* Returns 0 if the chunk is not and does not contain the requested
> +   aligned sub-chunk, else returns the amount of "waste" from
> +   trimming.  BYTES is the *user* byte size, not the chunk byte
> +   size.  */
> +static size_t
> +chunk_ok_for_memalign (mchunkptr p, size_t alignment, size_t bytes)
> +{
> +  void *m = chunk2mem (p);
> +  INTERNAL_SIZE_T size = memsize (p);
> +  void *aligned_m = m;
> +
> +  if (__glibc_unlikely (misaligned_chunk (p)))
> +    malloc_printerr ("_int_memalign(): unaligned chunk detected");
> +
> +  aligned_m = PTR_ALIGN_UP (m, alignment);
> +
> +  INTERNAL_SIZE_T front_extra = (intptr_t) aligned_m - (intptr_t) m;
> +
> +  /* We can't trim off the front as it's too small.  */
> +  if (front_extra > 0 && front_extra < MINSIZE)
> +    return 0;
> +
> +  /* If it's a perfect fit, it's an exception to the return value rule
> +     (we would return zero waste, which looks like "not usable"), so
> +     handle it here by returning a small non-zero value instead.  */
> +  if (size == bytes && front_extra == 0)
> +    return 1;
> +
> +  /* If the block we need fits in the chunk, calculate total waste.  */
> +  if (size > bytes + front_extra)
> +    return size - bytes;
> +
> +  /* Can't use this chunk.  */ 
> +  return 0;
> +}
> +
> +/* BYTES is user requested bytes, not requested chunksize bytes.  */
>  static void *
>  _int_memalign (mstate av, size_t alignment, size_t bytes)
>  {
> @@ -4934,8 +5028,7 @@ _int_memalign (mstate av, size_t alignment, size_t bytes)
>    mchunkptr remainder;            /* spare room at end to split off */
>    unsigned long remainder_size;   /* its size */
>    INTERNAL_SIZE_T size;
> -
> -
> +  mchunkptr victim;
>  
>    nb = checked_request2size (bytes);
>    if (nb == 0)
> @@ -4944,29 +5037,142 @@ _int_memalign (mstate av, size_t alignment, size_t bytes)
>        return NULL;
>      }
>  
> -  /*
> -     Strategy: find a spot within that chunk that meets the alignment
> +  /* We can't check tcache here because we hold the arena lock, which
> +     tcache doesn't expect.  We expect it has been checked
> +     earlier.  */
> +
> +  /* Strategy: search the bins looking for an existing block that
> +     meets our needs.  We scan a range of bins from "exact size" to
> +     "just under 2x", spanning the small/large barrier if needed.  If
> +     we don't find anything in those bins, the common malloc code will
> +     scan starting at 2x.  */
> +
> +  /* This will be set if we found a candidate chunk.  */
> +  victim = NULL;
> +
> +  /* Fast bins are singly-linked, hard to remove a chunk from the middle
> +     and unlikely to meet our alignment requirements.  We have not done
> +     any experimentation with searching for aligned fastbins.  */
> +
> +  int first_bin_index;
> +  int first_largebin_index;
> +  int last_bin_index;
> +
> +  if (in_smallbin_range (nb))
> +    first_bin_index = smallbin_index (nb);
> +  else
> +    first_bin_index = largebin_index (nb);
> +
> +  if (in_smallbin_range (nb * 2))
> +    last_bin_index = smallbin_index (nb * 2);
> +  else
> +    last_bin_index = largebin_index (nb * 2);
> +
> +  first_largebin_index = largebin_index (MIN_LARGE_SIZE);
> +
> +  int victim_index;                 /* its bin index */
> +
> +  for (victim_index = first_bin_index;
> +       victim_index < last_bin_index;
> +       victim_index ++)
> +    {
> +      victim = NULL;
> +
> +      if (victim_index < first_largebin_index)
> +    {
> +      /* Check small bins.  Small bin chunks are doubly-linked despite
> +	 being the same size.  */
> +
> +      mchunkptr fwd;                    /* misc temp for linking */
> +      mchunkptr bck;                    /* misc temp for linking */
> +
> +      bck = bin_at (av, victim_index);
> +      fwd = bck->fd;
> +      while (fwd != bck)
> +	{
> +	  if (chunk_ok_for_memalign (fwd, alignment, bytes) > 0)
> +	    {
> +	      victim = fwd;
> +
> +	      /* Unlink it */
> +	      victim->fd->bk = victim->bk;
> +	      victim->bk->fd = victim->fd;
> +	      break;
> +	    }
> +
> +	  fwd = fwd->fd;
> +	}
> +    }
> +  else
> +    {
> +      /* Check large bins.  */
> +      mchunkptr fwd;                    /* misc temp for linking */
> +      mchunkptr bck;                    /* misc temp for linking */
> +      mchunkptr best = NULL;
> +      size_t best_size = 0;
> +
> +      bck = bin_at (av, victim_index);
> +      fwd = bck->fd;
> +
> +      while (fwd != bck)
> +	{
> +	  int extra;
> +
> +	  if (chunksize (fwd) < nb)
> +	      break;
> +	  extra = chunk_ok_for_memalign (fwd, alignment, bytes);
> +	  if (extra > 0
> +	      && (extra <= best_size || best == NULL))
> +	    {
> +	      best = fwd;
> +	      best_size = extra;
> +	    }
> +
> +	  fwd = fwd->fd;
> +	}
> +      victim = best;
> +
> +      if (victim != NULL)
> +	{
> +	  unlink_chunk (av, victim);
> +	  break;
> +	}
> +    }
> +
> +      if (victim != NULL)
> +	break;
> +    }
> +
> +  /* Strategy: find a spot within that chunk that meets the alignment
>       request, and then possibly free the leading and trailing space.
> -   */
> +     This strategy is incredibly costly and can lead to external
> +     fragmentation if header and footer chunks are unused.  */
>  
> -  /* Call malloc with worst case padding to hit alignment. */
> +  if (victim != NULL)
> +    {
> +      p = victim;
> +      m = chunk2mem (p);
> +      set_inuse (p);
> +    }
> +  else
> +    {
> +      /* Call malloc with worst case padding to hit alignment. */
>  
> -  m = (char *) (_int_malloc (av, nb + alignment + MINSIZE));
> +      m = (char *) (_int_malloc (av, nb + alignment + MINSIZE));
>  
> -  if (m == 0)
> -    return 0;           /* propagate failure */
> +      if (m == 0)
> +	return 0;           /* propagate failure */
>  
> -  p = mem2chunk (m);
> +      p = mem2chunk (m);
> +    }
>  
>    if ((((unsigned long) (m)) % alignment) != 0)   /* misaligned */
> -
> -    { /*
> -                Find an aligned spot inside chunk.  Since we need to give back
> -                leading space in a chunk of at least MINSIZE, if the first
> -                calculation places us at a spot with less than MINSIZE leader,
> -                we can move to the next aligned spot -- we've allocated enough
> -                total room so that this is always possible.
> -                 */
> +    {
> +      /* Find an aligned spot inside chunk.  Since we need to give back
> +         leading space in a chunk of at least MINSIZE, if the first
> +         calculation places us at a spot with less than MINSIZE leader,
> +         we can move to the next aligned spot -- we've allocated enough
> +         total room so that this is always possible.  */
>        brk = (char *) mem2chunk (((unsigned long) (m + alignment - 1)) &
>                                  - ((signed long) alignment));
>        if ((unsigned long) (brk - (char *) (p)) < MINSIZE)
> diff --git a/malloc/tst-memalign-2.c b/malloc/tst-memalign-2.c
> new file mode 100644
> index 0000000000..4996578e9f
> --- /dev/null
> +++ b/malloc/tst-memalign-2.c
> @@ -0,0 +1,155 @@
> +/* Test for memalign chunk reuse.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <errno.h>
> +#include <malloc.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <array_length.h>
> +#include <libc-pointer-arith.h>
> +#include <support/check.h>
> +
> +typedef struct TestCase {
> +  size_t size;
> +  size_t alignment;
> +  void *ptr1;
> +  void *ptr2;
> +} TestCase;
> +
> +static TestCase tcache_allocs[] = {
> +  { 24, 8, NULL, NULL },
> +  { 24, 16, NULL, NULL },
> +  { 128, 32, NULL, NULL }
> +};
> +#define TN array_length (tcache_allocs)
> +
> +static TestCase large_allocs[] = {
> +  { 23450, 64, NULL, NULL },
> +  { 23450, 64, NULL, NULL },
> +  { 23550, 64, NULL, NULL },
> +  { 23550, 64, NULL, NULL },
> +  { 23650, 64, NULL, NULL },
> +  { 23650, 64, NULL, NULL },
> +  { 33650, 64, NULL, NULL },
> +  { 33650, 64, NULL, NULL }
> +};
> +#define LN array_length (large_allocs)
> +
> +void *p;
> +
> +/* Sanity checks, ancillary to the actual test.  */
> +#define CHECK(p,a) \
> +  if (p == NULL || !PTR_IS_ALIGNED (p, a)) \
> +    FAIL_EXIT1 ("NULL or misaligned memory detected.\n");
> +
> +static int
> +do_test (void)
> +{
> +  int i, j;
> +  int count;
> +  void *ptr[10];
> +  void *p;
> +
> +  /* TCache test.  */
> +
> +  for (i = 0; i < TN; ++ i)
> +    {
> +      tcache_allocs[i].ptr1 = memalign (tcache_allocs[i].alignment, tcache_allocs[i].size);
> +      CHECK (tcache_allocs[i].ptr1, tcache_allocs[i].alignment);
> +      free (tcache_allocs[i].ptr1);
> +      /* This should return the same chunk as was just free'd.  */
> +      tcache_allocs[i].ptr2 = memalign (tcache_allocs[i].alignment, tcache_allocs[i].size);
> +      CHECK (tcache_allocs[i].ptr2, tcache_allocs[i].alignment);
> +      free (tcache_allocs[i].ptr2);
> +
> +      TEST_VERIFY (tcache_allocs[i].ptr1 == tcache_allocs[i].ptr2);
> +    }
> +
> +  /* Test for non-head tcache hits.  */
> +  for (i = 0; i < array_length (ptr); ++ i)
> +    {
> +      if (i == 4)
> +	{
> +	  ptr[i] = memalign (64, 256);
> +	  CHECK (ptr[i], 64);
> +	}
> +      else
> +	{
> +	  ptr[i] = malloc (256);
> +	  CHECK (ptr[i], 4);
> +	}
> +    }
> +  for (i = 0; i < array_length (ptr); ++ i)
> +    free (ptr[i]);
> +
> +  p = memalign (64, 256);
> +  CHECK (p, 64);
> +
> +  count = 0;
> +  for (i = 0; i < 10; ++ i)
> +    if (ptr[i] == p)
> +      ++ count;
> +  free (p);
> +  TEST_VERIFY (count > 0);
> +
> +  /* Large bins test.  */
> +
> +  for (i = 0; i < LN; ++ i)
> +    {
> +      large_allocs[i].ptr1 = memalign (large_allocs[i].alignment, large_allocs[i].size);
> +      CHECK (large_allocs[i].ptr1, large_allocs[i].alignment);
> +      /* Keep chunks from combining by fragmenting the heap.  */
> +      p = malloc (512);
> +      CHECK (p, 4);
> +    }
> +
> +  for (i = 0; i < LN; ++ i)
> +    free (large_allocs[i].ptr1);
> +
> +  /* Force the unsorted bins to be scanned and moved to small/large
> +     bins.  */
> +  p = malloc (60000);
> +
> +  for (i = 0; i < LN; ++ i)
> +    {
> +      large_allocs[i].ptr2 = memalign (large_allocs[i].alignment, large_allocs[i].size);
> +      CHECK (large_allocs[i].ptr2, large_allocs[i].alignment);
> +    }
> +
> +  count = 0;
> +  for (i = 0; i < LN; ++ i)
> +    {
> +      int ok = 0;
> +      for (j = 0; j < LN; ++ j)
> +	if (large_allocs[i].ptr1 == large_allocs[j].ptr2)
> +	  ok = 1;
> +      if (ok == 1)
> +	count ++;
> +    }
> +
> +  /* The allocation algorithm is complicated outside of the memalign
> +     logic, so just make sure it's working for most of the
> +     allocations.  This avoids possible boundary conditions with
> +     empty/full heaps.  */
> +  TEST_VERIFY (count > LN / 2);
> +
> +  return 0;
> +}
> +
> +#include <support/test-driver.c>
>
  
DJ Delorie March 29, 2023, 8:36 p.m. UTC | #2
Thanks!  Pushed.
  
Cristian Rodríguez March 30, 2023, 10:04 a.m. UTC | #3
On Wed, Mar 29, 2023 at 5:36 PM DJ Delorie via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
>
> Thanks!  Pushed.

Crashes previously working rust code..

for example "ripgrep" (command rg)

#rg FIND_PA
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p
|| chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk
(mem2chunk (p))

Will see if anything else died ;)

[1]    2123 IOT instruction (core dumped)  rg FIND_PA
  
Adhemerval Zanella Netto March 30, 2023, 10:50 a.m. UTC | #4
On 30/03/23 07:04, Cristian Rodríguez wrote:
> On Wed, Mar 29, 2023 at 5:36 PM DJ Delorie via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>>
>> Thanks!  Pushed.
> 
> Crashes previously working rust code..
> 
> for example "ripgrep" (command rg)
> 
> #rg FIND_PA
> Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p
> || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk
> (mem2chunk (p))
> 
> Will see if anything else died ;)
> 
> [1]    2123 IOT instruction (core dumped)  rg FIND_PA

Do you have any testcase that triggers it?
  
Adhemerval Zanella Netto March 31, 2023, 3:39 p.m. UTC | #5
On 29/03/23 01:20, DJ Delorie wrote:
> From e32abda27e5c0aa82f4b736fdca35d56bf665cce Mon Sep 17 00:00:00 2001
> From: DJ Delorie via Libc-alpha <libc-alpha@sourceware.org>
> Date: Wed, 29 Mar 2023 00:18:40 -0400
> Subject: memalign: Support scanning for aligned chunks.
> 
> This patch adds a chunk scanning algorithm to the _int_memalign code
> path that reduces heap fragmentation by reusing already aligned chunks
> instead of always looking for chunks of larger sizes and splitting
> them.  The tcache macros are extended to allow removing a chunk from
> the middle of the list.
> 
> The goal is to fix the pathological use cases where heaps grow
> continuously in workloads that are heavy users of memalign.
> 
> Note that tst-memalign-2 checks for tcache operation, which
> malloc-check bypasses.

So it seems this patch does trigger a regression.  I am seeing on a speccpu2017
benchmark (cam4_s) failure:

****************************************
Contents of cam4_s_base.gcc-64.err
****************************************
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Program received signal SIGABRT: Process abort signal.
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))
Fatal glibc error: malloc.c:3617 (_mid_memalign): assertion failed: !p || chunk_is_mmapped (mem2chunk (p)) || ar_ptr == arena_for_chunk (mem2chunk (p))

I have not yet isolated the malloc calls patterns, but I would like to give you
a heads up that this does seems to be an issue with at least on reproduce.
  
Stefan Liebler April 5, 2023, 2:07 p.m. UTC | #6
On 29.03.23 22:36, DJ Delorie via Libc-alpha wrote:
> 
> Thanks!  Pushed.
> 
On s390 (31bit), I see the test fail:
FAIL: malloc/tst-memalign-2-mcheck

After adding those printfs ...:
diff --git a/malloc/tst-memalign-2.c b/malloc/tst-memalign-2.c
index 4996578e9f..adfebf8384 100644
--- a/malloc/tst-memalign-2.c
+++ b/malloc/tst-memalign-2.c
@@ -72,10 +72,12 @@ do_test (void)
     {
       tcache_allocs[i].ptr1 = memalign (tcache_allocs[i].alignment,
tcache_allocs[i].size);
       CHECK (tcache_allocs[i].ptr1, tcache_allocs[i].alignment);
+      printf ("%d# ptr1=%p\n", i, tcache_allocs[i].ptr1);
       free (tcache_allocs[i].ptr1);
       /* This should return the same chunk as was just free'd.  */
       tcache_allocs[i].ptr2 = memalign (tcache_allocs[i].alignment,
tcache_allocs[i].size);
       CHECK (tcache_allocs[i].ptr2, tcache_allocs[i].alignment);
+      printf ("%d# ptr2=%p\n", i, tcache_allocs[i].ptr2);
       free (tcache_allocs[i].ptr2);

       TEST_VERIFY (tcache_allocs[i].ptr1 == tcache_allocs[i].ptr2);


... I've got this output:
0# ptr1=0x55bc71b8
0# ptr2=0x55bc71b8
1# ptr1=0x55bc7210
1# ptr2=0x55bc7260
error: tst-memalign-2.c:83: not true: tcache_allocs[i].ptr1 ==
tcache_allocs[i].ptr2
2# ptr1=0x55bc72e0
2# ptr2=0x55bc72e0
error: 1 test failures


malloc/tst-memalign-2 (without mcheck) is passing.
PASS: malloc/tst-memalign-2
original exit status 0
0# ptr1=0x55c0e190
0# ptr2=0x55c0e190
1# ptr1=0x55c0e190
1# ptr2=0x55c0e190
2# ptr1=0x55c0e1c0
2# ptr2=0x55c0e1c0

Can you please help.

Thanks,
Stefan
  
DJ Delorie April 5, 2023, 5:58 p.m. UTC | #7
Stefan Liebler via Libc-alpha <libc-alpha@sourceware.org> writes:
> On s390 (31bit), I see the test fail:
> FAIL: malloc/tst-memalign-2-mcheck

Please try
https://sourceware.org/pipermail/libc-alpha/2023-April/146959.html

I fixed that test there, hopefully you're seeing the same thing I saw ;-)
  
Stefan Liebler April 11, 2023, 11:40 a.m. UTC | #8
On 05.04.23 19:58, DJ Delorie wrote:
> Stefan Liebler via Libc-alpha <libc-alpha@sourceware.org> writes:
>> On s390 (31bit), I see the test fail:
>> FAIL: malloc/tst-memalign-2-mcheck
> 
> Please try
> https://sourceware.org/pipermail/libc-alpha/2023-April/146959.html
> 
> I fixed that test there, hopefully you're seeing the same thing I saw ;-)
> 
Hi DJ,

I've applied your patch
"[patch v2] malloc: set NON_MAIN_ARENA flag for reclaimed memalign chunk
(BZ #30101)"
on top of
"hurd: Don't leak __hurd_reply_port0"
commit cd019ddd892e182277fadd6aedccc57fa3923c8d

Now I get those fails:
on s390x (64bit) and x86_64:
- malloc/tst-memalign-2-mcheck.out
error: tst-memalign-2.c:114: not true: count > 0
error: 1 test failures

- malloc/tst-memalign-3-mcheck.out
error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
tcache_allocs[i].ptr2
error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
tcache_allocs[i].ptr2
error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
tcache_allocs[i].ptr2
error: 3 test failures


on s390 (31bit):
- malloc/tst-memalign-2-mcheck.out
error: tst-memalign-2.c:86: not true: tcache_allocs[i].ptr1 ==
tcache_allocs[i].ptr2
error: 1 test failures

- malloc/tst-memalign-3-mcheck.out
error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
tcache_allocs[i].ptr2
error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
tcache_allocs[i].ptr2
error: tst-memalign-3.c:117: not true: count > 0
error: 3 test failures


Do you also see the fails on x86_64?
  
Stefan Liebler April 12, 2023, 11:23 a.m. UTC | #9
On 11.04.23 13:40, Stefan Liebler via Libc-alpha wrote:
> On 05.04.23 19:58, DJ Delorie wrote:
>> Stefan Liebler via Libc-alpha <libc-alpha@sourceware.org> writes:
>>> On s390 (31bit), I see the test fail:
>>> FAIL: malloc/tst-memalign-2-mcheck
>>
>> Please try
>> https://sourceware.org/pipermail/libc-alpha/2023-April/146959.html
>>
>> I fixed that test there, hopefully you're seeing the same thing I saw ;-)
>>
> Hi DJ,
> 
> I've applied your patch
> "[patch v2] malloc: set NON_MAIN_ARENA flag for reclaimed memalign chunk
> (BZ #30101)"
> on top of
> "hurd: Don't leak __hurd_reply_port0"
> commit cd019ddd892e182277fadd6aedccc57fa3923c8d
> 
> Now I get those fails:
> on s390x (64bit) and x86_64:
> - malloc/tst-memalign-2-mcheck.out
> error: tst-memalign-2.c:114: not true: count > 0
> error: 1 test failures
> 
> - malloc/tst-memalign-3-mcheck.out
> error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
> tcache_allocs[i].ptr2
> error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
> tcache_allocs[i].ptr2
> error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
> tcache_allocs[i].ptr2
> error: 3 test failures
> 
> 
> on s390 (31bit):
> - malloc/tst-memalign-2-mcheck.out
> error: tst-memalign-2.c:86: not true: tcache_allocs[i].ptr1 ==
> tcache_allocs[i].ptr2
> error: 1 test failures
> 
> - malloc/tst-memalign-3-mcheck.out
> error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
> tcache_allocs[i].ptr2
> error: tst-memalign-3.c:89: not true: tcache_allocs[i].ptr1 ==
> tcache_allocs[i].ptr2
> error: tst-memalign-3.c:117: not true: count > 0
> error: 3 test failures
> 
> 
> Do you also see the fails on x86_64?

Just as information. I also see the same fails as described above with
"[patch v3] malloc: set NON_MAIN_ARENA flag for reclaimed memalign chunk
(BZ #30101)"
https://sourceware.org/pipermail/libc-alpha/2023-April/147181.html
  

Patch

diff --git a/malloc/Makefile b/malloc/Makefile
index dfb51d344c..79178c4905 100644
--- a/malloc/Makefile
+++ b/malloc/Makefile
@@ -43,6 +43,7 @@  tests := mallocbug tst-malloc tst-valloc tst-calloc tst-obstack \
 	 tst-tcfree1 tst-tcfree2 tst-tcfree3 \
 	 tst-safe-linking \
 	 tst-mallocalign1 \
+	 tst-memalign-2
 
 tests-static := \
 	 tst-interpose-static-nothread \
@@ -72,7 +73,7 @@  test-srcs = tst-mtrace
 # with MALLOC_CHECK_=3 because they expect a specific failure.
 tests-exclude-malloc-check = tst-malloc-check tst-malloc-usable \
 	tst-mxfast tst-safe-linking \
-	tst-compathooks-off tst-compathooks-on
+	tst-compathooks-off tst-compathooks-on tst-memalign-2
 
 # Run all tests with MALLOC_CHECK_=3
 tests-malloc-check = $(filter-out $(tests-exclude-malloc-check) \
diff --git a/malloc/malloc.c b/malloc/malloc.c
index 76c50e3f58..8ebc4372bc 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3162,19 +3162,44 @@  tcache_put (mchunkptr chunk, size_t tc_idx)
 }
 
 /* Caller must ensure that we know tc_idx is valid and there's
-   available chunks to remove.  */
+   available chunks to remove.  Removes chunk from the middle of the
+   list.  */
 static __always_inline void *
-tcache_get (size_t tc_idx)
+tcache_get_n (size_t tc_idx, tcache_entry **ep)
 {
-  tcache_entry *e = tcache->entries[tc_idx];
+  tcache_entry *e;
+  if (ep == &(tcache->entries[tc_idx]))
+    e = *ep;
+  else
+    e = REVEAL_PTR (*ep);
+
   if (__glibc_unlikely (!aligned_OK (e)))
     malloc_printerr ("malloc(): unaligned tcache chunk detected");
-  tcache->entries[tc_idx] = REVEAL_PTR (e->next);
+
+  if (ep == &(tcache->entries[tc_idx]))
+      *ep = REVEAL_PTR (e->next);
+  else
+    *ep = PROTECT_PTR (ep, REVEAL_PTR (e->next));
+
   --(tcache->counts[tc_idx]);
   e->key = 0;
   return (void *) e;
 }
 
+/* Like the above, but removes from the head of the list.  */
+static __always_inline void *
+tcache_get (size_t tc_idx)
+{
+  return tcache_get_n (tc_idx, & tcache->entries[tc_idx]);
+}
+
+/* Iterates through the tcache linked list.  */
+static __always_inline tcache_entry *
+tcache_next (tcache_entry *e)
+{
+  return (tcache_entry *) REVEAL_PTR (e->next);
+}
+
 static void
 tcache_thread_shutdown (void)
 {
@@ -3283,7 +3308,7 @@  __libc_malloc (size_t bytes)
 
   DIAG_PUSH_NEEDS_COMMENT;
   if (tc_idx < mp_.tcache_bins
-      && tcache
+      && tcache != NULL
       && tcache->counts[tc_idx] > 0)
     {
       victim = tcache_get (tc_idx);
@@ -3542,6 +3567,38 @@  _mid_memalign (size_t alignment, size_t bytes, void *address)
       alignment = a;
     }
 
+#if USE_TCACHE
+  {
+    size_t tbytes;
+    tbytes = checked_request2size (bytes);
+    if (tbytes == 0)
+      {
+	__set_errno (ENOMEM);
+	return NULL;
+      }
+    size_t tc_idx = csize2tidx (tbytes);
+
+    if (tc_idx < mp_.tcache_bins
+	&& tcache != NULL
+	&& tcache->counts[tc_idx] > 0)
+      {
+	/* The tcache itself isn't encoded, but the chain is.  */
+	tcache_entry **tep = & tcache->entries[tc_idx];
+	tcache_entry *te = *tep;
+	while (te != NULL && !PTR_IS_ALIGNED (te, alignment))
+	  {
+	    tep = & (te->next);
+	    te = tcache_next (te);
+	  }
+	if (te != NULL)
+	  {
+	    void *victim = tcache_get_n (tc_idx, tep);
+	    return tag_new_usable (victim);
+	  }
+      }
+  }
+#endif
+
   if (SINGLE_THREAD_P)
     {
       p = _int_memalign (&main_arena, alignment, bytes);
@@ -3847,7 +3904,7 @@  _int_malloc (mstate av, size_t bytes)
 	      /* While we're here, if we see other chunks of the same size,
 		 stash them in the tcache.  */
 	      size_t tc_idx = csize2tidx (nb);
-	      if (tcache && tc_idx < mp_.tcache_bins)
+	      if (tcache != NULL && tc_idx < mp_.tcache_bins)
 		{
 		  mchunkptr tc_victim;
 
@@ -3905,7 +3962,7 @@  _int_malloc (mstate av, size_t bytes)
 	  /* While we're here, if we see other chunks of the same size,
 	     stash them in the tcache.  */
 	  size_t tc_idx = csize2tidx (nb);
-	  if (tcache && tc_idx < mp_.tcache_bins)
+	  if (tcache != NULL && tc_idx < mp_.tcache_bins)
 	    {
 	      mchunkptr tc_victim;
 
@@ -3967,7 +4024,7 @@  _int_malloc (mstate av, size_t bytes)
 #if USE_TCACHE
   INTERNAL_SIZE_T tcache_nb = 0;
   size_t tc_idx = csize2tidx (nb);
-  if (tcache && tc_idx < mp_.tcache_bins)
+  if (tcache != NULL && tc_idx < mp_.tcache_bins)
     tcache_nb = nb;
   int return_cached = 0;
 
@@ -4047,7 +4104,7 @@  _int_malloc (mstate av, size_t bytes)
 #if USE_TCACHE
 	      /* Fill cache first, return to user only if cache fills.
 		 We may return one of these chunks later.  */
-	      if (tcache_nb
+	      if (tcache_nb > 0
 		  && tcache->counts[tc_idx] < mp_.tcache_count)
 		{
 		  tcache_put (victim, tc_idx);
@@ -4921,6 +4978,43 @@  _int_realloc (mstate av, mchunkptr oldp, INTERNAL_SIZE_T oldsize,
    ------------------------------ memalign ------------------------------
  */
 
+/* Returns 0 if the chunk is not and does not contain the requested
+   aligned sub-chunk, else returns the amount of "waste" from
+   trimming.  BYTES is the *user* byte size, not the chunk byte
+   size.  */
+static size_t
+chunk_ok_for_memalign (mchunkptr p, size_t alignment, size_t bytes)
+{
+  void *m = chunk2mem (p);
+  INTERNAL_SIZE_T size = memsize (p);
+  void *aligned_m = m;
+
+  if (__glibc_unlikely (misaligned_chunk (p)))
+    malloc_printerr ("_int_memalign(): unaligned chunk detected");
+
+  aligned_m = PTR_ALIGN_UP (m, alignment);
+
+  INTERNAL_SIZE_T front_extra = (intptr_t) aligned_m - (intptr_t) m;
+
+  /* We can't trim off the front as it's too small.  */
+  if (front_extra > 0 && front_extra < MINSIZE)
+    return 0;
+
+  /* If it's a perfect fit, it's an exception to the return value rule
+     (we would return zero waste, which looks like "not usable"), so
+     handle it here by returning a small non-zero value instead.  */
+  if (size == bytes && front_extra == 0)
+    return 1;
+
+  /* If the block we need fits in the chunk, calculate total waste.  */
+  if (size > bytes + front_extra)
+    return size - bytes;
+
+  /* Can't use this chunk.  */ 
+  return 0;
+}
+
+/* BYTES is user requested bytes, not requested chunksize bytes.  */
 static void *
 _int_memalign (mstate av, size_t alignment, size_t bytes)
 {
@@ -4934,8 +5028,7 @@  _int_memalign (mstate av, size_t alignment, size_t bytes)
   mchunkptr remainder;            /* spare room at end to split off */
   unsigned long remainder_size;   /* its size */
   INTERNAL_SIZE_T size;
-
-
+  mchunkptr victim;
 
   nb = checked_request2size (bytes);
   if (nb == 0)
@@ -4944,29 +5037,142 @@  _int_memalign (mstate av, size_t alignment, size_t bytes)
       return NULL;
     }
 
-  /*
-     Strategy: find a spot within that chunk that meets the alignment
+  /* We can't check tcache here because we hold the arena lock, which
+     tcache doesn't expect.  We expect it has been checked
+     earlier.  */
+
+  /* Strategy: search the bins looking for an existing block that
+     meets our needs.  We scan a range of bins from "exact size" to
+     "just under 2x", spanning the small/large barrier if needed.  If
+     we don't find anything in those bins, the common malloc code will
+     scan starting at 2x.  */
+
+  /* This will be set if we found a candidate chunk.  */
+  victim = NULL;
+
+  /* Fast bins are singly-linked, hard to remove a chunk from the middle
+     and unlikely to meet our alignment requirements.  We have not done
+     any experimentation with searching for aligned fastbins.  */
+
+  int first_bin_index;
+  int first_largebin_index;
+  int last_bin_index;
+
+  if (in_smallbin_range (nb))
+    first_bin_index = smallbin_index (nb);
+  else
+    first_bin_index = largebin_index (nb);
+
+  if (in_smallbin_range (nb * 2))
+    last_bin_index = smallbin_index (nb * 2);
+  else
+    last_bin_index = largebin_index (nb * 2);
+
+  first_largebin_index = largebin_index (MIN_LARGE_SIZE);
+
+  int victim_index;                 /* its bin index */
+
+  for (victim_index = first_bin_index;
+       victim_index < last_bin_index;
+       victim_index ++)
+    {
+      victim = NULL;
+
+      if (victim_index < first_largebin_index)
+    {
+      /* Check small bins.  Small bin chunks are doubly-linked despite
+	 being the same size.  */
+
+      mchunkptr fwd;                    /* misc temp for linking */
+      mchunkptr bck;                    /* misc temp for linking */
+
+      bck = bin_at (av, victim_index);
+      fwd = bck->fd;
+      while (fwd != bck)
+	{
+	  if (chunk_ok_for_memalign (fwd, alignment, bytes) > 0)
+	    {
+	      victim = fwd;
+
+	      /* Unlink it */
+	      victim->fd->bk = victim->bk;
+	      victim->bk->fd = victim->fd;
+	      break;
+	    }
+
+	  fwd = fwd->fd;
+	}
+    }
+  else
+    {
+      /* Check large bins.  */
+      mchunkptr fwd;                    /* misc temp for linking */
+      mchunkptr bck;                    /* misc temp for linking */
+      mchunkptr best = NULL;
+      size_t best_size = 0;
+
+      bck = bin_at (av, victim_index);
+      fwd = bck->fd;
+
+      while (fwd != bck)
+	{
+	  int extra;
+
+	  if (chunksize (fwd) < nb)
+	      break;
+	  extra = chunk_ok_for_memalign (fwd, alignment, bytes);
+	  if (extra > 0
+	      && (extra <= best_size || best == NULL))
+	    {
+	      best = fwd;
+	      best_size = extra;
+	    }
+
+	  fwd = fwd->fd;
+	}
+      victim = best;
+
+      if (victim != NULL)
+	{
+	  unlink_chunk (av, victim);
+	  break;
+	}
+    }
+
+      if (victim != NULL)
+	break;
+    }
+
+  /* Strategy: find a spot within that chunk that meets the alignment
      request, and then possibly free the leading and trailing space.
-   */
+     This strategy is incredibly costly and can lead to external
+     fragmentation if header and footer chunks are unused.  */
 
-  /* Call malloc with worst case padding to hit alignment. */
+  if (victim != NULL)
+    {
+      p = victim;
+      m = chunk2mem (p);
+      set_inuse (p);
+    }
+  else
+    {
+      /* Call malloc with worst case padding to hit alignment. */
 
-  m = (char *) (_int_malloc (av, nb + alignment + MINSIZE));
+      m = (char *) (_int_malloc (av, nb + alignment + MINSIZE));
 
-  if (m == 0)
-    return 0;           /* propagate failure */
+      if (m == 0)
+	return 0;           /* propagate failure */
 
-  p = mem2chunk (m);
+      p = mem2chunk (m);
+    }
 
   if ((((unsigned long) (m)) % alignment) != 0)   /* misaligned */
-
-    { /*
-                Find an aligned spot inside chunk.  Since we need to give back
-                leading space in a chunk of at least MINSIZE, if the first
-                calculation places us at a spot with less than MINSIZE leader,
-                we can move to the next aligned spot -- we've allocated enough
-                total room so that this is always possible.
-                 */
+    {
+      /* Find an aligned spot inside chunk.  Since we need to give back
+         leading space in a chunk of at least MINSIZE, if the first
+         calculation places us at a spot with less than MINSIZE leader,
+         we can move to the next aligned spot -- we've allocated enough
+         total room so that this is always possible.  */
       brk = (char *) mem2chunk (((unsigned long) (m + alignment - 1)) &
                                 - ((signed long) alignment));
       if ((unsigned long) (brk - (char *) (p)) < MINSIZE)
diff --git a/malloc/tst-memalign-2.c b/malloc/tst-memalign-2.c
new file mode 100644
index 0000000000..4996578e9f
--- /dev/null
+++ b/malloc/tst-memalign-2.c
@@ -0,0 +1,155 @@ 
+/* Test for memalign chunk reuse.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <errno.h>
+#include <malloc.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <array_length.h>
+#include <libc-pointer-arith.h>
+#include <support/check.h>
+
+typedef struct TestCase {
+  size_t size;
+  size_t alignment;
+  void *ptr1;
+  void *ptr2;
+} TestCase;
+
+static TestCase tcache_allocs[] = {
+  { 24, 8, NULL, NULL },
+  { 24, 16, NULL, NULL },
+  { 128, 32, NULL, NULL }
+};
+#define TN array_length (tcache_allocs)
+
+static TestCase large_allocs[] = {
+  { 23450, 64, NULL, NULL },
+  { 23450, 64, NULL, NULL },
+  { 23550, 64, NULL, NULL },
+  { 23550, 64, NULL, NULL },
+  { 23650, 64, NULL, NULL },
+  { 23650, 64, NULL, NULL },
+  { 33650, 64, NULL, NULL },
+  { 33650, 64, NULL, NULL }
+};
+#define LN array_length (large_allocs)
+
+void *p;
+
+/* Sanity checks, ancillary to the actual test.  */
+#define CHECK(p,a) \
+  if (p == NULL || !PTR_IS_ALIGNED (p, a)) \
+    FAIL_EXIT1 ("NULL or misaligned memory detected.\n");
+
+static int
+do_test (void)
+{
+  int i, j;
+  int count;
+  void *ptr[10];
+  void *p;
+
+  /* TCache test.  */
+
+  for (i = 0; i < TN; ++ i)
+    {
+      tcache_allocs[i].ptr1 = memalign (tcache_allocs[i].alignment, tcache_allocs[i].size);
+      CHECK (tcache_allocs[i].ptr1, tcache_allocs[i].alignment);
+      free (tcache_allocs[i].ptr1);
+      /* This should return the same chunk as was just free'd.  */
+      tcache_allocs[i].ptr2 = memalign (tcache_allocs[i].alignment, tcache_allocs[i].size);
+      CHECK (tcache_allocs[i].ptr2, tcache_allocs[i].alignment);
+      free (tcache_allocs[i].ptr2);
+
+      TEST_VERIFY (tcache_allocs[i].ptr1 == tcache_allocs[i].ptr2);
+    }
+
+  /* Test for non-head tcache hits.  */
+  for (i = 0; i < array_length (ptr); ++ i)
+    {
+      if (i == 4)
+	{
+	  ptr[i] = memalign (64, 256);
+	  CHECK (ptr[i], 64);
+	}
+      else
+	{
+	  ptr[i] = malloc (256);
+	  CHECK (ptr[i], 4);
+	}
+    }
+  for (i = 0; i < array_length (ptr); ++ i)
+    free (ptr[i]);
+
+  p = memalign (64, 256);
+  CHECK (p, 64);
+
+  count = 0;
+  for (i = 0; i < 10; ++ i)
+    if (ptr[i] == p)
+      ++ count;
+  free (p);
+  TEST_VERIFY (count > 0);
+
+  /* Large bins test.  */
+
+  for (i = 0; i < LN; ++ i)
+    {
+      large_allocs[i].ptr1 = memalign (large_allocs[i].alignment, large_allocs[i].size);
+      CHECK (large_allocs[i].ptr1, large_allocs[i].alignment);
+      /* Keep chunks from combining by fragmenting the heap.  */
+      p = malloc (512);
+      CHECK (p, 4);
+    }
+
+  for (i = 0; i < LN; ++ i)
+    free (large_allocs[i].ptr1);
+
+  /* Force the unsorted bins to be scanned and moved to small/large
+     bins.  */
+  p = malloc (60000);
+
+  for (i = 0; i < LN; ++ i)
+    {
+      large_allocs[i].ptr2 = memalign (large_allocs[i].alignment, large_allocs[i].size);
+      CHECK (large_allocs[i].ptr2, large_allocs[i].alignment);
+    }
+
+  count = 0;
+  for (i = 0; i < LN; ++ i)
+    {
+      int ok = 0;
+      for (j = 0; j < LN; ++ j)
+	if (large_allocs[i].ptr1 == large_allocs[j].ptr2)
+	  ok = 1;
+      if (ok == 1)
+	count ++;
+    }
+
+  /* The allocation algorithm is complicated outside of the memalign
+     logic, so just make sure it's working for most of the
+     allocations.  This avoids possible boundary conditions with
+     empty/full heaps.  */
+  TEST_VERIFY (count > LN / 2);
+
+  return 0;
+}
+
+#include <support/test-driver.c>