[RFC] Calling realloc vs. mallo/memcpy/free and _IO_vasprintf.

Message ID 5657C135.1090807@redhat.com
State Dropped
Headers

Commit Message

Carlos O'Donell Nov. 27, 2015, 2:34 a.m. UTC
  While answering a question on libc-help about asprintf I happened
to look at _IO_vasprintf and noticed some code which tries to
decide if it should do a realloc vs. malloc/memcpy/free.

In this particular case we should always be shrinking the buffer.
For example _IO_vfprintf may have expanded the stream buffer but
now no longer needs the extra space. If we are growing the buffer
then something is very very wrong since that means _IO_vpfrintf
will have failed due to lack of memory, and we handled that case
earlier in _IO_vasprintf.

So again, if we are only shrinking the buffer, then realloc has
to be the most optimal operation in all cases. It would seem to
me that an allocation shrink like this would almost never result
in an allocation move, but rather malloc would just split the
allocation and reuse the remainder for other malloc calls.

So why the micro-optimization?

Is my assumption of always shrinking the buffer wrong?

If my assumption is right, and my logic is right, wouldn't the
following patch be a natural cleanup?

---

Cheers,
Carlos.
  

Comments

Siddhesh Poyarekar Nov. 27, 2015, 3:14 a.m. UTC | #1
On Thu, Nov 26, 2015 at 09:34:29PM -0500, Carlos O'Donell wrote:
> While answering a question on libc-help about asprintf I happened
> to look at _IO_vasprintf and noticed some code which tries to
> decide if it should do a realloc vs. malloc/memcpy/free.
> 
> In this particular case we should always be shrinking the buffer.
> For example _IO_vfprintf may have expanded the stream buffer but
> now no longer needs the extra space. If we are growing the buffer
> then something is very very wrong since that means _IO_vpfrintf
> will have failed due to lack of memory, and we handled that case
> earlier in _IO_vasprintf.
> 
> So again, if we are only shrinking the buffer, then realloc has
> to be the most optimal operation in all cases. It would seem to
> me that an allocation shrink like this would almost never result
> in an allocation move, but rather malloc would just split the
> allocation and reuse the remainder for other malloc calls.
> 
> So why the micro-optimization?
> 
> Is my assumption of always shrinking the buffer wrong?
> 
> If my assumption is right, and my logic is right, wouldn't the
> following patch be a natural cleanup?

Why would malloc+memcpy+free be faster than a realloc, which does the
same thing internally, but without adding the extra PLT and avoiding
memcpy for smaller copies?  I'd say just drop that bit and stick to
realloc.

Siddhesh
  
Carlos O'Donell Nov. 27, 2015, 3:28 a.m. UTC | #2
On 11/26/2015 10:14 PM, Siddhesh Poyarekar wrote:
> On Thu, Nov 26, 2015 at 09:34:29PM -0500, Carlos O'Donell wrote:
>> While answering a question on libc-help about asprintf I happened
>> to look at _IO_vasprintf and noticed some code which tries to
>> decide if it should do a realloc vs. malloc/memcpy/free.
>>
>> In this particular case we should always be shrinking the buffer.
>> For example _IO_vfprintf may have expanded the stream buffer but
>> now no longer needs the extra space. If we are growing the buffer
>> then something is very very wrong since that means _IO_vpfrintf
>> will have failed due to lack of memory, and we handled that case
>> earlier in _IO_vasprintf.
>>
>> So again, if we are only shrinking the buffer, then realloc has
>> to be the most optimal operation in all cases. It would seem to
>> me that an allocation shrink like this would almost never result
>> in an allocation move, but rather malloc would just split the
>> allocation and reuse the remainder for other malloc calls.
>>
>> So why the micro-optimization?
>>
>> Is my assumption of always shrinking the buffer wrong?
>>
>> If my assumption is right, and my logic is right, wouldn't the
>> following patch be a natural cleanup?
> 
> Why would malloc+memcpy+free be faster than a realloc, which does the
> same thing internally, but without adding the extra PLT and avoiding
> memcpy for smaller copies?  I'd say just drop that bit and stick to
> realloc.

Agreed. It appears to me that _int_realloc always splits the allocated
region, and frees the remainder if >= MINSIZE (non-mmaped case), which
is a faster operation. In the mmaped case we call mremap which will
shrink the mapping, which is a syscall, and in that case *might* be
slower tahn malloc/memcpy/free due to the cost of entering the kernel
and the work needing to be done. However, I would argue that
malloc/memcpy/free in the mean performance case is likely to be a
lot more work than just realloc.

Cheers,
Carlos.
  
Carlos O'Donell Nov. 27, 2015, 5:38 a.m. UTC | #3
On 11/26/2015 10:14 PM, Siddhesh Poyarekar wrote:
> On Thu, Nov 26, 2015 at 09:34:29PM -0500, Carlos O'Donell wrote:
>> While answering a question on libc-help about asprintf I happened
>> to look at _IO_vasprintf and noticed some code which tries to
>> decide if it should do a realloc vs. malloc/memcpy/free.
>>
>> In this particular case we should always be shrinking the buffer.
>> For example _IO_vfprintf may have expanded the stream buffer but
>> now no longer needs the extra space. If we are growing the buffer
>> then something is very very wrong since that means _IO_vpfrintf
>> will have failed due to lack of memory, and we handled that case
>> earlier in _IO_vasprintf.
>>
>> So again, if we are only shrinking the buffer, then realloc has
>> to be the most optimal operation in all cases. It would seem to
>> me that an allocation shrink like this would almost never result
>> in an allocation move, but rather malloc would just split the
>> allocation and reuse the remainder for other malloc calls.
>>
>> So why the micro-optimization?
>>
>> Is my assumption of always shrinking the buffer wrong?
>>
>> If my assumption is right, and my logic is right, wouldn't the
>> following patch be a natural cleanup?
> 
> Why would malloc+memcpy+free be faster than a realloc, which does the
> same thing internally, but without adding the extra PLT and avoiding
> memcpy for smaller copies?  I'd say just drop that bit and stick to
> realloc.
 
Just to verify my assertion about the buffer always being shrunk, I added:

  /* The amount the stream has allocated is always >= what we needed.  */
  assert ((sf._sbf._f._IO_write_end - sf._sbf._f._IO_write_base)
          >= needed);

And it triggers in the testsuite:

(gdb) p sf._sbf._f._IO_write_end
$2 = 0x9c5054 ""
(gdb) p sf._sbf._f._IO_write_base
$3 = 0x9c4ff0 "fma_downward (0x3.", '0' <repeats 12 times>, "1ffcp+10132, -0x2.", '0' <repeats 15 times>, "4p+4432, 0x1.80e000000003fffep+14516)"
(gdb) p needed
$4 = 101
(gdb) p sf._sbf._f._IO_write_end - sf._sbf._f._IO_write_base
$5 = 100
(gdb) 

It's a case where the initial working buffer is exactly equal to what's
needed minus the NULL, and because we have to return a null-terminated
string we must grow the buffer by 1 byte.

I still don't see how realloc is going to be slower since it's just
one call, and there *might* be enough space in this chunk or the next
to grow by one byte and that avoids the memcpy.

Cheers,
Carlos.
  
Siddhesh Poyarekar Nov. 27, 2015, 5:43 a.m. UTC | #4
On Fri, Nov 27, 2015 at 12:38:45AM -0500, Carlos O'Donell wrote:
> I still don't see how realloc is going to be slower since it's just
> one call, and there *might* be enough space in this chunk or the next
> to grow by one byte and that avoids the memcpy.

Agreed, in fact even in the mremap case you mentioned, the
malloc+memcpy+free ought to be slower because you end up calling two
syscalls (mmap followed by munmap) instead of just mremap in case of
realloc.  I don't think there is a situation where realloc calls
mremap and the malloc+memcpy+free combination allocates on the heap.

Siddhesh
  
Carlos O'Donell Nov. 27, 2015, 6:09 a.m. UTC | #5
On 11/27/2015 12:43 AM, Siddhesh Poyarekar wrote:
> On Fri, Nov 27, 2015 at 12:38:45AM -0500, Carlos O'Donell wrote:
>> I still don't see how realloc is going to be slower since it's just
>> one call, and there *might* be enough space in this chunk or the next
>> to grow by one byte and that avoids the memcpy.
> 
> Agreed, in fact even in the mremap case you mentioned, the
> malloc+memcpy+free ought to be slower because you end up calling two
> syscalls (mmap followed by munmap) instead of just mremap in case of
> realloc.  I don't think there is a situation where realloc calls
> mremap and the malloc+memcpy+free combination allocates on the heap.

The *only* interesting case is: What if we don't realloc a too-large buffer?

What if we kept the too-large buffer, null-terminated the string, and
returned that to the caller? You are trading off size of allocated
buffer versus realloc cost.

My reading of vfprintf.c shows that we malloc rather conservatively
to grow the working buffer as we memcpy (xsputn) the results into place.
Therefore the working buffer shouldn't be that much larger than what's
needed, and that means we might just get away with avoiding the realloc
in the case the buffer is more than 1 byte (for NULL) larger than required?

So in 2000 (15 years ago), Ulrich added this code, it previously did
an unconditional realloc as we suggest today.

The comment was:

commit 4100c5a0e62c461adf2757a163dafbbc02ca3fc4
Author: Ulrich Drepper <drepper@redhat.com>
Date:   Tue Apr 11 17:53:41 2000 +0000

    (_IO_vasprintf): Try to avoid memory fragmentation by allocating
    new memory at the end instead of reallocating.

I agree that realloc has the potential to fragment the arena, but we have
no tools to measure this. If we are going to micro-optimize to reduce
heap fragmentation we are going to need a lot more instrumentation in
malloc.

Thoughts?

Cheers,
Carlos.
  
Rich Felker Nov. 27, 2015, 3:45 p.m. UTC | #6
On Thu, Nov 26, 2015 at 09:34:29PM -0500, Carlos O'Donell wrote:
> While answering a question on libc-help about asprintf I happened
> to look at _IO_vasprintf and noticed some code which tries to
> decide if it should do a realloc vs. malloc/memcpy/free.
> 
> In this particular case we should always be shrinking the buffer.
> For example _IO_vfprintf may have expanded the stream buffer but
> now no longer needs the extra space. If we are growing the buffer
> then something is very very wrong since that means _IO_vpfrintf
> will have failed due to lack of memory, and we handled that case
> earlier in _IO_vasprintf.
> 
> So again, if we are only shrinking the buffer, then realloc has
> to be the most optimal operation in all cases. It would seem to
> me that an allocation shrink like this would almost never result
> in an allocation move, but rather malloc would just split the
> allocation and reuse the remainder for other malloc calls.
> 
> So why the micro-optimization?
> 
> Is my assumption of always shrinking the buffer wrong?

I suspect the aim is not optimization but rather to prevent
fragmentation from leaving a small allocation behind in an area carved
out of large free space. Whether the code is effective at doing this,
I'm not sure.

Rich
  
Carlos O'Donell Nov. 27, 2015, 4:07 p.m. UTC | #7
On 11/27/2015 10:45 AM, Rich Felker wrote:
> On Thu, Nov 26, 2015 at 09:34:29PM -0500, Carlos O'Donell wrote:
>> While answering a question on libc-help about asprintf I happened
>> to look at _IO_vasprintf and noticed some code which tries to
>> decide if it should do a realloc vs. malloc/memcpy/free.
>>
>> In this particular case we should always be shrinking the buffer.
>> For example _IO_vfprintf may have expanded the stream buffer but
>> now no longer needs the extra space. If we are growing the buffer
>> then something is very very wrong since that means _IO_vpfrintf
>> will have failed due to lack of memory, and we handled that case
>> earlier in _IO_vasprintf.
>>
>> So again, if we are only shrinking the buffer, then realloc has
>> to be the most optimal operation in all cases. It would seem to
>> me that an allocation shrink like this would almost never result
>> in an allocation move, but rather malloc would just split the
>> allocation and reuse the remainder for other malloc calls.
>>
>> So why the micro-optimization?
>>
>> Is my assumption of always shrinking the buffer wrong?
> 
> I suspect the aim is not optimization but rather to prevent
> fragmentation from leaving a small allocation behind in an area carved
> out of large free space. Whether the code is effective at doing this,
> I'm not sure.

As I mention here in the final patch:
https://www.sourceware.org/ml/libc-alpha/2015-11/msg00607.html

My preference is for realloc to be doing this kind of work by making
a decision to split or move based on some kind of fragmentation
estimate (which it carries as part of the arena information perhaps).

I don't know that the caller is really the best place to put this
heuristic.

Thoughts?

Cheers,
Carlos.
  
Rich Felker Nov. 28, 2015, 6:09 p.m. UTC | #8
On Fri, Nov 27, 2015 at 11:07:12AM -0500, Carlos O'Donell wrote:
> On 11/27/2015 10:45 AM, Rich Felker wrote:
> > On Thu, Nov 26, 2015 at 09:34:29PM -0500, Carlos O'Donell wrote:
> >> While answering a question on libc-help about asprintf I happened
> >> to look at _IO_vasprintf and noticed some code which tries to
> >> decide if it should do a realloc vs. malloc/memcpy/free.
> >>
> >> In this particular case we should always be shrinking the buffer.
> >> For example _IO_vfprintf may have expanded the stream buffer but
> >> now no longer needs the extra space. If we are growing the buffer
> >> then something is very very wrong since that means _IO_vpfrintf
> >> will have failed due to lack of memory, and we handled that case
> >> earlier in _IO_vasprintf.
> >>
> >> So again, if we are only shrinking the buffer, then realloc has
> >> to be the most optimal operation in all cases. It would seem to
> >> me that an allocation shrink like this would almost never result
> >> in an allocation move, but rather malloc would just split the
> >> allocation and reuse the remainder for other malloc calls.
> >>
> >> So why the micro-optimization?
> >>
> >> Is my assumption of always shrinking the buffer wrong?
> > 
> > I suspect the aim is not optimization but rather to prevent
> > fragmentation from leaving a small allocation behind in an area carved
> > out of large free space. Whether the code is effective at doing this,
> > I'm not sure.
> 
> As I mention here in the final patch:
> https://www.sourceware.org/ml/libc-alpha/2015-11/msg00607.html
> 
> My preference is for realloc to be doing this kind of work by making
> a decision to split or move based on some kind of fragmentation
> estimate (which it carries as part of the arena information perhaps).
> 
> I don't know that the caller is really the best place to put this
> heuristic.
> 
> Thoughts?

I'm not clear on which is best. The caller has more knowledge that the
allocator implementation fundamentally cannot have, like knowing
whether it will pass the object to realloc again to enlarge it again,
etc. and there's also the issue that even if glibc's realloc is smart
about this, replacement allocator implementations (which glibc allows)
may not be. Personally, I try to avoid using realloc for any purposes
except enlarging existing allocations either to their now-known final
size or as part of an exponential growth strategy that bounds the
number of memcpy operations that might be needed.

Rich
  

Patch

diff --git a/libio/vasprintf.c b/libio/vasprintf.c
index 61cdfdd..fff8c54 100644
--- a/libio/vasprintf.c
+++ b/libio/vasprintf.c
@@ -62,24 +62,13 @@  _IO_vasprintf (char **result_ptr, const char *format, _IO_va_list args)
       free (sf._sbf._f._IO_buf_base);
       return ret;
     }
-  /* Only use realloc if the size we need is of the same (binary)
-     order of magnitude then the memory we allocated.  */
   needed = sf._sbf._f._IO_write_ptr - sf._sbf._f._IO_write_base + 1;
   allocated = sf._sbf._f._IO_write_end - sf._sbf._f._IO_write_base;
-  if ((allocated >> 1) <= needed)
-    *result_ptr = (char *) realloc (sf._sbf._f._IO_buf_base, needed);
-  else
-    {
-      *result_ptr = (char *) malloc (needed);
-      if (*result_ptr != NULL)
-       {
-         memcpy (*result_ptr, sf._sbf._f._IO_buf_base, needed - 1);
-         free (sf._sbf._f._IO_buf_base);
-       }
-      else
-       /* We have no choice, use the buffer we already have.  */
-       *result_ptr = (char *) realloc (sf._sbf._f._IO_buf_base, needed);
-    }
+  /* We are either shrinking the buffer or doing nothing, otherwise
+     _IO_vfprintf would have failed itself being unable to grow the
+     buffer to the needed size.  Therefore it's always fastest here to
+     call realloc.  */
+  *result_ptr = (char *) realloc (sf._sbf._f._IO_buf_base, needed);
   if (*result_ptr == NULL)
     *result_ptr = sf._sbf._f._IO_buf_base;
   (*result_ptr)[needed - 1] = '\0';