[4/5] Fix deadlock in _int_free consistency check

Message ID DB6PR0801MB20537B594B5629491605CD74834B0@DB6PR0801MB2053.eurprd08.prod.outlook.com
State New, archived
Headers

Commit Message

Wilco Dijkstra Oct. 12, 2017, 9:35 a.m. UTC
  This patch fixes a deadlock in the fastbin consistency check.
If we fail the fast check due to concurrent modifications to
the next chunk or system_mem, we should not lock if we already
have the arena lock.  Simplify the check to make it obviously
correct.

Passes GLIBC tests, OK for commit?

ChangeLog:
2017-10-11  Wilco Dijkstra  <wdijkstr@arm.com>

	* malloc/malloc.c (_int_free): Fix deadlock bug.
--
  

Comments

Andreas Schwab Oct. 12, 2017, 10:01 a.m. UTC | #1
On Okt 12 2017, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index c00df205c6004ee5b5d0aee9ffd5130b3c8f9e9f..f4f44400d120188c4d0bece996380e04b35c8fac 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -4168,15 +4168,14 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>  			     >= av->system_mem, 0))
>        {
>  	/* We might not have a lock at this point and concurrent modifications
> -	   of system_mem might have let to a false positive.  Redo the test
> -	   after getting the lock.  */
> -	if (!have_lock
> -	    || ({ __libc_lock_lock (av->mutex);
> -		  chunksize_nomask (chunk_at_offset (p, size)) <= 2 * SIZE_SZ
> -		  || chunksize (chunk_at_offset (p, size)) >= av->system_mem;
> -	        }))
> +	   of system_mem might result in a false positive.  Redo the test after
> +	   getting the lock.  */
> +	if (!have_lock)
> +	  __libc_lock_lock (av->mutex);
> +	if (chunksize_nomask (chunk_at_offset (p, size)) <= 2 * SIZE_SZ
> +	    || chunksize (chunk_at_offset (p, size)) >= av->system_mem)

There is no need to redo the tests if we had the lock.

Andreas.
  
Florian Weimer Oct. 12, 2017, 10:06 a.m. UTC | #2
On 10/12/2017 11:35 AM, Wilco Dijkstra wrote:
> This patch fixes a deadlock in the fastbin consistency check.
> If we fail the fast check due to concurrent modifications to
> the next chunk or system_mem, we should not lock if we already
> have the arena lock.  Simplify the check to make it obviously
> correct.

I don't think the subject line is correct.  What is the deadlock?  I 
don't see it.

Thanks,
Florian
  
Andreas Schwab Oct. 12, 2017, 10:16 a.m. UTC | #3
On Okt 12 2017, Florian Weimer <fweimer@redhat.com> wrote:

> On 10/12/2017 11:35 AM, Wilco Dijkstra wrote:
>> This patch fixes a deadlock in the fastbin consistency check.
>> If we fail the fast check due to concurrent modifications to
>> the next chunk or system_mem, we should not lock if we already
>> have the arena lock.  Simplify the check to make it obviously
>> correct.
>
> I don't think the subject line is correct.  What is the deadlock?  I don't
> see it.

The problem is that commit 24cffce736 inverted the condition on
have_lock, which is wrong.

Andreas.
  
Wilco Dijkstra Oct. 12, 2017, 10:18 a.m. UTC | #4
Florian Weimer wrote:
    
> I don't think the subject line is correct.  What is the deadlock?  I 
> don't see it.

> -	if (!have_lock
> -	    || ({ __libc_lock_lock (av->mutex);

It's right there. Have_lock means you've just done __libc_lock_lock (av->mutex),
so doing it again (same thread) implies deadlock.

Wilco
  
Wilco Dijkstra Oct. 12, 2017, 10:41 a.m. UTC | #5
Andreas Schwab wrote:

> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index c00df205c6004ee5b5d0aee9ffd5130b3c8f9e9f..f4f44400d120188c4d0bece996380e04b35c8fac 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -4168,15 +4168,14 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>                             >= av->system_mem, 0))
>        {
>        /* We might not have a lock at this point and concurrent modifications
> -        of system_mem might have let to a false positive.  Redo the test
> -        after getting the lock.  */
> -     if (!have_lock
> -         || ({ __libc_lock_lock (av->mutex);
> -               chunksize_nomask (chunk_at_offset (p, size)) <= 2 * SIZE_SZ
> -               || chunksize (chunk_at_offset (p, size)) >= av->system_mem;
> -             }))
> +        of system_mem might result in a false positive.  Redo the test after
> +        getting the lock.  */
> +     if (!have_lock)
> +       __libc_lock_lock (av->mutex);
> +     if (chunksize_nomask (chunk_at_offset (p, size)) <= 2 * SIZE_SZ
> +         || chunksize (chunk_at_offset (p, size)) >= av->system_mem)

> There is no need to redo the tests if we had the lock.

Well I guess an alternative is to do:

if (have_lock)
   print error
else
{
  lock
  repeat test and print error
  unlock
}

I also wonder whether we should actually unlock again before printing the error -
or do we assume/hope/know no memory allocation is ever required in the error case?

Wilco
  
Andreas Schwab Oct. 12, 2017, 10:47 a.m. UTC | #6
On Okt 12 2017, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> Andreas Schwab wrote:
>
>> diff --git a/malloc/malloc.c b/malloc/malloc.c
>> index c00df205c6004ee5b5d0aee9ffd5130b3c8f9e9f..f4f44400d120188c4d0bece996380e04b35c8fac 100644
>> --- a/malloc/malloc.c
>> +++ b/malloc/malloc.c
>> @@ -4168,15 +4168,14 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>>                             >= av->system_mem, 0))
>>        {
>>        /* We might not have a lock at this point and concurrent modifications
>> -        of system_mem might have let to a false positive.  Redo the test
>> -        after getting the lock.  */
>> -     if (!have_lock
>> -         || ({ __libc_lock_lock (av->mutex);
>> -               chunksize_nomask (chunk_at_offset (p, size)) <= 2 * SIZE_SZ
>> -               || chunksize (chunk_at_offset (p, size)) >= av->system_mem;
>> -             }))
>> +        of system_mem might result in a false positive.  Redo the test after
>> +        getting the lock.  */
>> +     if (!have_lock)
>> +       __libc_lock_lock (av->mutex);
>> +     if (chunksize_nomask (chunk_at_offset (p, size)) <= 2 * SIZE_SZ
>> +         || chunksize (chunk_at_offset (p, size)) >= av->system_mem)
>
>> There is no need to redo the tests if we had the lock.
>
> Well I guess an alternative is to do:
>
> if (have_lock)
>    print error
> else
> {
>   lock
>   repeat test and print error
>   unlock
> }

No, you can just test have_lock again, and skip the redo if set.  Still
much clearer than the original layout.

Andreas.
  
Florian Weimer Oct. 12, 2017, 10:48 a.m. UTC | #7
On 10/12/2017 12:16 PM, Andreas Schwab wrote:
> On Okt 12 2017, Florian Weimer <fweimer@redhat.com> wrote:
> 
>> On 10/12/2017 11:35 AM, Wilco Dijkstra wrote:
>>> This patch fixes a deadlock in the fastbin consistency check.
>>> If we fail the fast check due to concurrent modifications to
>>> the next chunk or system_mem, we should not lock if we already
>>> have the arena lock.  Simplify the check to make it obviously
>>> correct.
>>
>> I don't think the subject line is correct.  What is the deadlock?  I don't
>> see it.
> 
> The problem is that commit 24cffce736 inverted the condition on
> have_lock, which is wrong.

Oh, right, sorry about that.

Thanks,
Florian
  
Florian Weimer Oct. 12, 2017, 10:53 a.m. UTC | #8
On 10/12/2017 12:18 PM, Wilco Dijkstra wrote:
> Florian Weimer wrote:
>      
>> I don't think the subject line is correct.  What is the deadlock?  I
>> don't see it.
> 
>> -	if (!have_lock
>> -	    || ({ __libc_lock_lock (av->mutex);
> 
> It's right there. Have_lock means you've just done __libc_lock_lock (av->mutex),
> so doing it again (same thread) implies deadlock.

Hmm.

So if we enter this code path with have_lock, we don't have to re-do the 
check, but malloc_printerr will be called in the end anyway, so this is 
not the interesting case.

In practice, without heap corruption, the lock will be acquired here and 
re-checking is needed, so I think your cleanup is okay after all.  The 
logic is indeed much clearer.

Thanks,
Florian
  
Wilco Dijkstra Oct. 12, 2017, 9:35 p.m. UTC | #9
DJ Delorie wrote:
> I think the bug can be fixed by only changing the sense of the have_lock
> condition:

> -     if (!have_lock
> +     if (have_lock

Yes it could but then it's still impossible to follow the logic... The only reason
I spotted the bug was because I refactored the code. Then the bug suddenly
became very obvious. I think locking as a side effect in conditions is something
we should avoid.

So my variant or this one are reasonable ways to do it:

 if (error detected)
   {
     if (have_lock)
       /* we had the lock during the test above, so the test is valid,
          and the error we detect is valid.  */
       print error message
     else
       /* we didn't have the lock, so aquire it and repeat the test.  If
          the error is still present, fail.  */
       get lock, repeat test, maybe print error message
   }

I suppose we could also print the error once at the end:

if (error detected)
{
  bool fail = true;
  if (!have_lock)
  {
     get_lock
     fail = repeat test
     unlock
  }
  if (fail)
    print error
}

Wilco
  
Florian Weimer Oct. 17, 2017, 10:03 a.m. UTC | #10
On 10/12/2017 11:35 PM, Wilco Dijkstra wrote:
> I suppose we could also print the error once at the end:
> 
> if (error detected)
> {
>    bool fail = true;
>    if (!have_lock)
>    {
>       get_lock
>       fail = repeat test
>       unlock
>    }
>    if (fail)
>      print error
> }

Yes, that's also a good option.

Anyway, should I make the one-character change removing the ! in the 
meantime?  This is a real bug, and we should fix that even if the code 
still isn't as pretty as it could be.

Thanks,
Florian
  
Wilco Dijkstra Oct. 17, 2017, 10:49 a.m. UTC | #11
Florian Weimer wrote:
>
> Anyway, should I make the one-character change removing the ! in the 
> meantime?  This is a real bug, and we should fix that even if the code 
> still isn't as pretty as it could be.

I could easily commit the current patch as is if you're thinking of backporting it.

Once we agree on what the best way of writing this sequence, I can
provide an updated patch.

Wilco
  

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index c00df205c6004ee5b5d0aee9ffd5130b3c8f9e9f..f4f44400d120188c4d0bece996380e04b35c8fac 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -4168,15 +4168,14 @@  _int_free (mstate av, mchunkptr p, int have_lock)
 			     >= av->system_mem, 0))
       {
 	/* We might not have a lock at this point and concurrent modifications
-	   of system_mem might have let to a false positive.  Redo the test
-	   after getting the lock.  */
-	if (!have_lock
-	    || ({ __libc_lock_lock (av->mutex);
-		  chunksize_nomask (chunk_at_offset (p, size)) <= 2 * SIZE_SZ
-		  || chunksize (chunk_at_offset (p, size)) >= av->system_mem;
-	        }))
+	   of system_mem might result in a false positive.  Redo the test after
+	   getting the lock.  */
+	if (!have_lock)
+	  __libc_lock_lock (av->mutex);
+	if (chunksize_nomask (chunk_at_offset (p, size)) <= 2 * SIZE_SZ
+	    || chunksize (chunk_at_offset (p, size)) >= av->system_mem)
 	  malloc_printerr ("free(): invalid next size (fast)");
-	if (! have_lock)
+	if (!have_lock)
 	  __libc_lock_unlock (av->mutex);
       }