[v2] malloc: Fix list_lock/arena lock deadlock [BZ #19182]

Message ID 56642691.7060806@redhat.com
State Superseded
Headers

Commit Message

Florian Weimer Dec. 6, 2015, 12:14 p.m. UTC
  On 11/03/2015 12:35 PM, Florian Weimer wrote:
> This is just a minimal change.  The fork handler lock acquisition has to
> go away anyway if we make fork async-signal-safe (bug 4737).

The attached version is slightly more elaborate.  It tries to preserve
the original lock order, but has a backup path in case the preferred
lock ordering is not possible.  It covers reused_arena as well, which
has a similar ordering violation.

Torvald, we need this patch for backporting, so a review of the
concurrency aspect would be extremely welcome. :)

Florian
  

Comments

Torvald Riegel Dec. 11, 2015, 4:49 p.m. UTC | #1
Disclaimer: I haven't yet cross-checked whether your analysis of which
lock protects which data is complete, nor do I know the malloc
synchronization details.  If you want, I can do the cross-check, but it
probably will take more time.  The comments below are just based on what
I saw in the immediate neighborhood of your changes.

On Sun, 2015-12-06 at 13:14 +0100, Florian Weimer wrote:
> diff --git a/malloc/arena.c b/malloc/arena.c
> index 3dab7bb..834907c 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -69,9 +69,11 @@ extern int sanity_check_heap_info_alignment[(sizeof (heap_info)
>  static __thread mstate thread_arena attribute_tls_model_ie;
>  
>  /* Arena free list.  list_lock protects the free_list variable below,
> -   and the next_free and attached_threads members of the mstate
> -   objects.  No other (malloc) locks must be taken while list_lock is
> -   active, otherwise deadlocks may occur.  */
> +   the next_free and attached_threads members of struct malloc_state
> +   objects, and concurrent writes to the next member of these objects.
> +   (Read access to the next member is synchronized via a barrier.)
> +   When list_lock is acquired, no arena lock must be acquired, but it
> +   is permitted to acquire arena locks after list_lock.  */

What about main_arena?  _int_new_arena looks like main_area is protected
by list_lock too.

I would provide more detail on the synchronization of read-only
traversals of the list.  list_lock seems to be used to enforce mutual
exclusion between modification operations to the list (including
necessary reading/traversal of the list, so you can't go in without a
lock, figure out you need to change, and then grab list_lock).  For
traversal-only / write-op synchronization, atomic operations are used.
This is the reason why write ops do not need to use acquire fences when
traversing the list.  The acquire fences are missing in reused_arena,
and ptmalloc_[un]lock_all[2], it seems.

(Remember that if you want to use a synchronizes-with edge, there must
be a *pair* of release and acquire MOs -- just an release MO fence (or
atomic_write_barrier) is not sufficient.)

I would also say "atomic ops" instead of barriers in the docs because we
have to use atomics there instead of plain accesses (although we don't
currently), and once we do we'll either use atomic ops or something kind
of atomic_thread_fence, so it would be fences not barriers; this avoids
ambiguity with the higher-level sync construct called barriers.
 
>  static mutex_t list_lock = _LIBC_LOCK_INITIALIZER;
>  static size_t narenas = 1;
> @@ -790,7 +792,16 @@ _int_new_arena (size_t size)
>    mutex_init (&a->mutex);
>    (void) mutex_lock (&a->mutex);
>  
> -  (void) mutex_lock (&list_lock);
> +  /* Put the arena in the global list.  If we cannot acquire
> +     list_lock, we have to temporarily release the arena lock, to
> +     avoid deadlocks.  NB: mutex_trylock returns 0 if the lock was
> +     acquired.  */
> +  bool reacquire_arena_lock = mutex_trylock (&list_lock);
> +  if (reacquire_arena_lock)
> +    {
> +      mutex_unlock (&a->mutex);
> +      (void) mutex_lock (&list_lock);
> +    }
>  
>    detach_arena (replaced_arena);
>  
> @@ -801,6 +812,9 @@ _int_new_arena (size_t size)
>  
>    (void) mutex_unlock (&list_lock);
>  
> +  if (reacquire_arena_lock)
> +    (void) mutex_lock (&a->mutex);
> +

Why can't we simply acquire a->mutex later?  Is there a way that it can
be visible to other threads?  If so, how and why would it be okay to
release the lock temporarily then?  If not, we should add a brief
comment stating that (and thus explaining later acquisition of
a->mutex).

>    return a;
>  }
>  
> @@ -884,11 +898,22 @@ reused_arena (mstate avoid_arena)
>  
>  out:
>    {
> +    /* Update the arena thread attachment counters.  If we cannot
> +       acquire list_lock, we have to temporarily release the arena
> +       lock, to avoid deadlocks.  NB: mutex_trylock returns 0 if the
> +       lock was acquired.  */

Why can you temporarily release the lock?  Will the memory not be reused
for a different arena or something else?  The comment should explain
that as well or point to an existing comment.

>      mstate replaced_arena = thread_arena;
> -    (void) mutex_lock (&list_lock);
> +    bool reacquire_arena_lock = mutex_trylock (&list_lock);
> +    if (reacquire_arena_lock)
> +      {
> +	(void) mutex_unlock (&result->mutex);
> +	(void) mutex_lock (&list_lock);
> +      }
>      detach_arena (replaced_arena);
>      ++result->attached_threads;
>      (void) mutex_unlock (&list_lock);
> +    if (reacquire_arena_lock)
> +      (void) mutex_lock (&result->mutex);
>    }
>  
>    LIBC_PROBE (memory_arena_reuse, 2, result, avoid_arena);

HTH for now.
  
Florian Weimer Dec. 14, 2015, 6:57 p.m. UTC | #2
On 12/11/2015 05:49 PM, Torvald Riegel wrote:
> Disclaimer: I haven't yet cross-checked whether your analysis of which
> lock protects which data is complete, nor do I know the malloc
> synchronization details.  If you want, I can do the cross-check, but it
> probably will take more time.  The comments below are just based on what
> I saw in the immediate neighborhood of your changes.

Understood.

> On Sun, 2015-12-06 at 13:14 +0100, Florian Weimer wrote:
>> diff --git a/malloc/arena.c b/malloc/arena.c
>> index 3dab7bb..834907c 100644
>> --- a/malloc/arena.c
>> +++ b/malloc/arena.c
>> @@ -69,9 +69,11 @@ extern int sanity_check_heap_info_alignment[(sizeof (heap_info)
>>  static __thread mstate thread_arena attribute_tls_model_ie;
>>  
>>  /* Arena free list.  list_lock protects the free_list variable below,
>> -   and the next_free and attached_threads members of the mstate
>> -   objects.  No other (malloc) locks must be taken while list_lock is
>> -   active, otherwise deadlocks may occur.  */
>> +   the next_free and attached_threads members of struct malloc_state
>> +   objects, and concurrent writes to the next member of these objects.
>> +   (Read access to the next member is synchronized via a barrier.)
>> +   When list_lock is acquired, no arena lock must be acquired, but it
>> +   is permitted to acquire arena locks after list_lock.  */
> 
> What about main_arena?  _int_new_arena looks like main_area is protected
> by list_lock too.

main_arena is essentially the first entry in the list of arenas,
pre-allocated in the data segment (so it's not created by
_int_new_arena).  I think it also uses sbrk instead of mmap for allocation.

> I would provide more detail on the synchronization of read-only
> traversals of the list.  list_lock seems to be used to enforce mutual
> exclusion between modification operations to the list (including
> necessary reading/traversal of the list, so you can't go in without a
> lock, figure out you need to change, and then grab list_lock).  For
> traversal-only / write-op synchronization, atomic operations are used.
> This is the reason why write ops do not need to use acquire fences when
> traversing the list.  The acquire fences are missing in reused_arena,
> and ptmalloc_[un]lock_all[2], it seems.

Yeah.  reused_arena also has a data race on the local static variable
(and this one could actually be harmful).

> (Remember that if you want to use a synchronizes-with edge, there must
> be a *pair* of release and acquire MOs -- just an release MO fence (or
> atomic_write_barrier) is not sufficient.)

This is legacy code.  We need to backport this, and the system compiler
(GCC 4.4) does not support memory ordering in the new way. :-(

> I would also say "atomic ops" instead of barriers in the docs because we
> have to use atomics there instead of plain accesses (although we don't
> currently),

I can add a FIXME, but I don't want to move the documentation ahead of
the code because that would just be confusing.

> and once we do we'll either use atomic ops or something kind
> of atomic_thread_fence, so it would be fences not barriers; this avoids
> ambiguity with the higher-level sync construct called barriers.

Good point, I'll just spell out the macro to avoid the confusion.

>>  static mutex_t list_lock = _LIBC_LOCK_INITIALIZER;
>>  static size_t narenas = 1;
>> @@ -790,7 +792,16 @@ _int_new_arena (size_t size)
>>    mutex_init (&a->mutex);
>>    (void) mutex_lock (&a->mutex);
>>  
>> -  (void) mutex_lock (&list_lock);
>> +  /* Put the arena in the global list.  If we cannot acquire
>> +     list_lock, we have to temporarily release the arena lock, to
>> +     avoid deadlocks.  NB: mutex_trylock returns 0 if the lock was
>> +     acquired.  */
>> +  bool reacquire_arena_lock = mutex_trylock (&list_lock);
>> +  if (reacquire_arena_lock)
>> +    {
>> +      mutex_unlock (&a->mutex);
>> +      (void) mutex_lock (&list_lock);
>> +    }
>>  
>>    detach_arena (replaced_arena);
>>  
>> @@ -801,6 +812,9 @@ _int_new_arena (size_t size)
>>  
>>    (void) mutex_unlock (&list_lock);
>>  
>> +  if (reacquire_arena_lock)
>> +    (void) mutex_lock (&a->mutex);
>> +
> 
> Why can't we simply acquire a->mutex later?

Once a->mutex is on the global arena list, reused_arena can pick it (and
will prefer it slightly) if its arena mutex is not acquired by another
thread.

> Is there a way that it can be visible to other threads?

Yes, through the list chained around main_area.next.

> If so, how and why would it be okay to
> release the lock temporarily then?

It just alters arena selection preferences, and those do not affect the
correctness of the malloc implementation (only its performance).

Later acquisition is a valid choice which prefers simplicity over
potentially better arena selection heuristics.  Although this will only
apply to the final arena ever created because otherwise _int_new_arena
will just be called.

>> @@ -884,11 +898,22 @@ reused_arena (mstate avoid_arena)
>>  
>>  out:
>>    {
>> +    /* Update the arena thread attachment counters.  If we cannot
>> +       acquire list_lock, we have to temporarily release the arena
>> +       lock, to avoid deadlocks.  NB: mutex_trylock returns 0 if the
>> +       lock was acquired.  */
> 
> Why can you temporarily release the lock?  Will the memory not be reused
> for a different arena or something else?  The comment should explain
> that as well or point to an existing comment.

Sharing arenas between different threads is not a problem because they
are locked upon use.  If we release the lock temporarily, some other
thread concurrently selecting its malloc arena is more likely to choose
it (same thing as above).

Thanks for your comments.  I'll post a new patch, but this has likely to
wait until tomorrow.

Florian
  
Torvald Riegel Dec. 17, 2015, 7:26 p.m. UTC | #3
On Mon, 2015-12-14 at 19:57 +0100, Florian Weimer wrote:
> On 12/11/2015 05:49 PM, Torvald Riegel wrote:
> > I would provide more detail on the synchronization of read-only
> > traversals of the list.  list_lock seems to be used to enforce mutual
> > exclusion between modification operations to the list (including
> > necessary reading/traversal of the list, so you can't go in without a
> > lock, figure out you need to change, and then grab list_lock).  For
> > traversal-only / write-op synchronization, atomic operations are used.
> > This is the reason why write ops do not need to use acquire fences when
> > traversing the list.  The acquire fences are missing in reused_arena,
> > and ptmalloc_[un]lock_all[2], it seems.
> 
> Yeah.  reused_arena also has a data race on the local static variable
> (and this one could actually be harmful).

The lack of acquire fences is harmful.  Or do you mean that this piece
of code might be more likely to cause code-generation-related problems
triggered by data races?

> > (Remember that if you want to use a synchronizes-with edge, there must
> > be a *pair* of release and acquire MOs -- just an release MO fence (or
> > atomic_write_barrier) is not sufficient.)
> 
> This is legacy code.  We need to backport this, and the system compiler
> (GCC 4.4) does not support memory ordering in the new way. :-(

We do have atomic_read_barrier, which seems to have been intended to be
equivalent to an acquire MO fence.  We also have atomic_forced_read that
can be used instead of atomic_load_relaxed.

> > I would also say "atomic ops" instead of barriers in the docs because we
> > have to use atomics there instead of plain accesses (although we don't
> > currently),
> 
> I can add a FIXME, but I don't want to move the documentation ahead of
> the code because that would just be confusing.

I think a FIXME would be good; we know there is a bug, so we should just
document that.

> >>  static mutex_t list_lock = _LIBC_LOCK_INITIALIZER;
> >>  static size_t narenas = 1;
> >> @@ -790,7 +792,16 @@ _int_new_arena (size_t size)
> >>    mutex_init (&a->mutex);
> >>    (void) mutex_lock (&a->mutex);
> >>  
> >> -  (void) mutex_lock (&list_lock);
> >> +  /* Put the arena in the global list.  If we cannot acquire
> >> +     list_lock, we have to temporarily release the arena lock, to
> >> +     avoid deadlocks.  NB: mutex_trylock returns 0 if the lock was
> >> +     acquired.  */
> >> +  bool reacquire_arena_lock = mutex_trylock (&list_lock);
> >> +  if (reacquire_arena_lock)
> >> +    {
> >> +      mutex_unlock (&a->mutex);
> >> +      (void) mutex_lock (&list_lock);
> >> +    }
> >>  
> >>    detach_arena (replaced_arena);
> >>  
> >> @@ -801,6 +812,9 @@ _int_new_arena (size_t size)
> >>  
> >>    (void) mutex_unlock (&list_lock);
> >>  
> >> +  if (reacquire_arena_lock)
> >> +    (void) mutex_lock (&a->mutex);
> >> +
> > 
> > Why can't we simply acquire a->mutex later?
> 
> Once a->mutex is on the global arena list, reused_arena can pick it (and
> will prefer it slightly) if its arena mutex is not acquired by another
> thread.

But this change can have both executions (ie, one that is equivalent to
acquiring later).  So either something is wrong there, or we can pick
the weaker option all the time.

I saw that you changed that bit in v3, but I want to point this out
nonetheless in case something else might be odd here.

> >> @@ -884,11 +898,22 @@ reused_arena (mstate avoid_arena)
> >>  
> >>  out:
> >>    {
> >> +    /* Update the arena thread attachment counters.  If we cannot
> >> +       acquire list_lock, we have to temporarily release the arena
> >> +       lock, to avoid deadlocks.  NB: mutex_trylock returns 0 if the
> >> +       lock was acquired.  */
> > 
> > Why can you temporarily release the lock?  Will the memory not be reused
> > for a different arena or something else?  The comment should explain
> > that as well or point to an existing comment.
> 
> Sharing arenas between different threads is not a problem because they
> are locked upon use.  If we release the lock temporarily, some other
> thread concurrently selecting its malloc arena is more likely to choose
> it (same thing as above).

I'm not sure whether my question came across, so let me ask again:  When
the lock is released temporarily, we still holds a pointer to the arena.
Until when we reacquire the lock, what can happen to the arena?  Can the
arena and it's memory be deleted or similar, and the memory be reused
for something else?

When a lock is temporarily released, then a few things can be the case.
First, the first and second critical section could be logically
unrelated.  Second, the program can re-confirm in the second critical
section that the world-view from the first critical section is still
correct, so it could have done stuff from the first section in the
second, informally.  Otherwise, why is breaking atomicity okay?  Is one
of the critical sections actually not necessary, or is there a bug?
Documenting the intent on this level would be useful for readers of this
code, I believe.
  

Patch

2015-12-06  Florian Weimer  <fweimer@redhat.com>

	[BZ #19182]
	* malloc/arena.c (list_lock): Document lock ordering requirements.
	(_int_new_arena): Release and re-acquire arena lock if list_lock
	cannot be acquired immeidately.
	(reused_arena): Likewise.

diff --git a/malloc/arena.c b/malloc/arena.c
index 3dab7bb..834907c 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -69,9 +69,11 @@  extern int sanity_check_heap_info_alignment[(sizeof (heap_info)
 static __thread mstate thread_arena attribute_tls_model_ie;
 
 /* Arena free list.  list_lock protects the free_list variable below,
-   and the next_free and attached_threads members of the mstate
-   objects.  No other (malloc) locks must be taken while list_lock is
-   active, otherwise deadlocks may occur.  */
+   the next_free and attached_threads members of struct malloc_state
+   objects, and concurrent writes to the next member of these objects.
+   (Read access to the next member is synchronized via a barrier.)
+   When list_lock is acquired, no arena lock must be acquired, but it
+   is permitted to acquire arena locks after list_lock.  */
 
 static mutex_t list_lock = _LIBC_LOCK_INITIALIZER;
 static size_t narenas = 1;
@@ -790,7 +792,16 @@  _int_new_arena (size_t size)
   mutex_init (&a->mutex);
   (void) mutex_lock (&a->mutex);
 
-  (void) mutex_lock (&list_lock);
+  /* Put the arena in the global list.  If we cannot acquire
+     list_lock, we have to temporarily release the arena lock, to
+     avoid deadlocks.  NB: mutex_trylock returns 0 if the lock was
+     acquired.  */
+  bool reacquire_arena_lock = mutex_trylock (&list_lock);
+  if (reacquire_arena_lock)
+    {
+      mutex_unlock (&a->mutex);
+      (void) mutex_lock (&list_lock);
+    }
 
   detach_arena (replaced_arena);
 
@@ -801,6 +812,9 @@  _int_new_arena (size_t size)
 
   (void) mutex_unlock (&list_lock);
 
+  if (reacquire_arena_lock)
+    (void) mutex_lock (&a->mutex);
+
   return a;
 }
 
@@ -884,11 +898,22 @@  reused_arena (mstate avoid_arena)
 
 out:
   {
+    /* Update the arena thread attachment counters.  If we cannot
+       acquire list_lock, we have to temporarily release the arena
+       lock, to avoid deadlocks.  NB: mutex_trylock returns 0 if the
+       lock was acquired.  */
     mstate replaced_arena = thread_arena;
-    (void) mutex_lock (&list_lock);
+    bool reacquire_arena_lock = mutex_trylock (&list_lock);
+    if (reacquire_arena_lock)
+      {
+	(void) mutex_unlock (&result->mutex);
+	(void) mutex_lock (&list_lock);
+      }
     detach_arena (replaced_arena);
     ++result->attached_threads;
     (void) mutex_unlock (&list_lock);
+    if (reacquire_arena_lock)
+      (void) mutex_lock (&result->mutex);
   }
 
   LIBC_PROBE (memory_arena_reuse, 2, result, avoid_arena);