malloc: Re-add protection for recursive calls to __malloc_fork_lock_parent
Commit Message
I've just understood what you meant by "this could have happened
before". I do agree it was already broken. It was just too difficult
to reproduce it here.
With commit ID 8a727af9, I get:
$ x=0; while ./testrun.sh ./malloc/tst-mallocfork ; do x=$((x + 1)); \
done; echo $? $x
Timed out: killed the child process
0 10
After applying this patch, it runs for hours without failing. But it
clearly doesn't fix it.
However, if you believe the test case is invalid, let's remove it.
I wonder if it requires a new bug report as 8a727af9 has been backported
to glibc 2.23.
--- 8< ---
This partially reverts commit 8a727af925be63aa6ea0f5f90e16751fd541626b
adding back save_arena, at_fork_recursive_cntr and part of the code
which updated them.
This doesn't fix the issue reported on [BZ #838] but keeps the current
workaround in order to minimize its occurrence.
It also adds source code comments describing the issue.
Tested on ppc64le and x86_64.
2016-05-10 Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com>
* malloc/arena.c (save_arena, ATFORK_ARENA_PTR)
(atfork_recursive_cntr): Re-add.
(__malloc_fork_lock_parent, __malloc_fork_unlock_parent)
(__malloc_fork_unlock_child): Re-add code to update save_arena
and atfork_recursive_cntr.
---
malloc/arena.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 46 insertions(+), 1 deletion(-)
Comments
On 05/11/2016 12:33 AM, Tulio Magno Quites Machado Filho wrote:
> I've just understood what you meant by "this could have happened
> before". I do agree it was already broken. It was just too difficult
> to reproduce it here.
>
> With commit ID 8a727af9, I get:
> $ x=0; while ./testrun.sh ./malloc/tst-mallocfork ; do x=$((x + 1)); \
> done; echo $? $x
> Timed out: killed the child process
> 0 10
>
> After applying this patch, it runs for hours without failing. But it
> clearly doesn't fix it.
> However, if you believe the test case is invalid, let's remove it.
>
> I wonder if it requires a new bug report as 8a727af9 has been backported
> to glibc 2.23.
We already have a bug for this, I think:
https://sourceware.org/bugzilla/show_bug.cgi?id=19703
I've just attached a more reliable test case to this bug. For me, it is
quite reliable with even quite old glibcs—the test case indicates
delivery of a few signals and then goes into deadlock.
I believe the new test is completely valid because sigusr1_handler calls
only async-signal-safe functions (the old one should really call _exit
instead of exit in the signal handler).
I tried this test with current master and your patch applied on top, and
I still get deadlocks. Can you give this test a try as well?
A true fix for bug 19703 depends on bug 19702 (Provide a flag indicating
whether a thread is in a signal handler), which is not easy to address
because we need an async-signal-safe sigaction/signal function and need
to atomically change the signal handler and its associated flags.
I think I can provide a partial fix for single-threaded programs without
bug 19702. It's going to be rather small but quite ugly.
Florian
Florian Weimer <fweimer@redhat.com> writes:
> On 05/11/2016 12:33 AM, Tulio Magno Quites Machado Filho wrote:
>>
>> I wonder if it requires a new bug report as 8a727af9 has been backported
>> to glibc 2.23.
>
> We already have a bug for this, I think:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=19703
Thanks for pointing that.
> I've just attached a more reliable test case to this bug. For me, it is
> quite reliable with even quite old glibcs—the test case indicates
> delivery of a few signals and then goes into deadlock.
>
> I believe the new test is completely valid because sigusr1_handler calls
> only async-signal-safe functions (the old one should really call _exit
> instead of exit in the signal handler).
>
> I tried this test with current master and your patch applied on top, and
> I still get deadlocks. Can you give this test a try as well?
It did get into deadlock soon here.
> A true fix for bug 19703 depends on bug 19702 (Provide a flag indicating
> whether a thread is in a signal handler), which is not easy to address
> because we need an async-signal-safe sigaction/signal function and need
> to atomically change the signal handler and its associated flags.
Thanks for pointing that too!
@@ -129,6 +129,19 @@ int __malloc_initialized = -1;
/* atfork support. */
+static void *save_arena;
+
+/* Magic value for the thread-specific arena pointer to avoid is in use.
+ This is used to prevent a thread from locking a mutex already locked by
+ itself, e.g. when receiving a signal before completing the execution of
+ __malloc_fork_unlock_parent.
+ More information on https://sourceware.org/bugzilla/show_bug.cgi?id=838. */
+
+# define ATFORK_ARENA_PTR ((void *) -1)
+
+/* Counter for number of times the list is locked by the same thread. */
+static unsigned int atfork_recursive_cntr;
+
/* The following three functions are called around fork from a
multi-threaded process. We do not use the general fork handler
mechanism to make sure that our handlers are the last ones being
@@ -145,8 +158,25 @@ __malloc_fork_lock_parent (void)
/* We do not acquire free_list_lock here because we completely
reconstruct free_list in __malloc_fork_unlock_child. */
- (void) mutex_lock (&list_lock);
+ if (mutex_trylock (&list_lock))
+ {
+ if (thread_arena == ATFORK_ARENA_PTR)
+ /* This is the same thread which already locks the global list.
+ Just bump the counter. */
+ goto out;
+ /* This thread has to wait its turn. */
+ (void) mutex_lock (&list_lock);
+ }
+ /* Only the current thread may perform malloc/free calls now.
+ save_arena will be reattached to the current thread, in
+ __malloc_fork_unlock_parent, so save_arena->attached_threads is not
+ updated. */
+ /* FIXME: The section between locking list_lock and the following write
+ to thread_arena is not signal-safe. The following code tries to
+ minimize the critical section but doesn't fix it. */
+ save_arena = thread_arena;
+ thread_arena = ATFORK_ARENA_PTR;
for (mstate ar_ptr = &main_arena;; )
{
(void) mutex_lock (&ar_ptr->mutex);
@@ -154,6 +184,8 @@ __malloc_fork_lock_parent (void)
if (ar_ptr == &main_arena)
break;
}
+out:
+ ++atfork_recursive_cntr;
}
void
@@ -163,6 +195,9 @@ __malloc_fork_unlock_parent (void)
if (__malloc_initialized < 1)
return;
+ if (--atfork_recursive_cntr != 0)
+ return;
+
for (mstate ar_ptr = &main_arena;; )
{
(void) mutex_unlock (&ar_ptr->mutex);
@@ -170,6 +205,13 @@ __malloc_fork_unlock_parent (void)
if (ar_ptr == &main_arena)
break;
}
+ /* Replace ATFORK_ARENA_PTR with save_arena.
+ save_arena->attached_threads was not changed in
+ __malloc_fork_lock_parent and is still correct. */
+ /* FIXME: The following mutex_unlock(&list_lock) call and the write to
+ thread_arena is not signal safe. The following code tries to
+ minimize the critical section but doesn't fix it. */
+ thread_arena = save_arena;
(void) mutex_unlock (&list_lock);
}
@@ -180,6 +222,8 @@ __malloc_fork_unlock_child (void)
if (__malloc_initialized < 1)
return;
+ thread_arena = save_arena;
+
/* Push all arenas to the free list, except thread_arena, which is
attached to the current thread. */
mutex_init (&free_list_lock);
@@ -202,6 +246,7 @@ __malloc_fork_unlock_child (void)
}
mutex_init (&list_lock);
+ atfork_recursive_cntr = 0;
}
/* Initialization routine. */