Message ID | 20191105152728.GA5666@redhat.com |
---|---|
State | Not applicable |
Headers |
Received: (qmail 73100 invoked by alias); 5 Nov 2019 15:27:43 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 73087 invoked by uid 89); 5 Nov 2019 15:27:43 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-7.6 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_2, GIT_PATCH_3, SPF_PASS autolearn=ham version=3.3.1 spammy=retry X-HELO: us-smtp-1.mimecast.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1572967660; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gvUBd68QYOf4zjMI6uQSjQ5CMU0M3270wf5ewDe4CGw=; b=dCgHDm4IOxe7/9fsweCZ52menWoXSlKP02PxcXvwAdEo84VTm/CJG3hMuUZS9RWmpe2nRH sgrce0oQdm5QBa5CKepfRNpexUR3GITvw6khgLCVuM9Euxtdfz64RnjMg2IrOdH71X3BTz G7wYMKJnfm0hYWKcQQsJw8XnDXAwVVA= Date: Tue, 5 Nov 2019 16:27:28 +0100 From: Oleg Nesterov <oleg@redhat.com> To: Thomas Gleixner <tglx@linutronix.de> Cc: Florian Weimer <fweimer@redhat.com>, Shawn Landden <shawn@git.icu>, libc-alpha@sourceware.org, linux-api@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>, Arnd Bergmann <arnd@arndb.de>, Deepa Dinamani <deepa.kernel@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Catalin Marinas <catalin.marinas@arm.com>, Keith Packard <keithp@keithp.com>, Peter Zijlstra <peterz@infradead.org> Subject: handle_exit_race && PF_EXITING Message-ID: <20191105152728.GA5666@redhat.com> References: <20191104002909.25783-1-shawn@git.icu> <87woceslfs.fsf@oldenburg2.str.redhat.com> <alpine.DEB.2.21.1911051053470.17054@nanos.tec.linutronix.de> MIME-Version: 1.0 In-Reply-To: <alpine.DEB.2.21.1911051053470.17054@nanos.tec.linutronix.de> User-Agent: Mutt/1.5.24 (2015-08-30) X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline |
Commit Message
Oleg Nesterov
Nov. 5, 2019, 3:27 p.m. UTC
On 11/05, Thomas Gleixner wrote: > > Out of curiosity, what's the race issue vs. robust list which you are > trying to solve? Off-topic, but this reminds me... #include <sched.h> #include <assert.h> #include <unistd.h> #include <syscall.h> #define FUTEX_LOCK_PI 6 int main(void) { struct sched_param sp = {}; sp.sched_priority = 2; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); int lock = vfork(); if (!lock) { sp.sched_priority = 1; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); _exit(0); } syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); return 0; } this creates the unkillable RT process spinning in futex_lock_pi() on a single CPU machine (or you can use taskset). Probably the patch below makes sense anyway, but of course it doesn't solve the real problem: futex_lock_pi() should not spin in this case. It seems to me I even sent the fix a long ago, but I can't recall what exactly it did. Probably the PF_EXITING check in attach_to_pi_owner() must simply die, I'll try to recall... Oleg.
Comments
On Tue, 5 Nov 2019, Oleg Nesterov wrote: > On 11/05, Thomas Gleixner wrote: > > > > Out of curiosity, what's the race issue vs. robust list which you are > > trying to solve? > > Off-topic, but this reminds me... > > #include <sched.h> > #include <assert.h> > #include <unistd.h> > #include <syscall.h> > > #define FUTEX_LOCK_PI 6 > > int main(void) > { > struct sched_param sp = {}; > > sp.sched_priority = 2; > assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); > > int lock = vfork(); > if (!lock) { > sp.sched_priority = 1; > assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); > _exit(0); > } > > syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); > return 0; > } > > this creates the unkillable RT process spinning in futex_lock_pi() on > a single CPU machine (or you can use taskset). Uuurgh. > Probably the patch below makes sense anyway, but of course it doesn't > solve the real problem: futex_lock_pi() should not spin in this case. Obviously not. > It seems to me I even sent the fix a long ago, but I can't recall what > exactly it did. Probably the PF_EXITING check in attach_to_pi_owner() > must simply die, I'll try to recall... We can't do that. The task might have released the futex already, so the waiter would operate on inconsistent state :( The problem with that exit race is that there is no form of serialization which might be used to address that. We can't abuse any of the task locks for this. I was working on a patch set some time ago to fix another more esoteric futex exit issue. Let me resurrect that. Vs. the signal pending check. That makes sense on its own, but we should not restrict it to fatal signals. Futexes are interruptible by definition. Thanks, tglx
On Tue, 5 Nov 2019, Thomas Gleixner wrote: > On Tue, 5 Nov 2019, Oleg Nesterov wrote: > > On 11/05, Thomas Gleixner wrote: > > > > > > Out of curiosity, what's the race issue vs. robust list which you are > > > trying to solve? > > > > Off-topic, but this reminds me... > > > > #include <sched.h> > > #include <assert.h> > > #include <unistd.h> > > #include <syscall.h> > > > > #define FUTEX_LOCK_PI 6 > > > > int main(void) > > { > > struct sched_param sp = {}; > > > > sp.sched_priority = 2; > > assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); > > > > int lock = vfork(); > > if (!lock) { > > sp.sched_priority = 1; > > assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); > > _exit(0); > > } > > > > syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); > > return 0; > > } > > > > this creates the unkillable RT process spinning in futex_lock_pi() on > > a single CPU machine (or you can use taskset). > > Uuurgh. But staring more at it. That's a scheduler bug. parent child set FIFO prio 2 fork() -> set FIFO prio 1 sched_setscheduler(...) return from syscall <= BUG _exit() When the child lowers its priority from 2 to 1, then the parent _must_ preempt the child simply because the parent is now the top priority task on that CPU. Child should never reach exit before the parent blocks on the futex. Peter? What's even more disturbing is that even with that bug happening the child is able to set PF_EXITING, but not PF_EXITPIDONE. That doesn't make sense. Thanks, tglx
On Tue, 5 Nov 2019, Thomas Gleixner wrote: > On Tue, 5 Nov 2019, Thomas Gleixner wrote: > > On Tue, 5 Nov 2019, Oleg Nesterov wrote: > > > On 11/05, Thomas Gleixner wrote: > > > > > > > > Out of curiosity, what's the race issue vs. robust list which you are > > > > trying to solve? > > > > > > Off-topic, but this reminds me... > > > > > > #include <sched.h> > > > #include <assert.h> > > > #include <unistd.h> > > > #include <syscall.h> > > > > > > #define FUTEX_LOCK_PI 6 > > > > > > int main(void) > > > { > > > struct sched_param sp = {}; > > > > > > sp.sched_priority = 2; > > > assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); > > > > > > int lock = vfork(); > > > if (!lock) { > > > sp.sched_priority = 1; > > > assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); > > > _exit(0); > > > } > > > > > > syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); > > > return 0; > > > } > > > > > > this creates the unkillable RT process spinning in futex_lock_pi() on > > > a single CPU machine (or you can use taskset). > > > > Uuurgh. > > But staring more at it. That's a scheduler bug. > > parent child > > set FIFO prio 2 > > fork() -> set FIFO prio 1 > sched_setscheduler(...) > return from syscall <= BUG > > _exit() > > When the child lowers its priority from 2 to 1, then the parent _must_ > preempt the child simply because the parent is now the top priority task on > that CPU. Child should never reach exit before the parent blocks on the > futex. I'm a moron. It's vfork() not fork() so the behaviour is expected. Staring more at the trace which shows me where this goes down the drain. Thanks, tglx
On Tue, 5 Nov 2019, Thomas Gleixner wrote: > > I'm a moron. It's vfork() not fork() so the behaviour is expected. > > Staring more at the trace which shows me where this goes down the drain. parent child set FIFO prio 2 vfork() -> set FIFO prio 1 implies wait_for_child() sched_setscheduler(...) exit() do_exit() tsk->flags |= PF_EXITING; .... mm_release() exit_futex(); (NOOP in this case) complete() --> wakes parent sys_futex() loop infinite because PF_EXITING is set, but PF_EXITPIDONE not So the obvious question is why PF_EXITPIDONE is set way after the futex exit cleanup has run, but moving this right after exit_futex() would not solve the exit race completely because the code after setting PF_EXITING is preemptible. So the same crap could happen just by preemption: task holds futex ... do_exit() tsk->flags |= PF_EXITING; preemption (unrelated wakeup of some other higher prio task, e.g. timer) switch_to(other_task) return to user sys_futex() loop infinite as above And just for the fun of it the futex exit cleanup could trigger the wakeup itself before PF_EXITPIDONE is set. There is some other issue which I need to lookup again. That's a slightly different problem but related to futex exit race conditions. The way we can deal with that is: do_exit() tsk->flags |= PF_EXITING; ... mutex_lock(&tsk->futex_exit_mutex); futex_exit(); tsk->flags |= PF_EXITPIDONE; mutex_unlock(&tsk->futex_exit_mutex); and on the futex lock_pi side: if (!(tsk->flags & PF_EXITING)) return 0; <- All good if (tsk->flags & PF_EXITPIDONE) return -EOWNERDEAD; <- Locker can take over mutex_lock(&tsk->futex_exit_mutex); if (tsk->flags & PF_EXITPIDONE) { mutex_unlock(&tsk->futex_exit_mutex); return -EOWNERDEAD; <- Locker can take over } queue_futex(); mutex_unlock(&tsk->futex_exit_mutex); Not that I think it's pretty, but it plugs all holes AFAICT. Thanks, tglx
--- x/kernel/futex.c +++ x/kernel/futex.c @@ -2842,10 +2842,12 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, * exit to complete. * - The user space value changed. */ - queue_unlock(hb); - put_futex_key(&q.key); - cond_resched(); - goto retry; + if (!fatal_signal_pending(current)) { + queue_unlock(hb); + put_futex_key(&q.key); + cond_resched(); + goto retry; + } default: goto out_unlock_put_key; }