handle_exit_race && PF_EXITING

Message ID	20191105152728.GA5666@redhat.com
State	Not applicable
Headers	Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org Date: Tue, 5 Nov 2019 16:27:28 +0100 From: Oleg Nesterov <oleg@redhat.com> To: Thomas Gleixner <tglx@linutronix.de> Cc: Florian Weimer <fweimer@redhat.com>, Shawn Landden <shawn@git.icu>, libc-alpha@sourceware.org, linux-api@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>, Arnd Bergmann <arnd@arndb.de>, Deepa Dinamani <deepa.kernel@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Catalin Marinas <catalin.marinas@arm.com>, Keith Packard <keithp@keithp.com>, Peter Zijlstra <peterz@infradead.org> Subject: handle_exit_race && PF_EXITING Message-ID: <20191105152728.GA5666@redhat.com> References: <20191104002909.25783-1-shawn@git.icu> <87woceslfs.fsf@oldenburg2.str.redhat.com> <alpine.DEB.2.21.1911051053470.17054@nanos.tec.linutronix.de> MIME-Version: 1.0 In-Reply-To: <alpine.DEB.2.21.1911051053470.17054@nanos.tec.linutronix.de> User-Agent: Mutt/1.5.24 (2015-08-30) Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline

Message ID

20191105152728.GA5666@redhat.com

State

Not applicable

Headers

Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
Sender: libc-alpha-owner@sourceware.org
Date: Tue, 5 Nov 2019 16:27:28 +0100
From: Oleg Nesterov <oleg@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Florian Weimer <fweimer@redhat.com>, Shawn Landden <shawn@git.icu>,
	libc-alpha@sourceware.org, linux-api@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>, Arnd Bergmann <arnd@arndb.de>,
	Deepa Dinamani <deepa.kernel@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Keith Packard <keithp@keithp.com>, Peter Zijlstra <peterz@infradead.org>
Subject: handle_exit_race && PF_EXITING
Message-ID: <20191105152728.GA5666@redhat.com>
References: <20191104002909.25783-1-shawn@git.icu>
	<87woceslfs.fsf@oldenburg2.str.redhat.com>
	<alpine.DEB.2.21.1911051053470.17054@nanos.tec.linutronix.de>
MIME-Version: 1.0
In-Reply-To: <alpine.DEB.2.21.1911051053470.17054@nanos.tec.linutronix.de>
User-Agent: Mutt/1.5.24 (2015-08-30)
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Commit Message

Oleg Nesterov Nov. 5, 2019, 3:27 p.m. UTC

  On 11/05, Thomas Gleixner wrote:
>
> Out of curiosity, what's the race issue vs. robust list which you are
> trying to solve?

Off-topic, but this reminds me...

	#include <sched.h>
	#include <assert.h>
	#include <unistd.h>
	#include <syscall.h>

	#define FUTEX_LOCK_PI		6

	int main(void)
	{
		struct sched_param sp = {};

		sp.sched_priority = 2;
		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);

		int lock = vfork();
		if (!lock) {
			sp.sched_priority = 1;
			assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
			_exit(0);
		}

		syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
		return 0;
	}

this creates the unkillable RT process spinning in futex_lock_pi() on
a single CPU machine (or you can use taskset).

Probably the patch below makes sense anyway, but of course it doesn't
solve the real problem: futex_lock_pi() should not spin in this case.

It seems to me I even sent the fix a long ago, but I can't recall what
exactly it did. Probably the PF_EXITING check in attach_to_pi_owner()
must simply die, I'll try to recall...

Oleg.

Comments

Thomas Gleixner Nov. 5, 2019, 5:28 p.m. UTC | #1

On Tue, 5 Nov 2019, Oleg Nesterov wrote:
> On 11/05, Thomas Gleixner wrote:
> >
> > Out of curiosity, what's the race issue vs. robust list which you are
> > trying to solve?
> 
> Off-topic, but this reminds me...
> 
> 	#include <sched.h>
> 	#include <assert.h>
> 	#include <unistd.h>
> 	#include <syscall.h>
> 
> 	#define FUTEX_LOCK_PI		6
> 
> 	int main(void)
> 	{
> 		struct sched_param sp = {};
> 
> 		sp.sched_priority = 2;
> 		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> 
> 		int lock = vfork();
> 		if (!lock) {
> 			sp.sched_priority = 1;
> 			assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> 			_exit(0);
> 		}
> 
> 		syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
> 		return 0;
> 	}
> 
> this creates the unkillable RT process spinning in futex_lock_pi() on
> a single CPU machine (or you can use taskset).

Uuurgh.

> Probably the patch below makes sense anyway, but of course it doesn't
> solve the real problem: futex_lock_pi() should not spin in this case.

Obviously not.

> It seems to me I even sent the fix a long ago, but I can't recall what
> exactly it did. Probably the PF_EXITING check in attach_to_pi_owner()
> must simply die, I'll try to recall...

We can't do that. The task might have released the futex already, so the
waiter would operate on inconsistent state :(

The problem with that exit race is that there is no form of serialization
which might be used to address that. We can't abuse any of the task locks
for this.

I was working on a patch set some time ago to fix another more esoteric
futex exit issue. Let me resurrect that.

Vs. the signal pending check. That makes sense on its own, but we should
not restrict it to fatal signals. Futexes are interruptible by definition.

Thanks,

	tglx

Thomas Gleixner Nov. 5, 2019, 5:59 p.m. UTC | #2

On Tue, 5 Nov 2019, Thomas Gleixner wrote:

> On Tue, 5 Nov 2019, Oleg Nesterov wrote:
> > On 11/05, Thomas Gleixner wrote:
> > >
> > > Out of curiosity, what's the race issue vs. robust list which you are
> > > trying to solve?
> > 
> > Off-topic, but this reminds me...
> > 
> > 	#include <sched.h>
> > 	#include <assert.h>
> > 	#include <unistd.h>
> > 	#include <syscall.h>
> > 
> > 	#define FUTEX_LOCK_PI		6
> > 
> > 	int main(void)
> > 	{
> > 		struct sched_param sp = {};
> > 
> > 		sp.sched_priority = 2;
> > 		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> > 
> > 		int lock = vfork();
> > 		if (!lock) {
> > 			sp.sched_priority = 1;
> > 			assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> > 			_exit(0);
> > 		}
> > 
> > 		syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
> > 		return 0;
> > 	}
> > 
> > this creates the unkillable RT process spinning in futex_lock_pi() on
> > a single CPU machine (or you can use taskset).
> 
> Uuurgh.

But staring more at it. That's a scheduler bug.

parent	    	    	child

 set FIFO prio 2

 fork()	         ->	set FIFO prio 1
 		 	 sched_setscheduler(...)
			   return from syscall		<= BUG

 		 	_exit()

When the child lowers its priority from 2 to 1, then the parent _must_
preempt the child simply because the parent is now the top priority task on
that CPU. Child should never reach exit before the parent blocks on the
futex.

Peter?

What's even more disturbing is that even with that bug happening the child
is able to set PF_EXITING, but not PF_EXITPIDONE. That doesn't make sense.

Thanks,

	tglx

Thomas Gleixner Nov. 5, 2019, 6:56 p.m. UTC | #3

On Tue, 5 Nov 2019, Thomas Gleixner wrote:
> On Tue, 5 Nov 2019, Thomas Gleixner wrote:
> > On Tue, 5 Nov 2019, Oleg Nesterov wrote:
> > > On 11/05, Thomas Gleixner wrote:
> > > >
> > > > Out of curiosity, what's the race issue vs. robust list which you are
> > > > trying to solve?
> > > 
> > > Off-topic, but this reminds me...
> > > 
> > > 	#include <sched.h>
> > > 	#include <assert.h>
> > > 	#include <unistd.h>
> > > 	#include <syscall.h>
> > > 
> > > 	#define FUTEX_LOCK_PI		6
> > > 
> > > 	int main(void)
> > > 	{
> > > 		struct sched_param sp = {};
> > > 
> > > 		sp.sched_priority = 2;
> > > 		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> > > 
> > > 		int lock = vfork();
> > > 		if (!lock) {
> > > 			sp.sched_priority = 1;
> > > 			assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> > > 			_exit(0);
> > > 		}
> > > 
> > > 		syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
> > > 		return 0;
> > > 	}
> > > 
> > > this creates the unkillable RT process spinning in futex_lock_pi() on
> > > a single CPU machine (or you can use taskset).
> > 
> > Uuurgh.
> 
> But staring more at it. That's a scheduler bug.
> 
> parent	    	    	child
> 
>  set FIFO prio 2
> 
>  fork()	         ->	set FIFO prio 1
>  		 	 sched_setscheduler(...)
> 			   return from syscall		<= BUG
> 
>  		 	_exit()
> 
> When the child lowers its priority from 2 to 1, then the parent _must_
> preempt the child simply because the parent is now the top priority task on
> that CPU. Child should never reach exit before the parent blocks on the
> futex.

I'm a moron. It's vfork() not fork() so the behaviour is expected.

Staring more at the trace which shows me where this goes down the drain.

Thanks,

	tglx

Thomas Gleixner Nov. 5, 2019, 7:19 p.m. UTC | #4

On Tue, 5 Nov 2019, Thomas Gleixner wrote:
> 
> I'm a moron. It's vfork() not fork() so the behaviour is expected.
> 
> Staring more at the trace which shows me where this goes down the drain.

 parent	    	    	child

  set FIFO prio 2

  vfork()			->	set FIFO prio 1
   implies wait_for_child()	 	sched_setscheduler(...)
 			   		exit()
					do_exit()
					tsk->flags |= PF_EXITING;
 					....
					mm_release()
					  exit_futex(); (NOOP in this case)
					  complete() --> wakes parent
 sys_futex()
    loop infinite because
    	 PF_EXITING is set,
	 but PF_EXITPIDONE not

So the obvious question is why PF_EXITPIDONE is set way after the futex
exit cleanup has run, but moving this right after exit_futex() would not
solve the exit race completely because the code after setting PF_EXITING is
preemptible. So the same crap could happen just by preemption:

  task holds futex
  ...
  do_exit()
    tsk->flags |= PF_EXITING;

preemption (unrelated wakeup of some other higher prio task, e.g. timer)

  switch_to(other_task)

  return to user
  sys_futex()
	loop infinite as above

And just for the fun of it the futex exit cleanup could trigger the wakeup
itself before PF_EXITPIDONE is set.

There is some other issue which I need to lookup again. That's a slightly
different problem but related to futex exit race conditions.

The way we can deal with that is:

    do_exit()
    tsk->flags |= PF_EXITING;
    ...
    mutex_lock(&tsk->futex_exit_mutex);
    futex_exit();
    tsk->flags |= PF_EXITPIDONE;
    mutex_unlock(&tsk->futex_exit_mutex);

and on the futex lock_pi side:

    if (!(tsk->flags & PF_EXITING))
    	return 0;		<- All good

    if (tsk->flags & PF_EXITPIDONE)
        return -EOWNERDEAD;	<- Locker can take over

    mutex_lock(&tsk->futex_exit_mutex);
    if (tsk->flags & PF_EXITPIDONE) {
        mutex_unlock(&tsk->futex_exit_mutex);
        return -EOWNERDEAD;	<- Locker can take over
    }

    queue_futex();
    mutex_unlock(&tsk->futex_exit_mutex);

Not that I think it's pretty, but it plugs all holes AFAICT.

Thanks,

	tglx

diff mbox

Patch

--- x/kernel/futex.c
+++ x/kernel/futex.c
@@ -2842,10 +2842,12 @@  static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 			 *   exit to complete.
 			 * - The user space value changed.
 			 */
-			queue_unlock(hb);
-			put_futex_key(&q.key);
-			cond_resched();
-			goto retry;
+			if (!fatal_signal_pending(current)) {
+				queue_unlock(hb);
+				put_futex_key(&q.key);
+				cond_resched();
+				goto retry;
+			}
 		default:
 			goto out_unlock_put_key;
 		}