diff mbox series

[16/20] futex: Implement sys_futex_waitv()

Message ID	20210915141525.621568509@infradead.org
State	Not applicable
Headers	DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 17AD43857416 Message-ID: <20210915141525.621568509@infradead.org> User-Agent: quilt/0.66 Date: Wed, 15 Sep 2021 16:07:26 +0200 From: Peter Zijlstra <peterz@infradead.org> To: andrealmeid@collabora.com, tglx@linutronix.de, mingo@redhat.com, dvhart@infradead.org, rostedt@goodmis.org, bigeasy@linutronix.de Subject: [PATCH 16/20] futex: Implement sys_futex_waitv() References: <20210915140710.596174479@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Precedence: list Cc: dave@stgolabs.net, libc-alpha@sourceware.org, peterz@infradead.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, mtk.manpages@gmail.com, kernel@collabora.com, krisman@collabora.com Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
Series	futex: splitup and waitv syscall \| [00/20] futex: splitup and waitv syscall [01/20] futex: Move to kernel/futex/ [02/20] futex: Split out syscalls [03/20] futex: Rename {,__}{,un}queue_me() [04/20] futex: Rename futex_wait_queue_me() [05/20] futex: Rename: queue_{,un}lock() [06/20] futex: Rename __unqueue_futex() [07/20] futex: Rename hash_futex() [08/20] futex: Rename: {get,cmpxchg}_futex_value_locked() [09/20] futex: Split out PI futex [10/20] futex: Rename: hb_waiter_{inc,dec,pending}() [11/20] futex: Rename: match_futex() [12/20] futex: Rename mark_wake_futex() [13/20] futex: Split out requeue [14/20] futex: Split out wait/wake [15/20] futex: Simplify double_lock_hb() [16/20] futex: Implement sys_futex_waitv() [17/20] futex,x86: Wire up sys_futex_waitv() [18/20] futex,arm: Wire up sys_futex_waitv() [19/20] selftests: futex: Add sys_futex_waitv() test [20/20] selftests: futex: Test sys_futex_waitv() timeout

Checks

Context	Check	Description
dj/TryBot-apply_patch	fail	Patch failed to apply to master at the time it was sent

Commit Message

Peter Zijlstra Sept. 15, 2021, 2:07 p.m. UTC

  From: André Almeida <andrealmeid@collabora.com>

Add support to wait on multiple futexes. This is the interface
implemented by this syscall:

futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
	    unsigned int flags, struct timespec *timo)

struct futex_waitv {
	__u64 val;
	__u64 uaddr;
	__u32 flags;
	__u32 __reserved;
};

Given an array of struct futex_waitv, wait on each uaddr. The thread
wakes if a futex_wake() is performed at any uaddr. The syscall returns
immediately if any waiter has *uaddr != val. *timo is an optional
absolute timeout value for the operation. This syscall supports only
64bit sized timeout structs. The flags argument of the syscall should be
used solely for specifying the timeout clock as realtime, if needed.
Flags for shared futexes, sizes, etc. should be used on the individual
flags of each waiter.

__reserved is used for explicit padding and should be 0, but it might be
used for future extensions. If the userspace uses 32-bit pointers, it
should make sure to explicitly cast it when assigning to waitv::uaddr.

Returns the array index of one of the awakened futexes. There’s no given
information of how many were awakened, or any particular attribute of it
(if it’s the first awakened, if it is of the smaller index...).

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210913175249.81074-3-andrealmeid@collabora.com
---
 MAINTAINERS                       |    1 
 include/linux/syscalls.h          |    6 +
 include/uapi/asm-generic/unistd.h |    5 
 include/uapi/linux/futex.h        |   25 ++++
 kernel/futex/futex.h              |   15 ++
 kernel/futex/syscalls.c           |  107 ++++++++++++++++++++
 kernel/futex/waitwake.c           |  201 ++++++++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                   |    3 
 8 files changed, 362 insertions(+), 1 deletion(-)
 create mode 100644 kernel/futex/syscalls.c

Comments

André Almeida Sept. 15, 2021, 3:20 p.m. UTC | #1

Às 11:07 de 15/09/21, Peter Zijlstra escreveu:
> From: André Almeida <andrealmeid@collabora.com>
> 
> Add support to wait on multiple futexes. This is the interface
> implemented by this syscall:
> 
> futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
> 	    unsigned int flags, struct timespec *timo)
> 
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -880,8 +880,11 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_s
>  #define __NR_process_mrelease 448
>  __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
>  
> +#define __NR_futex_waitv 449
> +__SC_COMP(__NR_futex_waitv, sys_futex_waitv)
> +

Oops, this should be __SYSCALL(), and not __SC_COMP(), my bad.

Peter Zijlstra Sept. 15, 2021, 3:37 p.m. UTC | #2

On Wed, Sep 15, 2021 at 04:07:26PM +0200, Peter Zijlstra wrote:
> +SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters,
> +		unsigned int, nr_futexes, unsigned int, flags,
> +		struct __kernel_timespec __user *, timo)

So I utterly detest timespec.. it makes no sense what so ever.

Can't we just, for new syscalls, simply use a s64 nsec argument and call
it a day?

Thomas, Arnd ?

André Almeida Sept. 15, 2021, 4:29 p.m. UTC | #3

Às 11:07 de 15/09/21, Peter Zijlstra escreveu:
> From: André Almeida <andrealmeid@collabora.com>
> 
> Add support to wait on multiple futexes. This is the interface
> implemented by this syscall:
> 
> futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
> 	    unsigned int flags, struct timespec *timo)
> 
> +/**
> + * futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
> + * @vs:		The futex list to wait on
> + * @count:	The size of the list
> + * @awaken:	Index of the last awoken futex, if any. Used to notify the
> + *		caller that it can return this index to userspace (return parameter)
> + *
> + * Prepare multiple futexes in a single step and enqueue them. This may fail if
> + * the futex list is invalid or if any futex was already awoken. On success the
> + * task is ready to interruptible sleep.
> + *
> + * Return:
> + *  -  1 - One of the futexes was awaken by another thread
> + *  -  0 - Success
> + *  - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
> + */
> +static int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *awaken)
> +{
> +	struct futex_hash_bucket *hb;
> +	bool retry = false;
> +	int ret, i;
> +	u32 uval;
> +
> +	/*
> +	 * Enqueuing multiple futexes is tricky, because we need to enqueue
> +	 * each futex in the list before dealing with the next one to avoid
> +	 * deadlocking on the hash bucket. But, before enqueuing, we need to
> +	 * make sure that current->state is TASK_INTERRUPTIBLE, so we don't
> +	 * absorb any awake events, which cannot be done before the
> +	 * get_futex_key of the next key, because it calls get_user_pages,
> +	 * which can sleep. Thus, we fetch the list of futexes keys in two
> +	 * steps, by first pinning all the memory keys in the futex key, and
> +	 * only then we read each key and queue the corresponding futex.
> +	 *
> +	 * Private futexes doesn't need to recalculate hash in retry, so skip
> +	 * get_futex_key() when retrying.
> +	 */
> +retry:
> +	for (i = 0; i < count; i++) {
> +		if ((vs[i].w.flags & FUTEX_PRIVATE_FLAG) && retry)
> +			continue;
> +
> +		ret = get_futex_key(u64_to_user_ptr(vs[i].w.uaddr),
> +				    !(vs[i].w.flags & FUTEX_PRIVATE_FLAG),
> +				    &vs[i].q.key, FUTEX_READ);
> +
> +		if (unlikely(ret))
> +			return ret;
> +	}
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +
> +	for (i = 0; i < count; i++) {
> +		u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr;
> +		struct futex_q *q = &vs[i].q;
> +		u32 val = (u32)vs[i].w.val;
> +
> +		hb = futex_q_lock(q);
> +		ret = futex_get_value_locked(&uval, uaddr);
> +
> +		if (!ret && uval == val) {
> +			/*
> +			 * The bucket lock can't be held while dealing with the
> +			 * next futex. Queue each futex at this moment so hb can
> +			 * be unlocked.
> +			 */
> +			futex_queue(q, hb);
> +			continue;
> +		}
> +
> +		futex_q_unlock(hb);
> +		__set_current_state(TASK_RUNNING);
> +
> +		/*
> +		 * Even if something went wrong, if we find out that a futex
> +		 * was awaken, we don't return error and return this index to
> +		 * userspace
> +		 */
> +		*awaken = unqueue_multiple(vs, i);
> +		if (*awaken >= 0)
> +			return 1;
> +
> +		if (uval != val)
> +			return -EWOULDBLOCK;
> +
> +		if (ret) {
> +			/*
> +			 * If we need to handle a page fault, we need to do so
> +			 * without any lock and any enqueued futex (otherwise
> +			 * we could lose some wakeup). So we do it here, after
> +			 * undoing all the work done so far. In success, we
> +			 * retry all the work.
> +			 */
> +			if (get_user(uval, uaddr))
> +				return -EFAULT;
> +
> +			retry = true;
> +			goto retry;
> +		}

My bad again: the last two if's should be in the reserve order. If ret
!= 0, the user copy didn't succeed and the value wasn't copied to uval,
thus the comparison (uval != val) should happen only if ret == 0.


> +	}
> +
> +	return 0;
> +}

Paul Eggert Sept. 15, 2021, 5:34 p.m. UTC | #4

On 9/15/21 8:37 AM, Peter Zijlstra wrote:
> I utterly detest timespec.. it makes no sense what so ever.
> 
> Can't we just, for new syscalls, simply use a s64 nsec argument and call
> it a day?

This would stop working in the year 2262. Not a good idea.

Any improvements on struct timespec should be a strict superset, not a 
subset. For example, you could advocate a signed 128-bit argument 
counting in units of attoseconds (10⁻¹⁸ s), the highest power-of-1000 
resolution that does not lose info when converting from struct timespec. 
This could use __int128 on platforms that have it, a two-integer struct 
otherwise.

I'm not sure this is a hill I'd want to die on. That being said, it 
would be cool to keep up with the people in the building near mine who 
are researching attosecond imaging (tricky because the uncertainty 
principle means attosecond laser pulses must have very broad spectra). 
And extending struct timespec on the low end is clearly the way to go, 
since its high end already goes back well before the Big Bang.

I hope you don't mind my going off the deep end a bit here. Still, the 
point is that if we're going to improve on struct timespec then it 
really should be an improvement.

Arnd Bergmann Sept. 15, 2021, 6:47 p.m. UTC | #5

On Wed, Sep 15, 2021 at 5:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Sep 15, 2021 at 04:07:26PM +0200, Peter Zijlstra wrote:
> > +SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters,
> > +             unsigned int, nr_futexes, unsigned int, flags,
> > +             struct __kernel_timespec __user *, timo)
>
> So I utterly detest timespec.. it makes no sense what so ever.
>
> Can't we just, for new syscalls, simply use a s64 nsec argument and call
> it a day?
>
> Thomas, Arnd ?

Do you mean passing the nanoseconds by value instead of a pointer?
I think that would be worse, since that means having incompatible calling
conventions between 32-bit and 64-bit architectures, and even
between 32-bit architectures that have different requirements for 64-bit
function arguments.

If we pass it by reference, there is much less to gain from changing the
timespec to plain nanoseconds. I wouldn't object to that, but I don't
see it helping much either. It would work for relative timeouts, but the
general trend seems to be to specify timeouts as absolute times,
and that would force each caller to read the time using clock_gettime()
and then convert it to nanoseconds before adding the timeout.

Specifying the timeout in terms of 32-bit relative milliseconds would the
way that epoll() does would be really simple, but that still feels odd.

        Arnd

Thomas Gleixner Sept. 16, 2021, 2:49 p.m. UTC | #6

On Wed, Sep 15 2021 at 10:34, Paul Eggert wrote:

> On 9/15/21 8:37 AM, Peter Zijlstra wrote:
>> I utterly detest timespec.. it makes no sense what so ever.
>> 
>> Can't we just, for new syscalls, simply use a s64 nsec argument and call
>> it a day?
>
> This would stop working in the year 2262. Not a good idea.

Make it u64 and it stops in 2552, i.e. 584 years from now which is
plenty. Lot's of the kernel internal timekeeping will stop working at
that point, so that interface is the least of my worries. And TBH, my
worries about the Y2552 problem are extremly close to zero.

> Any improvements on struct timespec should be a strict superset, not a 
> subset. For example, you could advocate a signed 128-bit argument 
> counting in units of attoseconds (10⁻¹⁸ s), the highest power-of-1000 
> resolution that does not lose info when converting from struct
> timespec.

Which requires a 128bit division on every syscall for no value at all.

Thanks,

        tglx

André Almeida Sept. 16, 2021, 6:54 p.m. UTC | #7

Às 11:49 de 16/09/21, Thomas Gleixner escreveu:
> On Wed, Sep 15 2021 at 10:34, Paul Eggert wrote:
> 
>> On 9/15/21 8:37 AM, Peter Zijlstra wrote:
>>> I utterly detest timespec.. it makes no sense what so ever.
>>>
>>> Can't we just, for new syscalls, simply use a s64 nsec argument and call
>>> it a day?
>>
>> This would stop working in the year 2262. Not a good idea.
> 
> Make it u64 and it stops in 2552, i.e. 584 years from now which is
> plenty. Lot's of the kernel internal timekeeping will stop working at
> that point, so that interface is the least of my worries. And TBH, my
> worries about the Y2552 problem are extremly close to zero.
> 

What do we win by using u64 instead of timespec?

Or what's so bad about timespec?

diff mbox series

Patch

--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7713,6 +7713,7 @@  M:	Ingo Molnar <mingo@redhat.com>
 R:	Peter Zijlstra <peterz@infradead.org>
 R:	Darren Hart <dvhart@infradead.org>
 R:	Davidlohr Bueso <dave@stgolabs.net>
+R:	André Almeida <andrealmeid@collabora.com>
 L:	linux-kernel@vger.kernel.org
 S:	Maintained
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -58,6 +58,7 @@  struct mq_attr;
 struct compat_stat;
 struct old_timeval32;
 struct robust_list_head;
+struct futex_waitv;
 struct getcpu_cache;
 struct old_linux_dirent;
 struct perf_event_attr;
@@ -623,6 +624,11 @@  asmlinkage long sys_get_robust_list(int
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+/* kernel/futex/syscalls.c */
+asmlinkage long sys_futex_waitv(struct futex_waitv *waiters,
+				unsigned int nr_futexes, unsigned int flags,
+				struct __kernel_timespec __user *timo);
+
 /* kernel/hrtimer.c */
 asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp,
 			      struct __kernel_timespec __user *rmtp);
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -880,8 +880,11 @@  __SYSCALL(__NR_memfd_secret, sys_memfd_s
 #define __NR_process_mrelease 448
 __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 
+#define __NR_futex_waitv 449
+__SC_COMP(__NR_futex_waitv, sys_futex_waitv)
+
 #undef __NR_syscalls
-#define __NR_syscalls 449
+#define __NR_syscalls 450
 
 /*
  * 32 bit systems traditionally used different
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -44,6 +44,31 @@ 
 					 FUTEX_PRIVATE_FLAG)
 
 /*
+ * Flags to specify the bit length of the futex word for futex2 syscalls.
+ * Currently, only 32 is supported.
+ */
+#define FUTEX_32		2
+
+/*
+ * Max numbers of elements in a futex_waitv array
+ */
+#define FUTEX_WAITV_MAX		128
+
+/**
+ * struct futex_waitv - A waiter for vectorized wait
+ * @val:	Expected value at uaddr
+ * @uaddr:	User address to wait on
+ * @flags:	Flags for this waiter
+ * @__reserved:	Reserved member to preserve data alignment. Should be 0.
+ */
+struct futex_waitv {
+	__u64 val;
+	__u64 uaddr;
+	__u32 flags;
+	__u32 __reserved;
+};
+
+/*
  * Support for robust futexes: the kernel cleans up held futexes at
  * thread exit time.
  */
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -268,6 +268,21 @@  extern int futex_requeue(u32 __user *uad
 extern int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		      ktime_t *abs_time, u32 bitset);
 
+/**
+ * struct futex_vector - Auxiliary struct for futex_waitv()
+ * @w: Userspace provided data
+ * @q: Kernel side data
+ *
+ * Struct used to build an array with all data need for futex_waitv()
+ */
+struct futex_vector {
+	struct futex_waitv w;
+	struct futex_q q;
+};
+
+extern int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
+			       struct hrtimer_sleeper *to);
+
 extern int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset);
 
 extern int futex_wake_op(u32 __user *uaddr1, unsigned int flags,
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -199,6 +199,113 @@  SYSCALL_DEFINE6(futex, u32 __user *, uad
 	return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
 }
 
+/* Mask of available flags for each futex in futex_waitv list */
+#define FUTEXV_WAITER_MASK (FUTEX_32 | FUTEX_PRIVATE_FLAG)
+
+/* Mask of available flags for sys_futex_waitv flag */
+#define FUTEXV_MASK (FUTEX_CLOCK_REALTIME)
+
+/**
+ * futex_parse_waitv - Parse a waitv array from userspace
+ * @futexv:	Kernel side list of waiters to be filled
+ * @uwaitv:     Userspace list to be parsed
+ * @nr_futexes: Length of futexv
+ *
+ * Return: Error code on failure, 0 on success
+ */
+static int futex_parse_waitv(struct futex_vector *futexv,
+			     struct futex_waitv __user *uwaitv,
+			     unsigned int nr_futexes)
+{
+	struct futex_waitv aux;
+	unsigned int i;
+
+	for (i = 0; i < nr_futexes; i++) {
+		if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
+			return -EFAULT;
+
+		if ((aux.flags & ~FUTEXV_WAITER_MASK) || aux.__reserved)
+			return -EINVAL;
+
+		futexv[i].w.flags = aux.flags;
+		futexv[i].w.val = aux.val;
+		futexv[i].w.uaddr = aux.uaddr;
+		futexv[i].q = futex_q_init;
+	}
+
+	return 0;
+}
+
+/**
+ * sys_futex_waitv - Wait on a list of futexes
+ * @waiters:    List of futexes to wait on
+ * @nr_futexes: Length of futexv
+ * @flags:      Flag for timeout (monotonic/realtime)
+ * @timo:	Optional absolute timeout.
+ *
+ * Given an array of `struct futex_waitv`, wait on each uaddr. The thread wakes
+ * if a futex_wake() is performed at any uaddr. The syscall returns immediately
+ * if any waiter has *uaddr != val. *timo is an optional timeout value for the
+ * operation. Each waiter has individual flags. The `flags` argument for the
+ * syscall should be used solely for specifying the timeout as realtime, if
+ * needed. Flags for shared futexes, sizes, etc. should be used on the
+ * individual flags of each waiter.
+ *
+ * Returns the array index of one of the awaken futexes. There's no given
+ * information of how many were awakened, or any particular attribute of it (if
+ * it's the first awakened, if it is of the smaller index...).
+ */
+
+SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters,
+		unsigned int, nr_futexes, unsigned int, flags,
+		struct __kernel_timespec __user *, timo)
+{
+	struct hrtimer_sleeper to;
+	struct futex_vector *futexv;
+	struct timespec64 ts;
+	ktime_t time;
+	int ret;
+
+	if (flags & ~FUTEXV_MASK)
+		return -EINVAL;
+
+	if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
+		return -EINVAL;
+
+	if (timo) {
+		int flag_clkid = (flags & FUTEX_CLOCK_REALTIME) ? FLAGS_CLOCKRT : 0;
+
+		if (get_timespec64(&ts, timo))
+			return -EFAULT;
+
+		/*
+		 * Since there's no opcode for futex_waitv, use
+		 * FUTEX_WAIT_BITSET that uses absolute timeout as well
+		 */
+		ret = futex_init_timeout(FUTEX_WAIT_BITSET, flags, &ts, &time);
+		if (ret)
+			return ret;
+
+		futex_setup_timer(&time, &to, flag_clkid, 0);
+	}
+
+	futexv = kcalloc(nr_futexes, sizeof(*futexv), GFP_KERNEL);
+	if (!futexv)
+		return -ENOMEM;
+
+	ret = futex_parse_waitv(futexv, waiters, nr_futexes);
+	if (!ret)
+		ret = futex_wait_multiple(futexv, nr_futexes, timo ? &to : NULL);
+
+	if (timo) {
+		hrtimer_cancel(&to.timer);
+		destroy_hrtimer_on_stack(&to.timer);
+	}
+
+	kfree(futexv);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE2(set_robust_list,
 		struct compat_robust_list_head __user *, head,
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -358,6 +358,207 @@  void futex_wait_queue(struct futex_hash_
 }
 
 /**
+ * unqueue_multiple - Remove various futexes from their hash bucket
+ * @v:	   The list of futexes to unqueue
+ * @count: Number of futexes in the list
+ *
+ * Helper to unqueue a list of futexes. This can't fail.
+ *
+ * Return:
+ *  - >=0 - Index of the last futex that was awoken;
+ *  - -1  - No futex was awoken
+ */
+static int unqueue_multiple(struct futex_vector *v, int count)
+{
+	int ret = -1, i;
+
+	for (i = 0; i < count; i++) {
+		if (!futex_unqueue(&v[i].q))
+			ret = i;
+	}
+
+	return ret;
+}
+
+/**
+ * futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
+ * @vs:		The futex list to wait on
+ * @count:	The size of the list
+ * @awaken:	Index of the last awoken futex, if any. Used to notify the
+ *		caller that it can return this index to userspace (return parameter)
+ *
+ * Prepare multiple futexes in a single step and enqueue them. This may fail if
+ * the futex list is invalid or if any futex was already awoken. On success the
+ * task is ready to interruptible sleep.
+ *
+ * Return:
+ *  -  1 - One of the futexes was awaken by another thread
+ *  -  0 - Success
+ *  - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
+ */
+static int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *awaken)
+{
+	struct futex_hash_bucket *hb;
+	bool retry = false;
+	int ret, i;
+	u32 uval;
+
+	/*
+	 * Enqueuing multiple futexes is tricky, because we need to enqueue
+	 * each futex in the list before dealing with the next one to avoid
+	 * deadlocking on the hash bucket. But, before enqueuing, we need to
+	 * make sure that current->state is TASK_INTERRUPTIBLE, so we don't
+	 * absorb any awake events, which cannot be done before the
+	 * get_futex_key of the next key, because it calls get_user_pages,
+	 * which can sleep. Thus, we fetch the list of futexes keys in two
+	 * steps, by first pinning all the memory keys in the futex key, and
+	 * only then we read each key and queue the corresponding futex.
+	 *
+	 * Private futexes doesn't need to recalculate hash in retry, so skip
+	 * get_futex_key() when retrying.
+	 */
+retry:
+	for (i = 0; i < count; i++) {
+		if ((vs[i].w.flags & FUTEX_PRIVATE_FLAG) && retry)
+			continue;
+
+		ret = get_futex_key(u64_to_user_ptr(vs[i].w.uaddr),
+				    !(vs[i].w.flags & FUTEX_PRIVATE_FLAG),
+				    &vs[i].q.key, FUTEX_READ);
+
+		if (unlikely(ret))
+			return ret;
+	}
+
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	for (i = 0; i < count; i++) {
+		u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr;
+		struct futex_q *q = &vs[i].q;
+		u32 val = (u32)vs[i].w.val;
+
+		hb = futex_q_lock(q);
+		ret = futex_get_value_locked(&uval, uaddr);
+
+		if (!ret && uval == val) {
+			/*
+			 * The bucket lock can't be held while dealing with the
+			 * next futex. Queue each futex at this moment so hb can
+			 * be unlocked.
+			 */
+			futex_queue(q, hb);
+			continue;
+		}
+
+		futex_q_unlock(hb);
+		__set_current_state(TASK_RUNNING);
+
+		/*
+		 * Even if something went wrong, if we find out that a futex
+		 * was awaken, we don't return error and return this index to
+		 * userspace
+		 */
+		*awaken = unqueue_multiple(vs, i);
+		if (*awaken >= 0)
+			return 1;
+
+		if (uval != val)
+			return -EWOULDBLOCK;
+
+		if (ret) {
+			/*
+			 * If we need to handle a page fault, we need to do so
+			 * without any lock and any enqueued futex (otherwise
+			 * we could lose some wakeup). So we do it here, after
+			 * undoing all the work done so far. In success, we
+			 * retry all the work.
+			 */
+			if (get_user(uval, uaddr))
+				return -EFAULT;
+
+			retry = true;
+			goto retry;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * futex_sleep_multiple - Check sleeping conditions and sleep
+ * @vs:    List of futexes to wait for
+ * @count: Length of vs
+ * @to:    Timeout
+ *
+ * Sleep if and only if the timeout hasn't expired and no futex on the list has
+ * been awaken.
+ */
+static void futex_sleep_multiple(struct futex_vector *vs, unsigned int count,
+				 struct hrtimer_sleeper *to)
+{
+	if (to && !to->task)
+		return;
+
+	for (; count; count--, vs++) {
+		if (!READ_ONCE(vs->q.lock_ptr))
+			return;
+	}
+
+	freezable_schedule();
+}
+
+/**
+ * futex_wait_multiple - Prepare to wait on and enqueue several futexes
+ * @vs:		The list of futexes to wait on
+ * @count:	The number of objects
+ * @to:		Timeout before giving up and returning to userspace
+ *
+ * Entry point for the FUTEX_WAIT_MULTIPLE futex operation, this function
+ * sleeps on a group of futexes and returns on the first futex that is
+ * wake, or after the timeout has elapsed.
+ *
+ * Return:
+ *  - >=0 - Hint to the futex that was awoken
+ *  - <0  - On error
+ */
+int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
+			struct hrtimer_sleeper *to)
+{
+	int ret, hint = 0;
+
+	if (to)
+		hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
+
+	while (1) {
+		ret = futex_wait_multiple_setup(vs, count, &hint);
+		if (ret) {
+			if (ret > 0) {
+				/* A futex was awaken during setup */
+				ret = hint;
+			}
+			return ret;
+		}
+
+		futex_sleep_multiple(vs, count, to);
+
+		__set_current_state(TASK_RUNNING);
+
+		ret = unqueue_multiple(vs, count);
+		if (ret >= 0)
+			return ret;
+
+		if (to && !to->task)
+			return -ETIMEDOUT;
+		else if (signal_pending(current))
+			return -ERESTARTSYS;
+		/*
+		 * The final case is a spurious wakeup, for
+		 * which just retry.
+		 */
+	}
+}
+
+/**
  * futex_wait_setup() - Prepare to wait on a futex
  * @uaddr:	the futex userspace address
  * @val:	the expected value
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -151,6 +151,9 @@  COND_SYSCALL_COMPAT(set_robust_list);
 COND_SYSCALL(get_robust_list);
 COND_SYSCALL_COMPAT(get_robust_list);
 
+/* kernel/futex/syscalls.c */
+COND_SYSCALL(futex_waitv);
+
 /* kernel/hrtimer.c */
 
 /* kernel/itimer.c */