Kernel prctl feature for syscall interception and emulation

Message ID 873616v6g9.fsf@collabora.com
State Not applicable
Headers
Series Kernel prctl feature for syscall interception and emulation |

Commit Message

Gabriel Krisman Bertazi Nov. 18, 2020, 6:57 p.m. UTC
  Hi,

I'm proposing a kernel patch for a feature I'm calling Syscall User
Dispatch (SUD).  It is a mechanism to efficiently redirect system calls
of only part of a binary back to userspace to be emulated by a
compatibility layer.  The patchset is close to being accepted, but
Florian suggested the feature might pose some constraints on glibc, and
requested I raise the discussion here.

The problem I am trying to solve is that modern Windows games running
over Wine are issuing Windows system calls directly from the Windows
code, without going through the "WinAPI", which doesn't give Wine a
chance to emulate the library calls and implement the behavior.  As a
result, Windows syscalls reache the Linux kernel, and the kernel has
no context to differentiate them from native syscalls coming from the
Wine side, since it cannot trust the ABI, not even syscall numbers to be
something sane.  Historically, Windows applications were very respectful
of the WinAPI, not bypassing it, but we are seeing modern applications
like games doing it more often for reasons, I believe, of DRM.

It is worth mentioning that, by design, Wine and the Windows application
run on the same process space, so we really cannot just filter specific
threads or the entire application. We need some kind of filter executed
on each system call.

Now, the obvious way to solve this problem would be cBPF filtering
memory regions, through Seccomp.  The main problem with that approach is
the performance of executing a large cBPF filter.  The goal is to run
games, and we observed the Seccomp filter become a bottleneck, since we
have many, many memory areas that need to be checked by cBPF.  In
addition, seccomp, as a security mechanism, doesn't support some filter
update operations, like removing them.  Another approaches were
explored, like making a new mode out of seccomp, but the kernel
community preferred to make it a separate, self-contained mechanism.
Other solutions, like (live) patching the Windows application are out
of question, as they would trip DRM and anti-cheat protection
mechanisms.

The SUD interface I proposed to the kernel community is self-contained
and exposed as a prctl option.  It lets userspace define a switch
variable per-thread that, when set, will raise a SIGSYS for any syscall
attempted.  The idea is that Wine can just flip this switch efficiently
before delivering control to the Windows portions of the binary, and
flip it back off when it needs to execute native syscalls.  It is
important for us that the switch flip doesn't require a syscall, for
performance reasons.  The interface also lets userspace define a
"dispatcher region" from where any syscalls are always executed,
regardless of the selector variable.  This is important for the return
of the SIGSYS directly to a Windows segment, where we need to execute
the signal return trampoline with the selector blocked.  Ideally, Wine
would simply define this dispatcher region as the entire libc code
segment, and just use the selector to safe-guard against Linux libraries
issuing syscalls by themselves (they exist).

I think my questions to libc are: what are the constraints, if any, that
libc would face with this new interface?  I expected this to be
completely invisible to libc.  In addition, are there any problems you
foresee with the current interface?

Finally, I don't think it makes sense to bother you immediately with
the kernel implementation patches, but if you want to see the them,
they are archived in the link below.  I can also share them directly on
this ML if you request it.

  https://lkml.org/lkml/2020/11/17/2347

Nevertheless, I think it is useful the share the final patch, that has
the in-tree documentation for the interface, which I inlined in this
message.

Thanks.

-- >8 --
Subject: [PATCH v7 7/7] docs: Document Syscall User Dispatch

Explain the interface, provide some background and security notes.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 .../admin-guide/syscall-user-dispatch.rst     | 87 +++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst
  

Comments

Rich Felker Nov. 19, 2020, 3:13 p.m. UTC | #1
On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote:
> Hi,
> 
> I'm proposing a kernel patch for a feature I'm calling Syscall User
> Dispatch (SUD).  It is a mechanism to efficiently redirect system calls
> of only part of a binary back to userspace to be emulated by a
> compatibility layer.  The patchset is close to being accepted, but
> Florian suggested the feature might pose some constraints on glibc, and
> requested I raise the discussion here.
> 
> The problem I am trying to solve is that modern Windows games running
> over Wine are issuing Windows system calls directly from the Windows
> code, without going through the "WinAPI", which doesn't give Wine a
> chance to emulate the library calls and implement the behavior.  As a
> result, Windows syscalls reache the Linux kernel, and the kernel has
> no context to differentiate them from native syscalls coming from the
> Wine side, since it cannot trust the ABI, not even syscall numbers to be
> something sane.  Historically, Windows applications were very respectful
> of the WinAPI, not bypassing it, but we are seeing modern applications
> like games doing it more often for reasons, I believe, of DRM.
> 
> It is worth mentioning that, by design, Wine and the Windows application
> run on the same process space, so we really cannot just filter specific
> threads or the entire application. We need some kind of filter executed
> on each system call.
> 
> Now, the obvious way to solve this problem would be cBPF filtering
> memory regions, through Seccomp.  The main problem with that approach is
> the performance of executing a large cBPF filter.  The goal is to run
> games, and we observed the Seccomp filter become a bottleneck, since we
> have many, many memory areas that need to be checked by cBPF.  In
> addition, seccomp, as a security mechanism, doesn't support some filter
> update operations, like removing them.  Another approaches were
> explored, like making a new mode out of seccomp, but the kernel
> community preferred to make it a separate, self-contained mechanism.
> Other solutions, like (live) patching the Windows application are out
> of question, as they would trip DRM and anti-cheat protection
> mechanisms.
> 
> The SUD interface I proposed to the kernel community is self-contained
> and exposed as a prctl option.  It lets userspace define a switch
> variable per-thread that, when set, will raise a SIGSYS for any syscall
> attempted.  The idea is that Wine can just flip this switch efficiently
> before delivering control to the Windows portions of the binary, and
> flip it back off when it needs to execute native syscalls.  It is
> important for us that the switch flip doesn't require a syscall, for
> performance reasons.  The interface also lets userspace define a
> "dispatcher region" from where any syscalls are always executed,
> regardless of the selector variable.  This is important for the return
> of the SIGSYS directly to a Windows segment, where we need to execute
> the signal return trampoline with the selector blocked.  Ideally, Wine
> would simply define this dispatcher region as the entire libc code
> segment, and just use the selector to safe-guard against Linux libraries
> issuing syscalls by themselves (they exist).
> 
> I think my questions to libc are: what are the constraints, if any, that
> libc would face with this new interface?  I expected this to be
> completely invisible to libc.  In addition, are there any problems you
> foresee with the current interface?
> 
> Finally, I don't think it makes sense to bother you immediately with
> the kernel implementation patches, but if you want to see the them,
> they are archived in the link below.  I can also share them directly on
> this ML if you request it.
> 
>   https://lkml.org/lkml/2020/11/17/2347
> 
> Nevertheless, I think it is useful the share the final patch, that has
> the in-tree documentation for the interface, which I inlined in this
> message.

SIGSYS (or signal handling in general) is not the right way to do
this. It has all the same problems that came up in seccomp filtering
with SIGSYS, and which were solved by user_notif mode (running the
interception in a separate thread rather than an async context
interrupting the syscall. In fact I wouldn't be surprised if what you
want can already be done with reasonable efficiency using seccomp
user_notif.

The default-intercept and excepting libc code segment is also bogus,
and will break stuff, including vdso syscall mechanism on i386 and any
code outside libc that makes its own syscalls from asm. If you need to
tag regions to control interception, it should be tagging the emulated
Windows guest code, which is bounded and you have full control over,
rather than the host code, which is unbounded and includes any
libraries that get linked indirectly by Wine. But I'm skeptical that
doing any new kernel-side logic for tagging is needed. Seccomp already
lets you filter on instruction pointer so you can install filters that
will trigger user_notif just for guest code, then let you execute the
emulation in the watcher thread and skip the actual syscall in the
watched thread.

Rich
  
Gabriel Krisman Bertazi Nov. 19, 2020, 4:15 p.m. UTC | #2
Rich Felker <dalias@libc.org> writes:

> On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote:

[...]

>
> SIGSYS (or signal handling in general) is not the right way to do
> this. It has all the same problems that came up in seccomp filtering
> with SIGSYS, and which were solved by user_notif mode (running the
> interception in a separate thread rather than an async context
> interrupting the syscall. In fact I wouldn't be surprised if what you
> want can already be done with reasonable efficiency using seccomp
> user_notif.

Hi Rich,

User_notif was raised in the kernel discussion and we had experimented
with it, but the latency of user_notif is even worse than what we can do
right now with other seccomp actions.

Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
return to a userspace thunk, but the understanding among Wine developers
is that SIGSYS is enough for their emulation needs.

> The default-intercept and excepting libc code segment is also bogus,
> and will break stuff, including vdso syscall mechanism on i386 and any
> code outside libc that makes its own syscalls from asm. If you need to
> tag regions to control interception, it should be tagging the emulated
> Windows guest code, which is bounded and you have full control over,
> rather than the host code, which is unbounded and includes any
> libraries that get linked indirectly by Wine.

The vdso trampoline, for the architectures that have it, is solved by
the kernel implementation, who makes sure that region is allowed.

The Linux code is not bounded, but the dispatcher region main goal is to
support trampolines outside of the vdso case. The correct userspace
implementation requires flipping the selector on any Windows/Linux code
boundary cross, exactly because other libraries can issue syscalls
directly.  The fact that libc is not the only one issuing syscalls is
the exact reason we need something more complex than a few seccomp
filters.

Flipping the selector on every boundary crosses is fine for performance,
since we don't go into the kernel.  But if we can avoid checking it from
kernelspace, that's an optimization, which is what I meant by the
dispatcher region allowing the more parts of the glibc code.  That's
just an optimization, but not strictly necessary for correctness.

I still don't think anything is broken here.

> But I'm skeptical that doing any new kernel-side logic for tagging is
> needed. Seccomp already lets you filter on instruction pointer so you
> can install filters that will trigger user_notif just for guest code,
> then let you execute the emulation in the watcher thread and skip the
> actual syscall in the watched thread.

As I mentioned, we can check IP in seccomp and write filters.  But this
has two problems:

1) Performance.  seccomp filters use cBPF which means 32bit comparisons,
no maps and a very limited instruction set.  We need to generate
boundary checks for each memory segment.  The filter becomes very large
very quickly and becomes a observable bottleneck.

2) Seccomp filters cannot be removed.  And we'd need to update them
frequently.
  
Rich Felker Nov. 19, 2020, 4:28 p.m. UTC | #3
On Thu, Nov 19, 2020 at 11:15:46AM -0500, Gabriel Krisman Bertazi wrote:
> Rich Felker <dalias@libc.org> writes:
> 
> > On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote:
> 
> [...]
> 
> >
> > SIGSYS (or signal handling in general) is not the right way to do
> > this. It has all the same problems that came up in seccomp filtering
> > with SIGSYS, and which were solved by user_notif mode (running the
> > interception in a separate thread rather than an async context
> > interrupting the syscall. In fact I wouldn't be surprised if what you
> > want can already be done with reasonable efficiency using seccomp
> > user_notif.
> 
> Hi Rich,
> 
> User_notif was raised in the kernel discussion and we had experimented
> with it, but the latency of user_notif is even worse than what we can do
> right now with other seccomp actions.

Is there a compelling argument that the latency matters here? What
syscalls are windows binaries making like this? Is there a reason you
can't do something like intercepting the syscall with seccomp the
first time it happens, then rewriting the code not to use a direct
syscall on future invocations?

> Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
> return to a userspace thunk, but the understanding among Wine developers
> is that SIGSYS is enough for their emulation needs.

It might work for Wine needs, if Wine can guarantee it will never be
running code with signals blocked and some other constraints, but then
you end up with a mechanism that's designed just for Wine and that
will have gratuitous reasons it's not usable elsewhere. That does not
seem appropriate for inclusion in kernel.

> > The default-intercept and excepting libc code segment is also bogus,
> > and will break stuff, including vdso syscall mechanism on i386 and any
> > code outside libc that makes its own syscalls from asm. If you need to
> > tag regions to control interception, it should be tagging the emulated
> > Windows guest code, which is bounded and you have full control over,
> > rather than the host code, which is unbounded and includes any
> > libraries that get linked indirectly by Wine.
> 
> The vdso trampoline, for the architectures that have it, is solved by
> the kernel implementation, who makes sure that region is allowed.

I guess that works but it's ugly and assumes particular policy goals
matching Wine's rather than being a general mechanism.

> The Linux code is not bounded, but the dispatcher region main goal is to
> support trampolines outside of the vdso case. The correct userspace
> implementation requires flipping the selector on any Windows/Linux code
> boundary cross, exactly because other libraries can issue syscalls
> directly.  The fact that libc is not the only one issuing syscalls is
> the exact reason we need something more complex than a few seccomp
> filters.

I don't think this is correct. Rather than listing all the host
library code ranges to allow, you just list all the guest Windows code
ranges to intercept. Wine knows them by virtue of being the loader for
them. This all seems really easy to do with seccomp with a very small
filter.

> > But I'm skeptical that doing any new kernel-side logic for tagging is
> > needed. Seccomp already lets you filter on instruction pointer so you
> > can install filters that will trigger user_notif just for guest code,
> > then let you execute the emulation in the watcher thread and skip the
> > actual syscall in the watched thread.
> 
> As I mentioned, we can check IP in seccomp and write filters.  But this
> has two problems:
> 
> 1) Performance.  seccomp filters use cBPF which means 32bit comparisons,
> no maps and a very limited instruction set.  We need to generate
> boundary checks for each memory segment.  The filter becomes very large
> very quickly and becomes a observable bottleneck.

This sounds like you're doing something wrong. Range checking is O(log
n) and n cannot be large enough to make log n significant. If you do
it with a linear search rather than binary then of course it's slow.

> 2) Seccomp filters cannot be removed.  And we'd need to update them
> frequently.

What are the updating requirements?

I'm not sure if Windows code is properly PIC or not, but if it is,
then you just do your own address assignment in a single huge range
(first allocated with PROT_NONE, then MAP_FIXED over top of it) so
that a single static range check suffices.

Rich
  
Gabriel Krisman Bertazi Nov. 19, 2020, 5:32 p.m. UTC | #4
Rich Felker <dalias@libc.org> writes:

> On Thu, Nov 19, 2020 at 11:15:46AM -0500, Gabriel Krisman Bertazi wrote:
>> Rich Felker <dalias@libc.org> writes:
>> 
>> > On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote:
>> 
>> [...]
>> 
>> >
>> > SIGSYS (or signal handling in general) is not the right way to do
>> > this. It has all the same problems that came up in seccomp filtering
>> > with SIGSYS, and which were solved by user_notif mode (running the
>> > interception in a separate thread rather than an async context
>> > interrupting the syscall. In fact I wouldn't be surprised if what you
>> > want can already be done with reasonable efficiency using seccomp
>> > user_notif.
>> 
>> Hi Rich,
>> 
>> User_notif was raised in the kernel discussion and we had experimented
>> with it, but the latency of user_notif is even worse than what we can do
>> right now with other seccomp actions.
>
> Is there a compelling argument that the latency matters here? What
> syscalls are windows binaries making like this? Is there a reason you
> can't do something like intercepting the syscall with seccomp the
> first time it happens, then rewriting the code not to use a direct
> syscall on future invocations?

We can't do any code rewriting without tripping DRM protections and
anti-cheating mechanisms.

I should correct myself here.  While it is true that user_notif is
slower than other seccomp actions, this is not a problem in itself.  The
frequency of syscalls that need to be emulated is much smaller than
regular syscalls, and the performance problem actually appears due to
the filtering.  I should investigate user_notif more, but I don't oppose
SUD doing user_notif instead of SIGSYS.  I will raise that with Wine
developers and the kernel community.

>> Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
>> return to a userspace thunk, but the understanding among Wine developers
>> is that SIGSYS is enough for their emulation needs.
>
> It might work for Wine needs, if Wine can guarantee it will never be
> running code with signals blocked and some other constraints, but then
> you end up with a mechanism that's designed just for Wine and that
> will have gratuitous reasons it's not usable elsewhere. That does not
> seem appropriate for inclusion in kernel.
>
>> > The default-intercept and excepting libc code segment is also bogus,
>> > and will break stuff, including vdso syscall mechanism on i386 and any
>> > code outside libc that makes its own syscalls from asm. If you need to
>> > tag regions to control interception, it should be tagging the emulated
>> > Windows guest code, which is bounded and you have full control over,
>> > rather than the host code, which is unbounded and includes any
>> > libraries that get linked indirectly by Wine.
>> 
>> The vdso trampoline, for the architectures that have it, is solved by
>> the kernel implementation, who makes sure that region is allowed.
>
> I guess that works but it's ugly and assumes particular policy goals
> matching Wine's rather than being a general mechanism.
>
>> The Linux code is not bounded, but the dispatcher region main goal is to
>> support trampolines outside of the vdso case. The correct userspace
>> implementation requires flipping the selector on any Windows/Linux code
>> boundary cross, exactly because other libraries can issue syscalls
>> directly.  The fact that libc is not the only one issuing syscalls is
>> the exact reason we need something more complex than a few seccomp
>> filters.
>
> I don't think this is correct. Rather than listing all the host
> library code ranges to allow, you just list all the guest Windows code
> ranges to intercept. Wine knows them by virtue of being the loader for
> them. This all seems really easy to do with seccomp with a very small
> filter.

The Windows code is not completely loaded at initialization time.  It
also has dynamic libraries loaded later.  yes, wine knows the memory
regions, but there is no guarantee there is a small number of segments
or that the full picture is known at any given moment.

>> > But I'm skeptical that doing any new kernel-side logic for tagging is
>> > needed. Seccomp already lets you filter on instruction pointer so you
>> > can install filters that will trigger user_notif just for guest code,
>> > then let you execute the emulation in the watcher thread and skip the
>> > actual syscall in the watched thread.
>> 
>> As I mentioned, we can check IP in seccomp and write filters.  But this
>> has two problems:
>> 
>> 1) Performance.  seccomp filters use cBPF which means 32bit comparisons,
>> no maps and a very limited instruction set.  We need to generate
>> boundary checks for each memory segment.  The filter becomes very large
>> very quickly and becomes a observable bottleneck.
>
> This sounds like you're doing something wrong. Range checking is O(log
> n) and n cannot be large enough to make log n significant. If you do
> it with a linear search rather than binary then of course it's slow.

And SUD is O(1).  The filtering overhead is the big point here.  The
seccomp kselftests benchmark shows a 32% overhead introduced by seccomp
for a simple getpid syscall.  With a second filter (not a second
verification on the same filter), the overhead goes to 47%.  SUD shows
an overhead of 13.4% over the same syscall.

I understand two filters is very different than 1 filter with more vmas,
but since we cannot remove filters, we'd need to add more filters to
make it more strict.

>> 2) Seccomp filters cannot be removed.  And we'd need to update them
>> frequently.
>
> What are the updating requirements?

As far as I understand (I'm not a wine developer), they need to remove
and modify filters.  Given seccomp is a security feature, It would be a
hard sell to support these operations. We discussed this on the kernel
list.

> I'm not sure if Windows code is properly PIC or not, but if it is,
> then you just do your own address assignment in a single huge range
> (first allocated with PROT_NONE, then MAP_FIXED over top of it) so
> that a single static range check suffices.

I'm Cc'ing some wine developers who can assist with this point.
  
Rich Felker Nov. 19, 2020, 5:39 p.m. UTC | #5
On Thu, Nov 19, 2020 at 12:32:54PM -0500, Gabriel Krisman Bertazi wrote:
> Rich Felker <dalias@libc.org> writes:
> 
> > On Thu, Nov 19, 2020 at 11:15:46AM -0500, Gabriel Krisman Bertazi wrote:
> >> Rich Felker <dalias@libc.org> writes:
> >> 
> >> > On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote:
> >> 
> >> [...]
> >> 
> >> >
> >> > SIGSYS (or signal handling in general) is not the right way to do
> >> > this. It has all the same problems that came up in seccomp filtering
> >> > with SIGSYS, and which were solved by user_notif mode (running the
> >> > interception in a separate thread rather than an async context
> >> > interrupting the syscall. In fact I wouldn't be surprised if what you
> >> > want can already be done with reasonable efficiency using seccomp
> >> > user_notif.
> >> 
> >> Hi Rich,
> >> 
> >> User_notif was raised in the kernel discussion and we had experimented
> >> with it, but the latency of user_notif is even worse than what we can do
> >> right now with other seccomp actions.
> >
> > Is there a compelling argument that the latency matters here? What
> > syscalls are windows binaries making like this? Is there a reason you
> > can't do something like intercepting the syscall with seccomp the
> > first time it happens, then rewriting the code not to use a direct
> > syscall on future invocations?
> 
> We can't do any code rewriting without tripping DRM protections and
> anti-cheating mechanisms.

I think you could if you maintained separate versions of the code for
read vs exec access ala some oldschool hardening tricks, but maybe
that's not compatible with windows code (or with 64-bit mode?).
Actually it's rather impressive that an DRM/anti-cheat mess works on
Wine at all..

> I should correct myself here.  While it is true that user_notif is
> slower than other seccomp actions, this is not a problem in itself.  The
> frequency of syscalls that need to be emulated is much smaller than
> regular syscalls, and the performance problem actually appears due to
> the filtering.  I should investigate user_notif more, but I don't oppose
> SUD doing user_notif instead of SIGSYS.  I will raise that with Wine
> developers and the kernel community.

Thanks! Avoiding repetition of the SIGSYS pitfall would be a good
thing.

> >> Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
> >> return to a userspace thunk, but the understanding among Wine developers
> >> is that SIGSYS is enough for their emulation needs.
> >
> > It might work for Wine needs, if Wine can guarantee it will never be
> > running code with signals blocked and some other constraints, but then
> > you end up with a mechanism that's designed just for Wine and that
> > will have gratuitous reasons it's not usable elsewhere. That does not
> > seem appropriate for inclusion in kernel.
> >
> >> > The default-intercept and excepting libc code segment is also bogus,
> >> > and will break stuff, including vdso syscall mechanism on i386 and any
> >> > code outside libc that makes its own syscalls from asm. If you need to
> >> > tag regions to control interception, it should be tagging the emulated
> >> > Windows guest code, which is bounded and you have full control over,
> >> > rather than the host code, which is unbounded and includes any
> >> > libraries that get linked indirectly by Wine.
> >> 
> >> The vdso trampoline, for the architectures that have it, is solved by
> >> the kernel implementation, who makes sure that region is allowed.
> >
> > I guess that works but it's ugly and assumes particular policy goals
> > matching Wine's rather than being a general mechanism.
> >
> >> The Linux code is not bounded, but the dispatcher region main goal is to
> >> support trampolines outside of the vdso case. The correct userspace
> >> implementation requires flipping the selector on any Windows/Linux code
> >> boundary cross, exactly because other libraries can issue syscalls
> >> directly.  The fact that libc is not the only one issuing syscalls is
> >> the exact reason we need something more complex than a few seccomp
> >> filters.
> >
> > I don't think this is correct. Rather than listing all the host
> > library code ranges to allow, you just list all the guest Windows code
> > ranges to intercept. Wine knows them by virtue of being the loader for
> > them. This all seems really easy to do with seccomp with a very small
> > filter.
> 
> The Windows code is not completely loaded at initialization time.  It
> also has dynamic libraries loaded later.  yes, wine knows the memory
> regions, but there is no guarantee there is a small number of segments
> or that the full picture is known at any given moment.

Yes, I didn't mean it was known statically at init time (although
maybe it can be; see below) just that all the code doing the loading
is under Wine's control (vs having system dynamic linker doing stuff
it can't reliably see, which is the case with host libraries).

> >> > But I'm skeptical that doing any new kernel-side logic for tagging is
> >> > needed. Seccomp already lets you filter on instruction pointer so you
> >> > can install filters that will trigger user_notif just for guest code,
> >> > then let you execute the emulation in the watcher thread and skip the
> >> > actual syscall in the watched thread.
> >> 
> >> As I mentioned, we can check IP in seccomp and write filters.  But this
> >> has two problems:
> >> 
> >> 1) Performance.  seccomp filters use cBPF which means 32bit comparisons,
> >> no maps and a very limited instruction set.  We need to generate
> >> boundary checks for each memory segment.  The filter becomes very large
> >> very quickly and becomes a observable bottleneck.
> >
> > This sounds like you're doing something wrong. Range checking is O(log
> > n) and n cannot be large enough to make log n significant. If you do
> > it with a linear search rather than binary then of course it's slow.
> 
> And SUD is O(1).  The filtering overhead is the big point here.  The

OK, but for practical purposes O(log n) == O(1).

> >> 2) Seccomp filters cannot be removed.  And we'd need to update them
> >> frequently.
> >
> > What are the updating requirements?
> 
> As far as I understand (I'm not a wine developer), they need to remove
> and modify filters.  Given seccomp is a security feature, It would be a
> hard sell to support these operations. We discussed this on the kernel
> list.
> 
> > I'm not sure if Windows code is properly PIC or not, but if it is,
> > then you just do your own address assignment in a single huge range
> > (first allocated with PROT_NONE, then MAP_FIXED over top of it) so
> > that a single static range check suffices.
> 
> I'm Cc'ing some wine developers who can assist with this point.

Great!

Rich
  
David Laight Nov. 19, 2020, 5:57 p.m. UTC | #6
> > The Windows code is not completely loaded at initialization time.  It
> > also has dynamic libraries loaded later.  yes, wine knows the memory
> > regions, but there is no guarantee there is a small number of segments
> > or that the full picture is known at any given moment.
> 
> Yes, I didn't mean it was known statically at init time (although
> maybe it can be; see below) just that all the code doing the loading
> is under Wine's control (vs having system dynamic linker doing stuff
> it can't reliably see, which is the case with host libraries).

Since wine must itself make the mmap() system calls that make memory
executable can't it arrange for windows code and linux code to be
above/below some critical address?

IIRC 32bit windows has the user/kernel split at 2G, so all the
linux code could be shoe-horned into the top 1GB.

A similar boundary could be picked for 64bit code.

This would probably require flags to mmap() to map above/below
the specified address (is there a flag for the 2G boundary
these days - wine used to do very horrid things).
It might also need a special elf interpreter to load the
wine code itself high.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Paul Gofman Nov. 19, 2020, 8:54 p.m. UTC | #7
On 11/19/20 20:57, David Laight wrote:
>>> The Windows code is not completely loaded at initialization time.  It
>>> also has dynamic libraries loaded later.  yes, wine knows the memory
>>> regions, but there is no guarantee there is a small number of segments
>>> or that the full picture is known at any given moment.
>> Yes, I didn't mean it was known statically at init time (although
>> maybe it can be; see below) just that all the code doing the loading
>> is under Wine's control (vs having system dynamic linker doing stuff
>> it can't reliably see, which is the case with host libraries).
> Since wine must itself make the mmap() system calls that make memory
> executable can't it arrange for windows code and linux code to be
> above/below some critical address?
>
> IIRC 32bit windows has the user/kernel split at 2G, so all the
> linux code could be shoe-horned into the top 1GB.
>
> A similar boundary could be picked for 64bit code.
>
> This would probably require flags to mmap() to map above/below
> the specified address (is there a flag for the 2G boundary
> these days - wine used to do very horrid things).
> It might also need a special elf interpreter to load the
> wine code itself high.
>
Wine does not control the loading of native libraries (which are subject
to ASLR and thus do not necessarily exactly follow mmap's top down
order). Wine is also not free to choose where to load the Windows
libraries. Some of Win libraries are relocatable, some are not. Even
those relocatable are still often assumed to be loaded at the base
address specified in PE, with assumption made either by library itself
or DRM or sandboxing / hotpatching / interception code from around.

Also, it is very common to DRMs to unpack the encrypted code to a newly
allocated segment (which gives no clue at the moment of allocation
whether it is going to be executable later), and then make it
executable. There are a lot of tricks about that and such code sometimes
assumes very specific (and Windows implementation dependent) things, in
particular, about the memory layout. Windows VirtualAlloc[Ex] gives the
way to request top down or bottom up allocation order, as well as
specific allocation address. The latter is not guaranteed to succeed of
course just like on Linux for obvious reasons, but if specific (high)
address rangesĀ  always have some space available on Windows, then there
are the apps in the wild which depend of that, as far as our practice goes.

If we were given mmap flag for specifying memory allocation boundary,
and also a sort of process-wide dlopen() config option for specifying
that boundary for every host shared library load, the address space
separation could probably work... until we hit a tricky case when the
app wants to get a memory specifically high address range. I think we
can't do that cleanly as both Windows and Linux currently have the same
128TB limit for user address space on x64 and we've got no spare space
to safely put native code without potential interference with Windows code.
  
Paul Gofman Nov. 19, 2020, 9:19 p.m. UTC | #8
On 11/19/20 23:54, Paul Gofman wrote:
> On 11/19/20 20:57, David Laight wrote:
>>>> The Windows code is not completely loaded at initialization time.  It
>>>> also has dynamic libraries loaded later.  yes, wine knows the memory
>>>> regions, but there is no guarantee there is a small number of segments
>>>> or that the full picture is known at any given moment.
>>> Yes, I didn't mean it was known statically at init time (although
>>> maybe it can be; see below) just that all the code doing the loading
>>> is under Wine's control (vs having system dynamic linker doing stuff
>>> it can't reliably see, which is the case with host libraries).
>> Since wine must itself make the mmap() system calls that make memory
>> executable can't it arrange for windows code and linux code to be
>> above/below some critical address?
>>
>> IIRC 32bit windows has the user/kernel split at 2G, so all the
>> linux code could be shoe-horned into the top 1GB.
>>
>> A similar boundary could be picked for 64bit code.
>>
>> This would probably require flags to mmap() to map above/below
>> the specified address (is there a flag for the 2G boundary
>> these days - wine used to do very horrid things).
>> It might also need a special elf interpreter to load the
>> wine code itself high.
>>
> Wine does not control the loading of native libraries (which are subject
> to ASLR and thus do not necessarily exactly follow mmap's top down
> order). Wine is also not free to choose where to load the Windows
> libraries. Some of Win libraries are relocatable, some are not. Even
> those relocatable are still often assumed to be loaded at the base
> address specified in PE, with assumption made either by library itself
> or DRM or sandboxing / hotpatching / interception code from around.
>
> Also, it is very common to DRMs to unpack the encrypted code to a newly
> allocated segment (which gives no clue at the moment of allocation
> whether it is going to be executable later), and then make it
> executable. There are a lot of tricks about that and such code sometimes
> assumes very specific (and Windows implementation dependent) things, in
> particular, about the memory layout. Windows VirtualAlloc[Ex] gives the
> way to request top down or bottom up allocation order, as well as
> specific allocation address. The latter is not guaranteed to succeed of
> course just like on Linux for obvious reasons, but if specific (high)
> address rangesĀ  always have some space available on Windows, then there
> are the apps in the wild which depend of that, as far as our practice goes.
>
> If we were given mmap flag for specifying memory allocation boundary,
> and also a sort of process-wide dlopen() config option for specifying
> that boundary for every host shared library load, the address space
> separation could probably work... until we hit a tricky case when the
> app wants to get a memory specifically high address range. I think we
> can't do that cleanly as both Windows and Linux currently have the same
> 128TB limit for user address space on x64 and we've got no spare space
> to safely put native code without potential interference with Windows code.
>
Maybe it is also interesting to mention that the initial Gabriel's
patches version was introducing the emulation trigger by specifying a
flag for memory region through mprotect(), so we could mark the regions
calls from which should be trapped. That would be probably the easiest
possible solution in terms of using that in Wine (as no memory allocated
by Wine itself is supposed to contain native host syscalls) but that
idea was not accepted. Mainly because, as I understand, such a
functionality does not belong to VM management.
  

Patch

diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
new file mode 100644
index 000000000000..e2fb36926f97
--- /dev/null
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst
@@ -0,0 +1,87 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Syscall User Dispatch
+=====================
+
+Background
+----------
+
+Compatibility layers like Wine need a way to efficiently emulate system
+calls of only a part of their process - the part that has the
+incompatible code - while being able to execute native syscalls without
+a high performance penalty on the native part of the process.  Seccomp
+falls short on this task, since it has limited support to efficiently
+filter syscalls based on memory regions, and it doesn't support removing
+filters.  Therefore a new mechanism is necessary.
+
+Syscall User Dispatch brings the filtering of the syscall dispatcher
+address back to userspace.  The application is in control of a flip
+switch, indicating the current personality of the process.  A
+multiple-personality application can then flip the switch without
+invoking the kernel, when crossing the compatibility layer API
+boundaries, to enable/disable the syscall redirection and execute
+syscalls directly (disabled) or send them to be emulated in userspace
+through a SIGSYS.
+
+The goal of this design is to provide very quick compatibility layer
+boundary crosses, which is achieved by not executing a syscall to change
+personality every time the compatibility layer executes.  Instead, a
+userspace memory region exposed to the kernel indicates the current
+personality, and the application simply modifies that variable to
+configure the mechanism.
+
+There is a relatively high cost associated with handling signals on most
+architectures, like x86, but at least for Wine, syscalls issued by
+native Windows code are currently not known to be a performance problem,
+since they are quite rare, at least for modern gaming applications.
+
+Since this mechanism is designed to capture syscalls issued by
+non-native applications, it must function on syscalls whose invocation
+ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
+doesn't rely on any of the syscall ABI to make the filtering.  It uses
+only the syscall dispatcher address and the userspace key.
+
+Interface
+---------
+
+A process can setup this mechanism on supported kernels
+CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
+
+  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
+
+<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
+disable the mechanism globally for that thread.  When
+PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
+
+<offset> and <offset+length> delimit a closed memory region interval
+from which syscalls are always executed directly, regardless of the
+userspace selector.  This provides a fast path for the C library, which
+includes the most common syscall dispatchers in the native code
+applications, and also provides a way for the signal handler to return
+without triggering a nested SIGSYS on (rt_)sigreturn.  Users of this
+interface should make sure that at least the signal trampoline code is
+included in this region. In addition, for syscalls that implement the
+trampoline code on the vDSO, that trampoline is never intercepted.
+
+[selector] is a pointer to a char-sized region in the process memory
+region, that provides a quick way to enable disable syscall redirection
+thread-wide, without the need to invoke the kernel directly.  selector
+can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
+value should terminate the program with a SIGSYS.
+
+Security Notes
+--------------
+
+Syscall User Dispatch provides functionality for compatibility layers to
+quickly capture system calls issued by a non-native part of the
+application, while not impacting the Linux native regions of the
+process.  It is not a mechanism for sandboxing system calls, and it
+should not be seen as a security mechanism, since it is trivial for a
+malicious application to subvert the mechanism by jumping to an allowed
+dispatcher region prior to executing the syscall, or to discover the
+address and modify the selector value.  If the use case requires any
+kind of security sandboxing, Seccomp should be used instead.
+
+Any fork or exec of the existing process resets the mechanism to
+PR_SYS_DISPATCH_OFF.