[resend] MIPS: Allow FPU emulator to use non-stack area.

Message ID 1412627010-4311-1-git-send-email-ddaney.cavm@gmail.com
State Not applicable
Headers

Commit Message

David Daney Oct. 6, 2014, 8:23 p.m. UTC
  From: David Daney <david.daney@cavium.com>

In order for MIPS to be able to support a non-executable stack, we
need to supply a method to specify a userspace area that can be used
for executing emulated branch delay slot instructions.

We add a new system call, sys_set_fpuemul_xol_area so that userspace
threads that are using the FPU can specify the location of the FPU
emulation out of line execution area.

Background:

MIPS floating point support requires that any instruction that cannot
be directly executed by the FPU, be emulated by the kernel.  Part of
this emulation involves executing non-FPU instructions that fall in
the delay slots of FP branch instructions.  Since the beginning of
MIPS/Linux time, this has been done by placing the instructions on the
userspace thread stack, and executing them there, as the instructions
must be executed in the MM context of the thread receiving the
emulation.

Because of this, the de facto MIPS Linux userspace ABI requires that
the userspace thread have an executable stack.  It is de facto,
because it is not written anywhere that this must be the case, but it
is never the less a requirement.

Problem:

How do we get MIPS Linux to use a non-executable stack in the face of
the FPU emulation problem?

Since userspace desires to change the ABI, put some of the onus on the
userspace code.  Any userspace thread desiring a non-executable stack,
must allocate a 4-byte aligned area at least 8 bytes long with that
has read/write/execute permissions and pass the address of that area
to the kernel with the new sys_set_fpuemul_xol_area system call.

This is similar to how we require userspace to notify the kernel of
the value of the thread local pointer.

Signed-off-by: David Daney <david.daney@cavium.com>
---

First attempt to libc-alpha@ failed due to anti-spam technology,
reattempting to a reduced list of recipients.

This patch has only been compile tested, and lacks the userspace
component.  It is presented as an alternate approch to the recently
proposed MIPS non-executable stack patches posted here:

http://www.linux-mips.org/archives/linux-mips/2014-10/msg00024.html

 arch/mips/include/asm/thread_info.h |  2 ++
 arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
 arch/mips/kernel/process.c          |  1 +
 arch/mips/kernel/scall32-o32.S      |  1 +
 arch/mips/kernel/scall64-64.S       |  1 +
 arch/mips/kernel/scall64-n32.S      |  1 +
 arch/mips/kernel/scall64-o32.S      |  1 +
 arch/mips/kernel/syscall.c          |  8 ++++++++
 arch/mips/math-emu/dsemul.c         | 11 +++++++----
 9 files changed, 31 insertions(+), 10 deletions(-)
  

Comments

Rich Felker Oct. 6, 2014, 8:54 p.m. UTC | #1
On Mon, Oct 06, 2014 at 01:23:30PM -0700, David Daney wrote:
> From: David Daney <david.daney@cavium.com>
> 
> In order for MIPS to be able to support a non-executable stack, we
> need to supply a method to specify a userspace area that can be used
> for executing emulated branch delay slot instructions.
> 
> We add a new system call, sys_set_fpuemul_xol_area so that userspace
> threads that are using the FPU can specify the location of the FPU
> emulation out of line execution area.
> 
> Background:
> 
> MIPS floating point support requires that any instruction that cannot
> be directly executed by the FPU, be emulated by the kernel.  Part of
> this emulation involves executing non-FPU instructions that fall in
> the delay slots of FP branch instructions.  Since the beginning of
> MIPS/Linux time, this has been done by placing the instructions on the
> userspace thread stack, and executing them there, as the instructions
> must be executed in the MM context of the thread receiving the
> emulation.
> 
> Because of this, the de facto MIPS Linux userspace ABI requires that
> the userspace thread have an executable stack.  It is de facto,
> because it is not written anywhere that this must be the case, but it
> is never the less a requirement.
> 
> Problem:
> 
> How do we get MIPS Linux to use a non-executable stack in the face of
> the FPU emulation problem?
> 
> Since userspace desires to change the ABI, put some of the onus on the
> userspace code.  Any userspace thread desiring a non-executable stack,
> must allocate a 4-byte aligned area at least 8 bytes long with that
> has read/write/execute permissions and pass the address of that area
> to the kernel with the new sys_set_fpuemul_xol_area system call.
> 
> This is similar to how we require userspace to notify the kernel of
> the value of the thread local pointer.

Userspace should play no part in this; requiring userspace to help
make special accomodations for fpu emulation largely defeats the
purpose of fpu emulation. The kernel is perfectly capable of mapping
an appropriate page. The mapping should happen at exec time, and at
clone time with CLONE_VM unless the kernel is going to handle mutual
exclusion so that only one thread can be using the page at a time.
(Using one page for the whole process, and excluding simultaneous
execution of fpu emulation in multiple threads, may be the more
practical approach.)

As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.

Rich
  
David Daney Oct. 6, 2014, 9:18 p.m. UTC | #2
On 10/06/2014 01:54 PM, Rich Felker wrote:
> On Mon, Oct 06, 2014 at 01:23:30PM -0700, David Daney wrote:
>> From: David Daney <david.daney@cavium.com>
>>
>> In order for MIPS to be able to support a non-executable stack, we
>> need to supply a method to specify a userspace area that can be used
>> for executing emulated branch delay slot instructions.
>>
>> We add a new system call, sys_set_fpuemul_xol_area so that userspace
>> threads that are using the FPU can specify the location of the FPU
>> emulation out of line execution area.
>>
>> Background:
>>
>> MIPS floating point support requires that any instruction that cannot
>> be directly executed by the FPU, be emulated by the kernel.  Part of
>> this emulation involves executing non-FPU instructions that fall in
>> the delay slots of FP branch instructions.  Since the beginning of
>> MIPS/Linux time, this has been done by placing the instructions on the
>> userspace thread stack, and executing them there, as the instructions
>> must be executed in the MM context of the thread receiving the
>> emulation.
>>
>> Because of this, the de facto MIPS Linux userspace ABI requires that
>> the userspace thread have an executable stack.  It is de facto,
>> because it is not written anywhere that this must be the case, but it
>> is never the less a requirement.
>>
>> Problem:
>>
>> How do we get MIPS Linux to use a non-executable stack in the face of
>> the FPU emulation problem?
>>
>> Since userspace desires to change the ABI, put some of the onus on the
>> userspace code.  Any userspace thread desiring a non-executable stack,
>> must allocate a 4-byte aligned area at least 8 bytes long with that
>> has read/write/execute permissions and pass the address of that area
>> to the kernel with the new sys_set_fpuemul_xol_area system call.
>>
>> This is similar to how we require userspace to notify the kernel of
>> the value of the thread local pointer.
>
> Userspace should play no part in this; requiring userspace to help
> make special accomodations for fpu emulation largely defeats the
> purpose of fpu emulation.

That is certainly one way of looking at it.  Really it is opinion, 
rather than fact though.

GLibc is full of code (see ld.so) that in earlier incantations of 
Unix/Linux was in kernel space, and was moved to userspace.  Given that 
there is a partitioning of code between kernel space and userspace, I 
think it not totally unreasonable to consider doing some of this in 
userspace.

Even on systems with hardware FPU, the architecture specification allows 
for/requires emulation of certain cases (denormals, etc.)  So it is 
already a requirement that userspace cooperate by always having free 
space below $SP for use by the kernel.  So the current situation is that 
userspace is providing services for the kernel FPU emulator.

My suggestion is to change the nature of the way these services are 
provided by the userspace program.

> The kernel is perfectly capable of mapping
> an appropriate page. The mapping should happen at exec time,  and at
> clone time with CLONE_VM

Why?  This adds overhead for threads that don't use the FPU.  So this 
suggestion adds at least one page of memory overhead for each thread in 
the system (unless I misunderstand what you are saying).

> unless the kernel is going to handle mutual
> exclusion so that only one thread can be using the page at a time.
> (Using one page for the whole process, and excluding simultaneous
> execution of fpu emulation in multiple threads, may be the more
> practical approach.)
>
> As an alternative, if the space of possible instruction with a delay
> slot is sufficiently small, all such instructions could be mapped as
> immutable code in a shared mapping, each at a fixed offset in the
> mapping. I suspect this would be borderline-impractical (multiple
> megabytes?), but it is the cleanest solution otherwise.
>

Yes, there are 2^32 possible instructions.  Each one is 4 bytes, plus 
you need a way to exit after the instruction has executed, which would 
require another instruction.  So you would need 32GB of memory to hold 
all those instructions, larger than the 32-bit virtual address space.

> Rich
>
  
Rich Felker Oct. 6, 2014, 9:31 p.m. UTC | #3
On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:
> >Userspace should play no part in this; requiring userspace to help
> >make special accomodations for fpu emulation largely defeats the
> >purpose of fpu emulation.
> 
> That is certainly one way of looking at it.  Really it is opinion,
> rather than fact though.

It's an opinion, yes, but it has substantial reason behind it.

> GLibc is full of code (see ld.so) that in earlier incantations of
> Unix/Linux was in kernel space, and was moved to userspace.  Given
> that there is a partitioning of code between kernel space and
> userspace, I think it not totally unreasonable to consider doing
> some of this in userspace.
> 
> Even on systems with hardware FPU, the architecture specification
> allows for/requires emulation of certain cases (denormals, etc.)  So
> it is already a requirement that userspace cooperate by always
> having free space below $SP for use by the kernel.  So the current
> situation is that userspace is providing services for the kernel FPU
> emulator.
> 
> My suggestion is to change the nature of the way these services are
> provided by the userspace program.

But this isn't setup by the userspace program. It's setup by the
kernel on program entry. Despite that, though, I think it's an
unnecessary (and undocumented!) constraint; the fact that it requires
the stack to be executable makes it even more harmful and
inappropriate.

> >The kernel is perfectly capable of mapping
> >an appropriate page. The mapping should happen at exec time,  and at
> >clone time with CLONE_VM
> 
> Why?  This adds overhead for threads that don't use the FPU.  So
> this suggestion adds at least one page of memory overhead for each
> thread in the system (unless I misunderstand what you are saying).

Yes, that's why I think the mutual-exclusion approach might be
preferred. But if you're going to use per-thread areas for this, they
MUST be allocated at thread-creation time, since that's the only time
you can handle error (by failing pthread_create). If you do it lazily,
it might fail and there's no way to recover. And there's no way to
know in advance whether a thread will invoke floating point code, so
you have to set it up for every thread.

> >unless the kernel is going to handle mutual
> >exclusion so that only one thread can be using the page at a time.
> >(Using one page for the whole process, and excluding simultaneous
> >execution of fpu emulation in multiple threads, may be the more
> >practical approach.)
> >
> >As an alternative, if the space of possible instruction with a delay
> >slot is sufficiently small, all such instructions could be mapped as
> >immutable code in a shared mapping, each at a fixed offset in the
> >mapping. I suspect this would be borderline-impractical (multiple
> >megabytes?), but it is the cleanest solution otherwise.
> >
> 
> Yes, there are 2^32 possible instructions.  Each one is 4 bytes,
> plus you need a way to exit after the instruction has executed,
> which would require another instruction.  So you would need 32GB of
> memory to hold all those instructions, larger than the 32-bit
> virtual address space.

There are not 2^32 instructions that have delay slots after them. Only
branch instructions have delay slots. The space of such instruction is
much smaller, probably on the order of 64-256 MB, not 32GB, but I
haven't looked at the instruction encoding tables to confirm this.

Rich
  
David Daney Oct. 6, 2014, 9:45 p.m. UTC | #4
On 10/06/2014 02:31 PM, Rich Felker wrote:
> On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:
>>> Userspace should play no part in this; requiring userspace to help
>>> make special accomodations for fpu emulation largely defeats the
>>> purpose of fpu emulation.
>>
>> That is certainly one way of looking at it.  Really it is opinion,
>> rather than fact though.
>
> It's an opinion, yes, but it has substantial reason behind it.
>
>> GLibc is full of code (see ld.so) that in earlier incantations of
>> Unix/Linux was in kernel space, and was moved to userspace.  Given
>> that there is a partitioning of code between kernel space and
>> userspace, I think it not totally unreasonable to consider doing
>> some of this in userspace.
>>
>> Even on systems with hardware FPU, the architecture specification
>> allows for/requires emulation of certain cases (denormals, etc.)  So
>> it is already a requirement that userspace cooperate by always
>> having free space below $SP for use by the kernel.  So the current
>> situation is that userspace is providing services for the kernel FPU
>> emulator.
>>
>> My suggestion is to change the nature of the way these services are
>> provided by the userspace program.
>
> But this isn't setup by the userspace program. It's setup by the
> kernel on program entry. Despite that, though, I think it's an
> unnecessary (and undocumented!) constraint; the fact that it requires
> the stack to be executable makes it even more harmful and
> inappropriate.
>

The management of the stack is absolutely done by userspace code.  Any 
time you do pthread_create(), userspace code does mmap() to allocate the 
stack area, it then sets permissions on the area, and then it passes the 
address of the area to clone().  Furthermore the userspace code has to 
be very careful in its use of the $sp register, so that it doesn't store 
data in places that will be used/clobbered by the kernel.

All of this is under the control of the userspace program and done with 
userspace code.

I appreciate the fact that libc authors might prefer *not* to write more 
code, but they could, especially if they wanted to add the feature of 
non-executable stacks to their library implementation.

David Daney
  
Rich Felker Oct. 6, 2014, 9:58 p.m. UTC | #5
On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
> On 10/06/2014 02:31 PM, Rich Felker wrote:
> >On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:
> >>>Userspace should play no part in this; requiring userspace to help
> >>>make special accomodations for fpu emulation largely defeats the
> >>>purpose of fpu emulation.
> >>
> >>That is certainly one way of looking at it.  Really it is opinion,
> >>rather than fact though.
> >
> >It's an opinion, yes, but it has substantial reason behind it.
> >
> >>GLibc is full of code (see ld.so) that in earlier incantations of
> >>Unix/Linux was in kernel space, and was moved to userspace.  Given
> >>that there is a partitioning of code between kernel space and
> >>userspace, I think it not totally unreasonable to consider doing
> >>some of this in userspace.
> >>
> >>Even on systems with hardware FPU, the architecture specification
> >>allows for/requires emulation of certain cases (denormals, etc.)  So
> >>it is already a requirement that userspace cooperate by always
> >>having free space below $SP for use by the kernel.  So the current
> >>situation is that userspace is providing services for the kernel FPU
> >>emulator.
> >>
> >>My suggestion is to change the nature of the way these services are
> >>provided by the userspace program.
> >
> >But this isn't setup by the userspace program. It's setup by the
> >kernel on program entry. Despite that, though, I think it's an
> >unnecessary (and undocumented!) constraint; the fact that it requires
> >the stack to be executable makes it even more harmful and
> >inappropriate.
> >
> 
> The management of the stack is absolutely done by userspace code.
> Any time you do pthread_create(), userspace code does mmap() to
> allocate the stack area, it then sets permissions on the area, and
> then it passes the address of the area to clone().

This is hardly management.

> Furthermore the
> userspace code has to be very careful in its use of the $sp
> register, so that it doesn't store data in places that will be
> used/clobbered by the kernel.

This is not "being careful". The stack pointer can never become
invalid unless you do wacky things in asm or invoke UB.

> All of this is under the control of the userspace program and done
> with userspace code.

For the most part it just happens by default. There is no particular
intentionality needed, and certainly no hideous MIPS-specific hacks
needed.

> I appreciate the fact that libc authors might prefer *not* to write
> more code, but they could, especially if they wanted to add the
> feature of non-executable stacks to their library implementation.

So your position is that:

1. A non-exec-stack system can only run new code produced to do extra
   stuff in userspace.

2. The startup code needs to do special work in userspace on MIPS to
   setup an executable area for fpu emulation.

3. Every call to clone/CLONE_VM needs to be accompanied by a call to
   mmap and this new syscall to set the address, and every call to
   SYS_exit needs to be accompanies by a call to munmap for the
   corresponding mapping.

This is a huge ill-designed mess.

Rich
  
David Daney Oct. 6, 2014, 10:17 p.m. UTC | #6
On 10/06/2014 02:58 PM, Rich Felker wrote:
> On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
>> On 10/06/2014 02:31 PM, Rich Felker wrote:
>>> On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:
>>>>> Userspace should play no part in this; requiring userspace to help
>>>>> make special accomodations for fpu emulation largely defeats the
>>>>> purpose of fpu emulation.
>>>>
>>>> That is certainly one way of looking at it.  Really it is opinion,
>>>> rather than fact though.
>>>
>>> It's an opinion, yes, but it has substantial reason behind it.
>>>
>>>> GLibc is full of code (see ld.so) that in earlier incantations of
>>>> Unix/Linux was in kernel space, and was moved to userspace.  Given
>>>> that there is a partitioning of code between kernel space and
>>>> userspace, I think it not totally unreasonable to consider doing
>>>> some of this in userspace.
>>>>
>>>> Even on systems with hardware FPU, the architecture specification
>>>> allows for/requires emulation of certain cases (denormals, etc.)  So
>>>> it is already a requirement that userspace cooperate by always
>>>> having free space below $SP for use by the kernel.  So the current
>>>> situation is that userspace is providing services for the kernel FPU
>>>> emulator.
>>>>
>>>> My suggestion is to change the nature of the way these services are
>>>> provided by the userspace program.
>>>
>>> But this isn't setup by the userspace program. It's setup by the
>>> kernel on program entry. Despite that, though, I think it's an
>>> unnecessary (and undocumented!) constraint; the fact that it requires
>>> the stack to be executable makes it even more harmful and
>>> inappropriate.
>>>
>>
>> The management of the stack is absolutely done by userspace code.
>> Any time you do pthread_create(), userspace code does mmap() to
>> allocate the stack area, it then sets permissions on the area, and
>> then it passes the address of the area to clone().
>
> This is hardly management.
>
>> Furthermore the
>> userspace code has to be very careful in its use of the $sp
>> register, so that it doesn't store data in places that will be
>> used/clobbered by the kernel.
>
> This is not "being careful". The stack pointer can never become
> invalid unless you do wacky things in asm or invoke UB.
>
>> All of this is under the control of the userspace program and done
>> with userspace code.
>
> For the most part it just happens by default. There is no particular
> intentionality needed, and certainly no hideous MIPS-specific hacks
> needed.
>

Yes, it happens by default.  But it wasn't magic.  It took careful work 
by the ABI and toolchain designers to make it work.


>> I appreciate the fact that libc authors might prefer *not* to write
>> more code, but they could, especially if they wanted to add the
>> feature of non-executable stacks to their library implementation.
>
> So your position is that:

It is not really a position that I have.  Rather a proposal for one 
possible way to make non-executable stacks work on MIPS.

>
> 1. A non-exec-stack system can only run new code produced to do extra
>     stuff in userspace.

Any non-executable stack solution for MIPS will require changes to the 
toolchain/libc.  So it is merely a question of what form the change 
should take.


>
> 2. The startup code needs to do special work in userspace on MIPS to
>     setup an executable area for fpu emulation.

Yes. Similar to how startup code has to do special work to set up the 
TLS areas, and load shared libraries.

>
> 3. Every call to clone/CLONE_VM needs to be accompanied by a call to
>     mmap and this new syscall to set the address, and every call to
>     SYS_exit needs to be accompanies by a call to munmap for the
>     corresponding mapping.
>

No, We don't have to mmap() on each thread creation.  Many threads 
(perhaps 512) could be handled by a single page, so the normal case 
would be a single mmap() for the life of the program.


> This is a huge ill-designed mess.
>

Have you seen the alternatives?

Have you ever wondered why MIPS doesn't have non-executable stack support?

> Rich
>
  
Rich Felker Oct. 6, 2014, 11:08 p.m. UTC | #7
On Mon, Oct 06, 2014 at 03:17:03PM -0700, David Daney wrote:
> >>Furthermore the
> >>userspace code has to be very careful in its use of the $sp
> >>register, so that it doesn't store data in places that will be
> >>used/clobbered by the kernel.
> >
> >This is not "being careful". The stack pointer can never become
> >invalid unless you do wacky things in asm or invoke UB.
> >
> >>All of this is under the control of the userspace program and done
> >>with userspace code.
> >
> >For the most part it just happens by default. There is no particular
> >intentionality needed, and certainly no hideous MIPS-specific hacks
> >needed.
> 
> Yes, it happens by default.  But it wasn't magic.  It took careful
> work by the ABI and toolchain designers to make it work.

Here I disagree. All of these things are completely universal, not
MIPS-specific.

> >>I appreciate the fact that libc authors might prefer *not* to write
> >>more code, but they could, especially if they wanted to add the
> >>feature of non-executable stacks to their library implementation.
> >
> >So your position is that:
> 
> It is not really a position that I have.  Rather a proposal for one
> possible way to make non-executable stacks work on MIPS.
> 
> >
> >1. A non-exec-stack system can only run new code produced to do extra
> >    stuff in userspace.
> 
> Any non-executable stack solution for MIPS will require changes to
> the toolchain/libc.  So it is merely a question of what form the
> change should take.

I disagree with this, at least for the most part. If the kernel does
the fpu emulation correctly, there's no reason it shouldn't be
possible to run existing binaries on a hardened kernel that does not
even support executable stack.

> >2. The startup code needs to do special work in userspace on MIPS to
> >    setup an executable area for fpu emulation.
> 
> Yes. Similar to how startup code has to do special work to set up
> the TLS areas,

Yes. Actually the simple way to implement this in userspace would be
with a page-sized, page-aligned object in TLS and a special call to
mprotect and your new syscall. One thing I'm not clear on: should this
memory have permissions r-x or rwx? If it has rwx, that defeats a lot
of the purpose of non-executable-stack. Hopefully it's r-x and the
kernel bypasses the non-writability to write to it.

> and load shared libraries.

Dynamic linking is completely a separate matter. Not all programs are
even dynamic-linked.

> >3. Every call to clone/CLONE_VM needs to be accompanied by a call to
> >    mmap and this new syscall to set the address, and every call to
> >    SYS_exit needs to be accompanies by a call to munmap for the
> >    corresponding mapping.
> >
> 
> No, We don't have to mmap() on each thread creation.  Many threads
> (perhaps 512) could be handled by a single page, so the normal case
> would be a single mmap() for the life of the program.

That's nice from a standpoint of avoiding memory waste, but it's
problematic if .////

> >This is a huge ill-designed mess.
> >
> 
> Have you seen the alternatives?

I proposed a couple and I think they're much less ugly. Could you
point me to the others?

But perhaps you could clarify one thing for me: why is any of this
even needed? A delay slot only exists for branch instructions, and I
can't see any reason the kernel can't just emulate the branch
instruction at the same time. This is a very restricted class of
instructions that should not require any complex emulation of memory
permissions, just manipulation of the resulting program counter value
after the floating point instruction finishes. Or am I missing
something?

> Have you ever wondered why MIPS doesn't have non-executable stack support?

I wasn't even aware that it didn't until your email.

Rich
  
Andy Lutomirski Oct. 6, 2014, 11:38 p.m. UTC | #8
On 10/06/2014 02:58 PM, Rich Felker wrote:
> On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
>> On 10/06/2014 02:31 PM, Rich Felker wrote:
>>> On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:
>>>>> Userspace should play no part in this; requiring userspace to help
>>>>> make special accomodations for fpu emulation largely defeats the
>>>>> purpose of fpu emulation.
>>>>
>>>> That is certainly one way of looking at it.  Really it is opinion,
>>>> rather than fact though.
>>>
>>> It's an opinion, yes, but it has substantial reason behind it.
>>>
>>>> GLibc is full of code (see ld.so) that in earlier incantations of
>>>> Unix/Linux was in kernel space, and was moved to userspace.  Given
>>>> that there is a partitioning of code between kernel space and
>>>> userspace, I think it not totally unreasonable to consider doing
>>>> some of this in userspace.
>>>>
>>>> Even on systems with hardware FPU, the architecture specification
>>>> allows for/requires emulation of certain cases (denormals, etc.)  So
>>>> it is already a requirement that userspace cooperate by always
>>>> having free space below $SP for use by the kernel.  So the current
>>>> situation is that userspace is providing services for the kernel FPU
>>>> emulator.
>>>>
>>>> My suggestion is to change the nature of the way these services are
>>>> provided by the userspace program.
>>>
>>> But this isn't setup by the userspace program. It's setup by the
>>> kernel on program entry. Despite that, though, I think it's an
>>> unnecessary (and undocumented!) constraint; the fact that it requires
>>> the stack to be executable makes it even more harmful and
>>> inappropriate.
>>>
>>
>> The management of the stack is absolutely done by userspace code.
>> Any time you do pthread_create(), userspace code does mmap() to
>> allocate the stack area, it then sets permissions on the area, and
>> then it passes the address of the area to clone().
> 
> This is hardly management.
> 
>> Furthermore the
>> userspace code has to be very careful in its use of the $sp
>> register, so that it doesn't store data in places that will be
>> used/clobbered by the kernel.
> 
> This is not "being careful". The stack pointer can never become
> invalid unless you do wacky things in asm or invoke UB.

I disagree a bit here.  There are runtimes that aren't libc or even C at
all.  See, for example, Go.  (Ugh!)

What happens if a signal happens while executing from this magic
trampoline?  Allocation of another one?  Crash on return from the outer
trampoline invocation?

Also, if this ends up being solved with a hack of this type, please do
it right: have *two* aliases of the trampoline, one writable, and one
executable (unless the MIPS kernel can bypass write-protection).


> 
>> All of this is under the control of the userspace program and done
>> with userspace code.
> 
> For the most part it just happens by default. There is no particular
> intentionality needed, and certainly no hideous MIPS-specific hacks
> needed.
> 
>> I appreciate the fact that libc authors might prefer *not* to write
>> more code, but they could, especially if they wanted to add the
>> feature of non-executable stacks to their library implementation.
> 
> So your position is that:
> 
> 1. A non-exec-stack system can only run new code produced to do extra
>    stuff in userspace.
> 
> 2. The startup code needs to do special work in userspace on MIPS to
>    setup an executable area for fpu emulation.
> 
> 3. Every call to clone/CLONE_VM needs to be accompanied by a call to
>    mmap and this new syscall to set the address, and every call to
>    SYS_exit needs to be accompanies by a call to munmap for the
>    corresponding mapping.
> 
> This is a huge ill-designed mess.

Amen.

Can the kernel not just emulate the instructions directly?  Can it
single-step through them in place?

FWIW, I have considered playing trampoline games like this on x86.  It's
a giant bloody mess, and it will almost certainly never happen, even
though the performance win is dramatic.  No, you don't want to know why. [1]

[1] If you actually want to know, imagine returning from a page fault
with sysret.

--Andy
  
David Daney Oct. 6, 2014, 11:48 p.m. UTC | #9
On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
> On 10/06/2014 02:58 PM, Rich Felker wrote:
>> On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
[...]
>> This is a huge ill-designed mess.
>
> Amen.
>
> Can the kernel not just emulate the instructions directly?

In theory it could, but since there can be implementation defined 
instructions, there is no way to achieve full instruction set coverage 
for all possible machines.

>  Can it single-step through them in place?

No.  If it could, we wouldn't be having this informative discussion.
  
Andy Lutomirski Oct. 6, 2014, 11:54 p.m. UTC | #10
On Mon, Oct 6, 2014 at 4:48 PM, David Daney <ddaney@caviumnetworks.com> wrote:
> On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
>>
>> On 10/06/2014 02:58 PM, Rich Felker wrote:
>>>
>>> On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
>
> [...]
>>>
>>> This is a huge ill-designed mess.
>>
>>
>> Amen.
>>
>> Can the kernel not just emulate the instructions directly?
>
>
> In theory it could, but since there can be implementation defined
> instructions, there is no way to achieve full instruction set coverage for
> all possible machines.

Can modern user code just avoid constructs that require this kind of
trampoline hack?  If so, can this be solved the same way that x86
added no-exec stacks?  (I.e. mark all the binaries as supporting
non-executable stacks and letting them crash if they screw it up.)

Knowing very little about MIPS, it sounds like this is the kernel
compensating for a dumb assembler.

--Andy
  
Rich Felker Oct. 7, 2014, 12:05 a.m. UTC | #11
On Mon, Oct 06, 2014 at 04:48:52PM -0700, David Daney wrote:
> On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
> >On 10/06/2014 02:58 PM, Rich Felker wrote:
> >>On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
> [...]
> >>This is a huge ill-designed mess.
> >
> >Amen.
> >
> >Can the kernel not just emulate the instructions directly?
> 
> In theory it could, but since there can be implementation defined
> instructions, there is no way to achieve full instruction set
> coverage for all possible machines.

Is the issue really implementation-defined instructions with delay
slots? If so it sounds like a made-up issue. They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions, and if you're writing the asm by
hand, you just don't put floating point instructions in the delay
slot.

Rich
  
Andrew Pinski Oct. 7, 2014, 12:11 a.m. UTC | #12
On Mon, Oct 6, 2014 at 5:05 PM, Rich Felker <dalias@libc.org> wrote:
> On Mon, Oct 06, 2014 at 04:48:52PM -0700, David Daney wrote:
>> On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
>> >On 10/06/2014 02:58 PM, Rich Felker wrote:
>> >>On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
>> [...]
>> >>This is a huge ill-designed mess.
>> >
>> >Amen.
>> >
>> >Can the kernel not just emulate the instructions directly?
>>
>> In theory it could, but since there can be implementation defined
>> instructions, there is no way to achieve full instruction set
>> coverage for all possible machines.
>
> Is the issue really implementation-defined instructions with delay
> slots? If so it sounds like a made-up issue. They're not going to
> occur in real binaries. Certainly a compiler is not going to generate
> implementation-defined instructions, and if you're writing the asm by
> hand, you just don't put floating point instructions in the delay
> slot.

It is not the instruction with delay slot but rather the instruction
in the delay slot itself.

Thanks,
Andrew
  
Rich Felker Oct. 7, 2014, 12:21 a.m. UTC | #13
On Mon, Oct 06, 2014 at 05:11:38PM -0700, Andrew Pinski wrote:
> On Mon, Oct 6, 2014 at 5:05 PM, Rich Felker <dalias@libc.org> wrote:
> > On Mon, Oct 06, 2014 at 04:48:52PM -0700, David Daney wrote:
> >> On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
> >> >On 10/06/2014 02:58 PM, Rich Felker wrote:
> >> >>On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
> >> [...]
> >> >>This is a huge ill-designed mess.
> >> >
> >> >Amen.
> >> >
> >> >Can the kernel not just emulate the instructions directly?
> >>
> >> In theory it could, but since there can be implementation defined
> >> instructions, there is no way to achieve full instruction set
> >> coverage for all possible machines.
> >
> > Is the issue really implementation-defined instructions with delay
> > slots? If so it sounds like a made-up issue. They're not going to
> > occur in real binaries. Certainly a compiler is not going to generate
> > implementation-defined instructions, and if you're writing the asm by
> > hand, you just don't put floating point instructions in the delay
> > slot.
> 
> It is not the instruction with delay slot but rather the instruction
> in the delay slot itself.

An instruction in the delay slot for the instruction being emulated?
How would that arise? Are there floating point instructions with delay
slots?

Rich
  
Andrew Pinski Oct. 7, 2014, 12:28 a.m. UTC | #14
On Mon, Oct 6, 2014 at 5:21 PM, Rich Felker <dalias@libc.org> wrote:
> On Mon, Oct 06, 2014 at 05:11:38PM -0700, Andrew Pinski wrote:
>> On Mon, Oct 6, 2014 at 5:05 PM, Rich Felker <dalias@libc.org> wrote:
>> > On Mon, Oct 06, 2014 at 04:48:52PM -0700, David Daney wrote:
>> >> On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
>> >> >On 10/06/2014 02:58 PM, Rich Felker wrote:
>> >> >>On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
>> >> [...]
>> >> >>This is a huge ill-designed mess.
>> >> >
>> >> >Amen.
>> >> >
>> >> >Can the kernel not just emulate the instructions directly?
>> >>
>> >> In theory it could, but since there can be implementation defined
>> >> instructions, there is no way to achieve full instruction set
>> >> coverage for all possible machines.
>> >
>> > Is the issue really implementation-defined instructions with delay
>> > slots? If so it sounds like a made-up issue. They're not going to
>> > occur in real binaries. Certainly a compiler is not going to generate
>> > implementation-defined instructions, and if you're writing the asm by
>> > hand, you just don't put floating point instructions in the delay
>> > slot.
>>
>> It is not the instruction with delay slot but rather the instruction
>> in the delay slot itself.
>
> An instruction in the delay slot for the instruction being emulated?
> How would that arise? Are there floating point instructions with delay
> slots?

Yes branches.
  
Andy Lutomirski Oct. 7, 2014, 12:29 a.m. UTC | #15
On Mon, Oct 6, 2014 at 5:28 PM, Andrew Pinski <pinskia@gmail.com> wrote:
> On Mon, Oct 6, 2014 at 5:21 PM, Rich Felker <dalias@libc.org> wrote:
>> On Mon, Oct 06, 2014 at 05:11:38PM -0700, Andrew Pinski wrote:
>>> On Mon, Oct 6, 2014 at 5:05 PM, Rich Felker <dalias@libc.org> wrote:
>>> > On Mon, Oct 06, 2014 at 04:48:52PM -0700, David Daney wrote:
>>> >> On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
>>> >> >On 10/06/2014 02:58 PM, Rich Felker wrote:
>>> >> >>On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
>>> >> [...]
>>> >> >>This is a huge ill-designed mess.
>>> >> >
>>> >> >Amen.
>>> >> >
>>> >> >Can the kernel not just emulate the instructions directly?
>>> >>
>>> >> In theory it could, but since there can be implementation defined
>>> >> instructions, there is no way to achieve full instruction set
>>> >> coverage for all possible machines.
>>> >
>>> > Is the issue really implementation-defined instructions with delay
>>> > slots? If so it sounds like a made-up issue. They're not going to
>>> > occur in real binaries. Certainly a compiler is not going to generate
>>> > implementation-defined instructions, and if you're writing the asm by
>>> > hand, you just don't put floating point instructions in the delay
>>> > slot.
>>>
>>> It is not the instruction with delay slot but rather the instruction
>>> in the delay slot itself.
>>
>> An instruction in the delay slot for the instruction being emulated?
>> How would that arise? Are there floating point instructions with delay
>> slots?
>
> Yes branches.

I admit I have no idea what's going here, but I find it hard to
believe that having the kernel fix this up for new code is desirable.
Unless MIPS can round-trip a trap *very* quickly, performance will be
awful for any code that has this problem.

--Andy
  
David Daney Oct. 7, 2014, 12:32 a.m. UTC | #16
On 10/06/2014 05:29 PM, Andy Lutomirski wrote:
> On Mon, Oct 6, 2014 at 5:28 PM, Andrew Pinski <pinskia@gmail.com> wrote:
>> On Mon, Oct 6, 2014 at 5:21 PM, Rich Felker <dalias@libc.org> wrote:
>>> On Mon, Oct 06, 2014 at 05:11:38PM -0700, Andrew Pinski wrote:
>>>> On Mon, Oct 6, 2014 at 5:05 PM, Rich Felker <dalias@libc.org> wrote:
>>>>> On Mon, Oct 06, 2014 at 04:48:52PM -0700, David Daney wrote:
>>>>>> On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
>>>>>>> On 10/06/2014 02:58 PM, Rich Felker wrote:
>>>>>>>> On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
>>>>>> [...]
>>>>>>>> This is a huge ill-designed mess.
>>>>>>>
>>>>>>> Amen.
>>>>>>>
>>>>>>> Can the kernel not just emulate the instructions directly?
>>>>>>
>>>>>> In theory it could, but since there can be implementation defined
>>>>>> instructions, there is no way to achieve full instruction set
>>>>>> coverage for all possible machines.
>>>>>
>>>>> Is the issue really implementation-defined instructions with delay
>>>>> slots? If so it sounds like a made-up issue. They're not going to
>>>>> occur in real binaries. Certainly a compiler is not going to generate
>>>>> implementation-defined instructions, and if you're writing the asm by
>>>>> hand, you just don't put floating point instructions in the delay
>>>>> slot.
>>>>
>>>> It is not the instruction with delay slot but rather the instruction
>>>> in the delay slot itself.
>>>
>>> An instruction in the delay slot for the instruction being emulated?
>>> How would that arise? Are there floating point instructions with delay
>>> slots?
>>
>> Yes branches.
>
> I admit I have no idea what's going here, but I find it hard to
> believe that having the kernel fix this up for new code is desirable.
> Unless MIPS can round-trip a trap *very* quickly, performance will be
> awful for any code that has this problem.
>

It is FPU *emulation*, of course the performance will suck.  We don't 
care about performance, we just want it to execute correctly.

David Daney
  
David Daney Oct. 7, 2014, 12:33 a.m. UTC | #17
On 10/06/2014 05:05 PM, Rich Felker wrote:
> On Mon, Oct 06, 2014 at 04:48:52PM -0700, David Daney wrote:
>> On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
>>> On 10/06/2014 02:58 PM, Rich Felker wrote:
>>>> On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
>> [...]
>>>> This is a huge ill-designed mess.
>>>
>>> Amen.
>>>
>>> Can the kernel not just emulate the instructions directly?
>>
>> In theory it could, but since there can be implementation defined
>> instructions, there is no way to achieve full instruction set
>> coverage for all possible machines.
>
> Is the issue really implementation-defined instructions with delay
> slots?

It is the instructions in the delay slots, not the branch instructions 
themselves that are of interest.  But, for the sake of the arguments, 
this is not a critical point.

> If so it sounds like a made-up issue.

It is not a made up issue.

If you want an architecture that has a well defined instruction set, 
stick with x86, Intel will tell you what is good for you and you will 
take whatever they give you.

If you want an architecture where you can add implementation defined 
instructions to do whatever you want, then you use an architecture like 
MIPS.

> They're not going to
> occur in real binaries. Certainly a compiler is not going to generate
> implementation-defined instructions,

Why not?  It will emit any instructions we care to make it emit.  If we 
want it to emit crypto instructions with patented algorithms, then it 
will do that.  But we would still like to use a generic kernel with 
generic FPU support.

The most straight forward way (and the currently implemented way) of 
doing this is to execute the instructions in question out-of-line (on 
the userspace stack).

The question here is:  What is the best way to get to a non-executable 
stack.

The consensus among MIPS developers is that we should continue using the 
out-of-line execution trick, but do it somewhere other than in stack memory.

One way of doing this is to have the kernel magically generate thread 
local memory regions.

Another option is to have userspace manage the out-of-line execution areas.

As is often the case, each approach has different pluses and minuses.
  
Andy Lutomirski Oct. 7, 2014, 12:48 a.m. UTC | #18
On Mon, Oct 6, 2014 at 5:33 PM, David Daney <ddaney@caviumnetworks.com> wrote:
> On 10/06/2014 05:05 PM, Rich Felker wrote:
>>
>> On Mon, Oct 06, 2014 at 04:48:52PM -0700, David Daney wrote:
>>>
>>> On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
>>>>
>>>> On 10/06/2014 02:58 PM, Rich Felker wrote:
>>>>>
>>>>> On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
>>>
>>> [...]
>>>>>
>>>>> This is a huge ill-designed mess.
>>>>
>>>>
>>>> Amen.
>>>>
>>>> Can the kernel not just emulate the instructions directly?
>>>
>>>
>>> In theory it could, but since there can be implementation defined
>>> instructions, there is no way to achieve full instruction set
>>> coverage for all possible machines.
>>
>>
>> Is the issue really implementation-defined instructions with delay
>> slots?
>
>
> It is the instructions in the delay slots, not the branch instructions
> themselves that are of interest.  But, for the sake of the arguments, this
> is not a critical point.
>
>> If so it sounds like a made-up issue.
>
>
> It is not a made up issue.
>
> If you want an architecture that has a well defined instruction set, stick
> with x86, Intel will tell you what is good for you and you will take
> whatever they give you.
>
> If you want an architecture where you can add implementation defined
> instructions to do whatever you want, then you use an architecture like
> MIPS.
>
>> They're not going to
>> occur in real binaries. Certainly a compiler is not going to generate
>> implementation-defined instructions,
>
>
> Why not?  It will emit any instructions we care to make it emit.  If we want
> it to emit crypto instructions with patented algorithms, then it will do
> that.  But we would still like to use a generic kernel with generic FPU
> support.
>
> The most straight forward way (and the currently implemented way) of doing
> this is to execute the instructions in question out-of-line (on the
> userspace stack).
>
> The question here is:  What is the best way to get to a non-executable
> stack.
>
> The consensus among MIPS developers is that we should continue using the
> out-of-line execution trick, but do it somewhere other than in stack memory.
>
> One way of doing this is to have the kernel magically generate thread local
> memory regions.
>
> Another option is to have userspace manage the out-of-line execution areas.
>
> As is often the case, each approach has different pluses and minuses.

Your patch is still buggy.  Imagine this sequence:

Daft userspace code does:

emulated fp branch to elsewhere (not taken)
insn 1
insn 2

The kernel shoves insn1 and insn2 in this magic trampoline and
re-enters user code there.

An asynchronous signal happens before insn1 executes.

The signal hander runs similar daft code, gets fixed up and returns
*to the now-overwritten trampoline*.  Boom.  This kind of failure mode
is why using any kind of magic trampoline sucks on all architectures.

Even the current code might have the same bug for all I know -- are
really updating the stack pointer when you emulate these instructions?
 Do you have a redzone for exactly this purpose?  Does the MIPS signal
delivery code check to see whether you're executing off the stack
outside of the ABI-protected region?


Given that this is documented as an ABI change, I'll ask again: can
you demand that user code that wants the ABI-breaking non-executable
stack must not do this?  IOW, binaries that claim to work with
non-executable stacks must not have fp branches (or alternatively must
not have anything other than nops in the delay slots of possibly
emulated FP branches)?  Or you could be polite and explicitly define
the set of instructions that are safe in fp branch delay slots.

(Also, seriously, fp branches have usable delay slots?  Wow!)

--Andy
  
Rich Felker Oct. 7, 2014, 12:49 a.m. UTC | #19
On Mon, Oct 06, 2014 at 05:33:18PM -0700, David Daney wrote:
> On 10/06/2014 05:05 PM, Rich Felker wrote:
> >On Mon, Oct 06, 2014 at 04:48:52PM -0700, David Daney wrote:
> >>On 10/06/2014 04:38 PM, Andy Lutomirski wrote:
> >>>On 10/06/2014 02:58 PM, Rich Felker wrote:
> >>>>On Mon, Oct 06, 2014 at 02:45:29PM -0700, David Daney wrote:
> >>[...]
> >>>>This is a huge ill-designed mess.
> >>>
> >>>Amen.
> >>>
> >>>Can the kernel not just emulate the instructions directly?
> >>
> >>In theory it could, but since there can be implementation defined
> >>instructions, there is no way to achieve full instruction set
> >>coverage for all possible machines.
> >
> >Is the issue really implementation-defined instructions with delay
> >slots?
> 
> It is the instructions in the delay slots, not the branch
> instructions themselves that are of interest.  But, for the sake of
> the arguments, this is not a critical point.

I think it's an important distinction. It means the problem domain is
supporting all possible instructions, not instructions which can
reasonably have delay slots.

> >If so it sounds like a made-up issue.
> 
> It is not a made up issue.
> 
> If you want an architecture that has a well defined instruction set,
> stick with x86, Intel will tell you what is good for you and you
> will take whatever they give you.
> 
> If you want an architecture where you can add implementation defined
> instructions to do whatever you want, then you use an architecture
> like MIPS.

The ability to add arbitrary instructions does not mean that arbitrary
uses of those instructions have to be supported by the ABI. It's
completely reasonable for the ABI to say they cannot be used in delay
slots for coprocessor-conditional branches.

And of course once you're in the realm of custom hardware and software
written to depend on that custom hardware, you know whether you have
an fpu or not anyway. If you have an fpu, you can ignore the
restriction. If you don't, you should follow it. Note that "partial
fpu emulation" (e.g. just denormals) is not relevant here; the issue
only arises if the coprocessor branch instructions have to be
emulated, which means "there's no fpu at all".

> >They're not going to
> >occur in real binaries. Certainly a compiler is not going to generate
> >implementation-defined instructions,
> 
> Why not?  It will emit any instructions we care to make it emit.  If
> we want it to emit crypto instructions with patented algorithms,
> then it will do that.  But we would still like to use a generic
> kernel with generic FPU support.
> 
> The most straight forward way (and the currently implemented way) of
> doing this is to execute the instructions in question out-of-line
> (on the userspace stack).
> 
> The question here is:  What is the best way to get to a
> non-executable stack.
> 
> The consensus among MIPS developers is that we should continue using

My experience has been that hardware and software developers focused
on a particular hardware target are generally unqualified to make
decisions that affect the design and operation of libc or the kernel.
They are not experts in these areas. It was apparent early on in this
thread, when you mentioned the idea that "not all threads would need
fpu support", that you were thinking from a standpoint of custom
low-level software and not a general purpose libc that cannot read the
application author's mind. It seems nobody had thought of the
impossibility of doing lazy setup (inability to handle failure) and
the necessity of always initializing this stuff at pthread_create
time, either. Design issues like this should be run by experts in the
libc area early on, not as an afterthought.

> the out-of-line execution trick, but do it somewhere other than in
> stack memory.

How do you answer Andy Lutomirski's question about what happens when a
signal handler interrupts execution while the program counter is
pointing at this "out-of-line execution" trampoline? This seems like a
show-stopper for using anything other than the stack.

> One way of doing this is to have the kernel magically generate
> thread local memory regions.
> 
> Another option is to have userspace manage the out-of-line execution areas.
> 
> As is often the case, each approach has different pluses and minuses.

Having the kernel magically do it would be better, but I'm doubtful
that solution works anyway due to the above signal handler/nesting
issue.

Rich
  
Kevin D. Kissell Oct. 7, 2014, 1:02 a.m. UTC | #20
On 10/06/2014 01:23 PM, David Daney wrote:
> From: David Daney <david.daney@cavium.com>
>
> In order for MIPS to be able to support a non-executable stack, we
> need to supply a method to specify a userspace area that can be used
> for executing emulated branch delay slot instructions.
>
> We add a new system call, sys_set_fpuemul_xol_area so that userspace
> threads that are using the FPU can specify the location of the FPU
> emulation out of line execution area.
>
> Background:
>
> MIPS floating point support requires that any instruction that cannot
> be directly executed by the FPU, be emulated by the kernel.  Part of
> this emulation involves executing non-FPU instructions that fall in
> the delay slots of FP branch instructions.  Since the beginning of
> MIPS/Linux time, this has been done by placing the instructions on the
> userspace thread stack, and executing them there, as the instructions
> must be executed in the MM context of the thread receiving the
> emulation.
Well, actually it doesn't go back to the beginning of MIPS/Linux time - 
I was the b*astard who took the user-mode functional emulator from 
Algorithmics and got it to work in the MIPS Linux kernel context, some 
time in the early 2000s.  It was all pretty straightforward, except for 
the delay-slot-of-an-emulated-FP-conditional-branch problem. As you 
note, it may be a load or store (though not a branch), so it needs to be 
done in the user's MM context, and the user stack has nice properties of 
being intrinsically per-thread and re-entrancy tolerant.
> Because of this, the de facto MIPS Linux userspace ABI requires that
> the userspace thread have an executable stack.  It is de facto,
> because it is not written anywhere that this must be the case, but it
> is never the less a requirement.
IIRC, when I first did the code, we already needed this for signal 
trampolines.  I just extended it.  Is it no longer required for signal 
support?  If not, how are signal trampolines now done, and could this 
not be re-extended to the FP branch delay slot emulation problem?
> Problem:
>
> How do we get MIPS Linux to use a non-executable stack in the face of
> the FPU emulation problem?
>
> Since userspace desires to change the ABI, put some of the onus on the
> userspace code.  Any userspace thread desiring a non-executable stack,
> must allocate a 4-byte aligned area at least 8 bytes long with that
> has read/write/execute permissions and pass the address of that area
> to the kernel with the new sys_set_fpuemul_xol_area system call.
>
> This is similar to how we require userspace to notify the kernel of
> the value of the thread local pointer.
It's easy for me to criticise, since I'm no longer responsible for 
maintenance, but I hope you'll excuse me for commenting that, while this 
seems like a small enough and neat enough patch per se,  I find it 
disagreeable to break legacy binaries and to see a mechanism whose name 
and implementation is so strictly tied to the one purpose.  Even if it's 
only used for the FP delay slot emulation today, shouldn't it be 
designed/coded/documented as a sort of generic trampoline scratchpad?  
And shouldn't we try to design things so that legacy code with FP but no 
new magic system call "just works"?  e.g. auto-allocate and initialize 
the space for a thread the first time it actually needs to emulate an FP 
branch?

/K.
>
> Signed-off-by: David Daney <david.daney@cavium.com>
> ---
>
> First attempt to libc-alpha@ failed due to anti-spam technology,
> reattempting to a reduced list of recipients.
>
> This patch has only been compile tested, and lacks the userspace
> component.  It is presented as an alternate approch to the recently
> proposed MIPS non-executable stack patches posted here:
>
> http://www.linux-mips.org/archives/linux-mips/2014-10/msg00024.html
>
>   arch/mips/include/asm/thread_info.h |  2 ++
>   arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
>   arch/mips/kernel/process.c          |  1 +
>   arch/mips/kernel/scall32-o32.S      |  1 +
>   arch/mips/kernel/scall64-64.S       |  1 +
>   arch/mips/kernel/scall64-n32.S      |  1 +
>   arch/mips/kernel/scall64-o32.S      |  1 +
>   arch/mips/kernel/syscall.c          |  8 ++++++++
>   arch/mips/math-emu/dsemul.c         | 11 +++++++----
>   9 files changed, 31 insertions(+), 10 deletions(-)
>
> diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
> index 7de8658..20d47f6 100644
> --- a/arch/mips/include/asm/thread_info.h
> +++ b/arch/mips/include/asm/thread_info.h
> @@ -26,6 +26,7 @@ struct thread_info {
>   	struct exec_domain	*exec_domain;	/* execution domain */
>   	unsigned long		flags;		/* low level flags */
>   	unsigned long		tp_value;	/* thread pointer */
> +	unsigned long		fpu_emul_xol;	/* FPU emul eXecute Out of Line VA */
>   	__u32			cpu;		/* current CPU */
>   	int			preempt_count;	/* 0 => preemptable, <0 => BUG */
>   
> @@ -46,6 +47,7 @@ struct thread_info {
>   	.task		= &tsk,			\
>   	.exec_domain	= &default_exec_domain, \
>   	.flags		= _TIF_FIXADE,		\
> +	.fpu_emul_xol	= ~0ul,			\
>   	.cpu		= 0,			\
>   	.preempt_count	= INIT_PREEMPT_COUNT,	\
>   	.addr_limit	= KERNEL_DS,		\
> diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
> index fdb4923..f1270ee 100644
> --- a/arch/mips/include/uapi/asm/unistd.h
> +++ b/arch/mips/include/uapi/asm/unistd.h
> @@ -375,16 +375,17 @@
>   #define __NR_seccomp			(__NR_Linux + 352)
>   #define __NR_getrandom			(__NR_Linux + 353)
>   #define __NR_memfd_create		(__NR_Linux + 354)
> +#define __NR_set_fpuemul_xol_area	(__NR_Linux + 355)
>   
>   /*
>    * Offset of the last Linux o32 flavoured syscall
>    */
> -#define __NR_Linux_syscalls		354
> +#define __NR_Linux_syscalls		355
>   
>   #endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
>   
>   #define __NR_O32_Linux			4000
> -#define __NR_O32_Linux_syscalls		354
> +#define __NR_O32_Linux_syscalls		355
>   
>   #if _MIPS_SIM == _MIPS_SIM_ABI64
>   
> @@ -707,16 +708,17 @@
>   #define __NR_seccomp			(__NR_Linux + 312)
>   #define __NR_getrandom			(__NR_Linux + 313)
>   #define __NR_memfd_create		(__NR_Linux + 314)
> +#define __NR_set_fpuemul_xol_area	(__NR_Linux + 315)
>   
>   /*
>    * Offset of the last Linux 64-bit flavoured syscall
>    */
> -#define __NR_Linux_syscalls		314
> +#define __NR_Linux_syscalls		315
>   
>   #endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */
>   
>   #define __NR_64_Linux			5000
> -#define __NR_64_Linux_syscalls		314
> +#define __NR_64_Linux_syscalls		315
>   
>   #if _MIPS_SIM == _MIPS_SIM_NABI32
>   
> @@ -1043,15 +1045,16 @@
>   #define __NR_seccomp			(__NR_Linux + 316)
>   #define __NR_getrandom			(__NR_Linux + 317)
>   #define __NR_memfd_create		(__NR_Linux + 318)
> +#define __NR_set_fpuemul_xol_area	(__NR_Linux + 319)
>   
>   /*
>    * Offset of the last N32 flavoured syscall
>    */
> -#define __NR_Linux_syscalls		318
> +#define __NR_Linux_syscalls		319
>   
>   #endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */
>   
>   #define __NR_N32_Linux			6000
> -#define __NR_N32_Linux_syscalls		318
> +#define __NR_N32_Linux_syscalls		319
>   
>   #endif /* _UAPI_ASM_UNISTD_H */
> diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
> index 636b074..6dde6bb 100644
> --- a/arch/mips/kernel/process.c
> +++ b/arch/mips/kernel/process.c
> @@ -151,6 +151,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
>   
>   	if (clone_flags & CLONE_SETTLS)
>   		ti->tp_value = regs->regs[7];
> +	ti->fpu_emul_xol = ~0ul;
>   
>   	return 0;
>   }
> diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
> index 744cd10..8c19a39 100644
> --- a/arch/mips/kernel/scall32-o32.S
> +++ b/arch/mips/kernel/scall32-o32.S
> @@ -579,3 +579,4 @@ EXPORT(sys_call_table)
>   	PTR	sys_seccomp
>   	PTR	sys_getrandom
>   	PTR	sys_memfd_create
> +	PTR	sys_set_fpuemul_xol_area	/* 4355 */
> diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
> index 002b1bc..0b9f72e 100644
> --- a/arch/mips/kernel/scall64-64.S
> +++ b/arch/mips/kernel/scall64-64.S
> @@ -434,4 +434,5 @@ EXPORT(sys_call_table)
>   	PTR	sys_seccomp
>   	PTR	sys_getrandom
>   	PTR	sys_memfd_create
> +	PTR	sys_set_fpuemul_xol_area	/* 5315 */
>   	.size	sys_call_table,.-sys_call_table
> diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
> index ca6cbbe..48f1760 100644
> --- a/arch/mips/kernel/scall64-n32.S
> +++ b/arch/mips/kernel/scall64-n32.S
> @@ -427,4 +427,5 @@ EXPORT(sysn32_call_table)
>   	PTR	sys_seccomp
>   	PTR	sys_getrandom
>   	PTR	sys_memfd_create
> +	PTR	sys_set_fpuemul_xol_area
>   	.size	sysn32_call_table,.-sysn32_call_table
> diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
> index 9e10d11..60def68 100644
> --- a/arch/mips/kernel/scall64-o32.S
> +++ b/arch/mips/kernel/scall64-o32.S
> @@ -564,4 +564,5 @@ EXPORT(sys32_call_table)
>   	PTR	sys_seccomp
>   	PTR	sys_getrandom
>   	PTR	sys_memfd_create
> +	PTR	sys_set_fpuemul_xol_area	/* 4355 */
>   	.size	sys32_call_table,.-sys32_call_table
> diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
> index 4a4f9dd..5f9d9e8 100644
> --- a/arch/mips/kernel/syscall.c
> +++ b/arch/mips/kernel/syscall.c
> @@ -96,6 +96,14 @@ SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
>   	return 0;
>   }
>   
> +SYSCALL_DEFINE1(set_fpuemul_xol_area, unsigned long, addr)
> +{
> +	struct thread_info *ti = task_thread_info(current);
> +
> +	ti->fpu_emul_xol = addr;
> +	return 0;
> +}
> +
>   static inline int mips_atomic_set(unsigned long addr, unsigned long new)
>   {
>   	unsigned long old, tmp;
> diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
> index 4f514f3..bf4ff61 100644
> --- a/arch/mips/math-emu/dsemul.c
> +++ b/arch/mips/math-emu/dsemul.c
> @@ -34,6 +34,7 @@ struct emuframe {
>   int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>   {
>   	extern asmlinkage void handle_dsemulret(void);
> +	struct thread_info *ti = task_thread_info(current);
>   	struct emuframe __user *fr;
>   	int err;
>   
> @@ -64,10 +65,12 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>   	 * branches, but gives us a cleaner interface to the exception
>   	 * handler (single entry point).
>   	 */
> -
> -	/* Ensure that the two instructions are in the same cache line */
> -	fr = (struct emuframe __user *)
> -		((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
> +	if (ti->fpu_emul_xol != ~0ul)
> +		fr = (struct emuframe *)ti->fpu_emul_xol;
> +	else
> +		/* Ensure that the two instructions are in the same cache line */
> +		fr = (struct emuframe __user *)
> +			((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
>   
>   	/* Verify that the stack pointer is not competely insane */
>   	if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
  
Rich Felker Oct. 7, 2014, 1:38 a.m. UTC | #21
On Mon, Oct 06, 2014 at 06:02:20PM -0700, Kevin D. Kissell wrote:
> On 10/06/2014 01:23 PM, David Daney wrote:
> >From: David Daney <david.daney@cavium.com>
> >
> >In order for MIPS to be able to support a non-executable stack, we
> >need to supply a method to specify a userspace area that can be used
> >for executing emulated branch delay slot instructions.
> >
> >We add a new system call, sys_set_fpuemul_xol_area so that userspace
> >threads that are using the FPU can specify the location of the FPU
> >emulation out of line execution area.
> >
> >Background:
> >
> >MIPS floating point support requires that any instruction that cannot
> >be directly executed by the FPU, be emulated by the kernel.  Part of
> >this emulation involves executing non-FPU instructions that fall in
> >the delay slots of FP branch instructions.  Since the beginning of
> >MIPS/Linux time, this has been done by placing the instructions on the
> >userspace thread stack, and executing them there, as the instructions
> >must be executed in the MM context of the thread receiving the
> >emulation.
> Well, actually it doesn't go back to the beginning of MIPS/Linux
> time - I was the b*astard who took the user-mode functional emulator
> from Algorithmics and got it to work in the MIPS Linux kernel
> context, some time in the early 2000s.  It was all pretty
> straightforward, except for the
> delay-slot-of-an-emulated-FP-conditional-branch problem. As you
> note, it may be a load or store (though not a branch), so it needs
> to be done in the user's MM context, and the user stack has nice
> properties of being intrinsically per-thread and re-entrancy
> tolerant.

If the space of possible instructions that need to run in the user's
MM context is sufficiently small, perhaps we could emulate the rest in
kernelspace and have a fixed code mapping exposed to userspace
containing each possible MM-context-dependent instruction combination.

As an alternative, the kernel could expose emulator code to run in
userspace as part of the vdso or other magic kernel-provided pages,
and this code would be capable of emulating arbitrary instructions,
which would of course take place in the user MM context.

This does not solve the problem for hardware with custom instructions,
but I still believe it's totally reasonable to say that the ABI does
not allow putting custom instructions in delay slots for coprocessor
branches.

> >Because of this, the de facto MIPS Linux userspace ABI requires that
> >the userspace thread have an executable stack.  It is de facto,
> >because it is not written anywhere that this must be the case, but it
> >is never the less a requirement.
> IIRC, when I first did the code, we already needed this for signal
> trampolines.  I just extended it.  Is it no longer required for
> signal support?  If not, how are signal trampolines now done, and
> could this not be re-extended to the FP branch delay slot emulation
> problem?

Signal trampolines were nonsense to begin with. The code needed is
fixed, not variable per-signal-instance, so it can be provided by libc
or by the kernel in the vdso page or similar.

> >Problem:
> >
> >How do we get MIPS Linux to use a non-executable stack in the face of
> >the FPU emulation problem?
> >
> >Since userspace desires to change the ABI, put some of the onus on the
> >userspace code.  Any userspace thread desiring a non-executable stack,
> >must allocate a 4-byte aligned area at least 8 bytes long with that
> >has read/write/execute permissions and pass the address of that area
> >to the kernel with the new sys_set_fpuemul_xol_area system call.
> >
> >This is similar to how we require userspace to notify the kernel of
> >the value of the thread local pointer.
> It's easy for me to criticise, since I'm no longer responsible for
> maintenance, but I hope you'll excuse me for commenting that, while
> this seems like a small enough and neat enough patch per se,  I find
> it disagreeable to break legacy binaries and to see a mechanism
> whose name and implementation is so strictly tied to the one
> purpose.  Even if it's only used for the FP delay slot emulation
> today, shouldn't it be designed/coded/documented as a sort of
> generic trampoline scratchpad?  And shouldn't we try to design
> things so that legacy code with FP but no new magic system call
> "just works"?  e.g. auto-allocate and initialize the space for a
> thread the first time it actually needs to emulate an FP branch?

"First time it actually needs to emulate" does not work, since it may
be impossible to allocate at that time, and then there's no way the
program can proceed. The allocation must be done at a time when you
can report failure, which means at the time of execve (for the main
thread) and clone (for other threads).

Rich
  
David Daney Oct. 7, 2014, 4:32 a.m. UTC | #22
On 10/06/2014 06:02 PM, Kevin D. Kissell wrote:
> On 10/06/2014 01:23 PM, David Daney wrote:
>> From: David Daney <david.daney@cavium.com>
>>
>> In order for MIPS to be able to support a non-executable stack, we
>> need to supply a method to specify a userspace area that can be used
>> for executing emulated branch delay slot instructions.
>>
>> We add a new system call, sys_set_fpuemul_xol_area so that userspace
>> threads that are using the FPU can specify the location of the FPU
>> emulation out of line execution area.
>>
>> Background:
>>
>> MIPS floating point support requires that any instruction that cannot
>> be directly executed by the FPU, be emulated by the kernel. Part of
>> this emulation involves executing non-FPU instructions that fall in
>> the delay slots of FP branch instructions.  Since the beginning of
>> MIPS/Linux time, this has been done by placing the instructions on the
>> userspace thread stack, and executing them there, as the instructions
>> must be executed in the MM context of the thread receiving the
>> emulation.
> Well, actually it doesn't go back to the beginning of MIPS/Linux time 
> - I was the b*astard who took the user-mode functional emulator from 
> Algorithmics and got it to work in the MIPS Linux kernel context, some 
> time in the early 2000s.  It was all pretty straightforward, except 
> for the delay-slot-of-an-emulated-FP-conditional-branch problem. As 
> you note, it may be a load or store (though not a branch), so it needs 
> to be done in the user's MM context, and the user stack has nice 
> properties of being intrinsically per-thread and re-entrancy tolerant.
>> Because of this, the de facto MIPS Linux userspace ABI requires that
>> the userspace thread have an executable stack.  It is de facto,
>> because it is not written anywhere that this must be the case, but it
>> is never the less a requirement.
> IIRC, when I first did the code, we already needed this for signal 
> trampolines.  I just extended it.  Is it no longer required for signal 
> support?  If not, how are signal trampolines now done, and could this 
> not be re-extended to the FP branch delay slot emulation problem?

I moved signal trampolines off the stack quite a few years ago. This is 
the only thing blocking non-executable stack.

The problem with the FP branch delay slot emulation is that the code 
that needs to be executed varies.  The signal trampoline code is known 
at kernel build time.

>> Problem:
>>
>> How do we get MIPS Linux to use a non-executable stack in the face of
>> the FPU emulation problem?
>>
>> Since userspace desires to change the ABI, put some of the onus on the
>> userspace code.  Any userspace thread desiring a non-executable stack,
>> must allocate a 4-byte aligned area at least 8 bytes long with that
>> has read/write/execute permissions and pass the address of that area
>> to the kernel with the new sys_set_fpuemul_xol_area system call.
>>
>> This is similar to how we require userspace to notify the kernel of
>> the value of the thread local pointer.
> It's easy for me to criticise, since I'm no longer responsible for 
> maintenance, but I hope you'll excuse me for commenting that, while 
> this seems like a small enough and neat enough patch per se,  I find 
> it disagreeable to break legacy binaries

It doesn't break legacy binaries.  They continue to use a executable 
stack and the emulation is done there.

This only would change new binaries that explicitly asked for a 
non-executable stack.

> and to see a mechanism whose name and implementation is so strictly 
> tied to the one purpose.  Even if it's only used for the FP delay slot 
> emulation today, shouldn't it be designed/coded/documented as a sort 
> of generic trampoline scratchpad?  And shouldn't we try to design 
> things so that legacy code with FP but no new magic system call "just 
> works"?  e.g. auto-allocate and initialize the space for a thread the 
> first time it actually needs to emulate an FP branch?

The binaries have to be tagged as non-executable stack, this is because 
GCC can, and does, generate trampolines on the stack as part of its 
normal code generation strategy.

That said, there are many problems with both the current code, and my 
proposal.

The main issue, as mentioned by another commenter, is the problem of 
signals and nested emulations.

If the emulated instruction raises a synchronous exception that is 
converted to a signal, what is the EPC in the register state on the 
stack?  Should it be the original location of the instruction, or the 
out-of-line location used by emulation?  Are there userspace runtime 
systems that care about this?

If we are emulating on the stack, a signal stack state could clobber the 
emulation location.

If the kernel automatically allocated the emulation locations, what 
would happen if there were a signal that interrupted the emulation, and 
the signal handler did a longjump to somewhere else?  How would we clean 
up the now unused emulation memory allocations?

>
> /K.
>>
>> Signed-off-by: David Daney <david.daney@cavium.com>
>> ---
>>
>> First attempt to libc-alpha@ failed due to anti-spam technology,
>> reattempting to a reduced list of recipients.
>>
>> This patch has only been compile tested, and lacks the userspace
>> component.  It is presented as an alternate approch to the recently
>> proposed MIPS non-executable stack patches posted here:
>>
>> http://www.linux-mips.org/archives/linux-mips/2014-10/msg00024.html
>>
>>   arch/mips/include/asm/thread_info.h |  2 ++
>>   arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
>>   arch/mips/kernel/process.c          |  1 +
>>   arch/mips/kernel/scall32-o32.S      |  1 +
>>   arch/mips/kernel/scall64-64.S       |  1 +
>>   arch/mips/kernel/scall64-n32.S      |  1 +
>>   arch/mips/kernel/scall64-o32.S      |  1 +
>>   arch/mips/kernel/syscall.c          |  8 ++++++++
>>   arch/mips/math-emu/dsemul.c         | 11 +++++++----
>>   9 files changed, 31 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/mips/include/asm/thread_info.h 
>> b/arch/mips/include/asm/thread_info.h
>> index 7de8658..20d47f6 100644
>> --- a/arch/mips/include/asm/thread_info.h
>> +++ b/arch/mips/include/asm/thread_info.h
>> @@ -26,6 +26,7 @@ struct thread_info {
>>       struct exec_domain    *exec_domain;    /* execution domain */
>>       unsigned long        flags;        /* low level flags */
>>       unsigned long        tp_value;    /* thread pointer */
>> +    unsigned long        fpu_emul_xol;    /* FPU emul eXecute Out of 
>> Line VA */
>>       __u32            cpu;        /* current CPU */
>>       int            preempt_count;    /* 0 => preemptable, <0 => BUG */
>>   @@ -46,6 +47,7 @@ struct thread_info {
>>       .task        = &tsk,            \
>>       .exec_domain    = &default_exec_domain, \
>>       .flags        = _TIF_FIXADE,        \
>> +    .fpu_emul_xol    = ~0ul,            \
>>       .cpu        = 0,            \
>>       .preempt_count    = INIT_PREEMPT_COUNT,    \
>>       .addr_limit    = KERNEL_DS,        \
>> diff --git a/arch/mips/include/uapi/asm/unistd.h 
>> b/arch/mips/include/uapi/asm/unistd.h
>> index fdb4923..f1270ee 100644
>> --- a/arch/mips/include/uapi/asm/unistd.h
>> +++ b/arch/mips/include/uapi/asm/unistd.h
>> @@ -375,16 +375,17 @@
>>   #define __NR_seccomp            (__NR_Linux + 352)
>>   #define __NR_getrandom            (__NR_Linux + 353)
>>   #define __NR_memfd_create        (__NR_Linux + 354)
>> +#define __NR_set_fpuemul_xol_area    (__NR_Linux + 355)
>>     /*
>>    * Offset of the last Linux o32 flavoured syscall
>>    */
>> -#define __NR_Linux_syscalls        354
>> +#define __NR_Linux_syscalls        355
>>     #endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
>>     #define __NR_O32_Linux            4000
>> -#define __NR_O32_Linux_syscalls        354
>> +#define __NR_O32_Linux_syscalls        355
>>     #if _MIPS_SIM == _MIPS_SIM_ABI64
>>   @@ -707,16 +708,17 @@
>>   #define __NR_seccomp            (__NR_Linux + 312)
>>   #define __NR_getrandom            (__NR_Linux + 313)
>>   #define __NR_memfd_create        (__NR_Linux + 314)
>> +#define __NR_set_fpuemul_xol_area    (__NR_Linux + 315)
>>     /*
>>    * Offset of the last Linux 64-bit flavoured syscall
>>    */
>> -#define __NR_Linux_syscalls        314
>> +#define __NR_Linux_syscalls        315
>>     #endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */
>>     #define __NR_64_Linux            5000
>> -#define __NR_64_Linux_syscalls        314
>> +#define __NR_64_Linux_syscalls        315
>>     #if _MIPS_SIM == _MIPS_SIM_NABI32
>>   @@ -1043,15 +1045,16 @@
>>   #define __NR_seccomp            (__NR_Linux + 316)
>>   #define __NR_getrandom            (__NR_Linux + 317)
>>   #define __NR_memfd_create        (__NR_Linux + 318)
>> +#define __NR_set_fpuemul_xol_area    (__NR_Linux + 319)
>>     /*
>>    * Offset of the last N32 flavoured syscall
>>    */
>> -#define __NR_Linux_syscalls        318
>> +#define __NR_Linux_syscalls        319
>>     #endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */
>>     #define __NR_N32_Linux            6000
>> -#define __NR_N32_Linux_syscalls        318
>> +#define __NR_N32_Linux_syscalls        319
>>     #endif /* _UAPI_ASM_UNISTD_H */
>> diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
>> index 636b074..6dde6bb 100644
>> --- a/arch/mips/kernel/process.c
>> +++ b/arch/mips/kernel/process.c
>> @@ -151,6 +151,7 @@ int copy_thread(unsigned long clone_flags, 
>> unsigned long usp,
>>         if (clone_flags & CLONE_SETTLS)
>>           ti->tp_value = regs->regs[7];
>> +    ti->fpu_emul_xol = ~0ul;
>>         return 0;
>>   }
>> diff --git a/arch/mips/kernel/scall32-o32.S 
>> b/arch/mips/kernel/scall32-o32.S
>> index 744cd10..8c19a39 100644
>> --- a/arch/mips/kernel/scall32-o32.S
>> +++ b/arch/mips/kernel/scall32-o32.S
>> @@ -579,3 +579,4 @@ EXPORT(sys_call_table)
>>       PTR    sys_seccomp
>>       PTR    sys_getrandom
>>       PTR    sys_memfd_create
>> +    PTR    sys_set_fpuemul_xol_area    /* 4355 */
>> diff --git a/arch/mips/kernel/scall64-64.S 
>> b/arch/mips/kernel/scall64-64.S
>> index 002b1bc..0b9f72e 100644
>> --- a/arch/mips/kernel/scall64-64.S
>> +++ b/arch/mips/kernel/scall64-64.S
>> @@ -434,4 +434,5 @@ EXPORT(sys_call_table)
>>       PTR    sys_seccomp
>>       PTR    sys_getrandom
>>       PTR    sys_memfd_create
>> +    PTR    sys_set_fpuemul_xol_area    /* 5315 */
>>       .size    sys_call_table,.-sys_call_table
>> diff --git a/arch/mips/kernel/scall64-n32.S 
>> b/arch/mips/kernel/scall64-n32.S
>> index ca6cbbe..48f1760 100644
>> --- a/arch/mips/kernel/scall64-n32.S
>> +++ b/arch/mips/kernel/scall64-n32.S
>> @@ -427,4 +427,5 @@ EXPORT(sysn32_call_table)
>>       PTR    sys_seccomp
>>       PTR    sys_getrandom
>>       PTR    sys_memfd_create
>> +    PTR    sys_set_fpuemul_xol_area
>>       .size    sysn32_call_table,.-sysn32_call_table
>> diff --git a/arch/mips/kernel/scall64-o32.S 
>> b/arch/mips/kernel/scall64-o32.S
>> index 9e10d11..60def68 100644
>> --- a/arch/mips/kernel/scall64-o32.S
>> +++ b/arch/mips/kernel/scall64-o32.S
>> @@ -564,4 +564,5 @@ EXPORT(sys32_call_table)
>>       PTR    sys_seccomp
>>       PTR    sys_getrandom
>>       PTR    sys_memfd_create
>> +    PTR    sys_set_fpuemul_xol_area    /* 4355 */
>>       .size    sys32_call_table,.-sys32_call_table
>> diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
>> index 4a4f9dd..5f9d9e8 100644
>> --- a/arch/mips/kernel/syscall.c
>> +++ b/arch/mips/kernel/syscall.c
>> @@ -96,6 +96,14 @@ SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
>>       return 0;
>>   }
>>   +SYSCALL_DEFINE1(set_fpuemul_xol_area, unsigned long, addr)
>> +{
>> +    struct thread_info *ti = task_thread_info(current);
>> +
>> +    ti->fpu_emul_xol = addr;
>> +    return 0;
>> +}
>> +
>>   static inline int mips_atomic_set(unsigned long addr, unsigned long 
>> new)
>>   {
>>       unsigned long old, tmp;
>> diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
>> index 4f514f3..bf4ff61 100644
>> --- a/arch/mips/math-emu/dsemul.c
>> +++ b/arch/mips/math-emu/dsemul.c
>> @@ -34,6 +34,7 @@ struct emuframe {
>>   int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned 
>> long cpc)
>>   {
>>       extern asmlinkage void handle_dsemulret(void);
>> +    struct thread_info *ti = task_thread_info(current);
>>       struct emuframe __user *fr;
>>       int err;
>>   @@ -64,10 +65,12 @@ int mips_dsemul(struct pt_regs *regs, 
>> mips_instruction ir, unsigned long cpc)
>>        * branches, but gives us a cleaner interface to the exception
>>        * handler (single entry point).
>>        */
>> -
>> -    /* Ensure that the two instructions are in the same cache line */
>> -    fr = (struct emuframe __user *)
>> -        ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
>> +    if (ti->fpu_emul_xol != ~0ul)
>> +        fr = (struct emuframe *)ti->fpu_emul_xol;
>> +    else
>> +        /* Ensure that the two instructions are in the same cache 
>> line */
>> +        fr = (struct emuframe __user *)
>> +            ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
>>         /* Verify that the stack pointer is not competely insane */
>>       if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct 
>> emuframe))))
>
>
  
David Daney Oct. 7, 2014, 4:50 a.m. UTC | #23
On 10/06/2014 05:49 PM, Rich Felker wrote:
> On Mon, Oct 06, 2014 at 05:33:18PM -0700, David Daney wrote:
[...]

>> Why not?  It will emit any instructions we care to make it emit.  If
>> we want it to emit crypto instructions with patented algorithms,
>> then it will do that.  But we would still like to use a generic
>> kernel with generic FPU support.
>>
>> The most straight forward way (and the currently implemented way) of
>> doing this is to execute the instructions in question out-of-line
>> (on the userspace stack).
>>
>> The question here is:  What is the best way to get to a
>> non-executable stack.
>>
>> The consensus among MIPS developers is that we should continue using
> My experience has been that hardware and software developers focused
> on a particular hardware target are generally unqualified to make
> decisions that affect the design and operation of libc or the kernel.
> They are not experts in these areas. It was apparent early on in this
> thread, when you mentioned the idea that "not all threads would need
> fpu support", that you were thinking from a standpoint of custom
> low-level software and not a general purpose libc that cannot read the
> application author's mind.
Not at all, I was thinking of soft-float ABIs, as they never execute FP 
instructions, and are often used on systems with no FPU.  In fact many 
non-FPU systems never execute any hard-float code.  So those system 
should not suffer large performance regressions from any change made to 
support a non-executable stack.

We use GLibC on many soft-float only systems, and I would posit that 
GLibC is a general purpose libc.

>   It seems nobody had thought of the
> impossibility of doing lazy setup (inability to handle failure) and
> the necessity of always initializing this stuff at pthread_create
> time, either. Design issues like this should be run by experts in the
> libc area early on, not as an afterthought.

I bow down to the experts, as obviously I know nothing about:

1) The Linux kernel
2) The MIPS architecture.
3) Library design.
4) C libraries and their interaction with the kernel, linker and compiler.

>
>> the out-of-line execution trick, but do it somewhere other than in
>> stack memory.
> How do you answer Andy Lutomirski's question about what happens when a
> signal handler interrupts execution while the program counter is
> pointing at this "out-of-line execution" trampoline? This seems like a
> show-stopper for using anything other than the stack.
It would be nice to support, but not doing so would not be a regression 
from current behavior.

>
>> One way of doing this is to have the kernel magically generate
>> thread local memory regions.
>>
>> Another option is to have userspace manage the out-of-line execution areas.
>>
>> As is often the case, each approach has different pluses and minuses.
> Having the kernel magically do it would be better, but I'm doubtful
> that solution works anyway due to the above signal handler/nesting
> issue.

So the perfect is the enemy of the good?  No non-executable stack for 
you, MIPS.

> Rich
>
  
Matthew Fortune Oct. 7, 2014, 9:13 a.m. UTC | #24
> >> the out-of-line execution trick, but do it somewhere other than in
> >> stack memory.
> > How do you answer Andy Lutomirski's question about what happens when a
> > signal handler interrupts execution while the program counter is
> > pointing at this "out-of-line execution" trampoline? This seems like a
> > show-stopper for using anything other than the stack.
> It would be nice to support, but not doing so would not be a regression
> from current behavior.

It seems appropriate to mention another issue which should be addressed as
part of the overall FPU emulation work...

From what I can see the out-of-line execution of delay slot instructions
will break micromips R3 addiupc, and all MIPS32r6 and MIPS64r6 PC-relative
instructions (inc load/store) as they will have the wrong base. Is there
anything in the current set of proposals that can address this (beyond
adding restrictions to what is ABI allowed in FPU branch delay slots)?

This is an issue whether the stack is executable or not but does directly
relate to the topic of FPU emulation.  It sounds like the kernel would not
be able to emulate a pc-relative load/store even if it was a special case
as it would not run in the correct MM context? [be gentle, I'm no expert
in this area].

Matthew
  
James Hogan Oct. 7, 2014, 10:52 a.m. UTC | #25
On 07/10/14 10:13, Matthew Fortune wrote:
>>>> the out-of-line execution trick, but do it somewhere other than in
>>>> stack memory.
>>> How do you answer Andy Lutomirski's question about what happens when a
>>> signal handler interrupts execution while the program counter is
>>> pointing at this "out-of-line execution" trampoline? This seems like a
>>> show-stopper for using anything other than the stack.
>> It would be nice to support, but not doing so would not be a regression
>> from current behavior.
> 
> It seems appropriate to mention another issue which should be addressed as
> part of the overall FPU emulation work...
> 
> From what I can see the out-of-line execution of delay slot instructions
> will break micromips R3 addiupc, and all MIPS32r6 and MIPS64r6 PC-relative
> instructions (inc load/store) as they will have the wrong base. Is there
> anything in the current set of proposals that can address this (beyond
> adding restrictions to what is ABI allowed in FPU branch delay slots)?
> 
> This is an issue whether the stack is executable or not but does directly
> relate to the topic of FPU emulation.  It sounds like the kernel would not
> be able to emulate a pc-relative load/store even if it was a special case
> as it would not run in the correct MM context? [be gentle, I'm no expert
> in this area].

I think special casing and emulating them in the kernel would work in
these cases, since it'd be a known set of instructions rather than
arbitrary unknown instructions, the kernel needs to read/write safely
into the user address space all the time for system calls.

Cheers
James
  
Rich Felker Oct. 7, 2014, 11:11 a.m. UTC | #26
On Mon, Oct 06, 2014 at 09:50:47PM -0700, David Daney wrote:
> >>the out-of-line execution trick, but do it somewhere other than in
> >>stack memory.
> >How do you answer Andy Lutomirski's question about what happens when a
> >signal handler interrupts execution while the program counter is
> >pointing at this "out-of-line execution" trampoline? This seems like a
> >show-stopper for using anything other than the stack.
> It would be nice to support, but not doing so would not be a
> regression from current behavior.

It's not just "nice" to support, it's mandatory. Otherwise you will
execute essentially *random instructions* in this case, providing a
very nice attack vector that can almost certainly be elevated to
arbitrary code execution via timing of signals during floating point
code.

The current behavior in regards to this is correct: because you have a
*stack*, each trampoline is pushed onto the stack in its own context,
and popped when it's no longer needed. You can have arbitrarily many
such trampolines up to the stack size. Note that each nested signal
handler already requires sizeof(ucontext_t) in stack space, so these
trampolines are a negligible additional cost without major effects on
the number of signal handlers you can nest without overflowing the
stack.

> >>One way of doing this is to have the kernel magically generate
> >>thread local memory regions.
> >>
> >>Another option is to have userspace manage the out-of-line execution areas.
> >>
> >>As is often the case, each approach has different pluses and minuses.
> >Having the kernel magically do it would be better, but I'm doubtful
> >that solution works anyway due to the above signal handler/nesting
> >issue.
> 
> So the perfect is the enemy of the good?  No non-executable stack
> for you, MIPS.

No, regressions that make the situation worse than executable-stack
are not "good" to begin with, even if it weren't for the other design
issues and dumping everything on userspace for the sake of being lazy
in the kernel.

Rich
  
Rich Felker Oct. 7, 2014, 11:19 a.m. UTC | #27
On Tue, Oct 07, 2014 at 09:13:22AM +0000, Matthew Fortune wrote:
> From what I can see the out-of-line execution of delay slot instructions
> will break micromips R3 addiupc, and all MIPS32r6 and MIPS64r6 PC-relative
> instructions (inc load/store) as they will have the wrong base. Is there
> anything in the current set of proposals that can address this (beyond
> adding restrictions to what is ABI allowed in FPU branch delay slots)?

Yes. If a trampoline is being generated to replace the delay slot
instruction, it can just contain more complex code to duplicate what
the PC-relative instruction would have done. Since the ABI already
assumes a stack is available, it can use the stack to backup registers
it needs for scratch space and restore them.

> This is an issue whether the stack is executable or not but does directly
> relate to the topic of FPU emulation.  It sounds like the kernel would not
> be able to emulate a pc-relative load/store even if it was a special case
> as it would not run in the correct MM context? [be gentle, I'm no expert
> in this area].

Really everything should be done in the kernel, and it's not as hard
as people are making it look. The kernel _already_ has to enforce MM
context permissions for every syscall that reads or writes user memory
(e.g. futex with PI mutexes or FUTEX_WAKE_OP, or even simple things
like read/write) so there's no reason it can't do emulated
loads/stores the exact same way.

Rich
  
James Hogan Oct. 7, 2014, 11:53 a.m. UTC | #28
On 07/10/14 05:32, David Daney wrote:
> If the kernel automatically allocated the emulation locations, what
> would happen if there were a signal that interrupted the emulation, and
> the signal handler did a longjump to somewhere else?  How would we clean
> up the now unused emulation memory allocations?

AFAICT, Leonid's implementation also has this problem, and that has a
separate stack of emuframes per thread managed completely by the kernel.

Essentially the kernel doesn't manage the stack, userland does, and
userland can choose to skip over sigframes and emuframes with siglongjmp
without telling the kernel.

Userland can even switch between contexts (which includes stack) with
setcontext (coroutines etc) which breaks the assumption in Leonid's
patches that emuframes will be completed in reverse order to them being
started, again demonstrating that it is essentially userland that
manages the stack.

I think any attempt by the kernel to keep track of user stacks (e.g. by
storing a stack pointer along with the emuframe so that unused emuframes
can be discarded later when stack pointer goes high again) will be
foiled by setcontext.

Hmm, I can't see a way forward that doesn't involve invasive userland
handling & ABI changes other than giving up with non-executable stacks
or limiting permitted instructions in delay slots to those Linux knows
how to emulate directly.

Cheers
James
  
James Hogan Oct. 7, 2014, 12:22 p.m. UTC | #29
On 07/10/14 12:53, James Hogan wrote:
> On 07/10/14 05:32, David Daney wrote:
>> If the kernel automatically allocated the emulation locations, what
>> would happen if there were a signal that interrupted the emulation, and
>> the signal handler did a longjump to somewhere else?  How would we clean
>> up the now unused emulation memory allocations?
> 
> AFAICT, Leonid's implementation also has this problem, and that has a
> separate stack of emuframes per thread managed completely by the kernel.
> 
> Essentially the kernel doesn't manage the stack, userland does, and
> userland can choose to skip over sigframes and emuframes with siglongjmp
> without telling the kernel.
> 
> Userland can even switch between contexts (which includes stack) with
> setcontext (coroutines etc) which breaks the assumption in Leonid's
> patches that emuframes will be completed in reverse order to them being
> started, again demonstrating that it is essentially userland that
> manages the stack.
> 
> I think any attempt by the kernel to keep track of user stacks (e.g. by
> storing a stack pointer along with the emuframe so that unused emuframes
> can be discarded later when stack pointer goes high again) will be
> foiled by setcontext.
> 
> Hmm, I can't see a way forward that doesn't involve invasive userland
> handling & ABI changes other than giving up with non-executable stacks
> or limiting permitted instructions in delay slots to those Linux knows
> how to emulate directly.

Would it work for a signal encountered during branch delay slot
emulation (maybe where the PC is pointing at that magic location the
kernel uses for emulation) to be treated as a return from emulation, but
leaving the user PC pointing to the original branch (with Cause.BD=1 I
suppose) prior to handling the signal, so that no more than one emuframe
is needed by each thread at a time?

Cheers
James
  
David Daney Oct. 7, 2014, 4:04 p.m. UTC | #30
On 10/07/2014 02:13 AM, Matthew Fortune wrote:
>>>> the out-of-line execution trick, but do it somewhere other than in
>>>> stack memory.
>>> How do you answer Andy Lutomirski's question about what happens when a
>>> signal handler interrupts execution while the program counter is
>>> pointing at this "out-of-line execution" trampoline? This seems like a
>>> show-stopper for using anything other than the stack.
>> It would be nice to support, but not doing so would not be a regression
>> from current behavior.
>
> It seems appropriate to mention another issue which should be addressed as
> part of the overall FPU emulation work...
>
>  From what I can see the out-of-line execution of delay slot instructions
> will break micromips R3 addiupc, and all MIPS32r6 and MIPS64r6 PC-relative
> instructions (inc load/store) as they will have the wrong base. Is there
> anything in the current set of proposals that can address this (beyond
> adding restrictions to what is ABI allowed in FPU branch delay slots)?
>
> This is an issue whether the stack is executable or not but does directly
> relate to the topic of FPU emulation.  It sounds like the kernel would not
> be able to emulate a pc-relative load/store even if it was a special case
> as it would not run in the correct MM context? [be gentle, I'm no expert
> in this area].
>

I haven't studied the r6 ISA in depth.  But you are correct, the r6 ISA 
cannot be supported with the eXecute-Out-of-Line tricks due to the PC 
relative instructions.

So probably the best path forward is to abandon the current method, and 
bite the bullet and write an entire instruction set emulator.  It 
doesn't have to be fast.

David Daney


> Matthew
>
  
David Daney Oct. 7, 2014, 4:08 p.m. UTC | #31
On 10/07/2014 04:11 AM, Rich Felker wrote:
> On Mon, Oct 06, 2014 at 09:50:47PM -0700, David Daney wrote:
>>>> the out-of-line execution trick, but do it somewhere other than in
>>>> stack memory.
>>> How do you answer Andy Lutomirski's question about what happens when a
>>> signal handler interrupts execution while the program counter is
>>> pointing at this "out-of-line execution" trampoline? This seems like a
>>> show-stopper for using anything other than the stack.
>> It would be nice to support, but not doing so would not be a
>> regression from current behavior.
>
> It's not just "nice" to support, it's mandatory. Otherwise you will
> execute essentially *random instructions* in this case, providing a
> very nice attack vector that can almost certainly be elevated to
> arbitrary code execution via timing of signals during floating point
> code.
>
> The current behavior in regards to this is correct: because you have a
> *stack*, each trampoline is pushed onto the stack in its own context,
> and popped when it's no longer needed. You can have arbitrarily many
> such trampolines up to the stack size. Note that each nested signal
> handler already requires sizeof(ucontext_t) in stack space, so these
> trampolines are a negligible additional cost without major effects on
> the number of signal handlers you can nest without overflowing the
> stack.

Yes, the stack takes care of the allocations, but the current 
implementation has many problems:

1) Signals clobber the emulation area.
2) Signals caused by the emulation, have incorrect saved machine state.

We have a low bar to pass, any new solution doesn't have to be perfect, 
it only has to be an improvement.

Keep in mind that we are not starting from a clean slate, there are many 
years of legacy code that has built up here.

David Daney
  
Andy Lutomirski Oct. 7, 2014, 6:16 p.m. UTC | #32
On Oct 7, 2014 9:09 AM, "David Daney" <ddaney@caviumnetworks.com> wrote:
>
> On 10/07/2014 04:11 AM, Rich Felker wrote:
>>
>> On Mon, Oct 06, 2014 at 09:50:47PM -0700, David Daney wrote:
>>>>>
>>>>> the out-of-line execution trick, but do it somewhere other than in
>>>>> stack memory.
>>>>
>>>> How do you answer Andy Lutomirski's question about what happens when a
>>>> signal handler interrupts execution while the program counter is
>>>> pointing at this "out-of-line execution" trampoline? This seems like a
>>>> show-stopper for using anything other than the stack.
>>>
>>> It would be nice to support, but not doing so would not be a
>>> regression from current behavior.
>>
>>
>> It's not just "nice" to support, it's mandatory. Otherwise you will
>> execute essentially *random instructions* in this case, providing a
>> very nice attack vector that can almost certainly be elevated to
>> arbitrary code execution via timing of signals during floating point
>> code.
>>
>> The current behavior in regards to this is correct: because you have a
>> *stack*, each trampoline is pushed onto the stack in its own context,
>> and popped when it's no longer needed. You can have arbitrarily many
>> such trampolines up to the stack size. Note that each nested signal
>> handler already requires sizeof(ucontext_t) in stack space, so these
>> trampolines are a negligible additional cost without major effects on
>> the number of signal handlers you can nest without overflowing the
>> stack.
>
>
> Yes, the stack takes care of the allocations, but the current implementation has many problems:
>
> 1) Signals clobber the emulation area.
> 2) Signals caused by the emulation, have incorrect saved machine state.
>
> We have a low bar to pass, any new solution doesn't have to be perfect, it only has to be an improvement.
>
> Keep in mind that we are not starting from a clean slate, there are many years of legacy code that has built up here.

A lesson I learned when doing the x86 vsyscall stuff: Don't waste time
improving legacy crap without a really good reason.  Especially don't
extend the interface.  Deprecate it (without breaking working user
code) and move on.

--Andy

>
> David Daney
  
Leonid Yegoshin Oct. 7, 2014, 6:32 p.m. UTC | #33
Well, I am not a subscriber to mail-list, so I read it the first time 
and some notes:

1)  David's approach would likely work for FPU emulation but unlikely 
works for MIPS Rel 2/Rel 1/ MIPS I emulation in MIPS R6 architecture. 
The reason is that the first MIPS R2 instruction (removed from MIPS R6) 
can be hit long before GLIBC/bionic/etc can determine how to use 
properly a new system call. And that instruction needs to be emulated. I 
actually hit this problem with ssh-keygen first and referred to  FPU 
emulation because I got it later, during my attempt to salvage a situation.

2)  The issue of uMIPS ADDIUPC and similar instructions are overblown in 
my opinion. Never of them are memory-related and their emulation in 
BD-slot can be easily done in kernel and that actually accelerates an 
emulation. Look at piece of code which I wrote to accelerate an 
emulation of some instructions in BD-slot of JR instruction:

         switch (MIPSInst_OPCODE(ir)) {
         case addiu_op:
                 if (MIPSInst_RT(ir))
                         regs->regs[MIPSInst_RT(ir)] =
(s32)regs->regs[MIPSInst_RS(ir)] +
                                 (s32)MIPSInst_SIMM(ir);
                 return(0);
#ifdef CONFIG_64BIT
         case daddiu_op:
                 if (MIPSInst_RT(ir))
                         regs->regs[MIPSInst_RT(ir)] =
(s64)regs->regs[MIPSInst_RS(ir)] +
(s64)MIPSInst_SIMM(ir);
return(0);
#endif

Five lines per instruction.

3)  The signal happened during execution of emulated instruction - 
signals are under control of kernel and we can easily delay a signal 
during execution of emulated instruction until return from do_dsemulret. 
It is not a big deal - nor code, nor performance. Thank you for good point.

4)  The voice for doing any instruction emulation in kernel - it is not 
a MIPS business model to force customer to put details of all 
Coprocessor 2 instructions public. We provide an interface and the rest 
is a customer business. Besides that it is really painful to make a 
differentiation between Cavium Octeon and some another CPU instructions 
with the same opcode. On other side, leaving emulation of their 
instructions to them is not a wise after having some good way doing that 
multiple years.

- Leonid.
  
David Daney Oct. 7, 2014, 6:43 p.m. UTC | #34
On 10/07/2014 11:32 AM, Leonid Yegoshin wrote:
> Well, I am not a subscriber to mail-list, so I read it the first time
> and some notes:
>
> 1)  David's approach would likely work for FPU emulation but unlikely
> works for MIPS Rel 2/Rel 1/ MIPS I emulation in MIPS R6 architecture.
> The reason is that the first MIPS R2 instruction (removed from MIPS R6)
> can be hit long before GLIBC/bionic/etc can determine how to use
> properly a new system call. And that instruction needs to be emulated. I
> actually hit this problem with ssh-keygen first and referred to  FPU
> emulation because I got it later, during my attempt to salvage a situation.
>
> 2)  The issue of uMIPS ADDIUPC and similar instructions are overblown in
> my opinion. Never of them are memory-related and their emulation in
> BD-slot can be easily done in kernel and that actually accelerates an
> emulation. Look at piece of code which I wrote to accelerate an
> emulation of some instructions in BD-slot of JR instruction:
>
>          switch (MIPSInst_OPCODE(ir)) {
>          case addiu_op:
>                  if (MIPSInst_RT(ir))
>                          regs->regs[MIPSInst_RT(ir)] =
> (s32)regs->regs[MIPSInst_RS(ir)] +
>                                  (s32)MIPSInst_SIMM(ir);
>                  return(0);
> #ifdef CONFIG_64BIT
>          case daddiu_op:
>                  if (MIPSInst_RT(ir))
>                          regs->regs[MIPSInst_RT(ir)] =
> (s64)regs->regs[MIPSInst_RS(ir)] +
> (s64)MIPSInst_SIMM(ir);
> return(0);
> #endif
>
> Five lines per instruction.

But you must be able to emulate them, so you need an emulator, not XOL.

>
> 3)  The signal happened during execution of emulated instruction -
> signals are under control of kernel and we can easily delay a signal
> during execution of emulated instruction until return from do_dsemulret.
> It is not a big deal - nor code, nor performance. Thank you for good point.
>

The problem is what to do with synchronous signals (SIGSEGV)  Those 
cannot be held off, and you really want the EPC value saved in the 
register state to be correct.

> 4)  The voice for doing any instruction emulation in kernel - it is not
> a MIPS business model to force customer to put details of all
> Coprocessor 2 instructions public. We provide an interface and the rest
> is a customer business. Besides that it is really painful to make a
> differentiation between Cavium Octeon and some another CPU instructions
> with the same opcode. On other side, leaving emulation of their
> instructions to them is not a wise after having some good way doing that
> multiple years.
>

With all the new information we have begun to understand, it seems like 
the only sane thing to do is get rid of this XOL approach and go to full 
emulation of the entire instruction set.

David Daney
  
Andy Lutomirski Oct. 7, 2014, 6:44 p.m. UTC | #35
On Tue, Oct 7, 2014 at 11:32 AM, Leonid Yegoshin
<Leonid.Yegoshin@imgtec.com> wrote:
> Well, I am not a subscriber to mail-list, so I read it the first time and
> some notes:
>

>
> 3)  The signal happened during execution of emulated instruction - signals
> are under control of kernel and we can easily delay a signal during
> execution of emulated instruction until return from do_dsemulret. It is not
> a big deal - nor code, nor performance. Thank you for good point.

If you go down this particular rabbit hole, you will never come back out.

What happens if one of those out-of-line instructions causes a
synchronous trap?  What if SIGSTOP arrives before ret?  What if
another thread removes the magic ret sequence?

>
> 4)  The voice for doing any instruction emulation in kernel - it is not a
> MIPS business model to force customer to put details of all Coprocessor 2
> instructions public. We provide an interface and the rest is a customer
> business. Besides that it is really painful to make a differentiation
> between Cavium Octeon and some another CPU instructions with the same
> opcode. On other side, leaving emulation of their instructions to them is
> not a wise after having some good way doing that multiple years.

IMO this is all backwards.  If MIPS customers put proprietary
instructions into their ISA, they leave out the FPU, and they put a
proprietary insn in a branch delay slot, then I think that they
deserve a fatal signal.

There's a really easy solution for new systems: fix the toolchain.
Teach the assembler to disallow any proprietary instructions in an FP
branch delay slot.

--Andy
  
David Daney Oct. 7, 2014, 6:50 p.m. UTC | #36
On 10/07/2014 11:44 AM, Andy Lutomirski wrote:
> On Tue, Oct 7, 2014 at 11:32 AM, Leonid Yegoshin
> <Leonid.Yegoshin@imgtec.com> wrote:
>> Well, I am not a subscriber to mail-list, so I read it the first time and
>> some notes:
>>
>
>>
>> 3)  The signal happened during execution of emulated instruction - signals
>> are under control of kernel and we can easily delay a signal during
>> execution of emulated instruction until return from do_dsemulret. It is not
>> a big deal - nor code, nor performance. Thank you for good point.
>
> If you go down this particular rabbit hole, you will never come back out.
>
> What happens if one of those out-of-line instructions causes a
> synchronous trap?  What if SIGSTOP arrives before ret?  What if
> another thread removes the magic ret sequence?
>
>>
>> 4)  The voice for doing any instruction emulation in kernel - it is not a
>> MIPS business model to force customer to put details of all Coprocessor 2
>> instructions public. We provide an interface and the rest is a customer
>> business. Besides that it is really painful to make a differentiation
>> between Cavium Octeon and some another CPU instructions with the same
>> opcode. On other side, leaving emulation of their instructions to them is
>> not a wise after having some good way doing that multiple years.
>
> IMO this is all backwards.  If MIPS customers put proprietary
> instructions into their ISA, they leave out the FPU, and they put a
> proprietary insn in a branch delay slot, then I think that they
> deserve a fatal signal.
>
> There's a really easy solution for new systems: fix the toolchain.
> Teach the assembler to disallow any proprietary instructions in an FP
> branch delay slot.
>

Yes, gas for MIPS already has an instruction attribute for instructions 
that cannot be placed in delay slots.  It should be a fairly simple 
matter to extend this to instructions that cannot be emulated.

Thanks,
David Daney


> --Andy
>
  
Rich Felker Oct. 7, 2014, 7:09 p.m. UTC | #37
On Tue, Oct 07, 2014 at 11:44:35AM -0700, Andy Lutomirski wrote:
> > 4)  The voice for doing any instruction emulation in kernel - it is not a
> > MIPS business model to force customer to put details of all Coprocessor 2
> > instructions public. We provide an interface and the rest is a customer
> > business. Besides that it is really painful to make a differentiation
> > between Cavium Octeon and some another CPU instructions with the same
> > opcode. On other side, leaving emulation of their instructions to them is
> > not a wise after having some good way doing that multiple years.
> 
> IMO this is all backwards.  If MIPS customers put proprietary
> instructions into their ISA, they leave out the FPU, and they put a
> proprietary insn in a branch delay slot, then I think that they
> deserve a fatal signal.

I agree completely here. We should not break things (or, as it seems,
leave them broken) for common usage cases that affect everyone just to
coddle proprietary vendor-specific instructions. The latter just
should not be used in delay slots unless the chip vendor also promises
to provide fpu branch in hardware.

Rich
  
Leonid Yegoshin Oct. 7, 2014, 7:13 p.m. UTC | #38
(repeat it because of some e-mail failure, sorry)

On 10/07/2014 11:43 AM, David Daney wrote:

>> Five lines per instruction.
> But you must be able to emulate them, so you need an emulator, not XOL.

I feel I didn't say clear - emulation of ADDIUPC (after second look it
is the only instruction requires a special handling) is A FIVE LINE OF
CODE. At least in MIPS R2 it would require 5 lines. In MIPS R2 emulator
I have some routine (50 lines) which checks BD-slot instruction for some
popular opcodes and emulates that and leave other opcodes to dsemul().

The same can be done for FPU emulator.

> The problem is what to do with synchronous signals (SIGSEGV) Those
> cannot be held off, and you really want the EPC value saved in the
> register state to be correct.

Any synchronous exception is not a problem, we know that emulation in
VDSO (read today - stack) is running and should take care of it. We can
easily change EPC before we start doing signal and pretend that problem
happened in correct place.

The async signals seem to be some problem... yet... until I finish look
into common Linux kernel code, I think.


On 10/07/2014 11:44 AM, Andy Lutomirski wrote:

> What happens if one of those out-of-line instructions causes a synchronous trap?

If we need to return that as signal then we change EPC to proper value
from emulframe->epc. If we do a nested emulation - continue.

  > What if SIGSTOP arrives before ret?

I am looking into way to delay asynchronous signals until an emulated
instruction is finished. Signals are not time accurate and never been,
so it is not a big deal to delay it.


  > What if another thread removes the magic ret sequence?

It can't do it in my approach, emulation is done in write protected area
and it is done in per-thread memory space.

- Leonid.
  
Leonid Yegoshin Oct. 7, 2014, 7:16 p.m. UTC | #39
On 10/07/2014 12:09 PM, Rich Felker wrote:
> I agree completely here. We should not break things (or, as it seems, 
> leave them broken) for common usage cases that affect everyone just to 
> coddle proprietary vendor-specific instructions. The latter just 
> should not be used in delay slots unless the chip vendor also promises 
> to provide fpu branch in hardware. Rich 
And what do you propose - remove a current in-stack emulation and you 
still think it doesn't break a status-quo?
  
Rich Felker Oct. 7, 2014, 7:21 p.m. UTC | #40
On Tue, Oct 07, 2014 at 12:16:59PM -0700, Leonid Yegoshin wrote:
> On 10/07/2014 12:09 PM, Rich Felker wrote:
> >I agree completely here. We should not break things (or, as it
> >seems, leave them broken) for common usage cases that affect
> >everyone just to coddle proprietary vendor-specific instructions.
> >The latter just should not be used in delay slots unless the chip
> >vendor also promises to provide fpu branch in hardware. Rich
> And what do you propose - remove a current in-stack emulation and
> you still think it doesn't break a status-quo?

The in-stack trampoline support could be left but used only for
emulating instructions the kernel doesn't know. This would make all
normal binaries immediately usable with non-executable stack, and
would avoid the only potential source of regressions. Ultimately I
think the "xol" stuff should be removed, but that could be a long term
goal.

Rich
  
Leonid Yegoshin Oct. 7, 2014, 7:27 p.m. UTC | #41
On 10/07/2014 12:21 PM, Rich Felker wrote:
> The in-stack trampoline support could be left but used only for 
> emulating instructions the kernel doesn't know. This would make all 
> normal binaries immediately usable with non-executable stack, and 
> would avoid the only potential source of regressions. Ultimately I 
> think the "xol" stuff should be removed, but that could be a long term 
> goal. 
Thank you, it is exactly what I am doing in patch series named "[PATCH 
0/3] MIPS executable stack protection".
I just setup a special stack for that.
  
Andy Lutomirski Oct. 7, 2014, 7:28 p.m. UTC | #42
On Tue, Oct 7, 2014 at 12:21 PM, Rich Felker <dalias@libc.org> wrote:
> On Tue, Oct 07, 2014 at 12:16:59PM -0700, Leonid Yegoshin wrote:
>> On 10/07/2014 12:09 PM, Rich Felker wrote:
>> >I agree completely here. We should not break things (or, as it
>> >seems, leave them broken) for common usage cases that affect
>> >everyone just to coddle proprietary vendor-specific instructions.
>> >The latter just should not be used in delay slots unless the chip
>> >vendor also promises to provide fpu branch in hardware. Rich
>> And what do you propose - remove a current in-stack emulation and
>> you still think it doesn't break a status-quo?
>
> The in-stack trampoline support could be left but used only for
> emulating instructions the kernel doesn't know. This would make all
> normal binaries immediately usable with non-executable stack, and
> would avoid the only potential source of regressions. Ultimately I
> think the "xol" stuff should be removed, but that could be a long term
> goal.

Does anything break if the xol stuff is disabled for PT_GNU_STACK tasks?

>
> Rich
  
Matthew Fortune Oct. 7, 2014, 7:40 p.m. UTC | #43
> >

> > 4)  The voice for doing any instruction emulation in kernel - it is not a

> > MIPS business model to force customer to put details of all Coprocessor 2

> > instructions public. We provide an interface and the rest is a customer

> > business. Besides that it is really painful to make a differentiation

> > between Cavium Octeon and some another CPU instructions with the same

> > opcode. On other side, leaving emulation of their instructions to them is

> > not a wise after having some good way doing that multiple years.

> 

> IMO this is all backwards.  If MIPS customers put proprietary

> instructions into their ISA, they leave out the FPU, and they put a

> proprietary insn in a branch delay slot, then I think that they

> deserve a fatal signal.

> 

> There's a really easy solution for new systems: fix the toolchain.

> Teach the assembler to disallow any proprietary instructions in an FP

> branch delay slot.


I think I'd be mostly in favour of this from a toolchain perspective but
only from the perspective of FP branch instructions. This still leaves a
problem for normal branches should any of them get removed and need emulating.
The general form of bltzal and bgezal would be the example here of branches
which are removed in R6 (The special case of using $0 remains). This is
really niche but my point is more about how we would deal with such a thing
if it happened. The answer may be just to scream and shout and discourage the
removal of such instructions from the architecture.

Matthew
  
David Daney Oct. 7, 2014, 8:03 p.m. UTC | #44
On 10/07/2014 12:28 PM, Andy Lutomirski wrote:
> On Tue, Oct 7, 2014 at 12:21 PM, Rich Felker <dalias@libc.org> wrote:
>> On Tue, Oct 07, 2014 at 12:16:59PM -0700, Leonid Yegoshin wrote:
>>> On 10/07/2014 12:09 PM, Rich Felker wrote:
>>>> I agree completely here. We should not break things (or, as it
>>>> seems, leave them broken) for common usage cases that affect
>>>> everyone just to coddle proprietary vendor-specific instructions.
>>>> The latter just should not be used in delay slots unless the chip
>>>> vendor also promises to provide fpu branch in hardware. Rich
>>> And what do you propose - remove a current in-stack emulation and
>>> you still think it doesn't break a status-quo?
>>
>> The in-stack trampoline support could be left but used only for
>> emulating instructions the kernel doesn't know. This would make all
>> normal binaries immediately usable with non-executable stack, and
>> would avoid the only potential source of regressions. Ultimately I
>> think the "xol" stuff should be removed, but that could be a long term
>> goal.
>
> Does anything break if the xol stuff is disabled for PT_GNU_STACK tasks?
>

The instructions must be executed, if you turn on a non-executable 
stack, you cannot execute them on the stack, so they must be handled in 
another way, which is the subject of this thread.

Options:

1a) XOL kernel manages the memory
1b) XOL userspace manages the menory
2) Emulate the instructions.
3) I don't think there is a 3rd. option.

As the imgtec people have said, you have to do #2 for their new r6 ISA, 
as it uses PC relative instructions.

I really think we should bite the bullet and do #2 for everything, it 
will be the cleanest long term solutions.

David Daney
  
Ralf Baechle Oct. 7, 2014, 11:20 p.m. UTC | #45
On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:

> >As an alternative, if the space of possible instruction with a delay
> >slot is sufficiently small, all such instructions could be mapped as
> >immutable code in a shared mapping, each at a fixed offset in the
> >mapping. I suspect this would be borderline-impractical (multiple
> >megabytes?), but it is the cleanest solution otherwise.
> >
> 
> Yes, there are 2^32 possible instructions.  Each one is 4 bytes, plus you
> need a way to exit after the instruction has executed, which would require
> another instruction.  So you would need 32GB of memory to hold all those
> instructions, larger than the 32-bit virtual address space.

Plus errata support for some older CPUs requires no other instructions
that might cause an exception to be present in the same cache line inflating
the size to 32 bytes per instruction.

I've contemplated a full emulation - but that would require an emulator that
is capable of most of the instruction set.  With all the random ASEs around
that would be hard to implement while the FPU emulator trampoline as currently
used has the advantage of automatically supporting ASEs, known and unknown.
So it's a huge bonus for maintenance.

  Ralf
  
David Daney Oct. 7, 2014, 11:59 p.m. UTC | #46
On 10/07/2014 04:20 PM, Ralf Baechle wrote:
> On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:
>
>>> As an alternative, if the space of possible instruction with a delay
>>> slot is sufficiently small, all such instructions could be mapped as
>>> immutable code in a shared mapping, each at a fixed offset in the
>>> mapping. I suspect this would be borderline-impractical (multiple
>>> megabytes?), but it is the cleanest solution otherwise.
>>>
>>
>> Yes, there are 2^32 possible instructions.  Each one is 4 bytes, plus you
>> need a way to exit after the instruction has executed, which would require
>> another instruction.  So you would need 32GB of memory to hold all those
>> instructions, larger than the 32-bit virtual address space.
>
> Plus errata support for some older CPUs requires no other instructions
> that might cause an exception to be present in the same cache line inflating
> the size to 32 bytes per instruction.
>
> I've contemplated a full emulation - but that would require an emulator that
> is capable of most of the instruction set.  With all the random ASEs around
> that would be hard to implement while the FPU emulator trampoline as currently
> used has the advantage of automatically supporting ASEs, known and unknown.
> So it's a huge bonus for maintenance.
>

Unfortunatly it breaks when our friends at Imgtec introduce their PC 
relative instructions in mipsr6, so an emulator may be unavoidable.

David Daney
  
Chuck Ebbert Oct. 8, 2014, 12:18 a.m. UTC | #47
On Tue, 7 Oct 2014 16:59:03 -0700
David Daney <ddaney@caviumnetworks.com> wrote:

> On 10/07/2014 04:20 PM, Ralf Baechle wrote:
> > On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:
> >
> >>> As an alternative, if the space of possible instruction with a delay
> >>> slot is sufficiently small, all such instructions could be mapped as
> >>> immutable code in a shared mapping, each at a fixed offset in the
> >>> mapping. I suspect this would be borderline-impractical (multiple
> >>> megabytes?), but it is the cleanest solution otherwise.
> >>>
> >>
> >> Yes, there are 2^32 possible instructions.  Each one is 4 bytes, plus you
> >> need a way to exit after the instruction has executed, which would require
> >> another instruction.  So you would need 32GB of memory to hold all those
> >> instructions, larger than the 32-bit virtual address space.
> >
> > Plus errata support for some older CPUs requires no other instructions
> > that might cause an exception to be present in the same cache line inflating
> > the size to 32 bytes per instruction.
> >
> > I've contemplated a full emulation - but that would require an emulator that
> > is capable of most of the instruction set.  With all the random ASEs around
> > that would be hard to implement while the FPU emulator trampoline as currently
> > used has the advantage of automatically supporting ASEs, known and unknown.
> > So it's a huge bonus for maintenance.
> >
> 
> Unfortunatly it breaks when our friends at Imgtec introduce their PC 
> relative instructions in mipsr6, so an emulator may be unavoidable.
> 

The x86 kprobes code deals with executing relocated insns with
PC-relative offsets by adjusting the offset in a relocated instruction
before executing it.

See arch/x86/kernel/kprobes/core.c::__copy_instruction()
  
Andy Lutomirski Oct. 8, 2014, 12:22 a.m. UTC | #48
On Oct 7, 2014 1:03 PM, "David Daney" <ddaney.cavm@gmail.com> wrote:
>
> On 10/07/2014 12:28 PM, Andy Lutomirski wrote:
>>
>> On Tue, Oct 7, 2014 at 12:21 PM, Rich Felker <dalias@libc.org> wrote:
>>>
>>> On Tue, Oct 07, 2014 at 12:16:59PM -0700, Leonid Yegoshin wrote:
>>>>
>>>> On 10/07/2014 12:09 PM, Rich Felker wrote:
>>>>>
>>>>> I agree completely here. We should not break things (or, as it
>>>>> seems, leave them broken) for common usage cases that affect
>>>>> everyone just to coddle proprietary vendor-specific instructions.
>>>>> The latter just should not be used in delay slots unless the chip
>>>>> vendor also promises to provide fpu branch in hardware. Rich
>>>>
>>>> And what do you propose - remove a current in-stack emulation and
>>>> you still think it doesn't break a status-quo?
>>>
>>>
>>> The in-stack trampoline support could be left but used only for
>>> emulating instructions the kernel doesn't know. This would make all
>>> normal binaries immediately usable with non-executable stack, and
>>> would avoid the only potential source of regressions. Ultimately I
>>> think the "xol" stuff should be removed, but that could be a long term
>>> goal.
>>
>>
>> Does anything break if the xol stuff is disabled for PT_GNU_STACK tasks?
>>
>
> The instructions must be executed, if you turn on a non-executable stack, you cannot execute them on the stack, so they must be handled in another way, which is the subject of this thread.
>
> Options:
>
> 1a) XOL kernel manages the memory
> 1b) XOL userspace manages the menory
> 2) Emulate the instructions.
> 3) I don't think there is a 3rd. option.

4) SIGILL

5) single-step or use an HW breakpoint if available


But, yes, 3 seems reasonable.

--Andy
  
Rich Felker Oct. 8, 2014, 2:37 a.m. UTC | #49
On Tue, Oct 07, 2014 at 07:18:33PM -0500, Chuck Ebbert wrote:
> On Tue, 7 Oct 2014 16:59:03 -0700
> David Daney <ddaney@caviumnetworks.com> wrote:
> 
> > On 10/07/2014 04:20 PM, Ralf Baechle wrote:
> > > On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:
> > >
> > >>> As an alternative, if the space of possible instruction with a delay
> > >>> slot is sufficiently small, all such instructions could be mapped as
> > >>> immutable code in a shared mapping, each at a fixed offset in the
> > >>> mapping. I suspect this would be borderline-impractical (multiple
> > >>> megabytes?), but it is the cleanest solution otherwise.
> > >>>
> > >>
> > >> Yes, there are 2^32 possible instructions.  Each one is 4 bytes, plus you
> > >> need a way to exit after the instruction has executed, which would require
> > >> another instruction.  So you would need 32GB of memory to hold all those
> > >> instructions, larger than the 32-bit virtual address space.
> > >
> > > Plus errata support for some older CPUs requires no other instructions
> > > that might cause an exception to be present in the same cache line inflating
> > > the size to 32 bytes per instruction.
> > >
> > > I've contemplated a full emulation - but that would require an emulator that
> > > is capable of most of the instruction set.  With all the random ASEs around
> > > that would be hard to implement while the FPU emulator trampoline as currently
> > > used has the advantage of automatically supporting ASEs, known and unknown.
> > > So it's a huge bonus for maintenance.
> > >
> > 
> > Unfortunatly it breaks when our friends at Imgtec introduce their PC 
> > relative instructions in mipsr6, so an emulator may be unavoidable.
> > 
> 
> The x86 kprobes code deals with executing relocated insns with
> PC-relative offsets by adjusting the offset in a relocated instruction
> before executing it.
> 
> See arch/x86/kernel/kprobes/core.c::__copy_instruction()

This only works if you have an ISA that can represent the full range
of possible relative addresses. It does not work for MIPS where the
range is quite restricted by the fixed instruction size.

Rich
  
Paul Burton Oct. 8, 2014, 10:31 a.m. UTC | #50
On Tue, Oct 07, 2014 at 04:59:03PM -0700, David Daney wrote:
> On 10/07/2014 04:20 PM, Ralf Baechle wrote:
> >On Mon, Oct 06, 2014 at 02:18:19PM -0700, David Daney wrote:
> >
> >>>As an alternative, if the space of possible instruction with a delay
> >>>slot is sufficiently small, all such instructions could be mapped as
> >>>immutable code in a shared mapping, each at a fixed offset in the
> >>>mapping. I suspect this would be borderline-impractical (multiple
> >>>megabytes?), but it is the cleanest solution otherwise.
> >>>
> >>
> >>Yes, there are 2^32 possible instructions.  Each one is 4 bytes, plus you
> >>need a way to exit after the instruction has executed, which would require
> >>another instruction.  So you would need 32GB of memory to hold all those
> >>instructions, larger than the 32-bit virtual address space.
> >
> >Plus errata support for some older CPUs requires no other instructions
> >that might cause an exception to be present in the same cache line inflating
> >the size to 32 bytes per instruction.
> >
> >I've contemplated a full emulation - but that would require an emulator that
> >is capable of most of the instruction set.  With all the random ASEs around
> >that would be hard to implement while the FPU emulator trampoline as currently
> >used has the advantage of automatically supporting ASEs, known and unknown.
> >So it's a huge bonus for maintenance.
> >
> 
> Unfortunatly it breaks when our friends at Imgtec introduce their PC
> relative instructions in mipsr6, so an emulator may be unavoidable.
> 
> David Daney

Just to note, this was also discussed when I submitted my much older
patch with a similar goal:

  http://patchwork.linux-mips.org/patch/6125/

...and the conclusion there also began converging towards full ISA
emulation (or at least, the subset of the ISA which userland can
execute):

  http://www.linux-mips.org/archives/linux-mips/2014-07/msg00034.html

For the record my preference is for emulation. It is in some ways more
work, but it's also much cleaner. Given that more instructions will need
to be emulated to run pre-R6 binaries on R6 systems anyway, the emulator
would only become increasingly useful.

Paul
  

Patch

diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index 7de8658..20d47f6 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -26,6 +26,7 @@  struct thread_info {
 	struct exec_domain	*exec_domain;	/* execution domain */
 	unsigned long		flags;		/* low level flags */
 	unsigned long		tp_value;	/* thread pointer */
+	unsigned long		fpu_emul_xol;	/* FPU emul eXecute Out of Line VA */
 	__u32			cpu;		/* current CPU */
 	int			preempt_count;	/* 0 => preemptable, <0 => BUG */
 
@@ -46,6 +47,7 @@  struct thread_info {
 	.task		= &tsk,			\
 	.exec_domain	= &default_exec_domain, \
 	.flags		= _TIF_FIXADE,		\
+	.fpu_emul_xol	= ~0ul,			\
 	.cpu		= 0,			\
 	.preempt_count	= INIT_PREEMPT_COUNT,	\
 	.addr_limit	= KERNEL_DS,		\
diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
index fdb4923..f1270ee 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -375,16 +375,17 @@ 
 #define __NR_seccomp			(__NR_Linux + 352)
 #define __NR_getrandom			(__NR_Linux + 353)
 #define __NR_memfd_create		(__NR_Linux + 354)
+#define __NR_set_fpuemul_xol_area	(__NR_Linux + 355)
 
 /*
  * Offset of the last Linux o32 flavoured syscall
  */
-#define __NR_Linux_syscalls		354
+#define __NR_Linux_syscalls		355
 
 #endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
 
 #define __NR_O32_Linux			4000
-#define __NR_O32_Linux_syscalls		354
+#define __NR_O32_Linux_syscalls		355
 
 #if _MIPS_SIM == _MIPS_SIM_ABI64
 
@@ -707,16 +708,17 @@ 
 #define __NR_seccomp			(__NR_Linux + 312)
 #define __NR_getrandom			(__NR_Linux + 313)
 #define __NR_memfd_create		(__NR_Linux + 314)
+#define __NR_set_fpuemul_xol_area	(__NR_Linux + 315)
 
 /*
  * Offset of the last Linux 64-bit flavoured syscall
  */
-#define __NR_Linux_syscalls		314
+#define __NR_Linux_syscalls		315
 
 #endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */
 
 #define __NR_64_Linux			5000
-#define __NR_64_Linux_syscalls		314
+#define __NR_64_Linux_syscalls		315
 
 #if _MIPS_SIM == _MIPS_SIM_NABI32
 
@@ -1043,15 +1045,16 @@ 
 #define __NR_seccomp			(__NR_Linux + 316)
 #define __NR_getrandom			(__NR_Linux + 317)
 #define __NR_memfd_create		(__NR_Linux + 318)
+#define __NR_set_fpuemul_xol_area	(__NR_Linux + 319)
 
 /*
  * Offset of the last N32 flavoured syscall
  */
-#define __NR_Linux_syscalls		318
+#define __NR_Linux_syscalls		319
 
 #endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */
 
 #define __NR_N32_Linux			6000
-#define __NR_N32_Linux_syscalls		318
+#define __NR_N32_Linux_syscalls		319
 
 #endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 636b074..6dde6bb 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -151,6 +151,7 @@  int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	if (clone_flags & CLONE_SETTLS)
 		ti->tp_value = regs->regs[7];
+	ti->fpu_emul_xol = ~0ul;
 
 	return 0;
 }
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 744cd10..8c19a39 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -579,3 +579,4 @@  EXPORT(sys_call_table)
 	PTR	sys_seccomp
 	PTR	sys_getrandom
 	PTR	sys_memfd_create
+	PTR	sys_set_fpuemul_xol_area	/* 4355 */
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index 002b1bc..0b9f72e 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -434,4 +434,5 @@  EXPORT(sys_call_table)
 	PTR	sys_seccomp
 	PTR	sys_getrandom
 	PTR	sys_memfd_create
+	PTR	sys_set_fpuemul_xol_area	/* 5315 */
 	.size	sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index ca6cbbe..48f1760 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -427,4 +427,5 @@  EXPORT(sysn32_call_table)
 	PTR	sys_seccomp
 	PTR	sys_getrandom
 	PTR	sys_memfd_create
+	PTR	sys_set_fpuemul_xol_area
 	.size	sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index 9e10d11..60def68 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -564,4 +564,5 @@  EXPORT(sys32_call_table)
 	PTR	sys_seccomp
 	PTR	sys_getrandom
 	PTR	sys_memfd_create
+	PTR	sys_set_fpuemul_xol_area	/* 4355 */
 	.size	sys32_call_table,.-sys32_call_table
diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
index 4a4f9dd..5f9d9e8 100644
--- a/arch/mips/kernel/syscall.c
+++ b/arch/mips/kernel/syscall.c
@@ -96,6 +96,14 @@  SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
 	return 0;
 }
 
+SYSCALL_DEFINE1(set_fpuemul_xol_area, unsigned long, addr)
+{
+	struct thread_info *ti = task_thread_info(current);
+
+	ti->fpu_emul_xol = addr;
+	return 0;
+}
+
 static inline int mips_atomic_set(unsigned long addr, unsigned long new)
 {
 	unsigned long old, tmp;
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 4f514f3..bf4ff61 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -34,6 +34,7 @@  struct emuframe {
 int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 {
 	extern asmlinkage void handle_dsemulret(void);
+	struct thread_info *ti = task_thread_info(current);
 	struct emuframe __user *fr;
 	int err;
 
@@ -64,10 +65,12 @@  int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 	 * branches, but gives us a cleaner interface to the exception
 	 * handler (single entry point).
 	 */
-
-	/* Ensure that the two instructions are in the same cache line */
-	fr = (struct emuframe __user *)
-		((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
+	if (ti->fpu_emul_xol != ~0ul)
+		fr = (struct emuframe *)ti->fpu_emul_xol;
+	else
+		/* Ensure that the two instructions are in the same cache line */
+		fr = (struct emuframe __user *)
+			((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
 
 	/* Verify that the stack pointer is not competely insane */
 	if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))