Patchwork posix: Optimize Linux posix_spawn

login
register
mail settings
Submitter Adhemerval Zanella Netto
Date May 6, 2019, 7:31 p.m.
Message ID <20190506193119.20505-1-adhemerval.zanella@linaro.org>
Download mbox | patch
Permalink /patch/32573/
State New
Headers show

Comments

Adhemerval Zanella Netto - May 6, 2019, 7:31 p.m.
The current internal posix_spawn symbol for Linux (__spawni) requires
to allocate a dynamic stack based on input arguments to handle the
SPAWN_XFLAGS_USE_PATH internal flag, which re-issue the input binary
as a shell script if execve call return ENOEXEC (to execute shell
scripts with an initial shebang).

This is done only for compatibility mode and the generic case does not
require the extra calculation plus the potential large mmap/munmap
call.  For default case, a pre-defined buffer is sufficed to use on the
clone call instead.

This patch optimizes Linux spawni by allocating a dynamic stack only
for compatibility symbol (SPAWN_XFLAGS_USE_PATH).  For generic case,
a stack allocated buffer is used instead.

Checked x86_64-linux-gnu and i686-linux-gnu.

	* sysdeps/unix/sysv/linux/spawni.c (posix_spawn_args): Remove
	argc member.
	(maybe_script_execute): Remove function.
	(execve_compat, __spawni_clone, __spawnix_compat): New function.
	(__spawni_child): Remove maybe_script_execute call.
	(__spawnix): Remove magic stack slack constant with stack_slack
	identifier.
	(__spawni): Only allocates a variable stack when
	SPAWN_XFLAGS_TRY_SHELL is used.
---
 sysdeps/unix/sysv/linux/spawni.c | 205 ++++++++++++++++++-------------
 1 file changed, 118 insertions(+), 87 deletions(-)
Florian Weimer - May 7, 2019, 8:33 a.m.
* Adhemerval Zanella:

> The current internal posix_spawn symbol for Linux (__spawni) requires
> to allocate a dynamic stack based on input arguments to handle the
> SPAWN_XFLAGS_USE_PATH internal flag, which re-issue the input binary
> as a shell script if execve call return ENOEXEC (to execute shell
> scripts with an initial shebang).
>
> This is done only for compatibility mode and the generic case does not
> require the extra calculation plus the potential large mmap/munmap
> call.  For default case, a pre-defined buffer is sufficed to use on the
> clone call instead.
>
> This patch optimizes Linux spawni by allocating a dynamic stack only
> for compatibility symbol (SPAWN_XFLAGS_USE_PATH).  For generic case,
> a stack allocated buffer is used instead.

I'm still not convinced that stack switching is required here.

Thanks,
Florian
Adhemerval Zanella Netto - May 7, 2019, 1:24 p.m.
On 07/05/2019 05:33, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> The current internal posix_spawn symbol for Linux (__spawni) requires
>> to allocate a dynamic stack based on input arguments to handle the
>> SPAWN_XFLAGS_USE_PATH internal flag, which re-issue the input binary
>> as a shell script if execve call return ENOEXEC (to execute shell
>> scripts with an initial shebang).
>>
>> This is done only for compatibility mode and the generic case does not
>> require the extra calculation plus the potential large mmap/munmap
>> call.  For default case, a pre-defined buffer is sufficed to use on the
>> clone call instead.
>>
>> This patch optimizes Linux spawni by allocating a dynamic stack only
>> for compatibility symbol (SPAWN_XFLAGS_USE_PATH).  For generic case,
>> a stack allocated buffer is used instead.
> 
> I'm still not convinced that stack switching is required here.

- Performance-wise it would be just a stack increment (which tends to
  better than a mmap/unmap), 

- It also should work with any stack hardening (--enable-stack-protector=all
  shows no issue) and the stack uses uses the calling process stack as well
  (which should trigger overflow due guard pages).
 
  It also should issue the stackoverflow earlier since it allocated the
  stack prior helper process usage.

- We do not enable cancellation, neither its uses PLT resolution, so stack
  usage should be limited to around the functions usage (such as the execvpe
  provision).

- It does not require a arch-specific way to query the stack pointer 
  (__builtin_frame_address (0) should work, but it would require some testing
  to see it is safe on all supported architectures and at least for ia64 we
  would need to set up a artificial maximum size which might overflow depending of
  the stack usage on the call).

I am more inclined to last point, CLONE would be a function call and it will
itself require some stack allocation in most ABIs. To add a real-safe clone
call, we will need to add a variant of clone which uses the current stack
pointer on clone call itself, not as an argument. And I really think this
is overkill for this change, taking on consideration the little gain for this
specific usage and required work.
Florian Weimer - May 7, 2019, 3:38 p.m.
* Adhemerval Zanella:

> On 07/05/2019 05:33, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> The current internal posix_spawn symbol for Linux (__spawni) requires
>>> to allocate a dynamic stack based on input arguments to handle the
>>> SPAWN_XFLAGS_USE_PATH internal flag, which re-issue the input binary
>>> as a shell script if execve call return ENOEXEC (to execute shell
>>> scripts with an initial shebang).
>>>
>>> This is done only for compatibility mode and the generic case does not
>>> require the extra calculation plus the potential large mmap/munmap
>>> call.  For default case, a pre-defined buffer is sufficed to use on the
>>> clone call instead.
>>>
>>> This patch optimizes Linux spawni by allocating a dynamic stack only
>>> for compatibility symbol (SPAWN_XFLAGS_USE_PATH).  For generic case,
>>> a stack allocated buffer is used instead.
>> 
>> I'm still not convinced that stack switching is required here.
>
> - Performance-wise it would be just a stack increment (which tends to
>   better than a mmap/unmap), 
>
> - It also should work with any stack hardening (--enable-stack-protector=all
>   shows no issue) and the stack uses uses the calling process stack as well
>   (which should trigger overflow due guard pages).

No, since there is no guard page as part of the on-stack array, we do
not detect any overflow into other parts of the stack frame if it
happens to be stored below the frame.  So this still disables some of
the hardening checks.

>   It also should issue the stackoverflow earlier since it allocated the
>   stack prior helper process usage.

That is true, but we won't know if the region is large enough.

> - We do not enable cancellation, neither its uses PLT resolution, so stack
>   usage should be limited to around the functions usage (such as the execvpe
>   provision).

Hmm.  Right, we should not trigger lazy binding if we applied PLT
avoidance successfully.  So this is not a problem here.

> - It does not require a arch-specific way to query the stack pointer 
>   (__builtin_frame_address (0) should work, but it would require some testing
>   to see it is safe on all supported architectures and at least for ia64 we
>   would need to set up a artificial maximum size which might overflow depending of
>   the stack usage on the call).

It's also not clear to me how we could compute a safe stack adjustment
from C code.

> I am more inclined to last point, CLONE would be a function call and it will
> itself require some stack allocation in most ABIs. To add a real-safe clone
> call, we will need to add a variant of clone which uses the current stack
> pointer on clone call itself, not as an argument. And I really think this
> is overkill for this change, taking on consideration the little gain for this
> specific usage and required work.

I agree that this work should be undertaken as part of a clone_samestack
or similar implementation.

Thanks,
Florian
Adhemerval Zanella Netto - May 7, 2019, 5:12 p.m.
On 07/05/2019 12:38, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 07/05/2019 05:33, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> The current internal posix_spawn symbol for Linux (__spawni) requires
>>>> to allocate a dynamic stack based on input arguments to handle the
>>>> SPAWN_XFLAGS_USE_PATH internal flag, which re-issue the input binary
>>>> as a shell script if execve call return ENOEXEC (to execute shell
>>>> scripts with an initial shebang).
>>>>
>>>> This is done only for compatibility mode and the generic case does not
>>>> require the extra calculation plus the potential large mmap/munmap
>>>> call.  For default case, a pre-defined buffer is sufficed to use on the
>>>> clone call instead.
>>>>
>>>> This patch optimizes Linux spawni by allocating a dynamic stack only
>>>> for compatibility symbol (SPAWN_XFLAGS_USE_PATH).  For generic case,
>>>> a stack allocated buffer is used instead.
>>>
>>> I'm still not convinced that stack switching is required here.
>>
>> - Performance-wise it would be just a stack increment (which tends to
>>   better than a mmap/unmap), 
>>
>> - It also should work with any stack hardening (--enable-stack-protector=all
>>   shows no issue) and the stack uses uses the calling process stack as well
>>   (which should trigger overflow due guard pages).
> 
> No, since there is no guard page as part of the on-stack array, we do
> not detect any overflow into other parts of the stack frame if it
> happens to be stored below the frame.  So this still disables some of
> the hardening checks.

Using a stack allocated array is essentially the same as using the same
stack of the caller: stack overflow in created helper thread will trigger
a fault based on the guard page of the calling thread (since stack is
allocated object will be based on caller thread). I give you this is sketchy
standard-wise.

To fully provide stack hardening we will need to use current mmap mechanism
with a guard page as well, maybe by using the allocatestack code for
pthread. However I really doubtful that is really a net gain in this case,
since the semantic is meant that new process is still running in caller
memory context and stack overflows should be checked based on caller
context (i.e, stack overflow should be guarded by the caller guard page).

> 
>>   It also should issue the stackoverflow earlier since it allocated the
>>   stack prior helper process usage.
> 
> That is true, but we won't know if the region is large enough.

That's why I think we should rely on caller stack hardning and thread the
helper process as function call.  The dynamic path for compatibility mode
I still think it would be better served by mmap/unmap (maybe adding a guard
page as well).

> 
>> - We do not enable cancellation, neither its uses PLT resolution, so stack
>>   usage should be limited to around the functions usage (such as the execvpe
>>   provision).
> 
> Hmm.  Right, we should not trigger lazy binding if we applied PLT
> avoidance successfully.  So this is not a problem here.
> 
>> - It does not require a arch-specific way to query the stack pointer 
>>   (__builtin_frame_address (0) should work, but it would require some testing
>>   to see it is safe on all supported architectures and at least for ia64 we
>>   would need to set up a artificial maximum size which might overflow depending of
>>   the stack usage on the call).
> 
> It's also not clear to me how we could compute a safe stack adjustment
> from C code.

The slack code is indeed based on heuristic, as we are currently doing while
adjusting the mmap size value.

> 
>> I am more inclined to last point, CLONE would be a function call and it will
>> itself require some stack allocation in most ABIs. To add a real-safe clone
>> call, we will need to add a variant of clone which uses the current stack
>> pointer on clone call itself, not as an argument. And I really think this
>> is overkill for this change, taking on consideration the little gain for this
>> specific usage and required work.
> 
> I agree that this work should be undertaken as part of a clone_samestack
> or similar implementation.

But do you think it should be a blocker for this change?
Florian Weimer - May 8, 2019, 1:12 p.m.
* Adhemerval Zanella:

> On 07/05/2019 12:38, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> On 07/05/2019 05:33, Florian Weimer wrote:
>>>> * Adhemerval Zanella:
>>>>
>>>>> The current internal posix_spawn symbol for Linux (__spawni) requires
>>>>> to allocate a dynamic stack based on input arguments to handle the
>>>>> SPAWN_XFLAGS_USE_PATH internal flag, which re-issue the input binary
>>>>> as a shell script if execve call return ENOEXEC (to execute shell
>>>>> scripts with an initial shebang).
>>>>>
>>>>> This is done only for compatibility mode and the generic case does not
>>>>> require the extra calculation plus the potential large mmap/munmap
>>>>> call.  For default case, a pre-defined buffer is sufficed to use on the
>>>>> clone call instead.
>>>>>
>>>>> This patch optimizes Linux spawni by allocating a dynamic stack only
>>>>> for compatibility symbol (SPAWN_XFLAGS_USE_PATH).  For generic case,
>>>>> a stack allocated buffer is used instead.
>>>>
>>>> I'm still not convinced that stack switching is required here.
>>>
>>> - Performance-wise it would be just a stack increment (which tends to
>>>   better than a mmap/unmap), 
>>>
>>> - It also should work with any stack hardening (--enable-stack-protector=all
>>>   shows no issue) and the stack uses uses the calling process stack as well
>>>   (which should trigger overflow due guard pages).
>> 
>> No, since there is no guard page as part of the on-stack array, we do
>> not detect any overflow into other parts of the stack frame if it
>> happens to be stored below the frame.  So this still disables some of
>> the hardening checks.
>
> Using a stack allocated array is essentially the same as using the same
> stack of the caller: stack overflow in created helper thread will trigger
> a fault based on the guard page of the calling thread (since stack is
> allocated object will be based on caller thread). I give you this is sketchy
> standard-wise.

The difference is this.  f1 is the outer function (that prepares the new
stack, f2 is the function which runs with a switch stacked.

With a stack reuse in an assembler function, it looks like this:

   +-----------------------------
   | Activation record for f1
   |
   |
   +-----------------------------
   | Activation record for f2
   |
   :
   +=== Guard ===================

So it just looks like an ordinary function call.  Stack overflow in f2
will hit the guard (if built with -fstack-clash-protection).

With an on-stack array for the stack, it looks like this:

   +-----------------------------
   | Activation record for f1
   |
   | [ Array
   | [ +-----------------------------
   | [ | Activation record for f2
   | [ |
   | [ |
   |
   | (*)
   |
   :
   +=== Guard ===================


The problem is that technically, the activation record of f1 continues
after the array, indicated by (*), and we won't know what the compiler
will store there.  This means that even if f2 is built with
-fstack-clash-protection, stack overflow detection is unreliable because
the overflow into f1's activation record does not hit a guard page.

Sorry if this is totally obvious.  If it is and you still maintain that
both cases are equivalent, then perhaps I'm confused.

Thanks,
Florian
Adhemerval Zanella Netto - May 8, 2019, 3:10 p.m.
On 08/05/2019 10:12, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 07/05/2019 12:38, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> On 07/05/2019 05:33, Florian Weimer wrote:
>>>>> * Adhemerval Zanella:
>>>>>
>>>>>> The current internal posix_spawn symbol for Linux (__spawni) requires
>>>>>> to allocate a dynamic stack based on input arguments to handle the
>>>>>> SPAWN_XFLAGS_USE_PATH internal flag, which re-issue the input binary
>>>>>> as a shell script if execve call return ENOEXEC (to execute shell
>>>>>> scripts with an initial shebang).
>>>>>>
>>>>>> This is done only for compatibility mode and the generic case does not
>>>>>> require the extra calculation plus the potential large mmap/munmap
>>>>>> call.  For default case, a pre-defined buffer is sufficed to use on the
>>>>>> clone call instead.
>>>>>>
>>>>>> This patch optimizes Linux spawni by allocating a dynamic stack only
>>>>>> for compatibility symbol (SPAWN_XFLAGS_USE_PATH).  For generic case,
>>>>>> a stack allocated buffer is used instead.
>>>>>
>>>>> I'm still not convinced that stack switching is required here.
>>>>
>>>> - Performance-wise it would be just a stack increment (which tends to
>>>>   better than a mmap/unmap), 
>>>>
>>>> - It also should work with any stack hardening (--enable-stack-protector=all
>>>>   shows no issue) and the stack uses uses the calling process stack as well
>>>>   (which should trigger overflow due guard pages).
>>>
>>> No, since there is no guard page as part of the on-stack array, we do
>>> not detect any overflow into other parts of the stack frame if it
>>> happens to be stored below the frame.  So this still disables some of
>>> the hardening checks.
>>
>> Using a stack allocated array is essentially the same as using the same
>> stack of the caller: stack overflow in created helper thread will trigger
>> a fault based on the guard page of the calling thread (since stack is
>> allocated object will be based on caller thread). I give you this is sketchy
>> standard-wise.
> 
> The difference is this.  f1 is the outer function (that prepares the new
> stack, f2 is the function which runs with a switch stacked.
> 
> With a stack reuse in an assembler function, it looks like this:
> 
>    +-----------------------------
>    | Activation record for f1
>    |
>    |
>    +-----------------------------
>    | Activation record for f2
>    |
>    :
>    +=== Guard ===================
> 
> So it just looks like an ordinary function call.  Stack overflow in f2
> will hit the guard (if built with -fstack-clash-protection).
> 
> With an on-stack array for the stack, it looks like this:
> 
>    +-----------------------------
>    | Activation record for f1
>    |
>    | [ Array
>    | [ +-----------------------------
>    | [ | Activation record for f2
>    | [ |
>    | [ |
>    |
>    | (*)
>    |
>    :
>    +=== Guard ===================
> 
> 
> The problem is that technically, the activation record of f1 continues
> after the array, indicated by (*), and we won't know what the compiler
> will store there.  This means that even if f2 is built with
> -fstack-clash-protection, stack overflow detection is unreliable because
> the overflow into f1's activation record does not hit a guard page.

Thanks for explanation, but I still failing to see exactly why even with 
-fstack-clash-protection f2 probing will fail to hit the guard page.

The f2 probing will query page size based on stack address based on f1 stack
address, so the action record (which I assume you meant to be the prologue 
probe based on stack allocation added by compiler) a at f2 will essentially 
probe f1 stack memory.  Following your diagram, a potentially stack overflow 
on f2 would do:

   +-----------------------------
   | Activation record for f1
   |
   | [ Array
   | [ +-----------------------------
   | [ | Activation record for f2
   | [ |
   | [ |
   |
   | (f1act)
   |
   :
   +=== Guard ===================
   : (f2act)

My understanding is even -fstack-protector would work as well, since a
buffer overrun on f2 will still overwrite f1 canary. I am not seeing why
this specific *clone* usage is different than a function call and if it 
is indeed it might be troublesome since it s a potentially vector of 
attack.

> 
> Sorry if this is totally obvious.  If it is and you still maintain that
> both cases are equivalent, then perhaps I'm confused.

It is not a simple topic and you are probably more involved in recently
compiler hardening for stack-clash than me. Thanks for your patience though.

> 
> Thanks,
> Florian
>
Florian Weimer - May 8, 2019, 3:26 p.m.
* Adhemerval Zanella:

>> With an on-stack array for the stack, it looks like this:
>> 
>>    +-----------------------------
>>    | Activation record for f1
>>    |
>>    | [ Array
>>    | [ +-----------------------------
>>    | [ | Activation record for f2
>>    | [ |
>>    | [ |
>>    |
>>    | (*)
>>    |
>>    :
>>    +=== Guard ===================
>> 
>> 
>> The problem is that technically, the activation record of f1 continues
>> after the array, indicated by (*), and we won't know what the compiler
>> will store there.  This means that even if f2 is built with
>> -fstack-clash-protection, stack overflow detection is unreliable because
>> the overflow into f1's activation record does not hit a guard page.
>
> Thanks for explanation, but I still failing to see exactly why even with 
> -fstack-clash-protection f2 probing will fail to hit the guard page.
>
> The f2 probing will query page size based on stack address based on f1 stack
> address, so the action record (which I assume you meant to be the prologue 
> probe based on stack allocation added by compiler) a at f2 will essentially 
> probe f1 stack memory.  Following your diagram, a potentially stack overflow 
> on f2 would do:
>
>    +-----------------------------
>    | Activation record for f1
>    |
>    | [ Array
>    | [ +-----------------------------
>    | [ | Activation record for f2
>    | [ |
>    | [ |
>    |
>    | (f1act)
>    |
>    :
>    +=== Guard ===================
>    : (f2act)
>
> My understanding is even -fstack-protector would work as well, since a
> buffer overrun on f2 will still overwrite f1 canary. I am not seeing why
> this specific *clone* usage is different than a function call and if it 
> is indeed it might be troublesome since it s a potentially vector of 
> attack.

The problem is not the case where f2 actually hits the guard page.  You
are correct that this is detected as before.

What is not detected is a partial overflow that hits the activation
record of f1 *outside the array*, but stops short of extending to the
guard page.  It's impossible to tell what will happen in this case
because we do not know the frame layout that GCC has produced for f1.

The issue can only be avoided reliably by writing f1 in assembler, I
think.  INTERNAL_SYSCALL is probably not safe for vfork.

Thanks,
Florian
Adhemerval Zanella Netto - May 8, 2019, 5:46 p.m.
On 08/05/2019 12:26, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>>> With an on-stack array for the stack, it looks like this:
>>>
>>>    +-----------------------------
>>>    | Activation record for f1
>>>    |
>>>    | [ Array
>>>    | [ +-----------------------------
>>>    | [ | Activation record for f2
>>>    | [ |
>>>    | [ |
>>>    |
>>>    | (*)
>>>    |
>>>    :
>>>    +=== Guard ===================
>>>
>>>
>>> The problem is that technically, the activation record of f1 continues
>>> after the array, indicated by (*), and we won't know what the compiler
>>> will store there.  This means that even if f2 is built with
>>> -fstack-clash-protection, stack overflow detection is unreliable because
>>> the overflow into f1's activation record does not hit a guard page.
>>
>> Thanks for explanation, but I still failing to see exactly why even with 
>> -fstack-clash-protection f2 probing will fail to hit the guard page.
>>
>> The f2 probing will query page size based on stack address based on f1 stack
>> address, so the action record (which I assume you meant to be the prologue 
>> probe based on stack allocation added by compiler) a at f2 will essentially 
>> probe f1 stack memory.  Following your diagram, a potentially stack overflow 
>> on f2 would do:
>>
>>    +-----------------------------
>>    | Activation record for f1
>>    |
>>    | [ Array
>>    | [ +-----------------------------
>>    | [ | Activation record for f2
>>    | [ |
>>    | [ |
>>    |
>>    | (f1act)
>>    |
>>    :
>>    +=== Guard ===================
>>    : (f2act)
>>
>> My understanding is even -fstack-protector would work as well, since a
>> buffer overrun on f2 will still overwrite f1 canary. I am not seeing why
>> this specific *clone* usage is different than a function call and if it 
>> is indeed it might be troublesome since it s a potentially vector of 
>> attack.
> 
> The problem is not the case where f2 actually hits the guard page.  You
> are correct that this is detected as before.
> 
> What is not detected is a partial overflow that hits the activation
> record of f1 *outside the array*, but stops short of extending to the
> guard page.  It's impossible to tell what will happen in this case
> because we do not know the frame layout that GCC has produced for f1.

But would in this case protected as well with -fstack-protector, where
f2 would overwrite f1 canary?

> 
> The issue can only be avoided reliably by writing f1 in assembler, I
> think.  INTERNAL_SYSCALL is probably not safe for vfork.

In any case, I still think the specific scenario is quite an overkill 
one to block this optimization. Specially because the possible adjustments
is somewhat infeasible (rewrite the helper spawn function in assembly,
or the impossibility to use INTERNAL_SYSCALL which may itself cause a
stack allocation) or add a arch-specific vfork variant to handle this
specific case.
Florian Weimer - May 8, 2019, 6:03 p.m.
* Adhemerval Zanella:

> On 08/05/2019 12:26, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>>> With an on-stack array for the stack, it looks like this:
>>>>
>>>>    +-----------------------------
>>>>    | Activation record for f1
>>>>    |
>>>>    | [ Array
>>>>    | [ +-----------------------------
>>>>    | [ | Activation record for f2
>>>>    | [ |
>>>>    | [ |
>>>>    |
>>>>    | (*)
>>>>    |
>>>>    :
>>>>    +=== Guard ===================
>>>>
>>>>
>>>> The problem is that technically, the activation record of f1 continues
>>>> after the array, indicated by (*), and we won't know what the compiler
>>>> will store there.  This means that even if f2 is built with
>>>> -fstack-clash-protection, stack overflow detection is unreliable because
>>>> the overflow into f1's activation record does not hit a guard page.
>>>
>>> Thanks for explanation, but I still failing to see exactly why even with 
>>> -fstack-clash-protection f2 probing will fail to hit the guard page.
>>>
>>> The f2 probing will query page size based on stack address based on f1 stack
>>> address, so the action record (which I assume you meant to be the prologue 
>>> probe based on stack allocation added by compiler) a at f2 will essentially 
>>> probe f1 stack memory.  Following your diagram, a potentially stack overflow 
>>> on f2 would do:
>>>
>>>    +-----------------------------
>>>    | Activation record for f1
>>>    |
>>>    | [ Array
>>>    | [ +-----------------------------
>>>    | [ | Activation record for f2
>>>    | [ |
>>>    | [ |
>>>    |
>>>    | (f1act)
>>>    |
>>>    :
>>>    +=== Guard ===================
>>>    : (f2act)
>>>
>>> My understanding is even -fstack-protector would work as well, since a
>>> buffer overrun on f2 will still overwrite f1 canary. I am not seeing why
>>> this specific *clone* usage is different than a function call and if it 
>>> is indeed it might be troublesome since it s a potentially vector of 
>>> attack.
>> 
>> The problem is not the case where f2 actually hits the guard page.  You
>> are correct that this is detected as before.
>> 
>> What is not detected is a partial overflow that hits the activation
>> record of f1 *outside the array*, but stops short of extending to the
>> guard page.  It's impossible to tell what will happen in this case
>> because we do not know the frame layout that GCC has produced for f1.
>
> But would in this case protected as well with -fstack-protector, where
> f2 would overwrite f1 canary?

No, the canary is at the top of the activation record, and it would not
be checked upon return from f2 anyway.

>> The issue can only be avoided reliably by writing f1 in assembler, I
>> think.  INTERNAL_SYSCALL is probably not safe for vfork.
>
> In any case, I still think the specific scenario is quite an overkill 
> one to block this optimization. Specially because the possible adjustments
> is somewhat infeasible (rewrite the helper spawn function in assembly,
> or the impossibility to use INTERNAL_SYSCALL which may itself cause a
> stack allocation) or add a arch-specific vfork variant to handle this
> specific case.

I do not have a strong opinion regarding this matter.  I'd still prefer
using vfork here.  I really don't understand your reluctance in this
matter.

On the other hand, an argument could be made to keep the mmap allocation
and use MAP_SHARED, so that it's possible to communicate the correct
error for vfork-as-fork environments.

Thanks,
Florian
Adhemerval Zanella Netto - May 8, 2019, 6:37 p.m.
On 08/05/2019 15:03, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 08/05/2019 12:26, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>>> With an on-stack array for the stack, it looks like this:
>>>>>
>>>>>    +-----------------------------
>>>>>    | Activation record for f1
>>>>>    |
>>>>>    | [ Array
>>>>>    | [ +-----------------------------
>>>>>    | [ | Activation record for f2
>>>>>    | [ |
>>>>>    | [ |
>>>>>    |
>>>>>    | (*)
>>>>>    |
>>>>>    :
>>>>>    +=== Guard ===================
>>>>>
>>>>>
>>>>> The problem is that technically, the activation record of f1 continues
>>>>> after the array, indicated by (*), and we won't know what the compiler
>>>>> will store there.  This means that even if f2 is built with
>>>>> -fstack-clash-protection, stack overflow detection is unreliable because
>>>>> the overflow into f1's activation record does not hit a guard page.
>>>>
>>>> Thanks for explanation, but I still failing to see exactly why even with 
>>>> -fstack-clash-protection f2 probing will fail to hit the guard page.
>>>>
>>>> The f2 probing will query page size based on stack address based on f1 stack
>>>> address, so the action record (which I assume you meant to be the prologue 
>>>> probe based on stack allocation added by compiler) a at f2 will essentially 
>>>> probe f1 stack memory.  Following your diagram, a potentially stack overflow 
>>>> on f2 would do:
>>>>
>>>>    +-----------------------------
>>>>    | Activation record for f1
>>>>    |
>>>>    | [ Array
>>>>    | [ +-----------------------------
>>>>    | [ | Activation record for f2
>>>>    | [ |
>>>>    | [ |
>>>>    |
>>>>    | (f1act)
>>>>    |
>>>>    :
>>>>    +=== Guard ===================
>>>>    : (f2act)
>>>>
>>>> My understanding is even -fstack-protector would work as well, since a
>>>> buffer overrun on f2 will still overwrite f1 canary. I am not seeing why
>>>> this specific *clone* usage is different than a function call and if it 
>>>> is indeed it might be troublesome since it s a potentially vector of 
>>>> attack.
>>>
>>> The problem is not the case where f2 actually hits the guard page.  You
>>> are correct that this is detected as before.
>>>
>>> What is not detected is a partial overflow that hits the activation
>>> record of f1 *outside the array*, but stops short of extending to the
>>> guard page.  It's impossible to tell what will happen in this case
>>> because we do not know the frame layout that GCC has produced for f1.
>>
>> But would in this case protected as well with -fstack-protector, where
>> f2 would overwrite f1 canary?
> 
> No, the canary is at the top of the activation record, and it would not
> be checked upon return from f2 anyway.

But f2 canary would be checked anyway (if it is also built with
stack hardening). I give you that it still might be an issue for 
a vfork interface which might issue user provided callbacks, but I
am not sure the really net gain for this specific scenario.

> 
>>> The issue can only be avoided reliably by writing f1 in assembler, I
>>> think.  INTERNAL_SYSCALL is probably not safe for vfork.
>>
>> In any case, I still think the specific scenario is quite an overkill 
>> one to block this optimization. Specially because the possible adjustments
>> is somewhat infeasible (rewrite the helper spawn function in assembly,
>> or the impossibility to use INTERNAL_SYSCALL which may itself cause a
>> stack allocation) or add a arch-specific vfork variant to handle this
>> specific case.
> 
> I do not have a strong opinion regarding this matter.  I'd still prefer
> using vfork here.  I really don't understand your reluctance in this
> matter.

Besides the optimization, I am not really convinced that the security gain
is really expressive in this case, the current scenario that invokes
mmap is overkill for specific usage, since there no guard page current
scenario is not really better security-wise, and the other scenarios
you are suggestion adds even more complexity.

> 
> On the other hand, an argument could be made to keep the mmap allocation
> and use MAP_SHARED, so that it's possible to communicate the correct
> error for vfork-as-fork environments.

If the idea is to remove CLONE_VM and use MAP_SHARED instead, it would
better to just use fork (which has it own scalability issues).
Adhemerval Zanella Netto - May 8, 2019, 9:37 p.m.
On 08/05/2019 15:37, Adhemerval Zanella wrote:
> 
> 
> On 08/05/2019 15:03, Florian Weimer wrote:
>> * Adhemerval Zanella:
>>
>>> On 08/05/2019 12:26, Florian Weimer wrote:
>>>> * Adhemerval Zanella:
>>>>
>>>>>> With an on-stack array for the stack, it looks like this:
>>>>>>
>>>>>>    +-----------------------------
>>>>>>    | Activation record for f1
>>>>>>    |
>>>>>>    | [ Array
>>>>>>    | [ +-----------------------------
>>>>>>    | [ | Activation record for f2
>>>>>>    | [ |
>>>>>>    | [ |
>>>>>>    |
>>>>>>    | (*)
>>>>>>    |
>>>>>>    :
>>>>>>    +=== Guard ===================
>>>>>>
>>>>>>
>>>>>> The problem is that technically, the activation record of f1 continues
>>>>>> after the array, indicated by (*), and we won't know what the compiler
>>>>>> will store there.  This means that even if f2 is built with
>>>>>> -fstack-clash-protection, stack overflow detection is unreliable because
>>>>>> the overflow into f1's activation record does not hit a guard page.
>>>>>
>>>>> Thanks for explanation, but I still failing to see exactly why even with 
>>>>> -fstack-clash-protection f2 probing will fail to hit the guard page.
>>>>>
>>>>> The f2 probing will query page size based on stack address based on f1 stack
>>>>> address, so the action record (which I assume you meant to be the prologue 
>>>>> probe based on stack allocation added by compiler) a at f2 will essentially 
>>>>> probe f1 stack memory.  Following your diagram, a potentially stack overflow 
>>>>> on f2 would do:
>>>>>
>>>>>    +-----------------------------
>>>>>    | Activation record for f1
>>>>>    |
>>>>>    | [ Array
>>>>>    | [ +-----------------------------
>>>>>    | [ | Activation record for f2
>>>>>    | [ |
>>>>>    | [ |
>>>>>    |
>>>>>    | (f1act)
>>>>>    |
>>>>>    :
>>>>>    +=== Guard ===================
>>>>>    : (f2act)
>>>>>
>>>>> My understanding is even -fstack-protector would work as well, since a
>>>>> buffer overrun on f2 will still overwrite f1 canary. I am not seeing why
>>>>> this specific *clone* usage is different than a function call and if it 
>>>>> is indeed it might be troublesome since it s a potentially vector of 
>>>>> attack.
>>>>
>>>> The problem is not the case where f2 actually hits the guard page.  You
>>>> are correct that this is detected as before.
>>>>
>>>> What is not detected is a partial overflow that hits the activation
>>>> record of f1 *outside the array*, but stops short of extending to the
>>>> guard page.  It's impossible to tell what will happen in this case
>>>> because we do not know the frame layout that GCC has produced for f1.
>>>
>>> But would in this case protected as well with -fstack-protector, where
>>> f2 would overwrite f1 canary?
>>
>> No, the canary is at the top of the activation record, and it would not
>> be checked upon return from f2 anyway.
> 
> But f2 canary would be checked anyway (if it is also built with
> stack hardening). I give you that it still might be an issue for 
> a vfork interface which might issue user provided callbacks, but I
> am not sure the really net gain for this specific scenario.
> 
>>
>>>> The issue can only be avoided reliably by writing f1 in assembler, I
>>>> think.  INTERNAL_SYSCALL is probably not safe for vfork.
>>>
>>> In any case, I still think the specific scenario is quite an overkill 
>>> one to block this optimization. Specially because the possible adjustments
>>> is somewhat infeasible (rewrite the helper spawn function in assembly,
>>> or the impossibility to use INTERNAL_SYSCALL which may itself cause a
>>> stack allocation) or add a arch-specific vfork variant to handle this
>>> specific case.
>>
>> I do not have a strong opinion regarding this matter.  I'd still prefer
>> using vfork here.  I really don't understand your reluctance in this
>> matter.
> 
> Besides the optimization, I am not really convinced that the security gain
> is really expressive in this case, the current scenario that invokes
> mmap is overkill for specific usage, since there no guard page current
> scenario is not really better security-wise, and the other scenarios
> you are suggestion adds even more complexity.

Also, for current case on posix_spawn it is a fixed code path with fixed
stack usage in help process, so assuming a large enough stack buffer
(which I will double check with -fstack-usage) it would never overflow.

It also does not prevent to adapt to the vfork-like interface you are 
suggesting once it is implemented (which does seems appealing after
this discussion).

Patch

diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c
index c1abf3f960..822cce2a77 100644
--- a/sysdeps/unix/sysv/linux/spawni.c
+++ b/sysdeps/unix/sysv/linux/spawni.c
@@ -74,6 +74,11 @@ 
 # define STACK(__stack, __stack_size) (__stack + __stack_size)
 #endif
 
+/* Additional stack size added on required space to execute the clone child
+   function (__spawni_child).  */
+enum {
+  stack_slack = 512
+};
 
 struct posix_spawn_args
 {
@@ -83,35 +88,39 @@  struct posix_spawn_args
   const posix_spawn_file_actions_t *fa;
   const posix_spawnattr_t *restrict attr;
   char *const *argv;
-  ptrdiff_t argc;
   char *const *envp;
   int xflags;
   int err;
 };
 
-/* Older version requires that shell script without shebang definition
-   to be called explicitly using /bin/sh (_PATH_BSHELL).  */
-static void
-maybe_script_execute (struct posix_spawn_args *args)
+/* This is compatibility function required to enable posix_spawn run
+   script without shebang definition for older posix_spawn versions
+   (2.15).  */
+static int
+execve_compat (const char *filename, char *const argv[], char *const envp[])
 {
-  if (SHLIB_COMPAT (libc, GLIBC_2_2, GLIBC_2_15)
-      && (args->xflags & SPAWN_XFLAGS_TRY_SHELL) && errno == ENOEXEC)
+  __execve (filename, argv, envp);
+
+  if (errno == ENOEXEC)
     {
-      char *const *argv = args->argv;
-      ptrdiff_t argc = args->argc;
+      char *const *cargv = argv;
+      ptrdiff_t argc = 0;
+      while (cargv[argc++] != NULL);
 
       /* Construct an argument list for the shell.  */
       char *new_argv[argc + 2];
       new_argv[0] = (char *) _PATH_BSHELL;
-      new_argv[1] = (char *) args->file;
+      new_argv[1] = (char *) filename;
       if (argc > 1)
 	memcpy (new_argv + 2, argv + 1, argc * sizeof (char *));
       else
 	new_argv[2] = NULL;
 
       /* Execute the shell.  */
-      args->exec (new_argv[0], new_argv, args->envp);
+      __execve (new_argv[0], new_argv, envp);
     }
+
+  return -1;
 }
 
 /* Function used in the clone call to setup the signals mask, posix_spawn
@@ -291,11 +300,6 @@  __spawni_child (void *arguments)
 
   args->exec (args->file, args->argv, args->envp);
 
-  /* This is compatibility function required to enable posix_spawn run
-     script without shebang definition for older posix_spawn versions
-     (2.15).  */
-  maybe_script_execute (args);
-
 fail:
   /* errno should have an appropriate non-zero value; otherwise,
      there's a bug in glibc or the kernel.  For lack of an error code
@@ -306,18 +310,61 @@  fail:
   _exit (SPAWN_ERROR);
 }
 
-/* Spawn a new process executing PATH with the attributes describes in *ATTRP.
-   Before running the process perform the actions described in FILE-ACTIONS. */
 static int
-__spawnix (pid_t * pid, const char *file,
-	   const posix_spawn_file_actions_t * file_actions,
-	   const posix_spawnattr_t * attrp, char *const argv[],
-	   char *const envp[], int xflags,
-	   int (*exec) (const char *, char *const *, char *const *))
+__spawni_clone (struct posix_spawn_args *args, void *stack, size_t stack_size,
+		pid_t *pid)
 {
-  pid_t new_pid;
-  struct posix_spawn_args args;
   int ec;
+  pid_t new_pid;
+
+  /* The clone flags used will create a new child that will run in the same
+     memory space (CLONE_VM) and the execution of calling thread will be
+     suspend until the child calls execve or _exit.
+
+     Also since the calling thread execution will be suspend, there is not
+     need for CLONE_SETTLS.  Although parent and child share the same TLS
+     namespace, there will be no concurrent access for TLS variables (errno
+     for instance).  */
+  new_pid = CLONE (__spawni_child, STACK (stack, stack_size), stack_size,
+		   CLONE_VM | CLONE_VFORK | SIGCHLD, args);
+
+  /* It needs to collect the case where the auxiliary process was created
+     but failed to execute the file (due either any preparation step or
+     for execve itself).  */
+  if (new_pid > 0)
+    {
+      /* Also, it handles the unlikely case where the auxiliary process was
+	 terminated before calling execve as if it was successfully.  The
+	 args.err is set to 0 as default and changed to a positive value
+	 only in case of failure, so in case of premature termination
+	 due a signal args.err will remain zeroed and it will be up to
+	 caller to actually collect it.  */
+      ec = args->err;
+      if (ec > 0)
+	/* There still an unlikely case where the child is cancelled after
+	   setting args.err, due to a positive error value.  Also there is
+	   possible pid reuse race (where the kernel allocated the same pid
+	   to an unrelated process).  Unfortunately due synchronization
+	   issues where the kernel might not have the process collected
+	   the waitpid below can not use WNOHANG.  */
+	__waitpid (new_pid, NULL, 0);
+    }
+  else
+    ec = -new_pid;
+
+  if ((ec == 0) && (pid != NULL))
+    *pid = new_pid;
+
+  return ec;
+}
+
+/* Allocates a stack using mmap to call clone.  The stack size is based on
+   number of arguments since it would be used on compat mode which may call
+   execvpe/execve_compat.  */
+static int
+__spawnix_compat (struct posix_spawn_args *args, pid_t *pid)
+{
+  char *const *argv = args->argv;
 
   /* To avoid imposing hard limits on posix_spawn{p} the total number of
      arguments is first calculated to allocate a mmap to hold all possible
@@ -340,7 +387,8 @@  __spawnix (pid_t * pid, const char *file,
 	     | ((GL (dl_stack_flags) & PF_X) ? PROT_EXEC : 0));
 
   /* Add a slack area for child's stack.  */
-  size_t argv_size = (argc * sizeof (void *)) + 512;
+  size_t argv_size = (argc * sizeof (void *)) + stack_slack;
+
   /* We need at least a few pages in case the compiler's stack checking is
      enabled.  In some configs, it is known to use at least 24KiB.  We use
      32KiB to be "safe" from anything the compiler might do.  Besides, the
@@ -352,64 +400,61 @@  __spawnix (pid_t * pid, const char *file,
   if (__glibc_unlikely (stack == MAP_FAILED))
     return errno;
 
+  int ec = __spawni_clone (args, stack, stack_size, pid);
+
+  __munmap (stack, stack_size);
+
+  return ec;
+}
+
+/* Spawn a new process executing PATH with the attributes describes in *ATTRP.
+   Before running the process perform the actions described in FILE-ACTIONS. */
+int
+__spawni (pid_t * pid, const char *file,
+	  const posix_spawn_file_actions_t * file_actions,
+	  const posix_spawnattr_t * attrp, char *const argv[],
+	  char *const envp[], int xflags)
+{
+  /* For SPAWN_XFLAGS_TRY_SHELL we need to execute a script even without
+     a shebang.  To accomplish it we pass as callback to __spawni_child
+     __execvpe (which call maybe_script_execute for such case) or
+     execve_compat (which mimics the semantic using execve).  */
+  int (*exec) (const char *, char *const *, char *const *) =
+    xflags & SPAWN_XFLAGS_TRY_SHELL
+    ? xflags & SPAWN_XFLAGS_USE_PATH ? __execvpe  : execve_compat
+    : xflags & SPAWN_XFLAGS_USE_PATH ? __execvpex : __execve;
+
+  /* Child must set args.err to something non-negative - we rely on
+     the parent and child sharing VM.  */
+  struct posix_spawn_args args = {
+    .err = 0,
+    .file = file,
+    .exec = exec,
+    .fa = file_actions,
+    .attr = attrp ? attrp : &(const posix_spawnattr_t) { 0 },
+    .argv = argv,
+    .envp = envp,
+    .xflags = xflags
+  };
+
   /* Disable asynchronous cancellation.  */
   int state;
   __libc_ptf_call (__pthread_setcancelstate,
                    (PTHREAD_CANCEL_DISABLE, &state), 0);
 
-  /* Child must set args.err to something non-negative - we rely on
-     the parent and child sharing VM.  */
-  args.err = 0;
-  args.file = file;
-  args.exec = exec;
-  args.fa = file_actions;
-  args.attr = attrp ? attrp : &(const posix_spawnattr_t) { 0 };
-  args.argv = argv;
-  args.argc = argc;
-  args.envp = envp;
-  args.xflags = xflags;
-
   __libc_signal_block_all (&args.oldmask);
 
-  /* The clone flags used will create a new child that will run in the same
-     memory space (CLONE_VM) and the execution of calling thread will be
-     suspend until the child calls execve or _exit.
-
-     Also since the calling thread execution will be suspend, there is not
-     need for CLONE_SETTLS.  Although parent and child share the same TLS
-     namespace, there will be no concurrent access for TLS variables (errno
-     for instance).  */
-  new_pid = CLONE (__spawni_child, STACK (stack, stack_size), stack_size,
-		   CLONE_VM | CLONE_VFORK | SIGCHLD, &args);
+  int ec;
 
-  /* It needs to collect the case where the auxiliary process was created
-     but failed to execute the file (due either any preparation step or
-     for execve itself).  */
-  if (new_pid > 0)
+  if (__glibc_likely ((xflags & SPAWN_XFLAGS_TRY_SHELL) == 0))
     {
-      /* Also, it handles the unlikely case where the auxiliary process was
-	 terminated before calling execve as if it was successfully.  The
-	 args.err is set to 0 as default and changed to a positive value
-	 only in case of failure, so in case of premature termination
-	 due a signal args.err will remain zeroed and it will be up to
-	 caller to actually collect it.  */
-      ec = args.err;
-      if (ec > 0)
-	/* There still an unlikely case where the child is cancelled after
-	   setting args.err, due to a positive error value.  Also there is
-	   possible pid reuse race (where the kernel allocated the same pid
-	   to an unrelated process).  Unfortunately due synchronization
-	   issues where the kernel might not have the process collected
-	   the waitpid below can not use WNOHANG.  */
-	__waitpid (new_pid, NULL, 0);
+      /* execvpe allocates at least (NAME_MAX + 1) + PATH_MAX to create the
+	 combination of PATH entry and program name.  */
+      char stack[(NAME_MAX + 1) + PATH_MAX + stack_slack];
+      ec = __spawni_clone (&args, stack, sizeof stack, pid);
     }
   else
-    ec = -new_pid;
-
-  __munmap (stack, stack_size);
-
-  if ((ec == 0) && (pid != NULL))
-    *pid = new_pid;
+    ec = __spawnix_compat (&args, pid);
 
   __libc_signal_restore_set (&args.oldmask);
 
@@ -417,17 +462,3 @@  __spawnix (pid_t * pid, const char *file,
 
   return ec;
 }
-
-/* Spawn a new process executing PATH with the attributes describes in *ATTRP.
-   Before running the process perform the actions described in FILE-ACTIONS. */
-int
-__spawni (pid_t * pid, const char *file,
-	  const posix_spawn_file_actions_t * acts,
-	  const posix_spawnattr_t * attrp, char *const argv[],
-	  char *const envp[], int xflags)
-{
-  /* It uses __execvpex to avoid run ENOEXEC in non compatibility mode (it
-     will be handled by maybe_script_execute).  */
-  return __spawnix (pid, file, acts, attrp, argv, envp, xflags,
-		    xflags & SPAWN_XFLAGS_USE_PATH ? __execvpex :__execve);
-}