[v7,0/4] arm64: Enable BTI for the executable as well as the interpreter

Message ID 20211115152714.3205552-1-broonie@kernel.org
Headers
Series arm64: Enable BTI for the executable as well as the interpreter |

Message

Mark Brown Nov. 15, 2021, 3:27 p.m. UTC
  Deployments of BTI on arm64 have run into issues interacting with
systemd's MemoryDenyWriteExecute feature.  Currently for dynamically
linked executables the kernel will only handle architecture specific
properties like BTI for the interpreter, the expectation is that the
interpreter will then handle any properties on the main executable.
For BTI this means remapping the executable segments PROT_EXEC |
PROT_BTI.

This interacts poorly with MemoryDenyWriteExecute since that is
implemented using a seccomp filter which prevents setting PROT_EXEC on
already mapped memory and lacks the context to be able to detect that
memory is already mapped with PROT_EXEC.  This series resolves this by
handling the BTI property for both the interpreter and the main
executable.

This does mean that we may get more code with BTI enabled if running on
a system without BTI support in the dynamic linker, this is expected to
be a safe configuration and testing seems to confirm that. It also
reduces the flexibility userspace has to disable BTI but it is expected
that for cases where there are problems which require BTI to be disabled
it is more likely that it will need to be disabled on a system level.

v7:
 - Rebase onto v5.16-rc1.
v6:
 - Rebase onto v5.15-rc1.
v5:
 - Rebase onto v5.14-rc2.
 - Tweak changelog on patch 1.
 - Use the helper for interpreter/executable flag in elf.h as well.
v4:
 - Rebase onto v5.14-rc1.
v3:
 - Fix passing of properties for parsing by the main executable.
 - Drop has_interp from arch_parse_elf_property().
 - Coding style tweaks.
v2:
 - Add a patch dropping has_interp from arch_adjust_elf_prot()
 - Fix bisection issue with static executables on arm64 in the first
   patch.

Mark Brown (4):
  elf: Allow architectures to parse properties on the main executable
  arm64: Enable BTI for main executable as well as the interpreter
  elf: Remove has_interp property from arch_adjust_elf_prot()
  elf: Remove has_interp property from arch_parse_elf_property()

 arch/arm64/include/asm/elf.h | 14 ++++++++++++--
 arch/arm64/kernel/process.c  | 16 +++-------------
 fs/binfmt_elf.c              | 29 ++++++++++++++++++++---------
 include/linux/elf.h          |  8 +++++---
 4 files changed, 40 insertions(+), 27 deletions(-)


base-commit: fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf
  

Comments

Catalin Marinas Dec. 8, 2021, 6:23 p.m. UTC | #1
On Mon, Nov 15, 2021 at 03:27:10PM +0000, Mark Brown wrote:
> Deployments of BTI on arm64 have run into issues interacting with
> systemd's MemoryDenyWriteExecute feature.  Currently for dynamically
> linked executables the kernel will only handle architecture specific
> properties like BTI for the interpreter, the expectation is that the
> interpreter will then handle any properties on the main executable.
> For BTI this means remapping the executable segments PROT_EXEC |
> PROT_BTI.
> 
> This interacts poorly with MemoryDenyWriteExecute since that is
> implemented using a seccomp filter which prevents setting PROT_EXEC on
> already mapped memory and lacks the context to be able to detect that
> memory is already mapped with PROT_EXEC.  This series resolves this by
> handling the BTI property for both the interpreter and the main
> executable.
> 
> This does mean that we may get more code with BTI enabled if running on
> a system without BTI support in the dynamic linker, this is expected to
> be a safe configuration and testing seems to confirm that. It also
> reduces the flexibility userspace has to disable BTI but it is expected
> that for cases where there are problems which require BTI to be disabled
> it is more likely that it will need to be disabled on a system level.

Given the silence on this series over the past months, I propose we drop
it. It's a bit unfortunate that systemd's MemoryDenyWriteExecute cannot
work with BTI but I also think the former is a pretty blunt hardening
mechanism (rejecting any mprotect(PROT_EXEC) regardless of the previous
attributes).

I'm not a security expert to assess whether MDWX is more important than
BTI (hardware availability also influences the distros decision). My
suggestion would be to look at a better way to support the MDWX on the
long run that does not interfere with BTI.
  
Szabolcs Nagy Dec. 9, 2021, 11:10 a.m. UTC | #2
The 12/08/2021 18:23, Catalin Marinas wrote:
> On Mon, Nov 15, 2021 at 03:27:10PM +0000, Mark Brown wrote:
> > Deployments of BTI on arm64 have run into issues interacting with
> > systemd's MemoryDenyWriteExecute feature.  Currently for dynamically
> > linked executables the kernel will only handle architecture specific
> > properties like BTI for the interpreter, the expectation is that the
> > interpreter will then handle any properties on the main executable.
> > For BTI this means remapping the executable segments PROT_EXEC |
> > PROT_BTI.
> > 
> > This interacts poorly with MemoryDenyWriteExecute since that is
> > implemented using a seccomp filter which prevents setting PROT_EXEC on
> > already mapped memory and lacks the context to be able to detect that
> > memory is already mapped with PROT_EXEC.  This series resolves this by
> > handling the BTI property for both the interpreter and the main
> > executable.
> > 
> > This does mean that we may get more code with BTI enabled if running on
> > a system without BTI support in the dynamic linker, this is expected to
> > be a safe configuration and testing seems to confirm that. It also
> > reduces the flexibility userspace has to disable BTI but it is expected
> > that for cases where there are problems which require BTI to be disabled
> > it is more likely that it will need to be disabled on a system level.
> 
> Given the silence on this series over the past months, I propose we drop
> it. It's a bit unfortunate that systemd's MemoryDenyWriteExecute cannot
> work with BTI but I also think the former is a pretty blunt hardening
> mechanism (rejecting any mprotect(PROT_EXEC) regardless of the previous
> attributes).
> 
> I'm not a security expert to assess whether MDWX is more important than
> BTI (hardware availability also influences the distros decision). My
> suggestion would be to look at a better way to support the MDWX on the
> long run that does not interfere with BTI.

i still think it would be better if the kernel dealt with
PROT_BTI for the exe loaded by the kernel.

things work without this series as glibc will continue to
apply mprotect and ignore the failure, but with the series
security is tighter (neither mdwx nor bti are compromised).

for unrelated reasons bti is not widely deployed currently
so i guess there is no urgent interest in this.
  
Mark Brown Jan. 4, 2022, 5:32 p.m. UTC | #3
On Thu, Dec 09, 2021 at 11:10:48AM +0000, Szabolcs Nagy wrote:
> The 12/08/2021 18:23, Catalin Marinas wrote:
> > On Mon, Nov 15, 2021 at 03:27:10PM +0000, Mark Brown wrote:

> > > memory is already mapped with PROT_EXEC.  This series resolves this by
> > > handling the BTI property for both the interpreter and the main
> > > executable.

> > Given the silence on this series over the past months, I propose we drop
> > it. It's a bit unfortunate that systemd's MemoryDenyWriteExecute cannot
> > work with BTI but I also think the former is a pretty blunt hardening
> > mechanism (rejecting any mprotect(PROT_EXEC) regardless of the previous
> > attributes).

> i still think it would be better if the kernel dealt with
> PROT_BTI for the exe loaded by the kernel.

The above message from Catalin isn't quite the full story here - my
understanding from backchannel is that there's concern from others that
we might be creating future issues by enabling PROT_BTI, especially in
the case where the same permissions issue prevents the dynamic linker
disabling PROT_BTI.  They'd therefore rather stick with the status quo
and not create any new ABI.  Unfortunately that's not something people
have been willing to say on the list, hopefully the above captures the
thinking well enough.

Personally I'm a bit ambivalent on this, I do see the potential issue
but I'm having trouble constructing an actual scenario and my instinct
is that since we handle PROT_EXEC we should also handle PROT_BTI for
consistency.
  
Jeremy Linton Jan. 5, 2022, 10:42 p.m. UTC | #4
Hi,

On 1/4/22 11:32, Mark Brown wrote:
> On Thu, Dec 09, 2021 at 11:10:48AM +0000, Szabolcs Nagy wrote:
>> The 12/08/2021 18:23, Catalin Marinas wrote:
>>> On Mon, Nov 15, 2021 at 03:27:10PM +0000, Mark Brown wrote:
> 
>>>> memory is already mapped with PROT_EXEC.  This series resolves this by
>>>> handling the BTI property for both the interpreter and the main
>>>> executable.
> 
>>> Given the silence on this series over the past months, I propose we drop
>>> it. It's a bit unfortunate that systemd's MemoryDenyWriteExecute cannot
>>> work with BTI but I also think the former is a pretty blunt hardening
>>> mechanism (rejecting any mprotect(PROT_EXEC) regardless of the previous
>>> attributes).
> 
>> i still think it would be better if the kernel dealt with
>> PROT_BTI for the exe loaded by the kernel.
> 
> The above message from Catalin isn't quite the full story here - my
> understanding from backchannel is that there's concern from others that
> we might be creating future issues by enabling PROT_BTI, especially in
> the case where the same permissions issue prevents the dynamic linker
> disabling PROT_BTI.  They'd therefore rather stick with the status quo
> and not create any new ABI.  Unfortunately that's not something people
> have been willing to say on the list, hopefully the above captures the
> thinking well enough.
> 
> Personally I'm a bit ambivalent on this, I do see the potential issue
> but I'm having trouble constructing an actual scenario and my instinct
> is that since we handle PROT_EXEC we should also handle PROT_BTI for
> consistency.
> 

I'm hardly a security expert, but it seems to me that BTI hardens 
against a wider set of possible exploits than MDWE. Yet, we are silently 
turning it off for systemd services which are considered some of the 
most security critical things in the machine right now (ex:logind, etc). 
So despite 'systemd-analyze secuirty` flagging those services as the 
most secure ones on a system, they might actually be less secure.

It also seems that getting BTI turned on earlier, as this patch is doing 
is itself a win.

So, mentally i'm having a hard time balancing the hypothetical problem 
laid out, as it should only really exist in an environment similar to 
the MDWE one, since AFAIK, its possible today to just flip it back off 
unless MDWE stops that from happening.

What are the the remaining alternatives? A new syscall? But that is by 
definition a new ABI, and wouldn't benefit from having BTI turned on as 
early as this patch is doing. Should we disable MDWE on a BTI machine? 
I'm not sure that is a good look, particularly if MDWE happens to 
successfully stop some exploit. AFAIK, MDWE+BTI are a good strong 
combination, it seems a shame if we can't get them both working together.

I hesitate to suggest it, but maybe this patch should be conditional 
somehow, that way !systemd/MDWE machines can behave as they do today, 
and systemd/MDWE machines can request BTI be turned on by the kernel 
automatically?
  
Catalin Marinas Jan. 6, 2022, 11 a.m. UTC | #5
Hi Jeremy,

On Wed, Jan 05, 2022 at 04:42:01PM -0600, Jeremy Linton wrote:
> On 1/4/22 11:32, Mark Brown wrote:
> > On Thu, Dec 09, 2021 at 11:10:48AM +0000, Szabolcs Nagy wrote:
> > > The 12/08/2021 18:23, Catalin Marinas wrote:
> > > > On Mon, Nov 15, 2021 at 03:27:10PM +0000, Mark Brown wrote:
> > > > > memory is already mapped with PROT_EXEC.  This series resolves this by
> > > > > handling the BTI property for both the interpreter and the main
> > > > > executable.
> > > > 
> > > > Given the silence on this series over the past months, I propose we drop
> > > > it. It's a bit unfortunate that systemd's MemoryDenyWriteExecute cannot
> > > > work with BTI but I also think the former is a pretty blunt hardening
> > > > mechanism (rejecting any mprotect(PROT_EXEC) regardless of the previous
> > > > attributes).
> > > 
> > > i still think it would be better if the kernel dealt with
> > > PROT_BTI for the exe loaded by the kernel.
> > 
> > The above message from Catalin isn't quite the full story here - my
> > understanding from backchannel is that there's concern from others that
> > we might be creating future issues by enabling PROT_BTI, especially in
> > the case where the same permissions issue prevents the dynamic linker
> > disabling PROT_BTI.  They'd therefore rather stick with the status quo
> > and not create any new ABI.  Unfortunately that's not something people
> > have been willing to say on the list, hopefully the above captures the
> > thinking well enough.
> > 
> > Personally I'm a bit ambivalent on this, I do see the potential issue
> > but I'm having trouble constructing an actual scenario and my instinct
> > is that since we handle PROT_EXEC we should also handle PROT_BTI for
> > consistency.
> 
> I'm hardly a security expert, but it seems to me that BTI hardens against a
> wider set of possible exploits than MDWE.

They are complementary features.

> Yet, we are silently turning it
> off for systemd services which are considered some of the most security
> critical things in the machine right now (ex:logind, etc). So despite
> 'systemd-analyze secuirty` flagging those services as the most secure ones
> on a system, they might actually be less secure.

Well, that's a distro decision. MDWE/MDWX is not something imposed by
the kernel but rather a seccomp bpf filter set up by systemd.

> It also seems that getting BTI turned on earlier, as this patch is doing is
> itself a win.
> 
> So, mentally i'm having a hard time balancing the hypothetical problem laid
> out, as it should only really exist in an environment similar to the MDWE
> one, since AFAIK, its possible today to just flip it back off unless MDWE
> stops that from happening.

That's a user ABI change and given that the first attempt was shown to
break with some combination of old loader and new main executable (or
the other way around), I'd rather keep things as they are.

> What are the the remaining alternatives? A new syscall? But that is by
> definition a new ABI,

A new ABI is better than changing the current ABI.

> and wouldn't benefit from having BTI turned on as early as this patch
> is doing.

In the absence of MDWX, it's not relevant how early the kernel turns BTI
on for the main executable. The dynamic loader would do this with an
mprotect() before actually executing any of the main code. Of course, we
assume there are no security bugs in the dynamic loader.

> Should we disable MDWE on a BTI machine? I'm
> not sure that is a good look, particularly if MDWE happens to successfully
> stop some exploit. AFAIK, MDWE+BTI are a good strong combination, it seems a
> shame if we can't get them both working together.

AFAICT MDWX wants (one of the filters) to prevent a previously writable
mapping from becoming executable through mprotect(PROT_EXEC). How common
is mprotect(PROT_EXEC|PROT_BTI) outside of the dynamic loader? I doubt
it is, especially in an MDWX environment. So can we not change the
filter to allow PROT_EXEC|PROT_BTI? If your code is already exploitable
to allow random syscalls, all bets are off anyway.

> I hesitate to suggest it, but maybe this patch should be conditional
> somehow, that way !systemd/MDWE machines can behave as they do today, and
> systemd/MDWE machines can request BTI be turned on by the kernel
> automatically?

That would be some big knob sysctl but I'm still not keen on toggling
the ABI like this.
  
Jeremy Linton Jan. 6, 2022, 4:09 p.m. UTC | #6
Hi,

On 1/6/22 05:00, Catalin Marinas wrote:
> Hi Jeremy,
> 
> On Wed, Jan 05, 2022 at 04:42:01PM -0600, Jeremy Linton wrote:
>> On 1/4/22 11:32, Mark Brown wrote:
>>> On Thu, Dec 09, 2021 at 11:10:48AM +0000, Szabolcs Nagy wrote:
>>>> The 12/08/2021 18:23, Catalin Marinas wrote:
>>>>> On Mon, Nov 15, 2021 at 03:27:10PM +0000, Mark Brown wrote:
>>>>>> memory is already mapped with PROT_EXEC.  This series resolves this by
>>>>>> handling the BTI property for both the interpreter and the main
>>>>>> executable.
>>>>>
>>>>> Given the silence on this series over the past months, I propose we drop
>>>>> it. It's a bit unfortunate that systemd's MemoryDenyWriteExecute cannot
>>>>> work with BTI but I also think the former is a pretty blunt hardening
>>>>> mechanism (rejecting any mprotect(PROT_EXEC) regardless of the previous
>>>>> attributes).
>>>>
>>>> i still think it would be better if the kernel dealt with
>>>> PROT_BTI for the exe loaded by the kernel.
>>>
>>> The above message from Catalin isn't quite the full story here - my
>>> understanding from backchannel is that there's concern from others that
>>> we might be creating future issues by enabling PROT_BTI, especially in
>>> the case where the same permissions issue prevents the dynamic linker
>>> disabling PROT_BTI.  They'd therefore rather stick with the status quo
>>> and not create any new ABI.  Unfortunately that's not something people
>>> have been willing to say on the list, hopefully the above captures the
>>> thinking well enough.
>>>
>>> Personally I'm a bit ambivalent on this, I do see the potential issue
>>> but I'm having trouble constructing an actual scenario and my instinct
>>> is that since we handle PROT_EXEC we should also handle PROT_BTI for
>>> consistency.
>>
>> I'm hardly a security expert, but it seems to me that BTI hardens against a
>> wider set of possible exploits than MDWE.
> 
> They are complementary features.
> 
>> Yet, we are silently turning it
>> off for systemd services which are considered some of the most security
>> critical things in the machine right now (ex:logind, etc). So despite
>> 'systemd-analyze secuirty` flagging those services as the most secure ones
>> on a system, they might actually be less secure.
> 
> Well, that's a distro decision. MDWE/MDWX is not something imposed by
> the kernel but rather a seccomp bpf filter set up by systemd.
> 
>> It also seems that getting BTI turned on earlier, as this patch is doing is
>> itself a win.
>>
>> So, mentally i'm having a hard time balancing the hypothetical problem laid
>> out, as it should only really exist in an environment similar to the MDWE
>> one, since AFAIK, its possible today to just flip it back off unless MDWE
>> stops that from happening.
> 
> That's a user ABI change and given that the first attempt was shown to
> break with some combination of old loader and new main executable (or
> the other way around), I'd rather keep things as they are.

This should only change the behavior for for binaries which conform to 
the new ABI containing the BTI note. So outside of the tiny window of 
things built with BTI, but run on !BTI hardware or older kernel+glibc, 
this shouldn't be a problem. (Unless i'm missing something) Put another 
way, now is the time to make a change, before there is a legacy BTI 
ecosystem we have to deal with.


> 
>> What are the the remaining alternatives? A new syscall? But that is by
>> definition a new ABI,
> 
> A new ABI is better than changing the current ABI.
> 
>> and wouldn't benefit from having BTI turned on as early as this patch
>> is doing.
> 
> In the absence of MDWX, it's not relevant how early the kernel turns BTI
> on for the main executable. The dynamic loader would do this with an
> mprotect() before actually executing any of the main code. Of course, we
> assume there are no security bugs in the dynamic loader.
> 
>> Should we disable MDWE on a BTI machine? I'm
>> not sure that is a good look, particularly if MDWE happens to successfully
>> stop some exploit. AFAIK, MDWE+BTI are a good strong combination, it seems a
>> shame if we can't get them both working together.
> 
> AFAICT MDWX wants (one of the filters) to prevent a previously writable
> mapping from becoming executable through mprotect(PROT_EXEC). How common
> is mprotect(PROT_EXEC|PROT_BTI) outside of the dynamic loader? I doubt
> it is, especially in an MDWX environment. So can we not change the
> filter to allow PROT_EXEC|PROT_BTI? If your code is already exploitable
> to allow random syscalls, all bets are off anyway.

I would expect JITs to be twittling EXEC|BTI but, those wouldn't be able 
to trivially run under MDWE anyway. So rarely?

Changing the filter to allow PROT_EXEC|PROT_BTI defeats the purpose 
because the hypothetical exploit would just add the BTI tags and turn 
BTI on as well. The filter, as is, also provides additional BTI 
protections because it makes it more difficult to disable BTI. Without 
that filter it seems likely someone could come up with a way to use an 
existing PROT_EXEC as a gadget to disable BTI anywhere they choose.

So, to your point before, BTI+MDWE are complementary, the combination 
seems considerably more robust than either by itself.

> 
>> I hesitate to suggest it, but maybe this patch should be conditional
>> somehow, that way !systemd/MDWE machines can behave as they do today, and
>> systemd/MDWE machines can request BTI be turned on by the kernel
>> automatically?
> 
> That would be some big knob sysctl but I'm still not keen on toggling
> the ABI like this.
>
  
Catalin Marinas Jan. 6, 2022, 6:13 p.m. UTC | #7
On Thu, Jan 06, 2022 at 10:09:35AM -0600, Jeremy Linton wrote:
> On 1/6/22 05:00, Catalin Marinas wrote:
> > On Wed, Jan 05, 2022 at 04:42:01PM -0600, Jeremy Linton wrote:
> > > So, mentally i'm having a hard time balancing the hypothetical problem laid
> > > out, as it should only really exist in an environment similar to the MDWE
> > > one, since AFAIK, its possible today to just flip it back off unless MDWE
> > > stops that from happening.
> > 
> > That's a user ABI change and given that the first attempt was shown to
> > break with some combination of old loader and new main executable (or
> > the other way around), I'd rather keep things as they are.
> 
> This should only change the behavior for for binaries which conform to the
> new ABI containing the BTI note. So outside of the tiny window of things
> built with BTI, but run on !BTI hardware or older kernel+glibc, this
> shouldn't be a problem. (Unless i'm missing something) Put another way, now
> is the time to make a change, before there is a legacy BTI ecosystem we have
> to deal with.

The concern is that the loader may decide in the future to not enable
(or turn off) BTI for some reason (e.g. mixed libraries, old glibc on
BTI hardware). If we force BTI on the main executable, we'd take this
option away. Note also that it's not only glibc here, there are other
loaders.

> > AFAICT MDWX wants (one of the filters) to prevent a previously writable
> > mapping from becoming executable through mprotect(PROT_EXEC). How common
> > is mprotect(PROT_EXEC|PROT_BTI) outside of the dynamic loader? I doubt
> > it is, especially in an MDWX environment. So can we not change the
> > filter to allow PROT_EXEC|PROT_BTI? If your code is already exploitable
> > to allow random syscalls, all bets are off anyway.
> 
> I would expect JITs to be twittling EXEC|BTI but, those wouldn't be able to
> trivially run under MDWE anyway. So rarely?
> 
> Changing the filter to allow PROT_EXEC|PROT_BTI defeats the purpose because
> the hypothetical exploit would just add the BTI tags and turn BTI on as
> well. The filter, as is, also provides additional BTI protections because it
> makes it more difficult to disable BTI. Without that filter it seems likely
> someone could come up with a way to use an existing PROT_EXEC as a gadget to
> disable BTI anywhere they choose.

To me, MDWX is more about protecting against making some (previously)
writable buffers executable (either inadvertently or as a programmer
decision). It's not meant to prevent the hijacking of the instruction
flow or data corruption that ends up as an mprotect(PROT_EXEC|PROT_BTI)
call. If that's possible, the MDWX mitigation is really negligible.

That said, the programmer may learn that passing PROT_BTI avoids the
filter and start using it, though it's not trivial due to the need for
landing pads.

> So, to your point before, BTI+MDWE are complementary, the combination seems
> considerably more robust than either by itself.

Before we come up with a better solution, I find allowing PROT_BTI an
acceptable compromise. I don't think we lose much with the current
codebase.

As for a better solution, we need to track the state of a memory range
as BPF denying PROT_EXEC anywhere is a pretty big hammer. We could add a
VM_WAS_WRITE flag and, in combination with a prctl() to opt in to the
feature, prevent mprotect(PROT_EXEC) on such vmas. No need for seccomp
bpf filters.
  
Mark Brown Jan. 6, 2022, 7:07 p.m. UTC | #8
On Thu, Jan 06, 2022 at 06:13:37PM +0000, Catalin Marinas wrote:
> On Thu, Jan 06, 2022 at 10:09:35AM -0600, Jeremy Linton wrote:

> > This should only change the behavior for for binaries which conform to the
> > new ABI containing the BTI note. So outside of the tiny window of things
> > built with BTI, but run on !BTI hardware or older kernel+glibc, this
> > shouldn't be a problem. (Unless i'm missing something) Put another way, now
> > is the time to make a change, before there is a legacy BTI ecosystem we have
> > to deal with.

> The concern is that the loader may decide in the future to not enable
> (or turn off) BTI for some reason (e.g. mixed libraries, old glibc on
> BTI hardware). If we force BTI on the main executable, we'd take this
> option away. Note also that it's not only glibc here, there are other
> loaders.

Neither of those examples should be concerns - BTI is per page so you
can mix BTI and non-BTI freely in a process (as will happen now for the
case where the dynamic loader is built for BTI but the main executable
is not, and the dynamic loader should do if there's a mix of BTI and
non-BTI libraries).  The main case where there might be a change would
be the case where there's individual excutables with incorrect BTI
annotations which are run under this seccomp MWDE, then the dynamic
loader might have support for disabling BTI based on some configuration
but wouldn't be able to due to the MWDE.

Note also that we're only taking the option of disabiling BTI away in
the case where there's something like this seccomp filter disabling
permission changes.
  
Catalin Marinas Jan. 7, 2022, 12:01 p.m. UTC | #9
On Thu, Jan 06, 2022 at 07:07:46PM +0000, Mark Brown wrote:
> On Thu, Jan 06, 2022 at 06:13:37PM +0000, Catalin Marinas wrote:
> > On Thu, Jan 06, 2022 at 10:09:35AM -0600, Jeremy Linton wrote:
> > > This should only change the behavior for for binaries which conform to the
> > > new ABI containing the BTI note. So outside of the tiny window of things
> > > built with BTI, but run on !BTI hardware or older kernel+glibc, this
> > > shouldn't be a problem. (Unless i'm missing something) Put another way, now
> > > is the time to make a change, before there is a legacy BTI ecosystem we have
> > > to deal with.
> 
> > The concern is that the loader may decide in the future to not enable
> > (or turn off) BTI for some reason (e.g. mixed libraries, old glibc on
> > BTI hardware). If we force BTI on the main executable, we'd take this
> > option away. Note also that it's not only glibc here, there are other
> > loaders.
> 
> Neither of those examples should be concerns - BTI is per page so you
> can mix BTI and non-BTI freely in a process (as will happen now for the
> case where the dynamic loader is built for BTI but the main executable
> is not, and the dynamic loader should do if there's a mix of BTI and
> non-BTI libraries).  The main case where there might be a change would
> be the case where there's individual excutables with incorrect BTI
> annotations which are run under this seccomp MWDE, then the dynamic
> loader might have support for disabling BTI based on some configuration
> but wouldn't be able to due to the MWDE.
> 
> Note also that we're only taking the option of disabiling BTI away in
> the case where there's something like this seccomp filter disabling
> permission changes.

Thanks for the clarification. I think we can look at this from two
angles:

1. Ignoring MDWE, should whoever does the original mmap() also honour
   PROT_BTI? We do this for static binaries but, for consistency, should
   we extend it to dynamic executable?

2. A 'simple' fix to allow MDWE together with BTI.

Regarding (1), I don't remember whether we decided to do it this way
because it was more complicated to handle it in the kernel (like the 4
more patches in this series) or because we wanted to leave the option to
the dynamic loader. It would be good to clarify this and we may have a
small window, as Jeremy said, where changing the ABI won't cause
problems (well, hopefully, there's still a risk).

If we only want (2), I just think we are approaching it wrongly. Looking
at mprotect(), this systemd feature is not even MDWE but rather "deny
mprotect execute" disregarding the previous attributes. If we want this,
something like below would do (conditional on some personality flag or a
prctl):

diff --git a/mm/mprotect.c b/mm/mprotect.c
index e552f5e0ccbd..4262e6f1c14e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -616,6 +616,11 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 			goto out;
 		}

+		if ((newflags & VM_EXEC) && !(vma->vm_flags & VM_EXEC)) {
+			error = -EACCES;
+			goto out;
+		}
+
 		/* Allow architectures to sanity-check the new flags */
 		if (!arch_validate_flags(newflags)) {
 			error = -EINVAL;

But we could also do a proper MDWE covering both mmap() and mprotect()
and keep track on whether a vma was previously writable. A personality
flag seems like a better option as we have READ_IMPLIES_EXEC, we could
add a DENY_WRITE_EXEC.
  
Mark Brown Jan. 7, 2022, 1:10 p.m. UTC | #10
On Fri, Jan 07, 2022 at 12:01:17PM +0000, Catalin Marinas wrote:

> Regarding (1), I don't remember whether we decided to do it this way
> because it was more complicated to handle it in the kernel (like the 4
> more patches in this series) or because we wanted to leave the option to
> the dynamic loader. It would be good to clarify this and we may have a
> small window, as Jeremy said, where changing the ABI won't cause
> problems (well, hopefully, there's still a risk).

My understanding is that it was basically just a "let's defer everything
to userspace" thing.  It means that userspace is responsible for turning
on BTI and is therefore responsible for any workarounds which are needed
for problematic binaries, it's the absolute minimum the kernel can be
responsible for.  This all predates my involvement though.
  
Catalin Marinas Jan. 17, 2022, 5:54 p.m. UTC | #11
On Fri, Jan 07, 2022 at 12:01:17PM +0000, Catalin Marinas wrote:
> I think we can look at this from two angles:
> 
> 1. Ignoring MDWE, should whoever does the original mmap() also honour
>    PROT_BTI? We do this for static binaries but, for consistency, should
>    we extend it to dynamic executable?
> 
> 2. A 'simple' fix to allow MDWE together with BTI.

Thinking about it, (1) is not that different from the kernel setting
PROT_EXEC on the main executable when the dynamic loader could've done
it as well. There is a case for making this more consistent: whoever
does the mmap() should use the full attributes.

Question for the toolchain people: would the compiler ever generate
relocations in the main executable that the linker needs to resolve via
an mprotect(READ|WRITE) followed by mprotect(READ|EXEC)? If yes, we'd
better go for a proper MDWE implementation in the kernel.
  
Adhemerval Zanella Jan. 17, 2022, 6:16 p.m. UTC | #12
On 17/01/2022 14:54, Catalin Marinas via Libc-alpha wrote:
> On Fri, Jan 07, 2022 at 12:01:17PM +0000, Catalin Marinas wrote:
>> I think we can look at this from two angles:
>>
>> 1. Ignoring MDWE, should whoever does the original mmap() also honour
>>    PROT_BTI? We do this for static binaries but, for consistency, should
>>    we extend it to dynamic executable?
>>
>> 2. A 'simple' fix to allow MDWE together with BTI.
> 
> Thinking about it, (1) is not that different from the kernel setting
> PROT_EXEC on the main executable when the dynamic loader could've done
> it as well. There is a case for making this more consistent: whoever
> does the mmap() should use the full attributes.
> 
> Question for the toolchain people: would the compiler ever generate
> relocations in the main executable that the linker needs to resolve via
> an mprotect(READ|WRITE) followed by mprotect(READ|EXEC)? If yes, we'd
> better go for a proper MDWE implementation in the kernel.
> 

Yes, text relocations.  However these are deprecated (some libcs even do
not support it) and have a lot of drawbacks.
  
H.J. Lu Jan. 17, 2022, 7:01 p.m. UTC | #13
On Mon, Jan 17, 2022 at 10:17 AM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
>
>
> On 17/01/2022 14:54, Catalin Marinas via Libc-alpha wrote:
> > On Fri, Jan 07, 2022 at 12:01:17PM +0000, Catalin Marinas wrote:
> >> I think we can look at this from two angles:
> >>
> >> 1. Ignoring MDWE, should whoever does the original mmap() also honour
> >>    PROT_BTI? We do this for static binaries but, for consistency, should
> >>    we extend it to dynamic executable?
> >>
> >> 2. A 'simple' fix to allow MDWE together with BTI.
> >
> > Thinking about it, (1) is not that different from the kernel setting
> > PROT_EXEC on the main executable when the dynamic loader could've done
> > it as well. There is a case for making this more consistent: whoever
> > does the mmap() should use the full attributes.
> >
> > Question for the toolchain people: would the compiler ever generate
> > relocations in the main executable that the linker needs to resolve via
> > an mprotect(READ|WRITE) followed by mprotect(READ|EXEC)? If yes, we'd
> > better go for a proper MDWE implementation in the kernel.
> >
>
> Yes, text relocations.  However these are deprecated (some libcs even do
> not support it) and have a lot of drawbacks.

We are taking a different approach for CET enabling.   CET will be
changed to be enabled from user space:

https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/cet/enable

and the CET kernel no longer enables CET automatically:

https://github.com/hjl-tools/linux/tree/hjl/cet%2F5.16.0-v4
  
Szabolcs Nagy Jan. 18, 2022, 11:02 a.m. UTC | #14
The 01/17/2022 17:54, Catalin Marinas wrote:
> On Fri, Jan 07, 2022 at 12:01:17PM +0000, Catalin Marinas wrote:
> > I think we can look at this from two angles:
> > 
> > 1. Ignoring MDWE, should whoever does the original mmap() also honour
> >    PROT_BTI? We do this for static binaries but, for consistency, should
> >    we extend it to dynamic executable?
> > 
> > 2. A 'simple' fix to allow MDWE together with BTI.
> 
> Thinking about it, (1) is not that different from the kernel setting
> PROT_EXEC on the main executable when the dynamic loader could've done
> it as well. There is a case for making this more consistent: whoever
> does the mmap() should use the full attributes.
> 

Yeah that was my original idea that it should be consistent.
One caveat is that protection flags are normally specified
in the program header, but the BTI marking is in
PT_GNU_PROPERTY which is harder to get to, so glibc does not
try to get it right for the initial mapping either: it has
to re-mmap or mprotect. (In principle we could use read
syscalls to parse the ELF headers and notes before mmap,
but that's more complicated with additional failure modes.)

i.e. if (2) is fixed then mprotect can be used for library
mapping too which is simpler than re-mmap.

> Question for the toolchain people: would the compiler ever generate
> relocations in the main executable that the linker needs to resolve via
> an mprotect(READ|WRITE) followed by mprotect(READ|EXEC)? If yes, we'd
> better go for a proper MDWE implementation in the kernel.

There is some support for text relocations in glibc, but it's not
expected to be common (e.g. it is a bug if a distro binary has it).

For static PIE we made -z text ldflag the default so text relocs
are rejected (i think glibc cannot self-relocate those, so ld.so
cannot have them either), but otherwise certain text relocs work
(static relocations are not supported).

$ cat a.c
int x = 42;
__attribute__((section(".text")))
int *y = &x;
int main(){ return *y; }
$ gcc a.c
/tmp/ccOrpMPD.s: Assembler messages:
/tmp/ccOrpMPD.s:12: Warning: ignoring changed section attributes for .text
$ readelf -aW ./a.out |grep TEXTREL
 0x0000000000000016 (TEXTREL)            0x0
 0x000000000000001e (FLAGS)              TEXTREL BIND_NOW
$ strace -e mprotect ./a.out
mprotect(0xffff839ee000, 65536, PROT_NONE) = 0
mprotect(0xffff839fe000, 12288, PROT_READ) = 0
mprotect(0xaaaac2096000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
mprotect(0xaaaac2096000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0xaaaac20a6000, 4096, PROT_READ) = 0
mprotect(0xffff83a55000, 4096, PROT_READ) = 0
+++ exited with 42 +++
  
Szabolcs Nagy Jan. 18, 2022, 11:22 a.m. UTC | #15
The 01/17/2022 11:01, H.J. Lu via Libc-alpha wrote:
> We are taking a different approach for CET enabling.   CET will be
> changed to be enabled from user space:
> 
> https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/cet/enable
> 
> and the CET kernel no longer enables CET automatically:
> 
> https://github.com/hjl-tools/linux/tree/hjl/cet%2F5.16.0-v4

we considered userspace handling of BTI in static exe
and ld.so too. at the time we wanted the protection to
be on whenever BTI marked code is executed, so it has
to be enabled at program entry.

i no longer think that the entry code protection is very
important, but delaying mprotect for static exe does
not fix our mprotect(*|PROT_EXEC) problem with systemd.

i also don't immediately see where you deal with shadow
stack allocation for the main stack if it is userspace
enabled, i expected that to require kernel assistance
if you want the main stack protected all the way up.
  
H.J. Lu Jan. 18, 2022, 12:55 p.m. UTC | #16
On Tue, Jan 18, 2022 at 3:22 AM Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
>
> The 01/17/2022 11:01, H.J. Lu via Libc-alpha wrote:
> > We are taking a different approach for CET enabling.   CET will be
> > changed to be enabled from user space:
> >
> > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/cet/enable
> >
> > and the CET kernel no longer enables CET automatically:
> >
> > https://github.com/hjl-tools/linux/tree/hjl/cet%2F5.16.0-v4
>
> we considered userspace handling of BTI in static exe
> and ld.so too. at the time we wanted the protection to
> be on whenever BTI marked code is executed, so it has
> to be enabled at program entry.
>
> i no longer think that the entry code protection is very
> important, but delaying mprotect for static exe does
> not fix our mprotect(*|PROT_EXEC) problem with systemd.
>
> i also don't immediately see where you deal with shadow
> stack allocation for the main stack if it is userspace
> enabled, i expected that to require kernel assistance
> if you want the main stack protected all the way up.

We enable shadow stack in user space as soon as possible:

https://gitlab.com/x86-glibc/glibc/-/commit/211abce607a9f6e4cd1cadefb87561413dd8fae9
  
Catalin Marinas Jan. 27, 2022, 12:24 p.m. UTC | #17
(Mark posted another series but I'm replying here to clarify some
aspects)

On Tue, Jan 18, 2022 at 11:02:55AM +0000, Szabolcs Nagy wrote:
> The 01/17/2022 17:54, Catalin Marinas wrote:
> > On Fri, Jan 07, 2022 at 12:01:17PM +0000, Catalin Marinas wrote:
> > > I think we can look at this from two angles:
> > > 
> > > 1. Ignoring MDWE, should whoever does the original mmap() also honour
> > >    PROT_BTI? We do this for static binaries but, for consistency, should
> > >    we extend it to dynamic executable?
> > > 
> > > 2. A 'simple' fix to allow MDWE together with BTI.
> > 
> > Thinking about it, (1) is not that different from the kernel setting
> > PROT_EXEC on the main executable when the dynamic loader could've done
> > it as well. There is a case for making this more consistent: whoever
> > does the mmap() should use the full attributes.
> 
> Yeah that was my original idea that it should be consistent.
> One caveat is that protection flags are normally specified
> in the program header, but the BTI marking is in
> PT_GNU_PROPERTY which is harder to get to, so glibc does not
> try to get it right for the initial mapping either: it has
> to re-mmap or mprotect. (In principle we could use read
> syscalls to parse the ELF headers and notes before mmap,
> but that's more complicated with additional failure modes.)
> 
> i.e. if (2) is fixed then mprotect can be used for library
> mapping too which is simpler than re-mmap.

I lost track of the userspace fixes here, was glibc changed to attempt a
re-mmap of the dynamic libraries instead of mprotect()?

It looks like (2) is a simpler fix and (1) could still be added for
consistency, it's complementary.

> > Question for the toolchain people: would the compiler ever generate
> > relocations in the main executable that the linker needs to resolve via
> > an mprotect(READ|WRITE) followed by mprotect(READ|EXEC)? If yes, we'd
> > better go for a proper MDWE implementation in the kernel.
> 
> There is some support for text relocations in glibc, but it's not
> expected to be common (e.g. it is a bug if a distro binary has it).

Thanks for the clarification.
  
Szabolcs Nagy Jan. 27, 2022, 2:48 p.m. UTC | #18
The 01/27/2022 12:24, Catalin Marinas wrote:
> (Mark posted another series but I'm replying here to clarify some
> aspects)
> 
> On Tue, Jan 18, 2022 at 11:02:55AM +0000, Szabolcs Nagy wrote:
> > The 01/17/2022 17:54, Catalin Marinas wrote:
> > > On Fri, Jan 07, 2022 at 12:01:17PM +0000, Catalin Marinas wrote:
> > > > I think we can look at this from two angles:
> > > > 
> > > > 1. Ignoring MDWE, should whoever does the original mmap() also honour
> > > >    PROT_BTI? We do this for static binaries but, for consistency, should
> > > >    we extend it to dynamic executable?
> > > > 
> > > > 2. A 'simple' fix to allow MDWE together with BTI.
> > > 
> > > Thinking about it, (1) is not that different from the kernel setting
> > > PROT_EXEC on the main executable when the dynamic loader could've done
> > > it as well. There is a case for making this more consistent: whoever
> > > does the mmap() should use the full attributes.
> > 
> > Yeah that was my original idea that it should be consistent.
> > One caveat is that protection flags are normally specified
> > in the program header, but the BTI marking is in
> > PT_GNU_PROPERTY which is harder to get to, so glibc does not
> > try to get it right for the initial mapping either: it has
> > to re-mmap or mprotect. (In principle we could use read
> > syscalls to parse the ELF headers and notes before mmap,
> > but that's more complicated with additional failure modes.)
> > 
> > i.e. if (2) is fixed then mprotect can be used for library
> > mapping too which is simpler than re-mmap.
> 
> I lost track of the userspace fixes here, was glibc changed to attempt a
> re-mmap of the dynamic libraries instead of mprotect()?

yes (so under mdwe, bti is lost on the exe but not on libs)

see the commit message for the fix
https://sourceware.org/bugzilla/show_bug.cgi?id=26831

> 
> It looks like (2) is a simpler fix and (1) could still be added for
> consistency, it's complementary.

i agree.

if (2) is fixed then i would change glibc to use
mprotect and handle the failure (this will require an
update to systemd and disabling mdwe on old kernels)

if (1) is fixed then i would probably still keep
doing mprotect on the main exe so bti protection
works on old kernels.