--enable-stack-protector for glibc, v4, now with arm

Message ID 56D76644.8050906@linaro.org
State Not applicable
Headers

Commit Message

Adhemerval Zanella March 2, 2016, 10:16 p.m. UTC
  On 02-03-2016 15:28, Nix wrote:
> On 28 Feb 2016, nix@esperi.org.uk spake thusly:
> 
>> This is version 4 of the stack-protected glibc patch, incorporating all review
>> comments to date (unless I missed some), and adding a degree of arm support
>> (i.e. "I know nothing about the platform but the tests passed").
> 
> So... other than the changelog, is there anything else I need to do to
> get review of the trickier parts of this? I've exhausted all the
> platforms I have access to and fixed all regressions on such platforms,
> but am happy to test/fix on more if people give me access to them.
> 

I did not review the patch themselves in detail, but I gave them a try on
aarch64 and ppc64le using the default --enable-stack-protector. I saw 
only one regression, which I also saw on i686 and x86_64 and (and makes
me question either why you did not see it or if you just overlook it):

FAIL: elf/check-localplt

$ cat elf/check-localplt.out 
Extra PLT reference: libc.so: __stack_chk_fail

Since there is no information indicating __stack_chk_fail is local,
the linker creates internal PLT calls.  We can solve it by the same
strategy we used on memcpy/memmove compiler call generation by using
asm symbol hack directive:


This seems to fix x86_64 and powerpc64le (also if we decide to add this
patch please add a comments explaining the issue).

And I also agree with Szabolcs Nagy about patch 15/16, we need to understand
better what is happening on the mentioned assembly routines before adding
hacks. Have you tried remove the assembly implementations for i386 and
let is use the default C implementation to check if it is something
related to the asm routines themselves?
  

Comments

Nix March 3, 2016, 5:34 p.m. UTC | #1
On 2 Mar 2016, Adhemerval Zanella stated:

> On 02-03-2016 15:28, Nix wrote:
>> On 28 Feb 2016, nix@esperi.org.uk spake thusly:
>> 
>>> This is version 4 of the stack-protected glibc patch, incorporating all review
>>> comments to date (unless I missed some), and adding a degree of arm support
>>> (i.e. "I know nothing about the platform but the tests passed").
>> 
>> So... other than the changelog, is there anything else I need to do to
>> get review of the trickier parts of this? I've exhausted all the
>> platforms I have access to and fixed all regressions on such platforms,
>> but am happy to test/fix on more if people give me access to them.
>
> I did not review the patch themselves in detail, but I gave them a try on
> aarch64 and ppc64le using the default --enable-stack-protector. I saw 
> only one regression,

OK, if it worked that well not only on 64-bit ARM but also on a platform
I never went near, I think we can say all the major holes are covered!
(I honestly expected a bit of per-arch protection to be needed on most
arches, as was needed on SPARC and x86_64 both -- but none was needed on
32-bit ARM either, so maybe I was being too paranoid.)

>                      which I also saw on i686 and x86_64 and (and makes
> me question either why you did not see it or if you just overlook it):
>
> FAIL: elf/check-localplt
>
> $ cat elf/check-localplt.out 
> Extra PLT reference: libc.so: __stack_chk_fail
>
> Since there is no information indicating __stack_chk_fail is local,
> the linker creates internal PLT calls.

I left that in specifically because I wasn't sure which way to go (we
see it on all platforms, including SPARC and ARM). It is quite possibly
desirable to de-PLTify this (because these calls are frequently made),
but it's *also* quite possibly desirable not to do that, because, like
malloc(), __stack_chk_fail() might reasonably be something that people
might want to intercept, to change what happens on stack overrun.

I'm not sure which is preferable. I guess if the fortification hooks
aren't overrideable, neither should these be.  BUt then, we can't really
use __fortify_fail() as a model, because glibc *implements*
fortification, so it contains all the calls to __fortify_fail() itself:
the same is not true of stack-protection, where random user programs can
and will also call __stack_chk_fail().

>                                        We can solve it by the same
> strategy we used on memcpy/memmove compiler call generation by using
> asm symbol hack directive:

Yeah, that would work. __stack_chk_fail would still be there, and
callable by everyone else, and overrideable for users outside of glibc
(because __stack_chk_fail() still *has* a PLT). Of course, if any users
do override it, they'll see rather inconsistent behaviour, because libc
will still use the old one, but libresolv, for example, will use the
overridden one! Is that inconsistency something that we should just
ignore?

> This seems to fix x86_64 and powerpc64le (also if we decide to add this
> patch please add a comments explaining the issue).

Definitely!

> And I also agree with Szabolcs Nagy about patch 15/16, we need to understand
> better what is happening on the mentioned assembly routines before adding
> hacks. Have you tried remove the assembly implementations for i386 and
> let is use the default C implementation to check if it is something
> related to the asm routines themselves?

It must be: other platforms don't have those implementations, and also
don't need the hacks (they didn't exist before I started working on this
patch series, and x86_64 was fine: only x86_32 saw test failures).

In the case of __sigjmp_save, the cause is clear: it's sibcalled to from
assembler and clearly must be (One of my worries is that other arches
may have assembler that does sibcalls in similar fashion without
mentioning it in comments or anything so my greps failed to find it...
but I hope that if any such exists, it'll pop up as a bug so we can fix
it, or someone here will know and mention it.)

Hm.

Looking at the prototypes of the functions in question, we see (dropping
my additions):

int
internal_function attribute_hidden
__pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr)

void
internal_function
__pthread_mutex_cond_lock_adjust (pthread_mutex_t *mutex)

and looking at the assembler in
sysdeps/unix/sysv/linux/i386/pthread_cond_timedwait.S, it is obvious
enough even to someone with my extremely minimal x86 asm experience that
this thing is not exactly feeling constrained by the normal way one
calls functions: the fast unlock path is calling
__pthread_mutex_unlock_usercnt() without having at any point adjusted
%esp downwards or anything, it does that afterwards on a slow path --
essentially the call is overlaid on top of the stack frame of
pthread_cond_timedwait() stack frame, like a weird handrolled sibcall
which returns conentionally or something. I'm not surprised that adding
a canary to something called like *that* makes things explode: I suspect
that adding any extra local variables to
__pthread_mutex_unlock_usercnt() that weren't optimized out would do the
same thing. (If so, there should probably be a warning comment in that
function!)

__pthread_mutex_cond_lock_adjust() can also be called in a similar
fashion (jumping via labels 1, 2, 5): __pthread_mutex_cond_lock() is
also called the same way but the function returns immediately
afterwards, and it also gets called more conventionally, so I guess no
harm is done there.)

Now if anyone who's actually done more than a trivial amount of x86
assembler in the last fifteen years could confirm that, I'd be very
happy... since it's quite possible I'm misreading it all. :)
  
Adhemerval Zanella March 3, 2016, 6:25 p.m. UTC | #2
On 03-03-2016 14:34, Nix wrote:
> On 2 Mar 2016, Adhemerval Zanella stated:
> 
>> On 02-03-2016 15:28, Nix wrote:
>>> On 28 Feb 2016, nix@esperi.org.uk spake thusly:
>>>
>>>> This is version 4 of the stack-protected glibc patch, incorporating all review
>>>> comments to date (unless I missed some), and adding a degree of arm support
>>>> (i.e. "I know nothing about the platform but the tests passed").
>>>
>>> So... other than the changelog, is there anything else I need to do to
>>> get review of the trickier parts of this? I've exhausted all the
>>> platforms I have access to and fixed all regressions on such platforms,
>>> but am happy to test/fix on more if people give me access to them.
>>
>> I did not review the patch themselves in detail, but I gave them a try on
>> aarch64 and ppc64le using the default --enable-stack-protector. I saw 
>> only one regression,
> 
> OK, if it worked that well not only on 64-bit ARM but also on a platform
> I never went near, I think we can say all the major holes are covered!
> (I honestly expected a bit of per-arch protection to be needed on most
> arches, as was needed on SPARC and x86_64 both -- but none was needed on
> 32-bit ARM either, so maybe I was being too paranoid.)

I thought too, although based on the patchset the level of arch-specific
hackery is still limited on some specific implementations.

I see also there are some arch specific implementations that issues
stack allocation (for instance strcspn for powerpc64) which won't have
instrumented code.  However I do not see how to accomplish unless it
it is being explicit added on such implementations.  I do no see this
as an impeding for patch, but I think it might require some notes in
documentation.

> 
>>                      which I also saw on i686 and x86_64 and (and makes
>> me question either why you did not see it or if you just overlook it):
>>
>> FAIL: elf/check-localplt
>>
>> $ cat elf/check-localplt.out 
>> Extra PLT reference: libc.so: __stack_chk_fail
>>
>> Since there is no information indicating __stack_chk_fail is local,
>> the linker creates internal PLT calls.
> 
> I left that in specifically because I wasn't sure which way to go (we
> see it on all platforms, including SPARC and ARM). It is quite possibly
> desirable to de-PLTify this (because these calls are frequently made),
> but it's *also* quite possibly desirable not to do that, because, like
> malloc(), __stack_chk_fail() might reasonably be something that people
> might want to intercept, to change what happens on stack overrun.
> 
> I'm not sure which is preferable. I guess if the fortification hooks
> aren't overrideable, neither should these be.  BUt then, we can't really
> use __fortify_fail() as a model, because glibc *implements*
> fortification, so it contains all the calls to __fortify_fail() itself:
> the same is not true of stack-protection, where random user programs can
> and will also call __stack_chk_fail().

I am not sure either, although imho I would prefer to not allow user to
override *internal* GLIBC stack overrun.  So I am inclined in de-PLTify
internal libc.so usage of __stack_chk_fail().

> 
>>                                        We can solve it by the same
>> strategy we used on memcpy/memmove compiler call generation by using
>> asm symbol hack directive:
> 
> Yeah, that would work. __stack_chk_fail would still be there, and
> callable by everyone else, and overrideable for users outside of glibc
> (because __stack_chk_fail() still *has* a PLT). Of course, if any users
> do override it, they'll see rather inconsistent behaviour, because libc
> will still use the old one, but libresolv, for example, will use the
> overridden one! Is that inconsistency something that we should just
> ignore?

One option is add a private symbol in libc.so, say __stack_chk_fail_glibc,
and add the alias for other libraries to this symbol.  This won't prevent
an user to really override with non-default options (like implementing
the __stack_chk_fail_glibc as well).

Another option will be to link each library with stack_chk_fail.o and
pulling a local copy of it and thus preventing overriding. 

> 
>> This seems to fix x86_64 and powerpc64le (also if we decide to add this
>> patch please add a comments explaining the issue).
> 
> Definitely!
> 
>> And I also agree with Szabolcs Nagy about patch 15/16, we need to understand
>> better what is happening on the mentioned assembly routines before adding
>> hacks. Have you tried remove the assembly implementations for i386 and
>> let is use the default C implementation to check if it is something
>> related to the asm routines themselves?
> 
> It must be: other platforms don't have those implementations, and also
> don't need the hacks (they didn't exist before I started working on this
> patch series, and x86_64 was fine: only x86_32 saw test failures).
> 
> In the case of __sigjmp_save, the cause is clear: it's sibcalled to from
> assembler and clearly must be (One of my worries is that other arches
> may have assembler that does sibcalls in similar fashion without
> mentioning it in comments or anything so my greps failed to find it...
> but I hope that if any such exists, it'll pop up as a bug so we can fix
> it, or someone here will know and mention it.)
> 
> Hm.
> 
> Looking at the prototypes of the functions in question, we see (dropping
> my additions):
> 
> int
> internal_function attribute_hidden
> __pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr)
> 
> void
> internal_function
> __pthread_mutex_cond_lock_adjust (pthread_mutex_t *mutex)
> 
> and looking at the assembler in
> sysdeps/unix/sysv/linux/i386/pthread_cond_timedwait.S, it is obvious
> enough even to someone with my extremely minimal x86 asm experience that
> this thing is not exactly feeling constrained by the normal way one
> calls functions: the fast unlock path is calling
> __pthread_mutex_unlock_usercnt() without having at any point adjusted
> %esp downwards or anything, it does that afterwards on a slow path --
> essentially the call is overlaid on top of the stack frame of
> pthread_cond_timedwait() stack frame, like a weird handrolled sibcall
> which returns conentionally or something. I'm not surprised that adding
> a canary to something called like *that* makes things explode: I suspect
> that adding any extra local variables to
> __pthread_mutex_unlock_usercnt() that weren't optimized out would do the
> same thing. (If so, there should probably be a warning comment in that
> function!)
> 
> __pthread_mutex_cond_lock_adjust() can also be called in a similar
> fashion (jumping via labels 1, 2, 5): __pthread_mutex_cond_lock() is
> also called the same way but the function returns immediately
> afterwards, and it also gets called more conventionally, so I guess no
> harm is done there.)
> 
> Now if anyone who's actually done more than a trivial amount of x86
> assembler in the last fifteen years could confirm that, I'd be very
> happy... since it's quite possible I'm misreading it all. :)
> 

IMHO I would just prefer to get rid of this specific assembly 
implementation.  In fact there are some patches that do intend to do
it (new cond var from Tovarlds and my cancellation refactor) and your
work is another reason to avoid large reimplementation in assembly
with a very strong performance reason to do so.
  
Nix March 3, 2016, 7:45 p.m. UTC | #3
On 3 Mar 2016, Adhemerval Zanella outgrape:

> On 03-03-2016 14:34, Nix wrote:
>> OK, if it worked that well not only on 64-bit ARM but also on a platform
>> I never went near, I think we can say all the major holes are covered!
>> (I honestly expected a bit of per-arch protection to be needed on most
>> arches, as was needed on SPARC and x86_64 both -- but none was needed on
>> 32-bit ARM either, so maybe I was being too paranoid.)
>
> I thought too, although based on the patchset the level of arch-specific
> hackery is still limited on some specific implementations.
>
> I see also there are some arch specific implementations that issues
> stack allocation (for instance strcspn for powerpc64) which won't have
> instrumented code.  However I do not see how to accomplish unless it
> it is being explicit added on such implementations.  I do no see this
> as an impeding for patch, but I think it might require some notes in
> documentation.

Yeah, there are plenty of those. All the accelerated assembler
implementations of everything will not get stack-protection, at least
not yet, which includes a great many string functions. But this is a
start, and the remaining pieces can probably be filled in piecemeal by
someone who knows the assembler in question :)

>> I'm not sure which is preferable. I guess if the fortification hooks
>> aren't overrideable, neither should these be.  BUt then, we can't really
>> use __fortify_fail() as a model, because glibc *implements*
>> fortification, so it contains all the calls to __fortify_fail() itself:
>> the same is not true of stack-protection, where random user programs can
>> and will also call __stack_chk_fail().
>
> I am not sure either, although imho I would prefer to not allow user to
> override *internal* GLIBC stack overrun.  So I am inclined in de-PLTify
> internal libc.so usage of __stack_chk_fail().

OK, I'll make that change. (My worry is that people using things like X
servers, where if it dies the whole system might end up more or less
locking up, might want to replace __stack_chk_fail() in order to
possibly switch stacks and return the hardware to a sane state. This is
less relevant now that most X servers don't pound on the hardware
directly, though...)

>> Yeah, that would work. __stack_chk_fail would still be there, and
>> callable by everyone else, and overrideable for users outside of glibc
>> (because __stack_chk_fail() still *has* a PLT). Of course, if any users
>> do override it, they'll see rather inconsistent behaviour, because libc
>> will still use the old one, but libresolv, for example, will use the
>> overridden one! Is that inconsistency something that we should just
>> ignore?
>
> One option is add a private symbol in libc.so, say __stack_chk_fail_glibc,
> and add the alias for other libraries to this symbol.  This won't prevent
> an user to really override with non-default options (like implementing
> the __stack_chk_fail_glibc as well).

Yeah.

> Another option will be to link each library with stack_chk_fail.o and
> pulling a local copy of it and thus preventing overriding. 

I was trying to avoid that sort of duplication. It seemed to be a recipe
for trouble later on -- but it *does* have that one advantage.

I am neutral on this, pick one :)

>> and looking at the assembler in
>> sysdeps/unix/sysv/linux/i386/pthread_cond_timedwait.S, it is obvious
>> enough even to someone with my extremely minimal x86 asm experience that
>> this thing is not exactly feeling constrained by the normal way one
>> calls functions: the fast unlock path is calling
>> __pthread_mutex_unlock_usercnt() without having at any point adjusted
>> %esp downwards or anything, it does that afterwards on a slow path --
>> essentially the call is overlaid on top of the stack frame of
>> pthread_cond_timedwait() stack frame, like a weird handrolled sibcall
>> which returns conentionally or something. I'm not surprised that adding
>> a canary to something called like *that* makes things explode: I suspect
>> that adding any extra local variables to
>> __pthread_mutex_unlock_usercnt() that weren't optimized out would do the
>> same thing. (If so, there should probably be a warning comment in that
>> function!)
>> 
>> __pthread_mutex_cond_lock_adjust() can also be called in a similar
>> fashion (jumping via labels 1, 2, 5): __pthread_mutex_cond_lock() is
>> also called the same way but the function returns immediately
>> afterwards, and it also gets called more conventionally, so I guess no
>> harm is done there.)
>> 
>> Now if anyone who's actually done more than a trivial amount of x86
>> assembler in the last fifteen years could confirm that, I'd be very
>> happy... since it's quite possible I'm misreading it all. :)
>
> IMHO I would just prefer to get rid of this specific assembly 
> implementation.  In fact there are some patches that do intend to do
> it (new cond var from Tovarlds and my cancellation refactor) and your
> work is another reason to avoid large reimplementation in assembly
> with a very strong performance reason to do so.

I'd guess the strong performance reason was probably to make the
uncontended path blazingly fast (look at it, it's hardly any
instructions at all). Isn't that the whole point of futexes?

Anyway, I'm not ditching the implementation in this patch series!
  
Nix March 4, 2016, 11:12 a.m. UTC | #4
On 3 Mar 2016, nix@esperi.org.uk said:

> I left that in specifically because I wasn't sure which way to go (we
> see it on all platforms, including SPARC and ARM). It is quite possibly
> desirable to de-PLTify this (because these calls are frequently made),
> but it's *also* quite possibly desirable not to do that, because, like
> malloc(), __stack_chk_fail() might reasonably be something that people
> might want to intercept, to change what happens on stack overrun.

"It's plausible" is of course code for "I have no idea and am guessing".

I just had a look on codesearch.debian.net. Overriders (ignoring other
libcs and early runtimes in things like openbios and grub which do
not link against a libc anyway):

physfs, on MacOS X only, because libssp is missing;
wine, overriding it with nothing just to placate the linker;
dmtcp, just prints an error and aborts, would be happy with ours;
       run in bizarre circumstances out of an mmap()ed memory region and
       also includes stubs for a lot of libgcc
condor, provides __stack_chk_fail itself iff glibc is old, so clearly
        *expects* glibc to provide it

*Nobody* overrides __stack_chk_fail with anything significant enough
that they'd be unhappy if we provided an implementation, so I'm going to
go with what you suggested and just de-PLTify it for local within-glibc
calls, leaving it unchanged for everything else.


I also note this code in the binutils ld testsuite:

    set flags "--defsym __stack_chk_fail=0"

which gives me confidence that my independent evolution of the same
horrendous hack in the librtld mapfile computation is not all that bad :)
  
Nix March 7, 2016, 4:27 p.m. UTC | #5
On 3 Mar 2016, nix@esperi.org.uk spake thusly:
> Looking at the prototypes of the functions in question, we see (dropping
> my additions):
>
> int
> internal_function attribute_hidden
> __pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr)
>
> void
> internal_function
> __pthread_mutex_cond_lock_adjust (pthread_mutex_t *mutex)
>
> and looking at the assembler in
> sysdeps/unix/sysv/linux/i386/pthread_cond_timedwait.S, it is obvious
> enough even to someone with my extremely minimal x86 asm experience that
> this thing is not exactly feeling constrained by the normal way one
> calls functions

Well, no. That's because internal_function expands to -mregparm on x86,
which I quite forgot -- and the cause depends on this.

Look at line 313 of pthread_cond_wait.S. Here's the block around it:

        /* We need to go back to futex_wait.  If we're using requeue_pi, then
           release the mutex we had acquired and go back.  */
22:     movl    16(%esp), %edx
        test    %edx, %edx
        jz      8b

        /* Adjust the mutex values first and then unlock it.  The unlock
           should always succeed or else the kernel did not lock the mutex
           correctly.  */
        movl    dep_mutex(%ebx), %eax
        call    __pthread_mutex_cond_lock_adjust
        xorl    %edx, %edx
        call    __pthread_mutex_unlock_usercnt
        jmp     8b

Now that second call to __pthread_mutex_unlock_usercnt() is interesting,
because *no* effort is made to set up the first parameter, the mutex, in
%eax: it's assumed to persist from __pthread_mutex_cond_lock_adjust().
And indeed if you look at the generated x86 assembler for the
non-stack-protected version of __pthread_mutex_cond_lock_adjust(),
you'll see that %eax is consulted but is never modified at any point.
However, when stack-protecting, %eax is used in the epilogue to test the
stack canary, and is smashed in the process, wrecking the call to
__pthread_mutex_unlock_usercnt(). (The second argument, passed in %edx,
is zeroed right there, so no such dependency exists for it.)

pthread_cond_timedwait.S has identical code in label 15.

How should I put it, this is dodgy. This is really dodgy. It's depending
on the compiler's register allocation for regparm functions never
changing to clobber %eax, which, well, we've already seen how reliable
*that* is. It's caller-saved and the caller is not saving it!

Adding a single instruction to both files to reload %eax fixes it, at
minimal cost (this is already a fallback path), and lets us stack-
protect the entire C-side mutex unlocking code path, eliminating the
mystery. I'll do that in the next patch series.

I have also audited all the x86 code in glibc for signs of the same
malaise, and found one instance, in
sysdeps/mach/hurd/i386/____longjmp_chk.S: I fixed that by decorating
_hurd_self_sigstate with inhibit_stack_protector, both because I can't
test it easily and because it seemed far simpler than the converse (that
function is normally inlined in any case). That'll also be in the
next series.

(Testing now, on the usual combination of x86-32/64, sparc-32/64 and
ARM.)
  

Patch

diff --git a/sysdeps/generic/symbol-hacks.h b/sysdeps/generic/symbol-hacks.h
index ce576c9..12829cc 100644
--- a/sysdeps/generic/symbol-hacks.h
+++ b/sysdeps/generic/symbol-hacks.h
@@ -5,3 +5,7 @@  asm ("memmove = __GI_memmove");
 asm ("memset = __GI_memset");
 asm ("memcpy = __GI_memcpy");
 #endif
+
+#if !defined __ASSEMBLER__ && IS_IN (libc)
+asm ("__stack_chk_fail = __stack_chk_fail_local");
+#endif