Patchwork [05/18] Open-code the memcpy() at static TLS initialization time.

login
register
mail settings
Submitter Nix
Date March 10, 2016, 1 a.m.
Message ID <87y49rkuye.fsf@esperi.org.uk>
Download mbox | patch
Permalink /patch/11297/
State New
Headers show

Comments

Nix - March 10, 2016, 1 a.m.
On 9 Mar 2016, Mike Frysinger uttered the following:

> On 08 Mar 2016 13:50, Nix wrote:
>> This one is a bit nasty.  Now that we are initializing TLS earlier for
>> the stack canary's sake, existing memcpy() implementations become
>> problematic.  We can use the multiarch implementations, but they might
>> not always be present, and even if they are present they might not always
>> be in assembler, so might be compiled with stack-protection.  We cannot
>> use posix/memcpy.c without marking both it and */wordcopy.c as non-stack-
>> protected, which for memcpy() of all things seems like a seriously bad
>> idea: if any function in glibc should be stack-protected, it's memcpy()
>> (though stack-protecting the many optimized assembly versions is not done
>> in this patch series).
>> 
>> So we have two real options: hack up the guts of posix/memcpy.c and
>> */wordcopy.c so that they can be #included (renamed and declared static)
>> inside libc-tls.c, or simply open-code the memcpy().  For simplicity's
>> sake, this patch open-codes it, on the grounds that static binaries are
>> relatively rare and quasi-deprecated anyway, and static binaries with
>> large TLS sections are yet rarer, and not worth the complexity of hacking
>> up all the arch-dependent wordcopy files.
>
> can we utilize _HAVE_STRING_ARCH_memcpy here ?  the string includes will
> sometimes provide inlined asm versions ...

I think so, *if* we can be sure that arches that define
_HAVE_STRING_ARCH_memcpy will never expand memcpy() to macros that
sometimes call functions, but only to pure inline asm stuff. Since there
is only one definer of that (x86), I'm not sure that's guaranteed...

(This does explain why my original implementation, which didn't do any
of this, also didn't have any problems on x86!)

How's this instead? (Seems to work just as well on x86, which is after
all the only arch this change will affect at all.)

From 3427487ee08586f76f711758795dba31b62c4238 Mon Sep 17 00:00:00 2001
From: Nick Alcock <nick.alcock@oracle.com>
Date: Tue, 23 Feb 2016 11:08:38 +0000
Subject: [PATCH] Open-code the memcpy() at static TLS initialization time.

This one is a bit nasty.  Now that we are initializing TLS earlier for
the stack canary's sake, existing memcpy() implementations become
problematic.  We can use the multiarch implementations, but they might
not always be present, and even if they are present they might not always
be in assembler, so might be compiled with stack-protection.  We cannot
use posix/memcpy.c without marking both it and */wordcopy.c as non-stack-
protected, which for memcpy() of all things seems like a seriously bad
idea: if any function in glibc should be stack-protected, it's memcpy()
(though stack-protecting the many optimized assembly versions is not done
in this patch series).

So we have two real options: hack up the guts of posix/memcpy.c and
*/wordcopy.c so that they can be #included (renamed and declared static)
inside libc-tls.c, or simply open-code the memcpy().  For simplicity's
sake, this patch open-codes it, on the grounds that static binaries are
relatively rare and quasi-deprecated anyway, and static binaries with
large TLS sections are yet rarer, and not worth the complexity of hacking
up all the arch-dependent wordcopy files.  There is one exception: if
the arch provides an inline assembler memcpy() implementation, we can
use that in preference.

(This was not revealed when testing on x86 because on that platform
GCC was open-coding the memcpy() for us.)

v2: New, lets us remove the memcpy() -fno-stack-protection, which wasn't
    enough in any case.
v4: Add an inhibit_loop_to_libcall to prevent GCC from turning the loop
    back into a memcpy() again.  Wrap long lines.
v6: Use the inline assembler ARCH_memcpy if available.

	* csu/libc-tls.c (__libc_setup_tls): Add inhibit_loop_to_libcall
	to avoid calls to potentially ifunced or stack-protected memcpy.
	Open-code the TLS-initialization memcpy.
---
 csu/libc-tls.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)
Mike Frysinger - March 10, 2016, 2:28 a.m.
On 10 Mar 2016 01:00, Nix wrote:
> On 9 Mar 2016, Mike Frysinger uttered the following:
> > On 08 Mar 2016 13:50, Nix wrote:
> >> This one is a bit nasty.  Now that we are initializing TLS earlier for
> >> the stack canary's sake, existing memcpy() implementations become
> >> problematic.  We can use the multiarch implementations, but they might
> >> not always be present, and even if they are present they might not always
> >> be in assembler, so might be compiled with stack-protection.  We cannot
> >> use posix/memcpy.c without marking both it and */wordcopy.c as non-stack-
> >> protected, which for memcpy() of all things seems like a seriously bad
> >> idea: if any function in glibc should be stack-protected, it's memcpy()
> >> (though stack-protecting the many optimized assembly versions is not done
> >> in this patch series).
> >> 
> >> So we have two real options: hack up the guts of posix/memcpy.c and
> >> */wordcopy.c so that they can be #included (renamed and declared static)
> >> inside libc-tls.c, or simply open-code the memcpy().  For simplicity's
> >> sake, this patch open-codes it, on the grounds that static binaries are
> >> relatively rare and quasi-deprecated anyway, and static binaries with
> >> large TLS sections are yet rarer, and not worth the complexity of hacking
> >> up all the arch-dependent wordcopy files.
> >
> > can we utilize _HAVE_STRING_ARCH_memcpy here ?  the string includes will
> > sometimes provide inlined asm versions ...
> 
> I think so, *if* we can be sure that arches that define
> _HAVE_STRING_ARCH_memcpy will never expand memcpy() to macros that
> sometimes call functions, but only to pure inline asm stuff. Since there
> is only one definer of that (x86), I'm not sure that's guaranteed...

i guess look at the other arch inlines to see what they do.  i honestly
don't know where or how well these APIs are defined.

> +  /* sbrk gives us zero'd memory, so we don't need to clear the remainder.
> +
> +     Use inlined asm implementation if available: otherwise, copy by hand,
> +     because memcpy() is stack-protected and is often multiarch too.  */
> +
> +#if defined _HAVE_STRING_ARCH_memcpy
>    memcpy (_dl_static_dtv[2].pointer.val, initimage, filesz);

should this also depend on SSP not being enabled ?
#if defined _HAVE_STRING_ARCH_memcpy || !defined <whatever SSP define>

otherwise, you'd want to use "#ifdef".
-mike
Adhemerval Zanella Netto - March 10, 2016, 3:02 a.m.
On 10-03-2016 09:28, Mike Frysinger wrote:
> On 10 Mar 2016 01:00, Nix wrote:
>> On 9 Mar 2016, Mike Frysinger uttered the following:
>>> On 08 Mar 2016 13:50, Nix wrote:
>>>> This one is a bit nasty.  Now that we are initializing TLS earlier for
>>>> the stack canary's sake, existing memcpy() implementations become
>>>> problematic.  We can use the multiarch implementations, but they might
>>>> not always be present, and even if they are present they might not always
>>>> be in assembler, so might be compiled with stack-protection.  We cannot
>>>> use posix/memcpy.c without marking both it and */wordcopy.c as non-stack-
>>>> protected, which for memcpy() of all things seems like a seriously bad
>>>> idea: if any function in glibc should be stack-protected, it's memcpy()
>>>> (though stack-protecting the many optimized assembly versions is not done
>>>> in this patch series).
>>>>
>>>> So we have two real options: hack up the guts of posix/memcpy.c and
>>>> */wordcopy.c so that they can be #included (renamed and declared static)
>>>> inside libc-tls.c, or simply open-code the memcpy().  For simplicity's
>>>> sake, this patch open-codes it, on the grounds that static binaries are
>>>> relatively rare and quasi-deprecated anyway, and static binaries with
>>>> large TLS sections are yet rarer, and not worth the complexity of hacking
>>>> up all the arch-dependent wordcopy files.
>>>
>>> can we utilize _HAVE_STRING_ARCH_memcpy here ?  the string includes will
>>> sometimes provide inlined asm versions ...
>>
>> I think so, *if* we can be sure that arches that define
>> _HAVE_STRING_ARCH_memcpy will never expand memcpy() to macros that
>> sometimes call functions, but only to pure inline asm stuff. Since there
>> is only one definer of that (x86), I'm not sure that's guaranteed...
> 
> i guess look at the other arch inlines to see what they do.  i honestly
> don't know where or how well these APIs are defined.
> 
>> +  /* sbrk gives us zero'd memory, so we don't need to clear the remainder.
>> +
>> +     Use inlined asm implementation if available: otherwise, copy by hand,
>> +     because memcpy() is stack-protected and is often multiarch too.  */
>> +
>> +#if defined _HAVE_STRING_ARCH_memcpy
>>    memcpy (_dl_static_dtv[2].pointer.val, initimage, filesz);
> 
> should this also depend on SSP not being enabled ?
> #if defined _HAVE_STRING_ARCH_memcpy || !defined <whatever SSP define>
> 
> otherwise, you'd want to use "#ifdef".
> -mike
> 

Is _HAVE_STRING_ARCH_memcpy really doing a better job than gcc? I would prefer
to just use the open memcpy and let it optimize it.
Nix - March 10, 2016, 10:20 a.m.
On 10 Mar 2016, Adhemerval Zanella stated:

> On 10-03-2016 09:28, Mike Frysinger wrote:
>> otherwise, you'd want to use "#ifdef".
>> -mike
>
> Is _HAVE_STRING_ARCH_memcpy really doing a better job than gcc? I would prefer
> to just use the open memcpy and let it optimize it.

In most cases I'd agree with you but the thing about
_HAVE_STRING_ARCH_memcpy is that GCC can decide *not* to optimize it:
though we have suppressed generation of libcalls in this function, not
even a call to __builtin_memcpy() is guaranteed to generate inline asm
with no function calls, as far as I know.

(We could start messing about with
__attribute__((__optimize__("-mmemcpy-strategy=STRATEGY"))), but this is
really starting to get a bit over the top, I think... particularly given
that the only platform that works on is *also* the only platform we have
ARCH_memcpy for, so the only platform we don't need that sort of thing
on.)
Nix - March 10, 2016, 10:29 a.m.
On 10 Mar 2016, Mike Frysinger verbalised:

> On 10 Mar 2016 01:00, Nix wrote:
>> I think so, *if* we can be sure that arches that define
>> _HAVE_STRING_ARCH_memcpy will never expand memcpy() to macros that
>> sometimes call functions, but only to pure inline asm stuff. Since there
>> is only one definer of that (x86), I'm not sure that's guaranteed...
>
> i guess look at the other arch inlines to see what they do.  i honestly
> don't know where or how well these APIs are defined.

The only platform that defines *any* string arch operations is x86. None
of those fall back to out-of-line functions at all, even for huge inputs
(which for some functions and some larger inputs strikes me as a
terrible implementation choice, but ah well.)

>> +  /* sbrk gives us zero'd memory, so we don't need to clear the remainder.
>> +
>> +     Use inlined asm implementation if available: otherwise, copy by hand,
>> +     because memcpy() is stack-protected and is often multiarch too.  */
>> +
>> +#if defined _HAVE_STRING_ARCH_memcpy
>>    memcpy (_dl_static_dtv[2].pointer.val, initimage, filesz);
>
> should this also depend on SSP not being enabled ?
> #if defined _HAVE_STRING_ARCH_memcpy || !defined <whatever SSP define>

Probably a good idea: it makes things less invasive for the common case.
(Though your belief that there is just *one* SSP define is alas untrue:
we'll have to check all three.)

Adding.
Nix - March 10, 2016, 3:13 p.m.
On 10 Mar 2016, nix@esperi.org.uk stated:

> On 10 Mar 2016, Mike Frysinger verbalised:
>> should this also depend on SSP not being enabled ?
>> #if defined _HAVE_STRING_ARCH_memcpy || !defined <whatever SSP define>
>
> Probably a good idea: it makes things less invasive for the common case.
> (Though your belief that there is just *one* SSP define is alas untrue:
> we'll have to check all three.)

Heh. This has additional fun complications: all of csu/ is built without
stack protection, so we cannot rely on the macros the compiler
predefines to indicate that stack protection is on: for csu/, it's not,
ever.

So I do have to introduce a new #define here, to indicate that
stack-protection is globally enabled even if it's off for some file in
particular (easy to do in configure.ac etc: folding that part into patch
1).

Patch

diff --git a/csu/libc-tls.c b/csu/libc-tls.c
index 3d67a64..af0d3e6 100644
--- a/csu/libc-tls.c
+++ b/csu/libc-tls.c
@@ -102,6 +102,7 @@  init_static_tls (size_t memsz, size_t align)
 }
 
 void
+inhibit_loop_to_libcall
 __libc_setup_tls (size_t tcbsize, size_t tcbalign)
 {
   void *tlsblock;
@@ -176,8 +177,22 @@  __libc_setup_tls (size_t tcbsize, size_t tcbalign)
 # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 #endif
   _dl_static_dtv[2].pointer.is_static = true;
-  /* sbrk gives us zero'd memory, so we don't need to clear the remainder.  */
+
+  /* sbrk gives us zero'd memory, so we don't need to clear the remainder.
+
+     Use inlined asm implementation if available: otherwise, copy by hand,
+     because memcpy() is stack-protected and is often multiarch too.  */
+
+#if defined _HAVE_STRING_ARCH_memcpy
   memcpy (_dl_static_dtv[2].pointer.val, initimage, filesz);
+#else
+  char *dst = (char *) _dl_static_dtv[2].pointer.val;
+  char *src = (char *) initimage;
+  size_t i;
+
+  for (i = 0; i < filesz; dst++, src++, i++)
+    *dst = *src;
+#endif
 
   /* Install the pointer to the dtv.  */