Compile elf/rtld.c with -fno-tree-loop-distribute-patterns.

Message ID 20191121021040.14554-1-sandra@codesourcery.com
State Committed
Headers

Commit Message

Sandra Loosemore Nov. 21, 2019, 2:10 a.m. UTC
  In GCC 10, the default at -O2 is now -ftree-loop-distribute-patterns.
This optimization causes GCC to "helpfully" convert the hand-written
loop in _dl_start into a call to memset, which is not available that
early in program startup.  Similar problems in other places in GLIBC
have been addressed by explicitly building with
-fno-tree-loop-distribute-patterns, but this one may have been
overlooked previously because it only affects targets where
HAVE_BUILTIN_MEMSET is not defined.

This patch fixes a bug observed on nios2-linux-gnu target that caused
all programs to segv on startup.
---
 elf/Makefile | 6 ++++++
 1 file changed, 6 insertions(+)
  

Comments

Florian Weimer Nov. 23, 2019, 1:25 p.m. UTC | #1
* Sandra Loosemore:

> In GCC 10, the default at -O2 is now -ftree-loop-distribute-patterns.
> This optimization causes GCC to "helpfully" convert the hand-written
> loop in _dl_start into a call to memset, which is not available that
> early in program startup.  Similar problems in other places in GLIBC
> have been addressed by explicitly building with
> -fno-tree-loop-distribute-patterns, but this one may have been
> overlooked previously because it only affects targets where
> HAVE_BUILTIN_MEMSET is not defined.

Thanks for pointing out this problem.

> +# On targets without __builtin_memset, rtld.c uses a hand-coded loop
> +# in _dl_start.  Make sure this isn't turned into a call to regular memset.
> +ifeq (yes,$(have-loop-to-function))
> +CFLAGS-rtld.c += -fno-tree-loop-distribute-patterns
> +endif

Is it possible to do this via a pragma?  If I understand things
correctly, this is not necessary for PI_STATIC_AND_HIDDEN targets
(where initialization of the dynamic loader is simpler).
  
Sandra Loosemore Nov. 23, 2019, 10:54 p.m. UTC | #2
On 11/23/19 6:25 AM, Florian Weimer wrote:
> * Sandra Loosemore:
> 
>> In GCC 10, the default at -O2 is now -ftree-loop-distribute-patterns.
>> This optimization causes GCC to "helpfully" convert the hand-written
>> loop in _dl_start into a call to memset, which is not available that
>> early in program startup.  Similar problems in other places in GLIBC
>> have been addressed by explicitly building with
>> -fno-tree-loop-distribute-patterns, but this one may have been
>> overlooked previously because it only affects targets where
>> HAVE_BUILTIN_MEMSET is not defined.
> 
> Thanks for pointing out this problem.
> 
>> +# On targets without __builtin_memset, rtld.c uses a hand-coded loop
>> +# in _dl_start.  Make sure this isn't turned into a call to regular memset.
>> +ifeq (yes,$(have-loop-to-function))
>> +CFLAGS-rtld.c += -fno-tree-loop-distribute-patterns
>> +endif
> 
> Is it possible to do this via a pragma?  If I understand things
> correctly, this is not necessary for PI_STATIC_AND_HIDDEN targets
> (where initialization of the dynamic loader is simpler).

I see that the definitions of memset and memmove use 
"inhibit_loop_to_libcall" (which expands into an optimize attribute) to 
prevent recursion, but I didn't think the header where that is defined 
(include/libc-symbols.h) is supposed to be included in the dynamic 
linker?  Also, already in elf/Makefile there is another instance where 
it adds -fno-tree-loop-distribute-patterns to the CFLAGS, so I just 
copied that.  I don't work with glibc internals enough to have a good 
feel for what the preferred solution is but I'll test a different 
solution if this one isn't good enough.

-Sandra
  
Florian Weimer Nov. 25, 2019, 8:08 a.m. UTC | #3
* Sandra Loosemore:

>> Is it possible to do this via a pragma?  If I understand things
>> correctly, this is not necessary for PI_STATIC_AND_HIDDEN targets
>> (where initialization of the dynamic loader is simpler).
>
> I see that the definitions of memset and memmove use 
> "inhibit_loop_to_libcall" (which expands into an optimize attribute) to 
> prevent recursion, but I didn't think the header where that is defined 
> (include/libc-symbols.h) is supposed to be included in the dynamic 
> linker?

elf/rtld.os is built with -include ../include/libc-symbols.h, so the
declaration should be in scope.  I don't know why it is not effective.
It probably only applies to the implementations of memset and memmove
themselves (if the generic ones written in C are used).

>  Also, already in elf/Makefile there is another instance where 
> it adds -fno-tree-loop-distribute-patterns to the CFLAGS, so I just 
> copied that.  I don't work with glibc internals enough to have a good 
> feel for what the preferred solution is but I'll test a different 
> solution if this one isn't good enough.

I had hoped we could write something like this at the start of
elf/rtld.c:

#ifndef PI_STATIC_AND_HIDDEN
# pragma GCC optimize ("no-tree-loop-distribute-patterns")
#endif

Then the optimization would still be applied on the targets where it
is safe to do so.

But I don't have a strong opinion about this and would appreciate
feedback from others.
  
Adhemerval Zanella Nov. 25, 2019, 7:18 p.m. UTC | #4
On 25/11/2019 05:08, Florian Weimer wrote:
> * Sandra Loosemore:
> 
>>> Is it possible to do this via a pragma?  If I understand things
>>> correctly, this is not necessary for PI_STATIC_AND_HIDDEN targets
>>> (where initialization of the dynamic loader is simpler).
>>
>> I see that the definitions of memset and memmove use 
>> "inhibit_loop_to_libcall" (which expands into an optimize attribute) to 
>> prevent recursion, but I didn't think the header where that is defined 
>> (include/libc-symbols.h) is supposed to be included in the dynamic 
>> linker?
> 
> elf/rtld.os is built with -include ../include/libc-symbols.h, so the
> declaration should be in scope.  I don't know why it is not effective.
> It probably only applies to the implementations of memset and memmove
> themselves (if the generic ones written in C are used).
> 
>>  Also, already in elf/Makefile there is another instance where 
>> it adds -fno-tree-loop-distribute-patterns to the CFLAGS, so I just 
>> copied that.  I don't work with glibc internals enough to have a good 
>> feel for what the preferred solution is but I'll test a different 
>> solution if this one isn't good enough.
> 
> I had hoped we could write something like this at the start of
> elf/rtld.c:
> 
> #ifndef PI_STATIC_AND_HIDDEN
> # pragma GCC optimize ("no-tree-loop-distribute-patterns")
> #endif
> 
> Then the optimization would still be applied on the targets where it
> is safe to do so.
> 
> But I don't have a strong opinion about this and would appreciate
> feedback from others.
> 

We already have a similar code to handle a similar issue:

 484   /* Partly clean the `bootstrap_map' structure up.  Don't use
 485      `memset' since it might not be built in or inlined and we cannot
 486      make function calls at this point.  Use '__builtin_memset' if we
 487      know it is available.  We do not have to clear the memory if we
 488      do not have to use the temporary bootstrap_map.  Global variables
 489      are initialized to zero by default.  */
 490 #ifndef DONT_USE_BOOTSTRAP_MAP
 491 # ifdef HAVE_BUILTIN_MEMSET
 492   __builtin_memset (bootstrap_map.l_info, '\0', sizeof (bootstrap_map.l_info));
 493 # else
 494   for (size_t cnt = 0;
 495        cnt < sizeof (bootstrap_map.l_info) / sizeof (bootstrap_map.l_info[0]);
 496        ++cnt)
 497     bootstrap_map.l_info[cnt] = 0;
 498 # endif
 499 #endif

The HAVE_BUILTIN_MEMSET on configure.ac check if __builtin_memset itself
calls memset and it is within DONT_USE_BOOTSTRAP_MAP mainly because it is
stack allocated for !PI_STATIC_AND_HIDDEN.

(As a side-note, I really think these kind of micro-optimization is just
over-complicate for minimal gain)

However, I am not sure there is really a restriction regarding
PI_STATIC_AND_HIDDEN and internal function calls.  Just after this memset
call it has:

 516   if (bootstrap_map.l_addr || ! bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)])                  
 517     { 
 518       /* Relocate ourselves so we can do normal function calls and                               
 519          data access using the global offset table.  */                                          
 520       
 521       ELF_DYNAMIC_RELOCATE (&bootstrap_map, 0, 0, 0);                                            
 522     }
 523   bootstrap_map.l_relocated = 1; 

So it should be safe to call memcpy/memset calls in _dl_start after this
point (as some ports that with !PI_STATIC_AND_HIDDEN does with memcpy) .
Is gcc creating a mem* calls before ld.so reallocate itself? If it were, 
where exactly?
  
Florian Weimer Nov. 25, 2019, 8:05 p.m. UTC | #5
* Adhemerval Zanella:

> We already have a similar code to handle a similar issue:
>
>  484   /* Partly clean the `bootstrap_map' structure up.  Don't use
>  485      `memset' since it might not be built in or inlined and we cannot
>  486      make function calls at this point.  Use '__builtin_memset' if we
>  487      know it is available.  We do not have to clear the memory if we
>  488      do not have to use the temporary bootstrap_map.  Global variables
>  489      are initialized to zero by default.  */
>  490 #ifndef DONT_USE_BOOTSTRAP_MAP
>  491 # ifdef HAVE_BUILTIN_MEMSET
>  492 __builtin_memset (bootstrap_map.l_info, '\0', sizeof
> (bootstrap_map.l_info));
>  493 # else
>  494   for (size_t cnt = 0;
>  495 cnt < sizeof (bootstrap_map.l_info) / sizeof
> (bootstrap_map.l_info[0]);
>  496        ++cnt)
>  497     bootstrap_map.l_info[cnt] = 0;
>  498 # endif
>  499 #endif
>
> The HAVE_BUILTIN_MEMSET on configure.ac check if __builtin_memset itself
> calls memset and it is within DONT_USE_BOOTSTRAP_MAP mainly because it is
> stack allocated for !PI_STATIC_AND_HIDDEN.
>
> (As a side-note, I really think these kind of micro-optimization is just
> over-complicate for minimal gain)
>
> However, I am not sure there is really a restriction regarding
> PI_STATIC_AND_HIDDEN and internal function calls.  Just after this memset
> call it has:
>
>  516 if (bootstrap_map.l_addr || !
> bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)])
>  517     { 
>  518 /* Relocate ourselves so we can do normal function calls and
>  519 data access using the global offset table.  */
>  520       
>  521 ELF_DYNAMIC_RELOCATE (&bootstrap_map, 0, 0, 0);
>  522     }
>  523   bootstrap_map.l_relocated = 1; 
>
> So it should be safe to call memcpy/memset calls in _dl_start after this
> point (as some ports that with !PI_STATIC_AND_HIDDEN does with memcpy) .
> Is gcc creating a mem* calls before ld.so reallocate itself? If it were, 
> where exactly?

In the code you quoted.  It reintroduces the memset call we so
carefully tried to remove.
  
Sandra Loosemore Nov. 25, 2019, 8:45 p.m. UTC | #6
On 11/25/19 1:08 AM, Florian Weimer wrote:
> 
> I had hoped we could write something like this at the start of
> elf/rtld.c:
> 
> #ifndef PI_STATIC_AND_HIDDEN
> # pragma GCC optimize ("no-tree-loop-distribute-patterns")
> #endif
> 
> Then the optimization would still be applied on the targets where it
> is safe to do so.

Well, I could certainly hack up and test a new patch to do that, if that 
is the recommended approach.  But I see no other existing uses of 
"pragma GCC optimize" anywhere in glibc other than in a few test cases, 
while elf/Makefile already specifies -fno-tree-loop-distribute-patterns 
on dl-tunables.c in the same way I added it for rtld.c, so my original 
patch seems more consistent with current practice.

> But I don't have a strong opinion about this and would appreciate
> feedback from others.

Same here.  I just want this bug fixed and will do it in whatever way 
the community agrees on.  As I said, the dynamic linker is currently 
completely broken on nios2 when built with GCC 10.

-Sandra
  
Adhemerval Zanella Nov. 25, 2019, 9:01 p.m. UTC | #7
On 25/11/2019 17:05, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> We already have a similar code to handle a similar issue:
>>
>>  484   /* Partly clean the `bootstrap_map' structure up.  Don't use
>>  485      `memset' since it might not be built in or inlined and we cannot
>>  486      make function calls at this point.  Use '__builtin_memset' if we
>>  487      know it is available.  We do not have to clear the memory if we
>>  488      do not have to use the temporary bootstrap_map.  Global variables
>>  489      are initialized to zero by default.  */
>>  490 #ifndef DONT_USE_BOOTSTRAP_MAP
>>  491 # ifdef HAVE_BUILTIN_MEMSET
>>  492 __builtin_memset (bootstrap_map.l_info, '\0', sizeof
>> (bootstrap_map.l_info));
>>  493 # else
>>  494   for (size_t cnt = 0;
>>  495 cnt < sizeof (bootstrap_map.l_info) / sizeof
>> (bootstrap_map.l_info[0]);
>>  496        ++cnt)
>>  497     bootstrap_map.l_info[cnt] = 0;
>>  498 # endif
>>  499 #endif
>>
>> The HAVE_BUILTIN_MEMSET on configure.ac check if __builtin_memset itself
>> calls memset and it is within DONT_USE_BOOTSTRAP_MAP mainly because it is
>> stack allocated for !PI_STATIC_AND_HIDDEN.
>>
>> (As a side-note, I really think these kind of micro-optimization is just
>> over-complicate for minimal gain)
>>
>> However, I am not sure there is really a restriction regarding
>> PI_STATIC_AND_HIDDEN and internal function calls.  Just after this memset
>> call it has:
>>
>>  516 if (bootstrap_map.l_addr || !
>> bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)])
>>  517     { 
>>  518 /* Relocate ourselves so we can do normal function calls and
>>  519 data access using the global offset table.  */
>>  520       
>>  521 ELF_DYNAMIC_RELOCATE (&bootstrap_map, 0, 0, 0);
>>  522     }
>>  523   bootstrap_map.l_relocated = 1; 
>>
>> So it should be safe to call memcpy/memset calls in _dl_start after this
>> point (as some ports that with !PI_STATIC_AND_HIDDEN does with memcpy) .
>> Is gcc creating a mem* calls before ld.so reallocate itself? If it were, 
>> where exactly?
> 
> In the code you quoted.  It reintroduces the memset call we so
> carefully tried to remove.
> 

Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc
for instance, slight larger on mips). I wonder if we could just create a
static global on rtld.c to simplify it, since the stack usage won't be
released anyway. Also, isn't the memset on !DONT_USE_BOOTSTRAP_MAP redundant?
  
Florian Weimer Nov. 25, 2019, 10:28 p.m. UTC | #8
* Adhemerval Zanella:

> Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc
> for instance, slight larger on mips). I wonder if we could just create a
> static global on rtld.c to simplify it, since the stack usage won't be
> released anyway.

I thought the stack is reset by the startup stub?

At one point, we need to rewrite the bootstrap relocation in something
else besides C, but I suppose we can just stick in
-fno-tree-loop-distribute-patterns today.

> Also, isn't the memset on !DONT_USE_BOOTSTRAP_MAP redundant?

Not sure if I understand the question.  The memset is likely redundant
on Linux because the stack is fresh and unperturbed.  If the startup
stub allocated it, clearing it wouldn't be necessary.  But the
situation is a bit iffy at the C level.
  
Adhemerval Zanella Nov. 26, 2019, 1:36 p.m. UTC | #9
On 25/11/2019 19:28, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc
>> for instance, slight larger on mips). I wonder if we could just create a
>> static global on rtld.c to simplify it, since the stack usage won't be
>> released anyway.
> 
> I thought the stack is reset by the startup stub?

It is and also it seems we are stuck on using stack allocated for bootstrap_map
for some architectures. For instance, the powerpc32/fpic always generate a GOT 
access even for static variables (and some comments on rltd.c about it do not
grasp all architecture requirements).

> 
> At one point, we need to rewrite the bootstrap relocation in something
> else besides C, but I suppose we can just stick in
> -fno-tree-loop-distribute-patterns today.

Maybe with some reorganization and refactoring we can simplify the include 
hack done with dynamic-link.h while taking care of not generating the
required relocations (maybe a compiler flag could help us, at least to
warn when such relocation are being emitted?).  But I see your point where
we are using C to produce code that is a specific subset of language 
standard.

> 
>> Also, isn't the memset on !DONT_USE_BOOTSTRAP_MAP redundant?
> 
> Not sure if I understand the question.  The memset is likely redundant
> on Linux because the stack is fresh and unperturbed.  If the startup
> stub allocated it, clearing it wouldn't be necessary.  But the
> situation is a bit iffy at the C level.
> 

Never mind, my confusion here.
  
Florian Weimer Nov. 26, 2019, 2:14 p.m. UTC | #10
* Adhemerval Zanella:

> On 25/11/2019 19:28, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc
>>> for instance, slight larger on mips). I wonder if we could just create a
>>> static global on rtld.c to simplify it, since the stack usage won't be
>>> released anyway.
>> 
>> I thought the stack is reset by the startup stub?
>
> It is and also it seems we are stuck on using stack allocated for
> bootstrap_map for some architectures. For instance, the
> powerpc32/fpic always generate a GOT access even for static
> variables (and some comments on rltd.c about it do not grasp all
> architecture requirements).

So should we just apply Sandra's patch?

>> At one point, we need to rewrite the bootstrap relocation in something
>> else besides C, but I suppose we can just stick in
>> -fno-tree-loop-distribute-patterns today.
>
> Maybe with some reorganization and refactoring we can simplify the include 
> hack done with dynamic-link.h while taking care of not generating the
> required relocations (maybe a compiler flag could help us, at least to
> warn when such relocation are being emitted?).  But I see your point where
> we are using C to produce code that is a specific subset of language 
> standard.

It's even worse—we try to restrict what we feed to the front end,
hoping that what comes out of the back end, assembler and linker
matches our requirements.  That's not how we should do things today.
  
Adhemerval Zanella Nov. 26, 2019, 5:35 p.m. UTC | #11
On 26/11/2019 11:14, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 25/11/2019 19:28, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc
>>>> for instance, slight larger on mips). I wonder if we could just create a
>>>> static global on rtld.c to simplify it, since the stack usage won't be
>>>> released anyway.
>>>
>>> I thought the stack is reset by the startup stub?
>>
>> It is and also it seems we are stuck on using stack allocated for
>> bootstrap_map for some architectures. For instance, the
>> powerpc32/fpic always generate a GOT access even for static
>> variables (and some comments on rltd.c about it do not grasp all
>> architecture requirements).
> 
> So should we just apply Sandra's patch?

I don't have a strong preference: either disabling for whole rltd.c (Sandra's
patch), your suggestion to restrict !PI_STATIC_AND_HIDDEN, or to use  
inhibit_loop_to_libcall on _dl_start only. I don't think 
-ftree-loop-distribute-patterns would have a large performance difference.

> 
>>> At one point, we need to rewrite the bootstrap relocation in something
>>> else besides C, but I suppose we can just stick in
>>> -fno-tree-loop-distribute-patterns today.
>>
>> Maybe with some reorganization and refactoring we can simplify the include 
>> hack done with dynamic-link.h while taking care of not generating the
>> required relocations (maybe a compiler flag could help us, at least to
>> warn when such relocation are being emitted?).  But I see your point where
>> we are using C to produce code that is a specific subset of language 
>> standard.
> 
> It's even worse—we try to restrict what we feed to the front end,
> hoping that what comes out of the back end, assembler and linker
> matches our requirements.  That's not how we should do things today.

Yeah, this specific code is fragile and simple changes such as changing
a variable scope can break it. I think we ca at least document the
restrictions and organize better the code (the include in the middle of
the function is really hacky).
  
Florian Weimer Nov. 26, 2019, 6:41 p.m. UTC | #12
* Sandra Loosemore:

> In GCC 10, the default at -O2 is now -ftree-loop-distribute-patterns.
> This optimization causes GCC to "helpfully" convert the hand-written
> loop in _dl_start into a call to memset, which is not available that
> early in program startup.  Similar problems in other places in GLIBC
> have been addressed by explicitly building with
> -fno-tree-loop-distribute-patterns, but this one may have been
> overlooked previously because it only affects targets where
> HAVE_BUILTIN_MEMSET is not defined.
>
> This patch fixes a bug observed on nios2-linux-gnu target that caused
> all programs to segv on startup.
> ---
>  elf/Makefile | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/elf/Makefile b/elf/Makefile
> index 0668818..b05af5c 100644
> --- a/elf/Makefile
> +++ b/elf/Makefile
> @@ -65,6 +65,12 @@ CFLAGS-dl-runtime.c += -fexceptions
> -fasynchronous-unwind-tables
>  CFLAGS-dl-lookup.c += -fexceptions -fasynchronous-unwind-tables
>  CFLAGS-dl-iterate-phdr.c += $(uses-callbacks)
>  
> +# On targets without __builtin_memset, rtld.c uses a hand-coded loop
> +# in _dl_start.  Make sure this isn't turned into a call to regular memset.
> +ifeq (yes,$(have-loop-to-function))
> +CFLAGS-rtld.c += -fno-tree-loop-distribute-patterns
> +endif
> +
>  # Compile rtld itself without stack protection.
>  # Also compile all routines in the static library that are elided from
>  # the shared libc because they are in libc.a in the same way.

I've pushed this now, given that Adhemerval had not objections either,
and it's a solution we have available today.  Thanks.
  

Patch

diff --git a/elf/Makefile b/elf/Makefile
index 0668818..b05af5c 100644
--- a/elf/Makefile
+++ b/elf/Makefile
@@ -65,6 +65,12 @@  CFLAGS-dl-runtime.c += -fexceptions -fasynchronous-unwind-tables
 CFLAGS-dl-lookup.c += -fexceptions -fasynchronous-unwind-tables
 CFLAGS-dl-iterate-phdr.c += $(uses-callbacks)
 
+# On targets without __builtin_memset, rtld.c uses a hand-coded loop
+# in _dl_start.  Make sure this isn't turned into a call to regular memset.
+ifeq (yes,$(have-loop-to-function))
+CFLAGS-rtld.c += -fno-tree-loop-distribute-patterns
+endif
+
 # Compile rtld itself without stack protection.
 # Also compile all routines in the static library that are elided from
 # the shared libc because they are in libc.a in the same way.