diff mbox series

elf/dl-deps.c: Make _dl_build_local_scope breadth first

Message ID 20211209235354.1558088-1-raj.khem@gmail.com
State New
Headers show
Series elf/dl-deps.c: Make _dl_build_local_scope breadth first | expand

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Khem Raj Dec. 9, 2021, 11:53 p.m. UTC
From: Mark Hatle <mark.hatle@windriver.com>

According to the ELF specification:

When resolving symbolic references, the dynamic linker examines the symbol
tables with a breadth-first search.

This function was using a depth first search.  By doing so the conflict
resolution reported to the prelinker (when LD_TRACE_PRELINKING=1 is set)
was incorrect.  This caused problems when their were various circular
dependencies between libraries.  The problem usually manifested itself by
the wrong IFUNC being executed.

Similar issue has been reported here [1]

[BZ# 20488]

[1] https://sourceware.org/legacy-ml/libc-alpha/2016-05/msg00034.html

Signed-off-by: Mark Hatle <mark.hatle@kernel.crashing.org>
Signed-off-by: Khem Raj <raj.khem@gmail.com>
---
 elf/dl-deps.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Comments

Adhemerval Zanella Jan. 11, 2022, 7:26 p.m. UTC | #1
On 09/12/2021 20:53, Khem Raj via Libc-alpha wrote:
> From: Mark Hatle <mark.hatle@windriver.com>
> 
> According to the ELF specification:
> 
> When resolving symbolic references, the dynamic linker examines the symbol
> tables with a breadth-first search.
> 
> This function was using a depth first search.  By doing so the conflict
> resolution reported to the prelinker (when LD_TRACE_PRELINKING=1 is set)
> was incorrect.  This caused problems when their were various circular
> dependencies between libraries.  The problem usually manifested itself by
> the wrong IFUNC being executed.
> 
> Similar issue has been reported here [1]
> 
> [BZ# 20488]
> 
> [1] https://sourceware.org/legacy-ml/libc-alpha/2016-05/msg00034.html
> 
> Signed-off-by: Mark Hatle <mark.hatle@kernel.crashing.org>
> Signed-off-by: Khem Raj <raj.khem@gmail.com>

I am trying to understand why it this only an issue for LD_TRACE_PRELINKING=1,
do we have a testcase that stress it for a default usercase?

When you say the 'wrong IFUNC being executed' what exactly you mean here?
Could we use a testcase based on this?

> ---
>  elf/dl-deps.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/elf/dl-deps.c b/elf/dl-deps.c
> index 237d9636c5..e15f7f83d8 100644
> --- a/elf/dl-deps.c
> +++ b/elf/dl-deps.c
> @@ -73,13 +73,19 @@ _dl_build_local_scope (struct link_map **list, struct link_map *map)
>  {
>    struct link_map **p = list;
>    struct link_map **q;
> +  struct link_map **r;
>  
>    *p++ = map;
>    map->l_reserved = 1;
> -  if (map->l_initfini)
> -    for (q = map->l_initfini + 1; *q; ++q)
> -      if (! (*q)->l_reserved)
> -	p += _dl_build_local_scope (p, *q);
> +
> +  for (r = list; r < p; ++r)
> +    if ((*r)->l_initfini)
> +      for (q = (*r)->l_initfini + 1; *q; ++q)
> +	if (! (*q)->l_reserved)
> +	  {
> +	    *p++ = *q;
> +	    (*q)->l_reserved = 1;
> +	  }
>    return p - list;
>  }
>
Mark Hatle Jan. 12, 2022, 7:08 p.m. UTC | #2
On 1/11/22 1:26 PM, Adhemerval Zanella wrote:
> 
> 
> On 09/12/2021 20:53, Khem Raj via Libc-alpha wrote:
>> From: Mark Hatle <mark.hatle@windriver.com>
>>
>> According to the ELF specification:
>>
>> When resolving symbolic references, the dynamic linker examines the symbol
>> tables with a breadth-first search.
>>
>> This function was using a depth first search.  By doing so the conflict
>> resolution reported to the prelinker (when LD_TRACE_PRELINKING=1 is set)
>> was incorrect.  This caused problems when their were various circular
>> dependencies between libraries.  The problem usually manifested itself by
>> the wrong IFUNC being executed.
>>
>> Similar issue has been reported here [1]
>>
>> [BZ# 20488]
>>
>> [1] https://sourceware.org/legacy-ml/libc-alpha/2016-05/msg00034.html
>>
>> Signed-off-by: Mark Hatle <mark.hatle@kernel.crashing.org>
>> Signed-off-by: Khem Raj <raj.khem@gmail.com>
> 
> I am trying to understand why it this only an issue for LD_TRACE_PRELINKING=1,
> do we have a testcase that stress it for a default usercase?

The underlying issue here is that resolution is happening depth first and not 
breadth first.  According to the ELF spec, all resolution should be breadth-first.


As noted in item in above, the prelinker just happens to be a way to actually 
show that the behavior is incorrect.  (There even appears to be a related defect 
with a reproducer.)


When taking the values from LD_TRACE_PRELINKING=1, various addresses and 
conflict resolutions are specified.  When you compare what is reported, vs what 
happens, vs what the spec says they don't align as they should.


> When you say the 'wrong IFUNC being executed' what exactly you mean here?
> Could we use a testcase based on this?


The prelinker (and possibly just in general), the IFUNC address used is the one 
from the wrong library scope.  I personally have never tried to reproduce this 
outside of the prelinking use-case, but based on the referenced report and the 
code at the time of the change, it is believed this could happen (but rarely) 
without a prelinked system in a very complex case with multiple libraries 
providing the same functions.

The main issue (my memory is sketchy sorry this might be wrong) is that if one 
or more functions is an IFUNC and one or more functions is NOT, then the you can 
get into a situation with a conflict of the wrong function being called using 
the wrong mechanism.

Definitely in the prelink case, the resolver gave the address of a regular 
function which was then placed into an IFUNC (or maybe it was the other way 
around) triggering the runtime segfault.

--Mark

>> ---
>>   elf/dl-deps.c | 14 ++++++++++----
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/elf/dl-deps.c b/elf/dl-deps.c
>> index 237d9636c5..e15f7f83d8 100644
>> --- a/elf/dl-deps.c
>> +++ b/elf/dl-deps.c
>> @@ -73,13 +73,19 @@ _dl_build_local_scope (struct link_map **list, struct link_map *map)
>>   {
>>     struct link_map **p = list;
>>     struct link_map **q;
>> +  struct link_map **r;
>>   
>>     *p++ = map;
>>     map->l_reserved = 1;
>> -  if (map->l_initfini)
>> -    for (q = map->l_initfini + 1; *q; ++q)
>> -      if (! (*q)->l_reserved)
>> -	p += _dl_build_local_scope (p, *q);
>> +
>> +  for (r = list; r < p; ++r)
>> +    if ((*r)->l_initfini)
>> +      for (q = (*r)->l_initfini + 1; *q; ++q)
>> +	if (! (*q)->l_reserved)
>> +	  {
>> +	    *p++ = *q;
>> +	    (*q)->l_reserved = 1;
>> +	  }
>>     return p - list;
>>   }
>>
Adhemerval Zanella Jan. 12, 2022, 8:12 p.m. UTC | #3
On 12/01/2022 16:08, Mark Hatle wrote:
> 
> 
> On 1/11/22 1:26 PM, Adhemerval Zanella wrote:
>>
>>
>> On 09/12/2021 20:53, Khem Raj via Libc-alpha wrote:
>>> From: Mark Hatle <mark.hatle@windriver.com>
>>>
>>> According to the ELF specification:
>>>
>>> When resolving symbolic references, the dynamic linker examines the symbol
>>> tables with a breadth-first search.
>>>
>>> This function was using a depth first search.  By doing so the conflict
>>> resolution reported to the prelinker (when LD_TRACE_PRELINKING=1 is set)
>>> was incorrect.  This caused problems when their were various circular
>>> dependencies between libraries.  The problem usually manifested itself by
>>> the wrong IFUNC being executed.
>>>
>>> Similar issue has been reported here [1]
>>>
>>> [BZ# 20488]
>>>
>>> [1] https://sourceware.org/legacy-ml/libc-alpha/2016-05/msg00034.html
>>>
>>> Signed-off-by: Mark Hatle <mark.hatle@kernel.crashing.org>
>>> Signed-off-by: Khem Raj <raj.khem@gmail.com>
>>
>> I am trying to understand why it this only an issue for LD_TRACE_PRELINKING=1,
>> do we have a testcase that stress it for a default usercase?
> 
> The underlying issue here is that resolution is happening depth first and not breadth first.  According to the ELF spec, all resolution should be breadth-first.
> 
> 
> As noted in item in above, the prelinker just happens to be a way to actually show that the behavior is incorrect.  (There even appears to be a related defect with a reproducer.)
> 
> 
> When taking the values from LD_TRACE_PRELINKING=1, various addresses and conflict resolutions are specified.  When you compare what is reported, vs what happens, vs what the spec says they don't align as they should.

That was pretty clear from bug report, what I am trying to understand is 
why it seems to not being reported before in default non-prelinked usage.

Also, the patch only changes the semantic to prelinked binaries which is
at least troublesome: this semantic difference at symbol resolution is
a potential source of issues where user see different behavior depending
whether prelinked was used or not. 

I am also worried because making such change might potentially trigger
some hard to diagnostic breakage.

> 
> 
>> When you say the 'wrong IFUNC being executed' what exactly you mean here?
>> Could we use a testcase based on this?
> 
> 
> The prelinker (and possibly just in general), the IFUNC address used is the one from the wrong library scope.  I personally have never tried to reproduce this outside of the prelinking use-case, but based on the referenced report and the code at the time of the change, it is believed this could happen (but rarely) without a prelinked system in a very complex case with multiple libraries providing the same functions.
> 
> The main issue (my memory is sketchy sorry this might be wrong) is that if one or more functions is an IFUNC and one or more functions is NOT, then the you can get into a situation with a conflict of the wrong function being called using the wrong mechanism.
> 
> Definitely in the prelink case, the resolver gave the address of a regular function which was then placed into an IFUNC (or maybe it was the other way around) triggering the runtime segfault.

I asked about ifunc because it should not really be dependent where ifunc
is used or not, what might be happening is this issue triggers the long
standing ifunc ordering issue more often.

To summarize, although this change only affects prelinked binaries I think
it would be better to change for default semantic as well.  We can make
it a tunable for the transition, like new DSO sorting algorithm with
glibc.rtld.dynamic_sort, and evaluate on Fedora rawhide, and then hit
the switch to make it the default.

We will also need regression testcases, although not sure what kind
of coverage we will need to provide (our own testsuite currently does not
trigger any issue though).

> 
> --Mark
> 
>>> ---
>>>   elf/dl-deps.c | 14 ++++++++++----
>>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/elf/dl-deps.c b/elf/dl-deps.c
>>> index 237d9636c5..e15f7f83d8 100644
>>> --- a/elf/dl-deps.c
>>> +++ b/elf/dl-deps.c
>>> @@ -73,13 +73,19 @@ _dl_build_local_scope (struct link_map **list, struct link_map *map)
>>>   {
>>>     struct link_map **p = list;
>>>     struct link_map **q;
>>> +  struct link_map **r;
>>>       *p++ = map;
>>>     map->l_reserved = 1;
>>> -  if (map->l_initfini)
>>> -    for (q = map->l_initfini + 1; *q; ++q)
>>> -      if (! (*q)->l_reserved)
>>> -    p += _dl_build_local_scope (p, *q);
>>> +
>>> +  for (r = list; r < p; ++r)
>>> +    if ((*r)->l_initfini)
>>> +      for (q = (*r)->l_initfini + 1; *q; ++q)
>>> +    if (! (*q)->l_reserved)
>>> +      {
>>> +        *p++ = *q;
>>> +        (*q)->l_reserved = 1;
>>> +      }
>>>     return p - list;
>>>   }
>>>
Mark Hatle Jan. 12, 2022, 8:41 p.m. UTC | #4
On 1/12/22 2:12 PM, Adhemerval Zanella wrote:
> 
> 
> On 12/01/2022 16:08, Mark Hatle wrote:
>>
>>
>> On 1/11/22 1:26 PM, Adhemerval Zanella wrote:
>>>
>>>
>>> On 09/12/2021 20:53, Khem Raj via Libc-alpha wrote:
>>>> From: Mark Hatle <mark.hatle@windriver.com>
>>>>
>>>> According to the ELF specification:
>>>>
>>>> When resolving symbolic references, the dynamic linker examines the symbol
>>>> tables with a breadth-first search.
>>>>
>>>> This function was using a depth first search.  By doing so the conflict
>>>> resolution reported to the prelinker (when LD_TRACE_PRELINKING=1 is set)
>>>> was incorrect.  This caused problems when their were various circular
>>>> dependencies between libraries.  The problem usually manifested itself by
>>>> the wrong IFUNC being executed.
>>>>
>>>> Similar issue has been reported here [1]
>>>>
>>>> [BZ# 20488]
>>>>
>>>> [1] https://sourceware.org/legacy-ml/libc-alpha/2016-05/msg00034.html
>>>>
>>>> Signed-off-by: Mark Hatle <mark.hatle@kernel.crashing.org>
>>>> Signed-off-by: Khem Raj <raj.khem@gmail.com>
>>>
>>> I am trying to understand why it this only an issue for LD_TRACE_PRELINKING=1,
>>> do we have a testcase that stress it for a default usercase?
>>
>> The underlying issue here is that resolution is happening depth first and not breadth first.  According to the ELF spec, all resolution should be breadth-first.
>>
>>
>> As noted in item in above, the prelinker just happens to be a way to actually show that the behavior is incorrect.  (There even appears to be a related defect with a reproducer.)
>>
>>
>> When taking the values from LD_TRACE_PRELINKING=1, various addresses and conflict resolutions are specified.  When you compare what is reported, vs what happens, vs what the spec says they don't align as they should.
> 
> That was pretty clear from bug report, what I am trying to understand is
> why it seems to not being reported before in default non-prelinked usage.
> 
> Also, the patch only changes the semantic to prelinked binaries which is
> at least troublesome: this semantic difference at symbol resolution is
> a potential source of issues where user see different behavior depending
> whether prelinked was used or not.
> 
> I am also worried because making such change might potentially trigger
> some hard to diagnostic breakage.
> 
>>
>>
>>> When you say the 'wrong IFUNC being executed' what exactly you mean here?
>>> Could we use a testcase based on this?
>>
>>
>> The prelinker (and possibly just in general), the IFUNC address used is the one from the wrong library scope.  I personally have never tried to reproduce this outside of the prelinking use-case, but based on the referenced report and the code at the time of the change, it is believed this could happen (but rarely) without a prelinked system in a very complex case with multiple libraries providing the same functions.
>>
>> The main issue (my memory is sketchy sorry this might be wrong) is that if one or more functions is an IFUNC and one or more functions is NOT, then the you can get into a situation with a conflict of the wrong function being called using the wrong mechanism.
>>
>> Definitely in the prelink case, the resolver gave the address of a regular function which was then placed into an IFUNC (or maybe it was the other way around) triggering the runtime segfault.
> 
> I asked about ifunc because it should not really be dependent where ifunc
> is used or not, what might be happening is this issue triggers the long
> standing ifunc ordering issue more often.

IFUNC is the only place we ever reproduced it.  At least one of the functions 
had to be an ifunc.  (Or maybe a better way to say it is at least one function 
was an ifunc and one wasn't, but they both had the same name.)

> To summarize, although this change only affects prelinked binaries I think
> it would be better to change for default semantic as well.  We can make
> it a tunable for the transition, like new DSO sorting algorithm with
> glibc.rtld.dynamic_sort, and evaluate on Fedora rawhide, and then hit
> the switch to make it the default.

It absolutely affects prelinked binaries.  At the time we believed it COULD 
affect regular binaries, but we had never seen a failure in the wild.

> We will also need regression testcases, although not sure what kind
> of coverage we will need to provide (our own testsuite currently does not
> trigger any issue though).

This is the test case that was developed at the time for including inside of the 
prelinker:

https://git.yoctoproject.org/prelink-cross/commit/testsuite?id=8f55afd84b3580b1f1d6af904e8c2a39221055b7

It essentially makes three libraries, with two functions (two libraries have the 
same function).  The resulting value expects the breadth-first loading behavior 
in order to generate the correct output.

The main purpose of the prelink test cases was to ensure the same behavior 
before and after the prelinking.  The same should be true with this change to 
glibc, everything should work the same before and after the change, unless the 
stuff before was an error (wrong function used).

We have also been using this patch in the Yocto Project since 2016.  And we've 
never had a report of an incompatibility/failure.  So it really is quite low 
risk, but I'll never say it's "no risk".

>>
>> --Mark
>>
>>>> ---
>>>>    elf/dl-deps.c | 14 ++++++++++----
>>>>    1 file changed, 10 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/elf/dl-deps.c b/elf/dl-deps.c
>>>> index 237d9636c5..e15f7f83d8 100644
>>>> --- a/elf/dl-deps.c
>>>> +++ b/elf/dl-deps.c
>>>> @@ -73,13 +73,19 @@ _dl_build_local_scope (struct link_map **list, struct link_map *map)
>>>>    {
>>>>      struct link_map **p = list;
>>>>      struct link_map **q;
>>>> +  struct link_map **r;
>>>>        *p++ = map;
>>>>      map->l_reserved = 1;
>>>> -  if (map->l_initfini)
>>>> -    for (q = map->l_initfini + 1; *q; ++q)
>>>> -      if (! (*q)->l_reserved)
>>>> -    p += _dl_build_local_scope (p, *q);
>>>> +
>>>> +  for (r = list; r < p; ++r)
>>>> +    if ((*r)->l_initfini)
>>>> +      for (q = (*r)->l_initfini + 1; *q; ++q)
>>>> +    if (! (*q)->l_reserved)
>>>> +      {
>>>> +        *p++ = *q;
>>>> +        (*q)->l_reserved = 1;
>>>> +      }
>>>>      return p - list;
>>>>    }
>>>>
Adhemerval Zanella Jan. 13, 2022, 11:52 a.m. UTC | #5
On 12/01/2022 17:41, Mark Hatle wrote:
> 
> 
> On 1/12/22 2:12 PM, Adhemerval Zanella wrote:
>>
>>
>> On 12/01/2022 16:08, Mark Hatle wrote:
>>>
>>>
>>> On 1/11/22 1:26 PM, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 09/12/2021 20:53, Khem Raj via Libc-alpha wrote:
>>>>> From: Mark Hatle <mark.hatle@windriver.com>
>>>>>
>>>>> According to the ELF specification:
>>>>>
>>>>> When resolving symbolic references, the dynamic linker examines the symbol
>>>>> tables with a breadth-first search.
>>>>>
>>>>> This function was using a depth first search.  By doing so the conflict
>>>>> resolution reported to the prelinker (when LD_TRACE_PRELINKING=1 is set)
>>>>> was incorrect.  This caused problems when their were various circular
>>>>> dependencies between libraries.  The problem usually manifested itself by
>>>>> the wrong IFUNC being executed.
>>>>>
>>>>> Similar issue has been reported here [1]
>>>>>
>>>>> [BZ# 20488]
>>>>>
>>>>> [1] https://sourceware.org/legacy-ml/libc-alpha/2016-05/msg00034.html
>>>>>
>>>>> Signed-off-by: Mark Hatle <mark.hatle@kernel.crashing.org>
>>>>> Signed-off-by: Khem Raj <raj.khem@gmail.com>
>>>>
>>>> I am trying to understand why it this only an issue for LD_TRACE_PRELINKING=1,
>>>> do we have a testcase that stress it for a default usercase?
>>>
>>> The underlying issue here is that resolution is happening depth first and not breadth first.  According to the ELF spec, all resolution should be breadth-first.
>>>
>>>
>>> As noted in item in above, the prelinker just happens to be a way to actually show that the behavior is incorrect.  (There even appears to be a related defect with a reproducer.)
>>>
>>>
>>> When taking the values from LD_TRACE_PRELINKING=1, various addresses and conflict resolutions are specified.  When you compare what is reported, vs what happens, vs what the spec says they don't align as they should.
>>
>> That was pretty clear from bug report, what I am trying to understand is
>> why it seems to not being reported before in default non-prelinked usage.
>>
>> Also, the patch only changes the semantic to prelinked binaries which is
>> at least troublesome: this semantic difference at symbol resolution is
>> a potential source of issues where user see different behavior depending
>> whether prelinked was used or not.
>>
>> I am also worried because making such change might potentially trigger
>> some hard to diagnostic breakage.
>>
>>>
>>>
>>>> When you say the 'wrong IFUNC being executed' what exactly you mean here?
>>>> Could we use a testcase based on this?
>>>
>>>
>>> The prelinker (and possibly just in general), the IFUNC address used is the one from the wrong library scope.  I personally have never tried to reproduce this outside of the prelinking use-case, but based on the referenced report and the code at the time of the change, it is believed this could happen (but rarely) without a prelinked system in a very complex case with multiple libraries providing the same functions.
>>>
>>> The main issue (my memory is sketchy sorry this might be wrong) is that if one or more functions is an IFUNC and one or more functions is NOT, then the you can get into a situation with a conflict of the wrong function being called using the wrong mechanism.
>>>
>>> Definitely in the prelink case, the resolver gave the address of a regular function which was then placed into an IFUNC (or maybe it was the other way around) triggering the runtime segfault.
>>
>> I asked about ifunc because it should not really be dependent where ifunc
>> is used or not, what might be happening is this issue triggers the long
>> standing ifunc ordering issue more often.
> 
> IFUNC is the only place we ever reproduced it.  At least one of the functions had to be an ifunc.  (Or maybe a better way to say it is at least one function was an ifunc and one wasn't, but they both had the same name.)
> 
>> To summarize, although this change only affects prelinked binaries I think
>> it would be better to change for default semantic as well.  We can make
>> it a tunable for the transition, like new DSO sorting algorithm with
>> glibc.rtld.dynamic_sort, and evaluate on Fedora rawhide, and then hit
>> the switch to make it the default.
> 
> It absolutely affects prelinked binaries.  At the time we believed it COULD affect regular binaries, but we had never seen a failure in the wild.
> 
>> We will also need regression testcases, although not sure what kind
>> of coverage we will need to provide (our own testsuite currently does not
>> trigger any issue though).
> 
> This is the test case that was developed at the time for including inside of the prelinker:
> 
> https://git.yoctoproject.org/prelink-cross/commit/testsuite?id=8f55afd84b3580b1f1d6af904e8c2a39221055b7
> 
> It essentially makes three libraries, with two functions (two libraries have the same function).  The resulting value expects the breadth-first loading behavior in order to generate the correct output.
> 
> The main purpose of the prelink test cases was to ensure the same behavior before and after the prelinking.  The same should be true with this change to glibc, everything should work the same before and after the change, unless the stuff before was an error (wrong function used).

The test does not really exercises IFUNC, however on the patch description it
seems that actually tries to fix a upstream bug that we already fixed [1].

And checking prelink output with this test, the patch does not really seem
to be changing the binding of the 'value' function:

* master

$ LD_TRACE_PRELINKING=1 [...] ./order
[...]
        ./order => ./order (0x00007f10eca07000, 0x00007f10eca07000)
        orderlib.so => ./orderlib.so (0x00007f10eca00000, 0x00007f10eca00000)
 [***]  orderlib3.so => ./orderlib3.so (0x00007f10ec9fb000, 0x00007f10ec9fb000)
        libc.so.6 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so.6 (0x00007f10ec7db000, 0x00007f10ec7db000) TLS(0x1, 0x0000000000000090)
        orderlib1.so => ./orderlib1.so (0x00007f10ec7d4000, 0x00007f10ec7d4000)
        orderlib2.so => ./orderlib2.so (0x00007f10ec7cf000, 0x00007f10ec7cf000)
        /lib64/ld-linux-x86-64.so.2 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld-linux-x86-64.so.2 (0x00007f10eca0c000, 0x00007f10eca0c000)
[...]        
        lookup 0x00007f10eca07000 0x0000000000000468 -> 0x00007f10ec9fb000 0x0000000000001100 /1 value
[...]

* patch

$ LD_TRACE_PRELINKING=1 [...] ./order
[...]
        ./order => ./order (0x00007f30cf770000, 0x00007f30cf770000)
        orderlib.so => ./orderlib.so (0x00007f30cf769000, 0x00007f30cf769000)
 [***]  orderlib3.so => ./orderlib3.so (0x00007f30cf764000, 0x00007f30cf764000) *
        libc.so.6 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so.6 (0x00007f30cf544000, 0x00007f30cf544000) TLS(0x1, 0x0000000000000090)
        orderlib1.so => ./orderlib1.so (0x00007f30cf53d000, 0x00007f30cf53d000)
        orderlib2.so => ./orderlib2.so (0x00007f30cf538000, 0x00007f30cf538000)
        /lib64/ld-linux-x86-64.so.2 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld-linux-x86-64.so.2 (0x00007f30cf775000, 0x00007f30cf775000)
[...]
lookup 0x00007f30cf770000 0x0000000000000468 -> 0x00007f30cf764000 0x0000000000001100 /1 value
[...]

So on both cases 'value' binds to orderlib3.so, which is the expected value.  I really
think we need to came with a testcase to actually stress it.

At least  _dl_build_local_scope is localized to prelink support, which I think it should
be safer to change.

> 
> We have also been using this patch in the Yocto Project since 2016.  And we've never had a report of an incompatibility/failure.  So it really is quite low risk, but I'll never say it's "no risk".
> 

Recently Florian has asked about prelinked support [1], and Joseph 
answer seems that Yocto still provide support for it [3].  IMHO I 
would just deprecate it, so we can eventually clean this up: it
requires a lot of hacks within loader, its support is not straightforward
for newer ports, and it even make less sense with ASLR and PIE executable.

It would be good to know if the performance optimization it should bring
does pay off the clutter and complexity it adds on loader code.


[1] https://sourceware.org/bugzilla/show_bug.cgi?id=19861
[2] https://sourceware.org/pipermail/libc-alpha/2021-August/130404.html
[3] https://git.yoctoproject.org/prelink-cross/
Mark Hatle Jan. 13, 2022, 4:33 p.m. UTC | #6
On 1/13/22 5:52 AM, Adhemerval Zanella wrote:
> 
> 
> On 12/01/2022 17:41, Mark Hatle wrote:
>>
>>
>> On 1/12/22 2:12 PM, Adhemerval Zanella wrote:
>>>
>>>
>>> On 12/01/2022 16:08, Mark Hatle wrote:
>>>>
>>>>
>>>> On 1/11/22 1:26 PM, Adhemerval Zanella wrote:
>>>>>
>>>>>
>>>>> On 09/12/2021 20:53, Khem Raj via Libc-alpha wrote:
>>>>>> From: Mark Hatle <mark.hatle@windriver.com>
>>>>>>
>>>>>> According to the ELF specification:
>>>>>>
>>>>>> When resolving symbolic references, the dynamic linker examines the symbol
>>>>>> tables with a breadth-first search.
>>>>>>
>>>>>> This function was using a depth first search.  By doing so the conflict
>>>>>> resolution reported to the prelinker (when LD_TRACE_PRELINKING=1 is set)
>>>>>> was incorrect.  This caused problems when their were various circular
>>>>>> dependencies between libraries.  The problem usually manifested itself by
>>>>>> the wrong IFUNC being executed.
>>>>>>
>>>>>> Similar issue has been reported here [1]
>>>>>>
>>>>>> [BZ# 20488]
>>>>>>
>>>>>> [1] https://sourceware.org/legacy-ml/libc-alpha/2016-05/msg00034.html
>>>>>>
>>>>>> Signed-off-by: Mark Hatle <mark.hatle@kernel.crashing.org>
>>>>>> Signed-off-by: Khem Raj <raj.khem@gmail.com>
>>>>>
>>>>> I am trying to understand why it this only an issue for LD_TRACE_PRELINKING=1,
>>>>> do we have a testcase that stress it for a default usercase?
>>>>
>>>> The underlying issue here is that resolution is happening depth first and not breadth first.  According to the ELF spec, all resolution should be breadth-first.
>>>>
>>>>
>>>> As noted in item in above, the prelinker just happens to be a way to actually show that the behavior is incorrect.  (There even appears to be a related defect with a reproducer.)
>>>>
>>>>
>>>> When taking the values from LD_TRACE_PRELINKING=1, various addresses and conflict resolutions are specified.  When you compare what is reported, vs what happens, vs what the spec says they don't align as they should.
>>>
>>> That was pretty clear from bug report, what I am trying to understand is
>>> why it seems to not being reported before in default non-prelinked usage.
>>>
>>> Also, the patch only changes the semantic to prelinked binaries which is
>>> at least troublesome: this semantic difference at symbol resolution is
>>> a potential source of issues where user see different behavior depending
>>> whether prelinked was used or not.
>>>
>>> I am also worried because making such change might potentially trigger
>>> some hard to diagnostic breakage.
>>>
>>>>
>>>>
>>>>> When you say the 'wrong IFUNC being executed' what exactly you mean here?
>>>>> Could we use a testcase based on this?
>>>>
>>>>
>>>> The prelinker (and possibly just in general), the IFUNC address used is the one from the wrong library scope.  I personally have never tried to reproduce this outside of the prelinking use-case, but based on the referenced report and the code at the time of the change, it is believed this could happen (but rarely) without a prelinked system in a very complex case with multiple libraries providing the same functions.
>>>>
>>>> The main issue (my memory is sketchy sorry this might be wrong) is that if one or more functions is an IFUNC and one or more functions is NOT, then the you can get into a situation with a conflict of the wrong function being called using the wrong mechanism.
>>>>
>>>> Definitely in the prelink case, the resolver gave the address of a regular function which was then placed into an IFUNC (or maybe it was the other way around) triggering the runtime segfault.
>>>
>>> I asked about ifunc because it should not really be dependent where ifunc
>>> is used or not, what might be happening is this issue triggers the long
>>> standing ifunc ordering issue more often.
>>
>> IFUNC is the only place we ever reproduced it.  At least one of the functions had to be an ifunc.  (Or maybe a better way to say it is at least one function was an ifunc and one wasn't, but they both had the same name.)
>>
>>> To summarize, although this change only affects prelinked binaries I think
>>> it would be better to change for default semantic as well.  We can make
>>> it a tunable for the transition, like new DSO sorting algorithm with
>>> glibc.rtld.dynamic_sort, and evaluate on Fedora rawhide, and then hit
>>> the switch to make it the default.
>>
>> It absolutely affects prelinked binaries.  At the time we believed it COULD affect regular binaries, but we had never seen a failure in the wild.
>>
>>> We will also need regression testcases, although not sure what kind
>>> of coverage we will need to provide (our own testsuite currently does not
>>> trigger any issue though).
>>
>> This is the test case that was developed at the time for including inside of the prelinker:
>>
>> https://git.yoctoproject.org/prelink-cross/commit/testsuite?id=8f55afd84b3580b1f1d6af904e8c2a39221055b7
>>
>> It essentially makes three libraries, with two functions (two libraries have the same function).  The resulting value expects the breadth-first loading behavior in order to generate the correct output.
>>
>> The main purpose of the prelink test cases was to ensure the same behavior before and after the prelinking.  The same should be true with this change to glibc, everything should work the same before and after the change, unless the stuff before was an error (wrong function used).
> 
> The test does not really exercises IFUNC, however on the patch description it
> seems that actually tries to fix a upstream bug that we already fixed [1].
> 
> And checking prelink output with this test, the patch does not really seem
> to be changing the binding of the 'value' function:
> 
> * master

Master branch is for conventional prelink, batches are cross-ported with the 
cross_prelink branch.   cross_prelink is where this was developed, so the chunk 
of the commit message about changing the linker has nothing to do with the bit I 
was showing you.  (Find the corresponding patch in cross_prelink, and you'll see 
the RTLD side of things.)

> $ LD_TRACE_PRELINKING=1 [...] ./order
> [...]
>          ./order => ./order (0x00007f10eca07000, 0x00007f10eca07000)
>          orderlib.so => ./orderlib.so (0x00007f10eca00000, 0x00007f10eca00000)
>   [***]  orderlib3.so => ./orderlib3.so (0x00007f10ec9fb000, 0x00007f10ec9fb000)
>          libc.so.6 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so.6 (0x00007f10ec7db000, 0x00007f10ec7db000) TLS(0x1, 0x0000000000000090)
>          orderlib1.so => ./orderlib1.so (0x00007f10ec7d4000, 0x00007f10ec7d4000)
>          orderlib2.so => ./orderlib2.so (0x00007f10ec7cf000, 0x00007f10ec7cf000)
>          /lib64/ld-linux-x86-64.so.2 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld-linux-x86-64.so.2 (0x00007f10eca0c000, 0x00007f10eca0c000)
> [...]
>          lookup 0x00007f10eca07000 0x0000000000000468 -> 0x00007f10ec9fb000 0x0000000000001100 /1 value
> [...]
> 
> * patch
> 
> $ LD_TRACE_PRELINKING=1 [...] ./order
> [...]
>          ./order => ./order (0x00007f30cf770000, 0x00007f30cf770000)
>          orderlib.so => ./orderlib.so (0x00007f30cf769000, 0x00007f30cf769000)
>   [***]  orderlib3.so => ./orderlib3.so (0x00007f30cf764000, 0x00007f30cf764000) *
>          libc.so.6 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so.6 (0x00007f30cf544000, 0x00007f30cf544000) TLS(0x1, 0x0000000000000090)
>          orderlib1.so => ./orderlib1.so (0x00007f30cf53d000, 0x00007f30cf53d000)
>          orderlib2.so => ./orderlib2.so (0x00007f30cf538000, 0x00007f30cf538000)
>          /lib64/ld-linux-x86-64.so.2 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld-linux-x86-64.so.2 (0x00007f30cf775000, 0x00007f30cf775000)
> [...]
> lookup 0x00007f30cf770000 0x0000000000000468 -> 0x00007f30cf764000 0x0000000000001100 /1 value
> [...]
> 
> So on both cases 'value' binds to orderlib3.so, which is the expected value.  I really
> think we need to came with a testcase to actually stress it.

It's more then just LD_TRACE_PRELINKING=1, the environment used is usually 
similar to:

LD_TRACE_PRELINKING=1
LD_BIND_NOW=1
LD_TRACE_LOADED_OBJECTS=1

Then for each library and executable it runs the ld.so and captures this 
information.

> At least  _dl_build_local_scope is localized to prelink support, which I think it should
> be safer to change.

Due to relocation and other changes, I don't know if you can run the prelinker 
for the test cases, but if you do -- it's pretty easy to show the issue in this 
case.

>>
>> We have also been using this patch in the Yocto Project since 2016.  And we've never had a report of an incompatibility/failure.  So it really is quite low risk, but I'll never say it's "no risk".
>>
> 
> Recently Florian has asked about prelinked support [1], and Joseph
> answer seems that Yocto still provide support for it [3].  IMHO I
> would just deprecate it, so we can eventually clean this up: it
> requires a lot of hacks within loader, its support is not straightforward
> for newer ports, and it even make less sense with ASLR and PIE executable.

I have stopped maintaining the prelinker at this time, as I don't have the 
knowledge for all of the latest fixups.  Over the last 3 or so years, the 
prelinker itself has fallen into a state of disrepair as additional relocations, 
functions, etc have been added to the system.  I don't have the time or more 
importantly knowledge to deal with the recent changes.

I would be happy to resume some of the maintenance activities if others would 
provide the fixes necessary for the latest compiler/library implementations -- 
but the number of people who understand ELF at that level AND have any desire 
for prelinking can probably be counted on one hand.

Even trying to use the prelinker with a somewhat modern Ubuntu system reports 
errors like:

../src/prelink: ./order: Could not parse `lookup 0x00007fffd8b0a000 
0xfffffffffffcd180 -> 0x00007fffd8b0a000 0x00000000000008e0 /0 __vdso_clock_gettime'

While I'm sure that one wouldn't be too hard to fix, then there is another and 
another..  (I believe the issue is that the VDSO items are presenting BEFORE the 
list of libraries which is change in behavior....)

(I did try to just skip those lookups, and it results in the prelinked code 
crashing..)

> It would be good to know if the performance optimization it should bring
> does pay off the clutter and complexity it adds on loader code.

When I last profiled this, roughly 3 1/2 years ago, the run-time linking speedup 
was huge.  There were two main advantages to this:

* Run Time linking speedup - primarily helped initial application loads.  System 
boot times went from 10-15 seconds down to 4-5 seconds.  For embedded systems 
this was massive.

* Memory usage.  The COW page usage for runtime linking can be significant on 
memory constrained systems.  Prelinking dropped the COW page usage in the 
systems I was looking at to about 10% of what it was prior.  This is believed to 
have further contributed to the boot time optimizations.

Last I looked at this, only about 20-30% of the system is capable of prelinking 
anymore due to the evolutionary changes in the various toolchain elements, 
introducing new relocations and related components.  Even things like 
re-ordering sections (done a couple years ago in binutils) has broken the 
prelinker in mysterious ways.

Add to this the IMHO mistaken belief that ASLR is some magic security device.  I 
see it more as a component of security that needs to be part of a broader system 
that you can decide to trade off against load performance (including memory). 
But the broad "if you don't use ASLR your device if vulnerable" mentality has 
seriously restricted the desire for people to use, improve and contribute back 
to the prelinker.

> 
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=19861
> [2] https://sourceware.org/pipermail/libc-alpha/2021-August/130404.html
> [3] https://git.yoctoproject.org/prelink-cross/
>
Adhemerval Zanella Jan. 13, 2022, 5:20 p.m. UTC | #7
On 13/01/2022 13:33, Mark Hatle wrote:
> 
> 
> On 1/13/22 5:52 AM, Adhemerval Zanella wrote:
>>
>>
>> On 12/01/2022 17:41, Mark Hatle wrote:
>>>
>>>
>>> On 1/12/22 2:12 PM, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 12/01/2022 16:08, Mark Hatle wrote:
>>>>>
>>>>>
>>>>> On 1/11/22 1:26 PM, Adhemerval Zanella wrote:
>>>>>>
>>>>>>
>>>>>> On 09/12/2021 20:53, Khem Raj via Libc-alpha wrote:
>>>>>>> From: Mark Hatle <mark.hatle@windriver.com>
>>>>>>>
>>>>>>> According to the ELF specification:
>>>>>>>
>>>>>>> When resolving symbolic references, the dynamic linker examines the symbol
>>>>>>> tables with a breadth-first search.
>>>>>>>
>>>>>>> This function was using a depth first search.  By doing so the conflict
>>>>>>> resolution reported to the prelinker (when LD_TRACE_PRELINKING=1 is set)
>>>>>>> was incorrect.  This caused problems when their were various circular
>>>>>>> dependencies between libraries.  The problem usually manifested itself by
>>>>>>> the wrong IFUNC being executed.
>>>>>>>
>>>>>>> Similar issue has been reported here [1]
>>>>>>>
>>>>>>> [BZ# 20488]
>>>>>>>
>>>>>>> [1] https://sourceware.org/legacy-ml/libc-alpha/2016-05/msg00034.html
>>>>>>>
>>>>>>> Signed-off-by: Mark Hatle <mark.hatle@kernel.crashing.org>
>>>>>>> Signed-off-by: Khem Raj <raj.khem@gmail.com>
>>>>>>
>>>>>> I am trying to understand why it this only an issue for LD_TRACE_PRELINKING=1,
>>>>>> do we have a testcase that stress it for a default usercase?
>>>>>
>>>>> The underlying issue here is that resolution is happening depth first and not breadth first.  According to the ELF spec, all resolution should be breadth-first.
>>>>>
>>>>>
>>>>> As noted in item in above, the prelinker just happens to be a way to actually show that the behavior is incorrect.  (There even appears to be a related defect with a reproducer.)
>>>>>
>>>>>
>>>>> When taking the values from LD_TRACE_PRELINKING=1, various addresses and conflict resolutions are specified.  When you compare what is reported, vs what happens, vs what the spec says they don't align as they should.
>>>>
>>>> That was pretty clear from bug report, what I am trying to understand is
>>>> why it seems to not being reported before in default non-prelinked usage.
>>>>
>>>> Also, the patch only changes the semantic to prelinked binaries which is
>>>> at least troublesome: this semantic difference at symbol resolution is
>>>> a potential source of issues where user see different behavior depending
>>>> whether prelinked was used or not.
>>>>
>>>> I am also worried because making such change might potentially trigger
>>>> some hard to diagnostic breakage.
>>>>
>>>>>
>>>>>
>>>>>> When you say the 'wrong IFUNC being executed' what exactly you mean here?
>>>>>> Could we use a testcase based on this?
>>>>>
>>>>>
>>>>> The prelinker (and possibly just in general), the IFUNC address used is the one from the wrong library scope.  I personally have never tried to reproduce this outside of the prelinking use-case, but based on the referenced report and the code at the time of the change, it is believed this could happen (but rarely) without a prelinked system in a very complex case with multiple libraries providing the same functions.
>>>>>
>>>>> The main issue (my memory is sketchy sorry this might be wrong) is that if one or more functions is an IFUNC and one or more functions is NOT, then the you can get into a situation with a conflict of the wrong function being called using the wrong mechanism.
>>>>>
>>>>> Definitely in the prelink case, the resolver gave the address of a regular function which was then placed into an IFUNC (or maybe it was the other way around) triggering the runtime segfault.
>>>>
>>>> I asked about ifunc because it should not really be dependent where ifunc
>>>> is used or not, what might be happening is this issue triggers the long
>>>> standing ifunc ordering issue more often.
>>>
>>> IFUNC is the only place we ever reproduced it.  At least one of the functions had to be an ifunc.  (Or maybe a better way to say it is at least one function was an ifunc and one wasn't, but they both had the same name.)
>>>
>>>> To summarize, although this change only affects prelinked binaries I think
>>>> it would be better to change for default semantic as well.  We can make
>>>> it a tunable for the transition, like new DSO sorting algorithm with
>>>> glibc.rtld.dynamic_sort, and evaluate on Fedora rawhide, and then hit
>>>> the switch to make it the default.
>>>
>>> It absolutely affects prelinked binaries.  At the time we believed it COULD affect regular binaries, but we had never seen a failure in the wild.
>>>
>>>> We will also need regression testcases, although not sure what kind
>>>> of coverage we will need to provide (our own testsuite currently does not
>>>> trigger any issue though).
>>>
>>> This is the test case that was developed at the time for including inside of the prelinker:
>>>
>>> https://git.yoctoproject.org/prelink-cross/commit/testsuite?id=8f55afd84b3580b1f1d6af904e8c2a39221055b7
>>>
>>> It essentially makes three libraries, with two functions (two libraries have the same function).  The resulting value expects the breadth-first loading behavior in order to generate the correct output.
>>>
>>> The main purpose of the prelink test cases was to ensure the same behavior before and after the prelinking.  The same should be true with this change to glibc, everything should work the same before and after the change, unless the stuff before was an error (wrong function used).
>>
>> The test does not really exercises IFUNC, however on the patch description it
>> seems that actually tries to fix a upstream bug that we already fixed [1].
>>
>> And checking prelink output with this test, the patch does not really seem
>> to be changing the binding of the 'value' function:
>>
>> * master
> 
> Master branch is for conventional prelink, batches are cross-ported with the cross_prelink branch.   cross_prelink is where this was developed, so the chunk of the commit message about changing the linker has nothing to do with the bit I was showing you.  (Find the corresponding patch in cross_prelink, and you'll see the RTLD side of things.)

I was not using your branch or repo, sorry if I was not clear.  This is
glibc master and the patch is glibc master plus your patch.

> 
>> $ LD_TRACE_PRELINKING=1 [...] ./order
>> [...]
>>          ./order => ./order (0x00007f10eca07000, 0x00007f10eca07000)
>>          orderlib.so => ./orderlib.so (0x00007f10eca00000, 0x00007f10eca00000)
>>   [***]  orderlib3.so => ./orderlib3.so (0x00007f10ec9fb000, 0x00007f10ec9fb000)
>>          libc.so.6 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so.6 (0x00007f10ec7db000, 0x00007f10ec7db000) TLS(0x1, 0x0000000000000090)
>>          orderlib1.so => ./orderlib1.so (0x00007f10ec7d4000, 0x00007f10ec7d4000)
>>          orderlib2.so => ./orderlib2.so (0x00007f10ec7cf000, 0x00007f10ec7cf000)
>>          /lib64/ld-linux-x86-64.so.2 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld-linux-x86-64.so.2 (0x00007f10eca0c000, 0x00007f10eca0c000)
>> [...]
>>          lookup 0x00007f10eca07000 0x0000000000000468 -> 0x00007f10ec9fb000 0x0000000000001100 /1 value
>> [...]
>>
>> * patch
>>
>> $ LD_TRACE_PRELINKING=1 [...] ./order
>> [...]
>>          ./order => ./order (0x00007f30cf770000, 0x00007f30cf770000)
>>          orderlib.so => ./orderlib.so (0x00007f30cf769000, 0x00007f30cf769000)
>>   [***]  orderlib3.so => ./orderlib3.so (0x00007f30cf764000, 0x00007f30cf764000) *
>>          libc.so.6 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so.6 (0x00007f30cf544000, 0x00007f30cf544000) TLS(0x1, 0x0000000000000090)
>>          orderlib1.so => ./orderlib1.so (0x00007f30cf53d000, 0x00007f30cf53d000)
>>          orderlib2.so => ./orderlib2.so (0x00007f30cf538000, 0x00007f30cf538000)
>>          /lib64/ld-linux-x86-64.so.2 => /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld-linux-x86-64.so.2 (0x00007f30cf775000, 0x00007f30cf775000)
>> [...]
>> lookup 0x00007f30cf770000 0x0000000000000468 -> 0x00007f30cf764000 0x0000000000001100 /1 value
>> [...]
>>
>> So on both cases 'value' binds to orderlib3.so, which is the expected value.  I really
>> think we need to came with a testcase to actually stress it.
> 
> It's more then just LD_TRACE_PRELINKING=1, the environment used is usually similar to:
> 
> LD_TRACE_PRELINKING=1
> LD_BIND_NOW=1
> LD_TRACE_LOADED_OBJECTS=1
> 
> Then for each library and executable it runs the ld.so and captures this information.

LD_TRACE_PRELINKING implies in LD_TRACE_LOADED_OBJECTS:

2773         case 16:
2774           /* The mode of the dynamic linker can be set.  */
2775           if (memcmp (envline, "TRACE_PRELINKING", 16) == 0)
2776             {
2777               state->mode = rtld_mode_trace;
2778               GLRO(dl_verbose) = 1;
2779               GLRO(dl_debug_mask) |= DL_DEBUG_PRELINK;
2780               GLRO(dl_trace_prelink) = &envline[17];
2781             }
2782           break;
2783 
2784         case 20:
2785           /* The mode of the dynamic linker can be set.  */
2786           if (memcmp (envline, "TRACE_LOADED_OBJECTS", 20) == 0)
2787             state->mode = rtld_mode_trace;
2788           break;

That's why it is used on glibc own prelink tests (tst-prelink for instance).  And
LD_BIND_NOW=1 does not change the test outcome also.

> 
>> At least  _dl_build_local_scope is localized to prelink support, which I think it should
>> be safer to change.
> 
> Due to relocation and other changes, I don't know if you can run the prelinker for the test cases, but if you do -- it's pretty easy to show the issue in this case.

It would not be straightforward to add prelink as test requirese (nor desirable),
but AFAIU LD_TRACE_PRELINKING is the loader's output used by prelink, 

> 
>>>
>>> We have also been using this patch in the Yocto Project since 2016.  And we've never had a report of an incompatibility/failure.  So it really is quite low risk, but I'll never say it's "no risk".
>>>
>>
>> Recently Florian has asked about prelinked support [1], and Joseph
>> answer seems that Yocto still provide support for it [3].  IMHO I
>> would just deprecate it, so we can eventually clean this up: it
>> requires a lot of hacks within loader, its support is not straightforward
>> for newer ports, and it even make less sense with ASLR and PIE executable.
> 
> I have stopped maintaining the prelinker at this time, as I don't have the knowledge for all of the latest fixups.  Over the last 3 or so years, the prelinker itself has fallen into a state of disrepair as additional relocations, functions, etc have been added to the system.  I don't have the time or more importantly knowledge to deal with the recent changes.
> 
> I would be happy to resume some of the maintenance activities if others would provide the fixes necessary for the latest compiler/library implementations -- but the number of people who understand ELF at that level AND have any desire for prelinking can probably be counted on one hand.
> 
> Even trying to use the prelinker with a somewhat modern Ubuntu system reports errors like:
> 
> ../src/prelink: ./order: Could not parse `lookup 0x00007fffd8b0a000 0xfffffffffffcd180 -> 0x00007fffd8b0a000 0x00000000000008e0 /0 __vdso_clock_gettime'
> 
> While I'm sure that one wouldn't be too hard to fix, then there is another and another..  (I believe the issue is that the VDSO items are presenting BEFORE the list of libraries which is change in behavior....)
> 
> (I did try to just skip those lookups, and it results in the prelinked code crashing..)
> 
>> It would be good to know if the performance optimization it should bring
>> does pay off the clutter and complexity it adds on loader code.
> 
> When I last profiled this, roughly 3 1/2 years ago, the run-time linking speedup was huge.  There were two main advantages to this:
> 
> * Run Time linking speedup - primarily helped initial application loads.  System boot times went from 10-15 seconds down to 4-5 seconds.  For embedded systems this was massive.

Right, this is interesting.  Any profile data where exactly the speed is coming
from? I wonder if we could get any gain by optimizing the normal patch without
the need to resort to prelink.

> 
> * Memory usage.  The COW page usage for runtime linking can be significant on memory constrained systems.  Prelinking dropped the COW page usage in the systems I was looking at to about 10% of what it was prior.  This is believed to have further contributed to the boot time optimizations.

Interesting, why exactly is prelinking help in COW usage? I would expect memory
utilization to be rough the same, is prelinking helping in aligning the segment
in a better way?

> 
> Last I looked at this, only about 20-30% of the system is capable of prelinking anymore due to the evolutionary changes in the various toolchain elements, introducing new relocations and related components.  Even things like re-ordering sections (done a couple years ago in binutils) has broken the prelinker in mysterious ways.

Yes and it is even harder to have a project that is dependent of both
static in dynamic linker to have out-of-tree developement without a
proper ABI definition.  That's why I think currently prelink is hackish
solution with a niche usage that adds a lot complexity to the code base.

For instance, we are aiming to support DT_RELR which would help to 
decrease the relocation segment size for PIE binaries.  It would be
probably another feature that prelink will lack support.

In fact this information you provided that only 20-30% of all binaries
are supported makes even more willing to really deprecate prelink.

> 
> Add to this the IMHO mistaken belief that ASLR is some magic security device.  I see it more as a component of security that needs to be part of a broader system that you can decide to trade off against load performance (including memory). But the broad "if you don't use ASLR your device if vulnerable" mentality has seriously restricted the desire for people to use, improve and contribute back to the prelinker.

ASLR/PIE is not a silver bullet, specially with limited mmap entropy on
32 bit systems. But it is a gradual improvement over the multiple security
features we support (like the generic ones as relro, malloc safelink, etc.
to arch-specific one such as x86_64 CET or aarch64 BTI or PAC/RET).

My point is more that usually what we see is generic distribution is to
use more broader security features. I am not sure about embedded though.

> 
>>
>> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=19861
>> [2] https://sourceware.org/pipermail/libc-alpha/2021-August/130404.html
>> [3] https://git.yoctoproject.org/prelink-cross/
>>
Mark Hatle Jan. 13, 2022, 6 p.m. UTC | #8
On 1/13/22 11:20 AM, Adhemerval Zanella wrote:
>> When I last profiled this, roughly 3 1/2 years ago, the run-time linking speedup was huge.  There were two main advantages to this:
>>
>> * Run Time linking speedup - primarily helped initial application loads.  System boot times went from 10-15 seconds down to 4-5 seconds.  For embedded systems this was massive.
> 
> Right, this is interesting.  Any profile data where exactly the speed is coming
> from? I wonder if we could get any gain by optimizing the normal patch without
> the need to resort to prelink.

glibc's runtime linker is very efficient, I don't honestly expect many speedups 
at this point.

This is partially from memory, so I may have a few details wrong... but

LD_DEBUG=statistics

On my ubuntu machine, just setting then and running /bin/bash results in:

     334067:	
     334067:	runtime linker statistics:
     334067:	  total startup time in dynamic loader: 252415 cycles
     334067:	            time needed for relocation: 119006 cycles (47.1%)
     334067:	                 number of relocations: 412
     334067:	      number of relocations from cache: 3
     334067:	        number of relative relocations: 5100
     334067:	           time needed to load objects: 92655 cycles (36.7%)
     334068:	
     334068:	runtime linker statistics:
     334068:	  total startup time in dynamic loader: 125018 cycles
     334068:	            time needed for relocation: 40554 cycles (32.4%)
     334068:	                 number of relocations: 176
     334068:	      number of relocations from cache: 3
     334068:	        number of relative relocations: 1534
     334068:	           time needed to load objects: 45882 cycles (36.7%)
     334069:	
     334069:	runtime linker statistics:
     334069:	  total startup time in dynamic loader: 121500 cycles
     334069:	            time needed for relocation: 39067 cycles (32.1%)
     334069:	                 number of relocations: 136
     334069:	      number of relocations from cache: 3
     334069:	        number of relative relocations: 1274
     334069:	           time needed to load objects: 47505 cycles (39.0%)
     334071:	
     334071:	runtime linker statistics:
     334071:	  total startup time in dynamic loader: 111850 cycles
     334071:	            time needed for relocation: 35089 cycles (31.3%)
     334071:	                 number of relocations: 135
     334071:	      number of relocations from cache: 3
     334071:	        number of relative relocations: 1272
     334071:	           time needed to load objects: 45746 cycles (40.8%)
     334072:	
     334072:	runtime linker statistics:
     334072:	  total startup time in dynamic loader: 109827 cycles
     334072:	            time needed for relocation: 34863 cycles (31.7%)
     334072:	                 number of relocations: 145
     334072:	      number of relocations from cache: 3
     334072:	        number of relative relocations: 1351
     334072:	           time needed to load objects: 45565 cycles (41.4%)

(why so many, because it's running through the profile and other bash startups 
which end up running additional items.)

When prelinker worked...  the number of relocations (and especially cycles) 
required dropped to about 1-10% of the original application.  This compounded by 
the large number of executables loaded at boot (think of sysvinit with all of 
the shells started and destroyed) turned into a massive speedup during early 
boot process.

As a normal "user" behavior, the speedup is negligible, because the amount of 
time spent loading vs running is nothing.... but in automated processing where 
something, like bash, is started runs for a fraction of a second, exits.. 
"repeat" 1000s of times.. it really becomes a massive part of the time scale.

So back to the above, I know that in one instance that bash would end up with 
about 4 relocations, with 400+ from cache with the prelinker.  Resulting in the 
cycles required for relocations to be in the 10% of overall load time, with time 
needed to load objects being roughly 90%.

>>
>> * Memory usage.  The COW page usage for runtime linking can be significant on memory constrained systems.  Prelinking dropped the COW page usage in the systems I was looking at to about 10% of what it was prior.  This is believed to have further contributed to the boot time optimizations.
> 
> Interesting, why exactly is prelinking help in COW usage? I would expect memory
> utilization to be rough the same, is prelinking helping in aligning the segment
> in a better way?

Each time a relocation occurs, the runtime linker needs to write into a page 
with that address.  No relocation, no runtime write, no COW page created.

Add to this mmap usage between applications, and you can run say 100 bash 
sessions and each session would use a fraction of the COW pages that it would 
without prelinking.

At one point I had statistics on this, but I don't even remember how this was 
calculated or done anymore.  (I had help from some kernel people to show me 
kernel memory use, contiguous pages, etc..)

>>
>> Last I looked at this, only about 20-30% of the system is capable of prelinking anymore due to the evolutionary changes in the various toolchain elements, introducing new relocations and related components.  Even things like re-ordering sections (done a couple years ago in binutils) has broken the prelinker in mysterious ways.
> 
> Yes and it is even harder to have a project that is dependent of both
> static in dynamic linker to have out-of-tree developement without a
> proper ABI definition.  That's why I think currently prelink is hackish
> solution with a niche usage that adds a lot complexity to the code base.
> 
> For instance, we are aiming to support DT_RELR which would help to
> decrease the relocation segment size for PIE binaries.  It would be
> probably another feature that prelink will lack support.
> 
> In fact this information you provided that only 20-30% of all binaries
> are supported makes even more willing to really deprecate prelink.

prelink has a huge advantage on embedded systems -- but it hasn't worked well 
for about 3 years now...  I was hoping other then life support someone would 
step up and contribute, and it never really happened.  There were a few 
bugs/fixes sent by Mentor that kept things going on a few platforms -- but even 
that eventually dried up.  (This is meant to thank them for the code and 
contributions they did!)

>>
>> Add to this the IMHO mistaken belief that ASLR is some magic security device.  I see it more as a component of security that needs to be part of a broader system that you can decide to trade off against load performance (including memory). But the broad "if you don't use ASLR your device if vulnerable" mentality has seriously restricted the desire for people to use, improve and contribute back to the prelinker.
> 
> ASLR/PIE is not a silver bullet, specially with limited mmap entropy on
> 32 bit systems. But it is a gradual improvement over the multiple security
> features we support (like the generic ones as relro, malloc safelink, etc.
> to arch-specific one such as x86_64 CET or aarch64 BTI or PAC/RET).

Exactly it's multiple security features work together for a purpose.  But 
everyone got convinced ASLR was a silver bullet and that is what started the 
final death spiral of the prelinker (as it is today).

> My point is more that usually what we see is generic distribution is to
> use more broader security features. I am not sure about embedded though.

Embedded needs security, no doubt.. but with the limited entropy (even on 
64-bit, the entropy is truly limited.. great I now have to run my attack 15 
times instead of 5..  that really isn't much of an improvement!) ASLR has become 
a check list item for some security consultant to approve a product release.

Things like the CET, BTI / PAC/RET have a much larger re-world security impact, 
IMHO.

So in the end the embedded development that I've been involved with has always 
had a series of "these are our options, in a perfect world we'd use them all -- 
but we don't have the memory (prelink helped), we've got disk space limits 
(can't use PAC/RET, binaries get bigger), we need to be able to upgrade the 
software (prelink on the device?  send pre-prelinked to all devices), we've got 
industry requirements (not all devices should have the same memory map, prelink 
ranomize addresses?), we've got maximum boot time requirements, etc.  It's not 
cut and dried what combination of those requirements, and which technologies 
(such as the prelinker) should be used to meet them.   As we have less operating 
system engineers, the preference is going away from using tools like prelink and 
lots of simple utilities into alternatives like "jumbo do it all" binaries that 
only get loaded once.  Avoiding initscript systems and packing system 
initialization into those binaries, or even moving to other libc's that have 
less relocation pressure (due to smaller libraires, feature sets, etc.)

If you declare prelink dead, then it's dead.. nobody will be bringing it back. 
But I do still believe technology wise it's a good technology for the embedded 
systems (remember embedded doesn't mean "small") to help the meet specific 
integration needs.  But without help from people with the appropriate knowledge 
to implement new features, like DT_RELR, in the prelinker -- there is little 
chance that it is anything but on life support.

--Mark

>>
>>>
>>> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=19861
>>> [2] https://sourceware.org/pipermail/libc-alpha/2021-August/130404.html
>>> [3] https://git.yoctoproject.org/prelink-cross/
>>>
Adhemerval Zanella Jan. 13, 2022, 6:37 p.m. UTC | #9
On 13/01/2022 15:00, Mark Hatle wrote:
> 
> 
> On 1/13/22 11:20 AM, Adhemerval Zanella wrote:
>>> When I last profiled this, roughly 3 1/2 years ago, the run-time linking speedup was huge.  There were two main advantages to this:
>>>
>>> * Run Time linking speedup - primarily helped initial application loads.  System boot times went from 10-15 seconds down to 4-5 seconds.  For embedded systems this was massive.
>>
>> Right, this is interesting.  Any profile data where exactly the speed is coming
>> from? I wonder if we could get any gain by optimizing the normal patch without
>> the need to resort to prelink.
> 
> glibc's runtime linker is very efficient, I don't honestly expect many speedups at this point.
> 
> This is partially from memory, so I may have a few details wrong... but
> 
> LD_DEBUG=statistics
> 
> On my ubuntu machine, just setting then and running /bin/bash results in:
> 
>     334067:   
>     334067:    runtime linker statistics:
>     334067:      total startup time in dynamic loader: 252415 cycles
>     334067:                time needed for relocation: 119006 cycles (47.1%)
>     334067:                     number of relocations: 412
>     334067:          number of relocations from cache: 3
>     334067:            number of relative relocations: 5100
>     334067:               time needed to load objects: 92655 cycles (36.7%)
>     334068:   
>     334068:    runtime linker statistics:
>     334068:      total startup time in dynamic loader: 125018 cycles
>     334068:                time needed for relocation: 40554 cycles (32.4%)
>     334068:                     number of relocations: 176
>     334068:          number of relocations from cache: 3
>     334068:            number of relative relocations: 1534
>     334068:               time needed to load objects: 45882 cycles (36.7%)
>     334069:   
>     334069:    runtime linker statistics:
>     334069:      total startup time in dynamic loader: 121500 cycles
>     334069:                time needed for relocation: 39067 cycles (32.1%)
>     334069:                     number of relocations: 136
>     334069:          number of relocations from cache: 3
>     334069:            number of relative relocations: 1274
>     334069:               time needed to load objects: 47505 cycles (39.0%)
>     334071:   
>     334071:    runtime linker statistics:
>     334071:      total startup time in dynamic loader: 111850 cycles
>     334071:                time needed for relocation: 35089 cycles (31.3%)
>     334071:                     number of relocations: 135
>     334071:          number of relocations from cache: 3
>     334071:            number of relative relocations: 1272
>     334071:               time needed to load objects: 45746 cycles (40.8%)
>     334072:   
>     334072:    runtime linker statistics:
>     334072:      total startup time in dynamic loader: 109827 cycles
>     334072:                time needed for relocation: 34863 cycles (31.7%)
>     334072:                     number of relocations: 145
>     334072:          number of relocations from cache: 3
>     334072:            number of relative relocations: 1351
>     334072:               time needed to load objects: 45565 cycles (41.4%)
> 
> (why so many, because it's running through the profile and other bash startups which end up running additional items.)
> 
> When prelinker worked...  the number of relocations (and especially cycles) required dropped to about 1-10% of the original application.  This compounded by the large number of executables loaded at boot (think of sysvinit with all of the shells started and destroyed) turned into a massive speedup during early boot process.
> 
> As a normal "user" behavior, the speedup is negligible, because the amount of time spent loading vs running is nothing.... but in automated processing where something, like bash, is started runs for a fraction of a second, exits.. "repeat" 1000s of times.. it really becomes a massive part of the time scale.
> 
> So back to the above, I know that in one instance that bash would end up with about 4 relocations, with 400+ from cache with the prelinker.  Resulting in the cycles required for relocations to be in the 10% of overall load time, with time needed to load objects being roughly 90%.
> 

Right, the compound improvements over all binaries make sense.

>>>
>>> * Memory usage.  The COW page usage for runtime linking can be significant on memory constrained systems.  Prelinking dropped the COW page usage in the systems I was looking at to about 10% of what it was prior.  This is believed to have further contributed to the boot time optimizations.
>>
>> Interesting, why exactly is prelinking help in COW usage? I would expect memory
>> utilization to be rough the same, is prelinking helping in aligning the segment
>> in a better way?
> 
> Each time a relocation occurs, the runtime linker needs to write into a page with that address.  No relocation, no runtime write, no COW page created.
> 
> Add to this mmap usage between applications, and you can run say 100 bash sessions and each session would use a fraction of the COW pages that it would without prelinking.

Yes, but not taking in consideration TEXTREL or writable PLT I am not 
seeing on how COW would help since the GOT (where mostly if not all 
relocation would happen) is anonymous mappings.

> 
> At one point I had statistics on this, but I don't even remember how this was calculated or done anymore.  (I had help from some kernel people to show me kernel memory use, contiguous pages, etc..)
> 
>>>
>>> Last I looked at this, only about 20-30% of the system is capable of prelinking anymore due to the evolutionary changes in the various toolchain elements, introducing new relocations and related components.  Even things like re-ordering sections (done a couple years ago in binutils) has broken the prelinker in mysterious ways.
>>
>> Yes and it is even harder to have a project that is dependent of both
>> static in dynamic linker to have out-of-tree developement without a
>> proper ABI definition.  That's why I think currently prelink is hackish
>> solution with a niche usage that adds a lot complexity to the code base.
>>
>> For instance, we are aiming to support DT_RELR which would help to
>> decrease the relocation segment size for PIE binaries.  It would be
>> probably another feature that prelink will lack support.
>>
>> In fact this information you provided that only 20-30% of all binaries
>> are supported makes even more willing to really deprecate prelink.
> 
> prelink has a huge advantage on embedded systems -- but it hasn't worked well for about 3 years now...  I was hoping other then life support someone would step up and contribute, and it never really happened.  There were a few bugs/fixes sent by Mentor that kept things going on a few platforms -- but even that eventually dried up.  (This is meant to thank them for the code and contributions they did!)
> 
>>>
>>> Add to this the IMHO mistaken belief that ASLR is some magic security device.  I see it more as a component of security that needs to be part of a broader system that you can decide to trade off against load performance (including memory). But the broad "if you don't use ASLR your device if vulnerable" mentality has seriously restricted the desire for people to use, improve and contribute back to the prelinker.
>>
>> ASLR/PIE is not a silver bullet, specially with limited mmap entropy on
>> 32 bit systems. But it is a gradual improvement over the multiple security
>> features we support (like the generic ones as relro, malloc safelink, etc.
>> to arch-specific one such as x86_64 CET or aarch64 BTI or PAC/RET).
> 
> Exactly it's multiple security features work together for a purpose.  But everyone got convinced ASLR was a silver bullet and that is what started the final death spiral of the prelinker (as it is today).
> 
>> My point is more that usually what we see is generic distribution is to
>> use more broader security features. I am not sure about embedded though.
> 
> Embedded needs security, no doubt.. but with the limited entropy (even on 64-bit, the entropy is truly limited.. great I now have to run my attack 15 times instead of 5..  that really isn't much of an improvement!) ASLR has become a check list item for some security consultant to approve a product release.
> 
> Things like the CET, BTI / PAC/RET have a much larger re-world security impact, IMHO.
> 
> So in the end the embedded development that I've been involved with has always had a series of "these are our options, in a perfect world we'd use them all -- but we don't have the memory (prelink helped), we've got disk space limits (can't use PAC/RET, binaries get bigger), we need to be able to upgrade the software (prelink on the device?  send pre-prelinked to all devices), we've got industry requirements (not all devices should have the same memory map, prelink ranomize addresses?), we've got maximum boot time requirements, etc.  It's not cut and dried what combination of those requirements, and which technologies (such as the prelinker) should be used to meet them.   As we have less operating system engineers, the preference is going away from using tools like prelink and lots of simple utilities into alternatives like "jumbo do it all" binaries that only get loaded once.  Avoiding initscript systems and packing system initialization into those binaries, or even moving
> to other libc's that have less relocation pressure (due to smaller libraires, feature sets, etc.)
> 
> If you declare prelink dead, then it's dead.. nobody will be bringing it back. But I do still believe technology wise it's a good technology for the embedded systems (remember embedded doesn't mean "small") to help the meet specific integration needs.  But without help from people with the appropriate knowledge to implement new features, like DT_RELR, in the prelinker -- there is little chance that it is anything but on life support.
> 

It is more and more I see that proper static linking is a *much* more
simple solution than prelink, with the advantage it also decrease code
complexity and attack surface and can keep up with ABI extension way
more easily.  For instance, static pie is now support on both glibc
and musl.

And it is not that I declare dead, but it will become a dead weight
support that we will need to provide for the sake of handful specific
usage that due lack of maintenance will have subtle issues and missing
support with the new ABI extensions.
Carlos O'Donell Jan. 13, 2022, 7:01 p.m. UTC | #10
On 1/13/22 13:37, Adhemerval Zanella wrote:
> And it is not that I declare dead, but it will become a dead weight
> support that we will need to provide for the sake of handful specific
> usage that due lack of maintenance will have subtle issues and missing
> support with the new ABI extensions. 
 
*I* declare prelink is dead because there is no interest in prelink.

Interest in prelink would entail IHVs, ISVs, or volunteers supporting the feature
upstream in *all* impacted components.

Lack of interest means the feature is going to be removed to reduced the
maintenance burden overall.
Carlos O'Donell Jan. 13, 2022, 7:02 p.m. UTC | #11
On 12/9/21 18:53, Khem Raj via Libc-alpha wrote:
> From: Mark Hatle <mark.hatle@windriver.com>
> 
> According to the ELF specification:
> 
> When resolving symbolic references, the dynamic linker examines the symbol
> tables with a breadth-first search.
> 
> This function was using a depth first search.  By doing so the conflict
> resolution reported to the prelinker (when LD_TRACE_PRELINKING=1 is set)
> was incorrect.  This caused problems when their were various circular
> dependencies between libraries.  The problem usually manifested itself by
> the wrong IFUNC being executed.
> 
> Similar issue has been reported here [1]
> 
> [BZ# 20488]
> 
> [1] https://sourceware.org/legacy-ml/libc-alpha/2016-05/msg00034.html
> 
> Signed-off-by: Mark Hatle <mark.hatle@kernel.crashing.org>
> Signed-off-by: Khem Raj <raj.khem@gmail.com>
This needs much more discussion before inclusion.

At the very least we need clear test cases that show the problem.

The alternative patch I would support is prelink support removal.
diff mbox series

Patch

diff --git a/elf/dl-deps.c b/elf/dl-deps.c
index 237d9636c5..e15f7f83d8 100644
--- a/elf/dl-deps.c
+++ b/elf/dl-deps.c
@@ -73,13 +73,19 @@  _dl_build_local_scope (struct link_map **list, struct link_map *map)
 {
   struct link_map **p = list;
   struct link_map **q;
+  struct link_map **r;
 
   *p++ = map;
   map->l_reserved = 1;
-  if (map->l_initfini)
-    for (q = map->l_initfini + 1; *q; ++q)
-      if (! (*q)->l_reserved)
-	p += _dl_build_local_scope (p, *q);
+
+  for (r = list; r < p; ++r)
+    if ((*r)->l_initfini)
+      for (q = (*r)->l_initfini + 1; *q; ++q)
+	if (! (*q)->l_reserved)
+	  {
+	    *p++ = *q;
+	    (*q)->l_reserved = 1;
+	  }
   return p - list;
 }