elf: Fix TLS modid reuse generation assignment

Message ID 20231128-tls-modid-reuse-v1-1-431c73f37fc7@marcan.st
State Committed
Commit 3921c5b40f293c57cb326f58713c924b0662ef59
Headers
Series elf: Fix TLS modid reuse generation assignment |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 warning Patch is already merged

Commit Message

Hector Martin Nov. 28, 2023, 6:23 a.m. UTC
  _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but does
*not* do anything to the generation counter. The first time this
happens, the generation is zero and map_generation() returns the current
generation to be used during relocation processing. However, if a
slotinfo entry is later reused, it will already have a generation
assigned. If this generation has fallen behind the current global max
generation, then this causes an obsolete generation to be assigned
during relocation processing, as map_generation() returns this
generation if nonzero. _dl_add_to_slotinfo() eventually resets the
generation, but by then it is too late. This causes DTV updates to be
skipped, leading to NULL or broken TLS slot pointers and segfaults.

Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
so it behaves the same as the first time a slot is assigned.
_dl_add_to_slotinfo() will still assign the correct static generation
later during module load, but relocation processing will no longer use
an obsolete generation.

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29039
---
 elf/dl-tls.c | 1 +
 1 file changed, 1 insertion(+)


---
base-commit: 78ca44da0160a0b442f0ca1f253e3360f044b2ec
change-id: 20231128-tls-modid-reuse-0a7a903a1f7e

Best regards,
  

Comments

Szabolcs Nagy Nov. 28, 2023, 10:53 a.m. UTC | #1
The 11/28/2023 15:23, Hector Martin wrote:
> _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but does
> *not* do anything to the generation counter. The first time this
> happens, the generation is zero and map_generation() returns the current
> generation to be used during relocation processing. However, if a
> slotinfo entry is later reused, it will already have a generation
> assigned. If this generation has fallen behind the current global max
> generation, then this causes an obsolete generation to be assigned
> during relocation processing, as map_generation() returns this
> generation if nonzero. _dl_add_to_slotinfo() eventually resets the
> generation, but by then it is too late. This causes DTV updates to be
> skipped, leading to NULL or broken TLS slot pointers and segfaults.
> 
> Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
> so it behaves the same as the first time a slot is assigned.
> _dl_add_to_slotinfo() will still assign the correct static generation
> later during module load, but relocation processing will no longer use
> an obsolete generation.
> 
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29039

Thanks, this looks good to me.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

i'd note that only TLSDESC relocation processing is affected
and in practice modid reuse happens after a dlclose.

we usually only mention the bug number in the commit message
not the bugzilla url.

i can commit your patch with these changes if that's ok with you.


i think the bug is present since

  commit 572bd547d57a39b6cf0ea072545dc4048921f4c3
  Author:     Szabolcs Nagy <szabolcs.nagy@arm.com>
  CommitDate: 2021-05-11 17:16:37 +0100

  elf: Fix DTV gap reuse logic [BZ #27135]

which got reverted and an updated version committed

  commit ba33937be210da5d07f7f01709323743f66011ce
  Author:     Adhemerval Zanella <adhemerval.zanella@linaro.org>
  CommitDate: 2021-07-14 15:10:27 -0300

  elf: Fix DTV gap reuse logic (BZ #27135)

which is part of the 2.34 release, so we will have to
backport up to that version.

before that, modid reuse was hard to trigger (only dlopen
failure would do it and the failure would have to happen
after gen counter assignment for it to be problematic).

> ---
>  elf/dl-tls.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index c192b5a13a94..70446e71a8c4 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -154,6 +154,7 @@ _dl_assign_tls_modid (struct link_map *l)
>  	      {
>  		/* Mark the entry as used, so any dependency see it.  */
>  		atomic_store_relaxed (&runp->slotinfo[result - disp].map, l);
> +		atomic_store_relaxed (&runp->slotinfo[result - disp].gen, 0);
>  		break;
>  	      }
>  
> 
> ---
> base-commit: 78ca44da0160a0b442f0ca1f253e3360f044b2ec
> change-id: 20231128-tls-modid-reuse-0a7a903a1f7e
> 
> Best regards,
> -- 
> Hector Martin <marcan@marcan.st>
>
  
Adhemerval Zanella Netto Nov. 28, 2023, 12:34 p.m. UTC | #2
On 28/11/23 07:53, Szabolcs Nagy wrote:
> The 11/28/2023 15:23, Hector Martin wrote:
>> _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but does
>> *not* do anything to the generation counter. The first time this
>> happens, the generation is zero and map_generation() returns the current
>> generation to be used during relocation processing. However, if a
>> slotinfo entry is later reused, it will already have a generation
>> assigned. If this generation has fallen behind the current global max
>> generation, then this causes an obsolete generation to be assigned
>> during relocation processing, as map_generation() returns this
>> generation if nonzero. _dl_add_to_slotinfo() eventually resets the
>> generation, but by then it is too late. This causes DTV updates to be
>> skipped, leading to NULL or broken TLS slot pointers and segfaults.
>>
>> Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
>> so it behaves the same as the first time a slot is assigned.
>> _dl_add_to_slotinfo() will still assign the correct static generation
>> later during module load, but relocation processing will no longer use
>> an obsolete generation.
>>
>> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29039
> 
> Thanks, this looks good to me.
> 
> Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
> 
> i'd note that only TLSDESC relocation processing is affected
> and in practice modid reuse happens after a dlclose.
> 
> we usually only mention the bug number in the commit message
> not the bugzilla url.
> 
> i can commit your patch with these changes if that's ok with you.
> 
> 
> i think the bug is present since
> 
>   commit 572bd547d57a39b6cf0ea072545dc4048921f4c3
>   Author:     Szabolcs Nagy <szabolcs.nagy@arm.com>
>   CommitDate: 2021-05-11 17:16:37 +0100
> 
>   elf: Fix DTV gap reuse logic [BZ #27135]
> 
> which got reverted and an updated version committed
> 
>   commit ba33937be210da5d07f7f01709323743f66011ce
>   Author:     Adhemerval Zanella <adhemerval.zanella@linaro.org>
>   CommitDate: 2021-07-14 15:10:27 -0300
> 
>   elf: Fix DTV gap reuse logic (BZ #27135)
> 
> which is part of the 2.34 release, so we will have to
> backport up to that version.
> 
> before that, modid reuse was hard to trigger (only dlopen
> failure would do it and the failure would have to happen
> after gen counter assignment for it to be problematic).

Would be possible to create a regression testcase for this issue? From RH
bug report [1], it seems to describe a somewhat feasible scenario.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2251557

> 
>> ---
>>  elf/dl-tls.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
>> index c192b5a13a94..70446e71a8c4 100644
>> --- a/elf/dl-tls.c
>> +++ b/elf/dl-tls.c
>> @@ -154,6 +154,7 @@ _dl_assign_tls_modid (struct link_map *l)
>>  	      {
>>  		/* Mark the entry as used, so any dependency see it.  */
>>  		atomic_store_relaxed (&runp->slotinfo[result - disp].map, l);
>> +		atomic_store_relaxed (&runp->slotinfo[result - disp].gen, 0);
>>  		break;
>>  	      }
>>  
>>
>> ---
>> base-commit: 78ca44da0160a0b442f0ca1f253e3360f044b2ec
>> change-id: 20231128-tls-modid-reuse-0a7a903a1f7e
>>
>> Best regards,
>> -- 
>> Hector Martin <marcan@marcan.st>
>>
  
Hector Martin Nov. 28, 2023, 2:04 p.m. UTC | #3
On 2023/11/28 19:53, Szabolcs Nagy wrote:
> The 11/28/2023 15:23, Hector Martin wrote:
>> _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but does
>> *not* do anything to the generation counter. The first time this
>> happens, the generation is zero and map_generation() returns the current
>> generation to be used during relocation processing. However, if a
>> slotinfo entry is later reused, it will already have a generation
>> assigned. If this generation has fallen behind the current global max
>> generation, then this causes an obsolete generation to be assigned
>> during relocation processing, as map_generation() returns this
>> generation if nonzero. _dl_add_to_slotinfo() eventually resets the
>> generation, but by then it is too late. This causes DTV updates to be
>> skipped, leading to NULL or broken TLS slot pointers and segfaults.
>>
>> Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
>> so it behaves the same as the first time a slot is assigned.
>> _dl_add_to_slotinfo() will still assign the correct static generation
>> later during module load, but relocation processing will no longer use
>> an obsolete generation.
>>
>> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29039
> 
> Thanks, this looks good to me.
> 
> Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
> 
> i'd note that only TLSDESC relocation processing is affected
> and in practice modid reuse happens after a dlclose.
> 
> we usually only mention the bug number in the commit message
> not the bugzilla url.
> 
> i can commit your patch with these changes if that's ok with you.

Works for me, thanks :)

> 
> 
> i think the bug is present since
> 
>   commit 572bd547d57a39b6cf0ea072545dc4048921f4c3
>   Author:     Szabolcs Nagy <szabolcs.nagy@arm.com>
>   CommitDate: 2021-05-11 17:16:37 +0100
> 
>   elf: Fix DTV gap reuse logic [BZ #27135]
> 
> which got reverted and an updated version committed
> 
>   commit ba33937be210da5d07f7f01709323743f66011ce
>   Author:     Adhemerval Zanella <adhemerval.zanella@linaro.org>
>   CommitDate: 2021-07-14 15:10:27 -0300
> 
>   elf: Fix DTV gap reuse logic (BZ #27135)
> 
> which is part of the 2.34 release, so we will have to
> backport up to that version.
> 
> before that, modid reuse was hard to trigger (only dlopen
> failure would do it and the failure would have to happen
> after gen counter assignment for it to be problematic).
> 
>> ---
>>  elf/dl-tls.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
>> index c192b5a13a94..70446e71a8c4 100644
>> --- a/elf/dl-tls.c
>> +++ b/elf/dl-tls.c
>> @@ -154,6 +154,7 @@ _dl_assign_tls_modid (struct link_map *l)
>>  	      {
>>  		/* Mark the entry as used, so any dependency see it.  */
>>  		atomic_store_relaxed (&runp->slotinfo[result - disp].map, l);
>> +		atomic_store_relaxed (&runp->slotinfo[result - disp].gen, 0);
>>  		break;
>>  	      }
>>  
>>
>> ---
>> base-commit: 78ca44da0160a0b442f0ca1f253e3360f044b2ec
>> change-id: 20231128-tls-modid-reuse-0a7a903a1f7e
>>
>> Best regards,
>> -- 
>> Hector Martin <marcan@marcan.st>
>>
> 

- Hector
  
Szabolcs Nagy Nov. 28, 2023, 4:43 p.m. UTC | #4
The 11/28/2023 09:34, Adhemerval Zanella Netto wrote:
> 
> 
> On 28/11/23 07:53, Szabolcs Nagy wrote:
> > The 11/28/2023 15:23, Hector Martin wrote:
> >> _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but does
> >> *not* do anything to the generation counter. The first time this
> >> happens, the generation is zero and map_generation() returns the current
> >> generation to be used during relocation processing. However, if a
> >> slotinfo entry is later reused, it will already have a generation
> >> assigned. If this generation has fallen behind the current global max
> >> generation, then this causes an obsolete generation to be assigned
> >> during relocation processing, as map_generation() returns this
> >> generation if nonzero. _dl_add_to_slotinfo() eventually resets the
> >> generation, but by then it is too late. This causes DTV updates to be
> >> skipped, leading to NULL or broken TLS slot pointers and segfaults.
> >>
> >> Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
> >> so it behaves the same as the first time a slot is assigned.
> >> _dl_add_to_slotinfo() will still assign the correct static generation
> >> later during module load, but relocation processing will no longer use
> >> an obsolete generation.
> >>
> >> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29039
> > 
> > Thanks, this looks good to me.
> > 
> > Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
> > 
> > i'd note that only TLSDESC relocation processing is affected
> > and in practice modid reuse happens after a dlclose.
> > 
> > we usually only mention the bug number in the commit message
> > not the bugzilla url.
> > 
> > i can commit your patch with these changes if that's ok with you.
> > 
> > 
> > i think the bug is present since
> > 
> >   commit 572bd547d57a39b6cf0ea072545dc4048921f4c3
> >   Author:     Szabolcs Nagy <szabolcs.nagy@arm.com>
> >   CommitDate: 2021-05-11 17:16:37 +0100
> > 
> >   elf: Fix DTV gap reuse logic [BZ #27135]
> > 
> > which got reverted and an updated version committed
> > 
> >   commit ba33937be210da5d07f7f01709323743f66011ce
> >   Author:     Adhemerval Zanella <adhemerval.zanella@linaro.org>
> >   CommitDate: 2021-07-14 15:10:27 -0300
> > 
> >   elf: Fix DTV gap reuse logic (BZ #27135)
> > 
> > which is part of the 2.34 release, so we will have to
> > backport up to that version.
> > 
> > before that, modid reuse was hard to trigger (only dlopen
> > failure would do it and the failure would have to happen
> > after gen counter assignment for it to be problematic).
> 
> Would be possible to create a regression testcase for this issue? From RH
> bug report [1], it seems to describe a somewhat feasible scenario.
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=2251557


note that the bug only happens if the code goes through
_dl_make_tlsdesc_dynamic (and later accesses the tls object
via _dl_tlsdesc_dynamic, not dlsym or __tls_get_addr)

because tlsdesc is optimized to use static tls, it should
not happen up to 512 byte tls (so naive test code won't
reproduce the bug).

(it also requires specific conditions to get an out of date
gen count in _dl_make_tlsdesc_dynamic and get a thread dtv
where that causes trouble.)

however this means we can recommend large surplus tls as
a workaround, e.g.:

GLIBC_TUNABLES=glibc.rtld.optional_static_tls=4000

which is useful anyway: dynamic tlsdesc fast path is slower
than the fixed offset tlsdesc access, so it makes tls access
faster too.

i will add this to the commit message and can look at adding
a regression test later.
  

Patch

diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index c192b5a13a94..70446e71a8c4 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -154,6 +154,7 @@  _dl_assign_tls_modid (struct link_map *l)
 	      {
 		/* Mark the entry as used, so any dependency see it.  */
 		atomic_store_relaxed (&runp->slotinfo[result - disp].map, l);
+		atomic_store_relaxed (&runp->slotinfo[result - disp].gen, 0);
 		break;
 	      }