[v2,08/14] elf: Fix DTV gap reuse logic [BZ #27135]

Message ID 299d28c6695cd2e76f222b0d36b17b7124c549e7.1618301209.git.szabolcs.nagy@arm.com
State Committed
Commit 572bd547d57a39b6cf0ea072545dc4048921f4c3
Delegated to: Adhemerval Zanella Netto
Headers
Series Dynamic TLS related data race fixes |

Commit Message

Szabolcs Nagy April 13, 2021, 8:20 a.m. UTC
  For some reason only dlopen failure caused dtv gaps to be reused.

It is possible that the intent was to never reuse modids for a
different module, but after dlopen failure all gaps are reused
not just the ones caused by the unfinished dlopened.

So the code has to handle reused modids already which seems to
work, however the data races at thread creation and tls access
(see bug 19329 and bug 27111) may be more severe if slots are
reused so this is scheduled after those fixes. I think fixing
the races are not simpler if reuse is disallowed and reuse has
other benefits, so set GL(dl_tls_dtv_gaps) whenever entries are
removed from the middle of the slotinfo list. The value does
not have to be correct: incorrect true value causes the next
modid query to do a slotinfo walk, incorrect false will leave
gaps and new entries are added at the end.

Fixes bug 27135.
---
 elf/dl-close.c |  6 +++++-
 elf/dl-open.c  | 10 ----------
 elf/dl-tls.c   |  5 +----
 3 files changed, 6 insertions(+), 15 deletions(-)
  

Comments

Adhemerval Zanella April 15, 2021, 7:45 p.m. UTC | #1
On 13/04/2021 05:20, Szabolcs Nagy via Libc-alpha wrote:
> For some reason only dlopen failure caused dtv gaps to be reused.
> 
> It is possible that the intent was to never reuse modids for a
> different module, but after dlopen failure all gaps are reused
> not just the ones caused by the unfinished dlopened.
> 
> So the code has to handle reused modids already which seems to
> work, however the data races at thread creation and tls access
> (see bug 19329 and bug 27111) may be more severe if slots are
> reused so this is scheduled after those fixes. I think fixing
> the races are not simpler if reuse is disallowed and reuse has
> other benefits, so set GL(dl_tls_dtv_gaps) whenever entries are
> removed from the middle of the slotinfo list. The value does
> not have to be correct: incorrect true value causes the next
> modid query to do a slotinfo walk, incorrect false will leave
> gaps and new entries are added at the end.
> 
> Fixes bug 27135.

LGTM, thanks.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> ---
>  elf/dl-close.c |  6 +++++-
>  elf/dl-open.c  | 10 ----------
>  elf/dl-tls.c   |  5 +----
>  3 files changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/elf/dl-close.c b/elf/dl-close.c
> index 3720e47dd1..9f31532f41 100644
> --- a/elf/dl-close.c
> +++ b/elf/dl-close.c
> @@ -88,7 +88,11 @@ remove_slotinfo (size_t idx, struct dtv_slotinfo_list *listp, size_t disp,
>        /* If this is not the last currently used entry no need to look
>  	 further.  */
>        if (idx != GL(dl_tls_max_dtv_idx))
> -	return true;
> +	{
> +	  /* There is an unused dtv entry in the middle.  */
> +	  GL(dl_tls_dtv_gaps) = true;
> +	  return true;
> +	}
>      }
>  
>    while (idx - disp > (disp == 0 ? 1 + GL(dl_tls_static_nelem) : 0))

Ok.

> diff --git a/elf/dl-open.c b/elf/dl-open.c
> index 83b8e96a5c..661f26977e 100644
> --- a/elf/dl-open.c
> +++ b/elf/dl-open.c
> @@ -890,16 +890,6 @@ no more namespaces available for dlmopen()"));
>  	 state if relocation failed, for example.  */
>        if (args.map)
>  	{
> -	  /* Maybe some of the modules which were loaded use TLS.
> -	     Since it will be removed in the following _dl_close call
> -	     we have to mark the dtv array as having gaps to fill the
> -	     holes.  This is a pessimistic assumption which won't hurt
> -	     if not true.  There is no need to do this when we are
> -	     loading the auditing DSOs since TLS has not yet been set
> -	     up.  */
> -	  if ((mode & __RTLD_AUDIT) == 0)
> -	    GL(dl_tls_dtv_gaps) = true;
> -
>  	  _dl_close_worker (args.map, true);
>  
>  	  /* All l_nodelete_pending objects should have been deleted

Ok.

> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index c4466bd9fc..b0257185e9 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -187,10 +187,7 @@ _dl_next_tls_modid (void)
>  size_t
>  _dl_count_modids (void)
>  {
> -  /* It is rare that we have gaps; see elf/dl-open.c (_dl_open) where
> -     we fail to load a module and unload it leaving a gap.  If we don't
> -     have gaps then the number of modids is the current maximum so
> -     return that.  */
> +  /* The count is the max unless dlclose or failed dlopen created gaps.  */
>    if (__glibc_likely (!GL(dl_tls_dtv_gaps)))
>      return GL(dl_tls_max_dtv_idx);
>  
> 

Ok.
  
Florian Weimer June 24, 2021, 9:48 a.m. UTC | #2
* Szabolcs Nagy via Libc-alpha:

> For some reason only dlopen failure caused dtv gaps to be reused.
>
> It is possible that the intent was to never reuse modids for a
> different module, but after dlopen failure all gaps are reused
> not just the ones caused by the unfinished dlopened.
>
> So the code has to handle reused modids already which seems to
> work, however the data races at thread creation and tls access
> (see bug 19329 and bug 27111) may be more severe if slots are
> reused so this is scheduled after those fixes. I think fixing
> the races are not simpler if reuse is disallowed and reuse has
> other benefits, so set GL(dl_tls_dtv_gaps) whenever entries are
> removed from the middle of the slotinfo list. The value does
> not have to be correct: incorrect true value causes the next
> modid query to do a slotinfo walk, incorrect false will leave
> gaps and new entries are added at the end.
>
> Fixes bug 27135.
> ---
>  elf/dl-close.c |  6 +++++-
>  elf/dl-open.c  | 10 ----------
>  elf/dl-tls.c   |  5 +----
>  3 files changed, 6 insertions(+), 15 deletions(-)

Apparently this broke GNOME Shell:

  <https://bugzilla.redhat.com/show_bug.cgi?id=1974970>

I'm trying to figure out why.

Thanks,
Florian
  
Florian Weimer June 24, 2021, 12:27 p.m. UTC | #3
* Florian Weimer:

> * Szabolcs Nagy via Libc-alpha:
>
>> For some reason only dlopen failure caused dtv gaps to be reused.
>>
>> It is possible that the intent was to never reuse modids for a
>> different module, but after dlopen failure all gaps are reused
>> not just the ones caused by the unfinished dlopened.
>>
>> So the code has to handle reused modids already which seems to
>> work, however the data races at thread creation and tls access
>> (see bug 19329 and bug 27111) may be more severe if slots are
>> reused so this is scheduled after those fixes. I think fixing
>> the races are not simpler if reuse is disallowed and reuse has
>> other benefits, so set GL(dl_tls_dtv_gaps) whenever entries are
>> removed from the middle of the slotinfo list. The value does
>> not have to be correct: incorrect true value causes the next
>> modid query to do a slotinfo walk, incorrect false will leave
>> gaps and new entries are added at the end.
>>
>> Fixes bug 27135.
>> ---
>>  elf/dl-close.c |  6 +++++-
>>  elf/dl-open.c  | 10 ----------
>>  elf/dl-tls.c   |  5 +----
>>  3 files changed, 6 insertions(+), 15 deletions(-)
>
> Apparently this broke GNOME Shell:
>
>   <https://bugzilla.redhat.com/show_bug.cgi?id=1974970>
>
> I'm trying to figure out why.

The bug is that if there is a gap, _dl_next_tls_modid does not update
the slotinfo list to mark the modid to be returned as reserved, so
multiple calls in a single dlopen operation keep returning the same
modid.

I'm not yet sure what the proper fix is for that.

Thanks,
Florian
  
Adhemerval Zanella June 24, 2021, 12:57 p.m. UTC | #4
On 24/06/2021 09:27, Florian Weimer via Libc-alpha wrote:
> * Florian Weimer:
> 
>> * Szabolcs Nagy via Libc-alpha:
>>
>>> For some reason only dlopen failure caused dtv gaps to be reused.
>>>
>>> It is possible that the intent was to never reuse modids for a
>>> different module, but after dlopen failure all gaps are reused
>>> not just the ones caused by the unfinished dlopened.
>>>
>>> So the code has to handle reused modids already which seems to
>>> work, however the data races at thread creation and tls access
>>> (see bug 19329 and bug 27111) may be more severe if slots are
>>> reused so this is scheduled after those fixes. I think fixing
>>> the races are not simpler if reuse is disallowed and reuse has
>>> other benefits, so set GL(dl_tls_dtv_gaps) whenever entries are
>>> removed from the middle of the slotinfo list. The value does
>>> not have to be correct: incorrect true value causes the next
>>> modid query to do a slotinfo walk, incorrect false will leave
>>> gaps and new entries are added at the end.
>>>
>>> Fixes bug 27135.
>>> ---
>>>  elf/dl-close.c |  6 +++++-
>>>  elf/dl-open.c  | 10 ----------
>>>  elf/dl-tls.c   |  5 +----
>>>  3 files changed, 6 insertions(+), 15 deletions(-)
>>
>> Apparently this broke GNOME Shell:
>>
>>   <https://bugzilla.redhat.com/show_bug.cgi?id=1974970>
>>
>> I'm trying to figure out why.
> 
> The bug is that if there is a gap, _dl_next_tls_modid does not update
> the slotinfo list to mark the modid to be returned as reserved, so
> multiple calls in a single dlopen operation keep returning the same
> modid.
> 
> I'm not yet sure what the proper fix is for that.

How hard would be to create a testcase for this?
  
Florian Weimer June 24, 2021, 2:20 p.m. UTC | #5
* Adhemerval Zanella via Libc-alpha:

> On 24/06/2021 09:27, Florian Weimer via Libc-alpha wrote:
>> * Florian Weimer:
>> 
>>> * Szabolcs Nagy via Libc-alpha:
>>>
>>>> For some reason only dlopen failure caused dtv gaps to be reused.
>>>>
>>>> It is possible that the intent was to never reuse modids for a
>>>> different module, but after dlopen failure all gaps are reused
>>>> not just the ones caused by the unfinished dlopened.
>>>>
>>>> So the code has to handle reused modids already which seems to
>>>> work, however the data races at thread creation and tls access
>>>> (see bug 19329 and bug 27111) may be more severe if slots are
>>>> reused so this is scheduled after those fixes. I think fixing
>>>> the races are not simpler if reuse is disallowed and reuse has
>>>> other benefits, so set GL(dl_tls_dtv_gaps) whenever entries are
>>>> removed from the middle of the slotinfo list. The value does
>>>> not have to be correct: incorrect true value causes the next
>>>> modid query to do a slotinfo walk, incorrect false will leave
>>>> gaps and new entries are added at the end.
>>>>
>>>> Fixes bug 27135.
>>>> ---
>>>>  elf/dl-close.c |  6 +++++-
>>>>  elf/dl-open.c  | 10 ----------
>>>>  elf/dl-tls.c   |  5 +----
>>>>  3 files changed, 6 insertions(+), 15 deletions(-)
>>>
>>> Apparently this broke GNOME Shell:
>>>
>>>   <https://bugzilla.redhat.com/show_bug.cgi?id=1974970>
>>>
>>> I'm trying to figure out why.
>> 
>> The bug is that if there is a gap, _dl_next_tls_modid does not update
>> the slotinfo list to mark the modid to be returned as reserved, so
>> multiple calls in a single dlopen operation keep returning the same
>> modid.
>> 
>> I'm not yet sure what the proper fix is for that.
>
> How hard would be to create a testcase for this?

Not particularly hard, I think.

We need six modules (mod1 to mod6), all using dynamic TLS with different
symbols (sym1 to sym6).  mod4 depends on mod5 and mod6, but no other
dependencies.

dlopen mod1
dlopen mod2
dlopen mod3
dlclose mod2 # create modid gap
dlclose mod1 # more modid gap
dlopen mod4

Then check that all six TLS variables have different addresses.  If the
bug is present, sym4 to to sym6 should all have the same address because
the modid is the same.

I have not written a test yet and won't get to it today.

Thanks,
Florian
  
Szabolcs Nagy June 24, 2021, 6:58 p.m. UTC | #6
The 06/24/2021 14:27, Florian Weimer wrote:
> * Florian Weimer:
> 
> > * Szabolcs Nagy via Libc-alpha:
> >
> >> For some reason only dlopen failure caused dtv gaps to be reused.
> >>
> >> It is possible that the intent was to never reuse modids for a
> >> different module, but after dlopen failure all gaps are reused
> >> not just the ones caused by the unfinished dlopened.
> >>
> >> So the code has to handle reused modids already which seems to
> >> work, however the data races at thread creation and tls access
> >> (see bug 19329 and bug 27111) may be more severe if slots are
> >> reused so this is scheduled after those fixes. I think fixing
> >> the races are not simpler if reuse is disallowed and reuse has
> >> other benefits, so set GL(dl_tls_dtv_gaps) whenever entries are
> >> removed from the middle of the slotinfo list. The value does
> >> not have to be correct: incorrect true value causes the next
> >> modid query to do a slotinfo walk, incorrect false will leave
> >> gaps and new entries are added at the end.
> >>
> >> Fixes bug 27135.
> >> ---
> >>  elf/dl-close.c |  6 +++++-
> >>  elf/dl-open.c  | 10 ----------
> >>  elf/dl-tls.c   |  5 +----
> >>  3 files changed, 6 insertions(+), 15 deletions(-)
> >
> > Apparently this broke GNOME Shell:
> >
> >   <https://bugzilla.redhat.com/show_bug.cgi?id=1974970>
> >
> > I'm trying to figure out why.
> 
> The bug is that if there is a gap, _dl_next_tls_modid does not update
> the slotinfo list to mark the modid to be returned as reserved, so
> multiple calls in a single dlopen operation keep returning the same
> modid.
> 
> I'm not yet sure what the proper fix is for that.

this patch is not critical for the other tls issues i fixed
for 2.34, so it should be safe to revert.

this might have been the reason why gap reuse was not enabled.

but if _dl_next_tls_modid is broken then likely a failed
dlopen can trigger that on old glibc too, so a fix would be
nice.

thanks for debugging this.
  

Patch

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 3720e47dd1..9f31532f41 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -88,7 +88,11 @@  remove_slotinfo (size_t idx, struct dtv_slotinfo_list *listp, size_t disp,
       /* If this is not the last currently used entry no need to look
 	 further.  */
       if (idx != GL(dl_tls_max_dtv_idx))
-	return true;
+	{
+	  /* There is an unused dtv entry in the middle.  */
+	  GL(dl_tls_dtv_gaps) = true;
+	  return true;
+	}
     }
 
   while (idx - disp > (disp == 0 ? 1 + GL(dl_tls_static_nelem) : 0))
diff --git a/elf/dl-open.c b/elf/dl-open.c
index 83b8e96a5c..661f26977e 100644
--- a/elf/dl-open.c
+++ b/elf/dl-open.c
@@ -890,16 +890,6 @@  no more namespaces available for dlmopen()"));
 	 state if relocation failed, for example.  */
       if (args.map)
 	{
-	  /* Maybe some of the modules which were loaded use TLS.
-	     Since it will be removed in the following _dl_close call
-	     we have to mark the dtv array as having gaps to fill the
-	     holes.  This is a pessimistic assumption which won't hurt
-	     if not true.  There is no need to do this when we are
-	     loading the auditing DSOs since TLS has not yet been set
-	     up.  */
-	  if ((mode & __RTLD_AUDIT) == 0)
-	    GL(dl_tls_dtv_gaps) = true;
-
 	  _dl_close_worker (args.map, true);
 
 	  /* All l_nodelete_pending objects should have been deleted
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index c4466bd9fc..b0257185e9 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -187,10 +187,7 @@  _dl_next_tls_modid (void)
 size_t
 _dl_count_modids (void)
 {
-  /* It is rare that we have gaps; see elf/dl-open.c (_dl_open) where
-     we fail to load a module and unload it leaving a gap.  If we don't
-     have gaps then the number of modids is the current maximum so
-     return that.  */
+  /* The count is the max unless dlclose or failed dlopen created gaps.  */
   if (__glibc_likely (!GL(dl_tls_dtv_gaps)))
     return GL(dl_tls_max_dtv_idx);