[BZ#17090/17620/17621] : fix DTV race, assert, and DTV_SURPLUS Static TLS limit

Message ID orzimyknzx.fsf@livre.home
State New, archived
Headers

Commit Message

Alexandre Oliva Sept. 24, 2016, 2:52 a.m. UTC
  On Sep 23, 2016, Florian Weimer <fw@deneb.enyo.de> wrote:

> * Alexandre Oliva:
>> On Sep 15, 2016, Andreas Schwab <schwab@suse.de> wrote:
>>> This breaks TLS in static programs on armv7, see BZ #19826.

>> Do we want the fix for BZ#19826 backported to older branches?

> That would make sense, yes.

Here's a backport of the patch, that applies cleanly on 2.24, 2.23 and
2.22 (the first affected by the regression), and that I tested by
applying it on top of the tag of each of these releases.

Would it make sense to install it in all 3 release branches?


Attentive readers will notice that the patch removes a comment that
alludes to a race, and brings back the code that had been replaced by
the comment.  When I posted the patch for the trunk, I restored the DTV
initializers to all 3 functions, although only one of them actually
needed them.  I'm afraid my sense of aesthetics overruled my sense of
thread-safety, because I couldn't figure out what the race was,
investigated the wrong potential race and got convinced it wouldn't hit.
But now, just before posting this message, I see I was mistaken: the
patch reintroduced a race in the trunk, that I'll proceed to fix.


The race amounts to the following scenario: one thread loads a module
that has a TLS block that happens to be assigned to static TLS, getting
nptl/allocatestack.c's __pthread_init_static_tls called through
GL(dl_init_static_tls) to initialize the TLS block for the module in all
threads' static TLS blocks.

__pthread_init_static_tls calls init_one_static_tls for each active
thread, initializing the DTV entry and copying the TLS initialization
block to the static TLS block of each thread.

IIRC the race arises when one thread decides to resize its DTV while the
initializer above is modifying it.  (if it decides to update it without
resizing, we also have a race, but both threads end up writing the same
final value, which AFAIK is not so much of a problem)

The end result is that the initializer may write to stale/freed memory,
that may even have already been reassigned to some other purpose.
That's very bad indeed.


So, although I post the entire backported patch, because that's what I
tested, and the comment I refer to above is in a part of the patch that,
if applied, would restore the described race, I immediately withdraw it,
and propose instead to revert the changes to elf/dl-reloc.c and
nptl/allocatestack.c that I'd installed earlier this week to the master
branch.  They're not necessary for use with the regular __tls_get_addr,
and they won't affect the alternate __tls_get_addr used in static
programs, because the Static TLS block for the main program (the only
module it cares about) and its DTV entry are always set up by
__libc_setup_tls (main thread) or _dl_allocate_tls_init (other threads):
it will never be the case that we will dlopen the main program after the
program starts and have to initialize its static TLS blocks (and DTV
entry) for all running threads, because we never dlopen the main
program.

For the master branch, I'll post a separate patch momentarily, reverting
the incorrect portion of the recent change, and expanding the comments
about the race condition, so that neither myself nor anyone else ever
mistakes another (harmless?) race for that one: that one thread
initializes other threads' TLS blocks while holding the rtld lock, but
later on the other threads access the initialized TLS blocks without
taking any locks or any form of memory synchronization that would
clearly establish, memory-wise, the implied happens-before.


For the earlier branches, I propose we install only the elf/dl-tls.c
change from the patch below; the elf/dl-reloc.c change could stay, but
I'm now more inclined to take it out, for uniformity.

I'll retest and repost the patches for master and branches early next
week, if not over the weekend.


Andreas, could you please confirm that you tested the dl-tls.c change by
itself on ARM, and that it was enough to fix the problem on its own?
Thanks,


WARNING: the patch below is broken, do not use!

[PR19826] fix non-LE TLS in static programs

An earlier fix for TLS dropped early initialization of DTV entries for
modules using static TLS, leaving it for __tls_get_addr to set them
up.  That worked on platforms that require the GD access model to be
relaxed to LE in the main executable, but it caused a regression on
platforms that allow GD in the main executable, particularly in
statically-linked programs: they use a custom __tls_get_addr that does
not update the DTV, which fails when the DTV early initialization is
not performed.

In static programs, __libc_setup_tls performs the DTV initialization
for the main thread, but the DTV of other threads is set up in
_dl_allocate_tls_init, so that's the fix that matters.

Restoring the initialization in the remaining functions modified by
this patch was just for uniformity.  It's not clear that it is ever
needed: even on platforms that allow GD in the main executable, the
dynamically-linked version of __tls_get_addr would set up the DTV
entries, even for static TLS modules, while updating the DTV counter.

for  ChangeLog

	[BZ #19826]
	* elf/dl-tls.c (_dl_allocate_tls_init): Restore DTV early
	initialization of static TLS entries.
	* elf/dl-reloc.c (_dl_nothread_init_static_tls): Likewise.
	* nptl/allocatestack.c (init_one_static_tls): Likewise.
---
 elf/dl-reloc.c       |    6 ++++++
 elf/dl-tls.c         |    5 +++++
 nptl/allocatestack.c |    9 ++++++---
 3 files changed, 17 insertions(+), 3 deletions(-)
  

Comments

Torvald Riegel Sept. 24, 2016, 10:52 a.m. UTC | #1
On Fri, 2016-09-23 at 23:52 -0300, Alexandre Oliva wrote:
> (if it decides to update it without
> resizing, we also have a race, but both threads end up writing the same
> final value, which AFAIK is not so much of a problem)

Can this still happen after your patches?  If so, then this needs to be
fixed.  Data races are a problem.  You must used atomics to actually
make writes of the same value, concurrently by more than one thread,
well-defined.

Also, for code comments at least, we need to be less ambiguous regarding
our use of the word "race".  I think we should avoid it unless we
specifically mean a data race (as defined by C11), or something that's
bad and needs to be fixed.  If we want to say that something is not
ordered by happens-before, we should say exactly that.  This allows us
to see the term "race" as a red flag.
If we need to use atomics to avoid data races, they will not be ordered
by happens-before.  But that's okay, because that's in the nature of
cases where we have to synchronize.

The final state that we should head towards is that we have code that is
either data-race-free or proper synchronization (e.g., using atomics)
that is well-documented.

> The end result is that the initializer may write to stale/freed memory,
> that may even have already been reassigned to some other purpose.
> That's very bad indeed.
> 
> 
> So, although I post the entire backported patch, because that's what I
> tested, and the comment I refer to above is in a part of the patch that,
> if applied, would restore the described race, I immediately withdraw it,
> and propose instead to revert the changes to elf/dl-reloc.c and
> nptl/allocatestack.c that I'd installed earlier this week to the master
> branch.  They're not necessary for use with the regular __tls_get_addr,
> and they won't affect the alternate __tls_get_addr used in static
> programs, because the Static TLS block for the main program (the only
> module it cares about) and its DTV entry are always set up by
> __libc_setup_tls (main thread) or _dl_allocate_tls_init (other threads):
> it will never be the case that we will dlopen the main program after the
> program starts and have to initialize its static TLS blocks (and DTV
> entry) for all running threads, because we never dlopen the main
> program.

ISTM that all this should become a code comment, and not just be in an
email on the list.

> For the master branch, I'll post a separate patch momentarily, reverting
> the incorrect portion of the recent change, and expanding the comments
> about the race condition, so that neither myself nor anyone else ever
> mistakes another (harmless?) race for that one: that one thread
> initializes other threads' TLS blocks while holding the rtld lock, but
> later on the other threads access the initialized TLS blocks without
> taking any locks or any form of memory synchronization that would
> clearly establish, memory-wise, the implied happens-before.

IIUC, that's were the assumption about dlopen vs. usage comes in that
you included in the other patch.  If so, this is not a data race (nor a
race condition).  Or are you referring to something else?

BTW, there are no harmless data races ;)

> I'll retest and repost the patches for master and branches early next
> week, if not over the weekend.

Please also remember that with the tools we have, it's pretty unlikely
that we can test the absence of concurrency bugs.  Therefore, it is very
important that our reasoning about concurrency is thorough.  And we need
to do that collectively, which means:
* Shared terminology
* Proper, detailed documentation
Otherwise, we can't communicate properly about it, which means we can't
review it, and that we're not decreasing bus factors.
  
Alexandre Oliva Sept. 25, 2016, 10:09 p.m. UTC | #2
On Sep 24, 2016, Torvald Riegel <triegel@redhat.com> wrote:

> On Fri, 2016-09-23 at 23:52 -0300, Alexandre Oliva wrote:
>> (if it decides to update it without
>> resizing, we also have a race, but both threads end up writing the same
>> final value, which AFAIK is not so much of a problem)

> Can this still happen after your patches?

After removing (again) one of the possibly-concurrent writes, I don't
see how it could still happen, since there's only one writer to a
thread's DTV at a time: at thread creation there's one, and while the
thread is active there's another.  But that's not the result of a
thorough global analysis from scratch, it's just my intuition based on
my limited understanding of the whole TLS architecture.  If you want a
thorough analysis, I'm afraid you'll have to resort to someone else,
probably someone who speaks the same language as you, and that reasons
at the same abstraction layer as you.  I'm not that person, as our many
attempts at conversations have shown.

> ISTM that all this should become a code comment, and not just be in an
> email on the list.

It's there in the patch.

> IIUC, that's were the assumption about dlopen vs. usage comes in that
> you included in the other patch.

Yeah, that too.

> If so, this is not a data race (nor a race condition).

It is a potential data race, yes, but I'll sustain it is probably
harmless, although I know you'll take that as heresy ;-)  Please look
for the details in the other thread.
  
Torvald Riegel Sept. 29, 2016, 11:27 a.m. UTC | #3
On Sun, 2016-09-25 at 19:09 -0300, Alexandre Oliva wrote:
> On Sep 24, 2016, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > On Fri, 2016-09-23 at 23:52 -0300, Alexandre Oliva wrote:
> >> (if it decides to update it without
> >> resizing, we also have a race, but both threads end up writing the same
> >> final value, which AFAIK is not so much of a problem)
> 
> > Can this still happen after your patches?
> 
> After removing (again) one of the possibly-concurrent writes, I don't
> see how it could still happen, since there's only one writer to a
> thread's DTV at a time: at thread creation there's one, and while the
> thread is active there's another.  But that's not the result of a
> thorough global analysis from scratch, it's just my intuition based on
> my limited understanding of the whole TLS architecture.  If you want a
> thorough analysis, I'm afraid you'll have to resort to someone else,
> probably someone who speaks the same language as you, and that reasons
> at the same abstraction layer as you.  I'm not that person, as our many
> attempts at conversations have shown.
> 
> > ISTM that all this should become a code comment, and not just be in an
> > email on the list.
> 
> It's there in the patch.

Apparently, I thought it wasn't.

At the very least please include this sentence of yours (see above) in
the code comments:
"But that's not the result of a thorough global analysis from scratch,
it's just my intuition based on my limited understanding of the whole
TLS architecture."

What I want to avoid is that somebody else tries to work on the code and
believes in your comment as-is, when you already know that you haven't
checked this completely.

> > IIUC, that's were the assumption about dlopen vs. usage comes in that
> > you included in the other patch.
> 
> Yeah, that too.
> 
> > If so, this is not a data race (nor a race condition).
> 
> It is a potential data race, yes,

I'm saying that when making the assumption, which is essentially a
precondition we (should) specify, this isn't a data race.  If violating
the precondition, then that's already undefined behavior, but a fault of
the caller.
  
Alexandre Oliva Sept. 29, 2016, 10:07 p.m. UTC | #4
On Sep 29, 2016, Torvald Riegel <triegel@redhat.com> wrote:

> On Sun, 2016-09-25 at 19:09 -0300, Alexandre Oliva wrote:
>> On Sep 24, 2016, Torvald Riegel <triegel@redhat.com> wrote:
>> 
>> > On Fri, 2016-09-23 at 23:52 -0300, Alexandre Oliva wrote:
>> >> (if it decides to update it without
>> >> resizing, we also have a race, but both threads end up writing the same
>> >> final value, which AFAIK is not so much of a problem)
>> 
>> > Can this still happen after your patches?
>> 
>> After removing (again) one of the possibly-concurrent writes, I don't
>> see how it could still happen, since there's only one writer to a
>> thread's DTV at a time: at thread creation there's one, and while the
>> thread is active there's another.  But that's not the result of a
>> thorough global analysis from scratch, it's just my intuition based on
>> my limited understanding of the whole TLS architecture.  If you want a
>> thorough analysis, I'm afraid you'll have to resort to someone else,
>> probably someone who speaks the same language as you, and that reasons
>> at the same abstraction layer as you.  I'm not that person, as our many
>> attempts at conversations have shown.
>> 
>> > ISTM that all this should become a code comment, and not just be in an
>> > email on the list.
>> 
>> It's there in the patch.

> Apparently, I thought it wasn't.

> At the very least please include this sentence of yours (see above) in
> the code comments:
> "But that's not the result of a thorough global analysis from scratch,
> it's just my intuition based on my limited understanding of the whole
> TLS architecture."

> What I want to avoid is that somebody else tries to work on the code and
> believes in your comment as-is, when you already know that you haven't
> checked this completely.

I checked thoroughly what I wrote in the comment, namely, that there'd
be a DTV race if we did what the removed code used to do before.

What I meant is that I cannot affirm that there aren't any other DTV- or
TLS-related races, before or after the fix, because I haven't ever done
a thorough review of the TLS implementation, and I surely haven't done
one for purposes of posting the patch.

I just know enough about the code in general, and about concurrency
programming in general, to know that, if I remove that code, that was
unnecessary and that caused the race I had detected a year or so ago,
that specific race will be gone, and the remaining code will work the
way it did before the patch I installed last week.

What I know about these topics would allow me to make numerous other
claims, of course.  But what would the point be of my spelling any of
them out?  You wouldn't believe them anyway!  Even if I said I'd made a
thorough analysis, would you trust it?


>> > IIUC, that's were the assumption about dlopen vs. usage comes in that
>> > you included in the other patch.
>> 
>> Yeah, that too.
>> 
>> > If so, this is not a data race (nor a race condition).
>> 
>> It is a potential data race, yes,

> I'm saying that when making the assumption, which is essentially a
> precondition we (should) specify, this isn't a data race.  If violating
> the precondition, then that's already undefined behavior, but a fault of
> the caller.

Are you saying it ceases to be a data race just because some earlier
precondition failed to be observed?  I don't think I agree with that.
But I don't think I'm interested in being part of such a discussion with
you either.
  
Torvald Riegel Sept. 30, 2016, 9:22 a.m. UTC | #5
On Thu, 2016-09-29 at 19:07 -0300, Alexandre Oliva wrote:
> What I know about these topics would allow me to make numerous other
> claims, of course.  But what would the point be of my spelling any of
> them out?  You wouldn't believe them anyway!  Even if I said I'd made a
> thorough analysis, would you trust it?

What I asked for is that you make it clear if you are not confident in a
comment you add.  Of course, I'd trust such a comment less than one that
you are confident in.

> 
> >> > IIUC, that's were the assumption about dlopen vs. usage comes in that
> >> > you included in the other patch.
> >> 
> >> Yeah, that too.
> >> 
> >> > If so, this is not a data race (nor a race condition).
> >> 
> >> It is a potential data race, yes,
> 
> > I'm saying that when making the assumption, which is essentially a
> > precondition we (should) specify, this isn't a data race.  If violating
> > the precondition, then that's already undefined behavior, but a fault of
> > the caller.
> 
> Are you saying it ceases to be a data race just because some earlier
> precondition failed to be observed?  I don't think I agree with that.

I would not classify it as a data race in the code because we make
statements about the code based on what it's supposed to do.  If the
caller violates the preconditions, everything can happen.
Nonetheless, when speculating about what would happen if the caller
would be allowed to call in a way that violates the precondition, then
we would say that there's a data race there.  However, speculating about
possible behavior that violates preconditions in code comments is just
confusing IMO, unless it's explicitly stated (ie, that this is
essentially a thought experiment).
  

Patch

diff --git a/elf/dl-reloc.c b/elf/dl-reloc.c
index 42bddc1..64e3aea 100644
--- a/elf/dl-reloc.c
+++ b/elf/dl-reloc.c
@@ -137,6 +137,12 @@  _dl_nothread_init_static_tls (struct link_map *map)
 # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 #endif
 
+  /* Fill in the DTV slot so that a later LD/GD access will find it.  */
+  dtv_t *dtv = THREAD_DTV ();
+  assert (map->l_tls_modid <= dtv[-1].counter);
+  dtv[map->l_tls_modid].pointer.is_static = true;
+  dtv[map->l_tls_modid].pointer.val = dest;
+
   /* Initialize the memory.  */
   memset (__mempcpy (dest, map->l_tls_initimage, map->l_tls_initimage_size),
 	  '\0', map->l_tls_blocksize - map->l_tls_initimage_size);
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index ed13fd9..0add960 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -511,6 +511,11 @@  _dl_allocate_tls_init (void *result)
 # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 #endif
 
+	  /* Set up the DTV entry.  The simplified __tls_get_addr that
+	     some platforms use in static programs requires it.  */
+	  dtv[map->l_tls_modid].pointer.is_static = true;
+	  dtv[map->l_tls_modid].pointer.val = dest;
+
 	  /* Copy the initialization image and clear the BSS part.  */
 	  memset (__mempcpy (dest, map->l_tls_initimage,
 			     map->l_tls_initimage_size), '\0',
diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index 6b42b11..99002b2 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -1209,9 +1209,12 @@  init_one_static_tls (struct pthread *curp, struct link_map *map)
 #  error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 # endif
 
-  /* We cannot delay the initialization of the Static TLS area, since
-     it can be accessed with LE or IE, but since the DTV is only used
-     by GD and LD, we can delay its update to avoid a race.  */
+  /* Fill in the DTV slot so that a later LD/GD access will find it.  */
+  dtv_t *dtv = GET_DTV (TLS_TPADJ (curp));
+  dtv[map->l_tls_modid].pointer.is_static = true;
+  dtv[map->l_tls_modid].pointer.val = dest;
+
+  /* Initialize the memory.  */
   memset (__mempcpy (dest, map->l_tls_initimage, map->l_tls_initimage_size),
 	  '\0', map->l_tls_blocksize - map->l_tls_initimage_size);
 }