[PR19826] fix non-LE TLS in static programs

Message ID orvaxmklme.fsf@livre.home
State New, archived
Headers

Commit Message

Alexandre Oliva Sept. 24, 2016, 3:44 a.m. UTC
  On Sep 21, 2016, Andreas Schwab <schwab@suse.de> wrote:

> On Sep 21 2016, Alexandre Oliva <aoliva@redhat.com> wrote:
>> [BZ #19826]
>> * elf/dl-tls.c (_dl_allocate_tls_init): Restore DTV early
>> initialization of static TLS entries.
>> * elf/dl-reloc.c (_dl_nothread_init_static_tls): Likewise.
>> * nptl/allocatestack.c (init_one_static_tls): Likewise.

> Ok.

No, it's not!  :-)

:-(


We'll need something along these lines, to fix the above, and to avoid
making the same mistake again.

I'm yet to test it myself, but I understand you'd already tested on ARM
the combination of the previous patch and this reversal, minus the
comments added below, i.e., the elf/dl-tls.c changes above.  Is that so?

Once I'm done with the testing, is this ok to install on top of the
previous patch?


[PR19826] revert needlessly reintroduced race, explain it

Along with the proper fix for the problem, a couple of unnecessary DTV
initializations were reintroduced, but they weren't just unnecessary:
one of them introduced a race condition that was alluded to in the
comments.

In this patch, I add a long comment explaining why we can't modify
another threads' DTV while initializing its static TLS block during
dlopen.

I also revert the unnecessary parts of the previous change, removing
the DTV-initializing code from the above, to fix the race, and, for
uniformity, also from the corresponding code that runs in
single-threaded programs, that was harmless as far as I can tell.


for  ChangeLog

	[BZ #19826]
	* nptl/allocatestack.c (init_one_static_tls): Revert the
	previous change and explain the race condition it would
	introduce.
	* elf/dl-reloc.c (_dl_nothread_init_static_tls): Revert
	previous change for uniformity.
---
 elf/dl-reloc.c       |    6 ------
 nptl/allocatestack.c |   36 ++++++++++++++++++++++++++++++------
 2 files changed, 30 insertions(+), 12 deletions(-)
  

Comments

Torvald Riegel Sept. 24, 2016, 10:18 a.m. UTC | #1
On Sat, 2016-09-24 at 00:44 -0300, Alexandre Oliva wrote:
> In this patch, I add a long comment explaining why we can't modify
> another threads' DTV while initializing its static TLS block during
> dlopen.

I appreciate that you added a comment.  Given that you seemed to have
paged in all of this currently, it would be even better if you could try
to further improve the documentation of all of this.  (I suppose missing
the issue you fix here would have been less likely if proper
documentation existed before.)  Having the patch being reviewed by
someone who's not deeply familiar with the TLS implementation would be
good, as it shows whether this really makes the code more accessible
(and thus reducing the bus factor...).

Are you aware of any synchronization in the TLS stuff that is not based
on locks?  That is, were there are data races as defined by C11?
If so, it would be great if you could transform them to using the
new-style C11 atomics.  I've reviewed nscd concurrency recently and
converted it to use C11 atomics, and this process revealed many
synchronization bugs such as missing barriers.

> +     (**) If we were to write to other threads' DTV entries here, we
> +     might race with their DTV resizing.  We might then write to
> +     stale, possibly already free()d and reallocated memory.  Or we
> +     could write past the end of the still-active DTV.  So, don't do
> +     that!

It's okay to point out why data races are bad, but you probably don't
need it.  I think we have consensus that every data race is bad and
should be fixed (with very few exceptions, which then need lengthy
explanations).  The general understanding should be that there's no such
thing as a benign data race.

> +     Another race, not to be mistaken for the above, has to do with
> +     the TLS block initialized below.  Although we perform this
> +     initialization while holding the rtld lock, once dlopen
> +     completes, other threads might run code that accesses the
> +     variables in these TLS blocks, without taking the rtld lock or
> +     any other explicit memory acquire.

What is a "memory acquire"?

(If you should be assuming that acquire operations synchronize-with
every other release operation such as a lock release, then this is
wrong; remember that it is pairs of release/acquire action that are
linked by a reads-from, so both release and acquire must target the same
memory location.)

> +     Our implementation appears to assume that, in order for the other
> +     threads to access those variables, there will be some form of
> +     synchronization between the end of dlopen() in one thread and the
> +     use of the just-loaded modules in others, otherwise the very use
> +     of the loaded code could be undefined.  */
> 

I'd reverse that: I find it misleading that you start that there is a
race condition, when in fact this seems not to be the case because of
the assumption.  Therefore, I'd mention the assumption first, and then
derive from that there is no data-race and that uses of the TLS will
find it initialized.
  
Alexandre Oliva Sept. 25, 2016, 10:18 p.m. UTC | #2
On Sep 24, 2016, Torvald Riegel <triegel@redhat.com> wrote:

> On Sat, 2016-09-24 at 00:44 -0300, Alexandre Oliva wrote:
>> In this patch, I add a long comment explaining why we can't modify
>> another threads' DTV while initializing its static TLS block during
>> dlopen.

> I appreciate that you added a comment.  Given that you seemed to have
> paged in all of this currently, it would be even better if you could try
> to further improve the documentation of all of this.

As your feedback to the patch shows, it's pointless for me to even try.
My way of reasoning about this stuff doesn't seem to make sense to you.
So why bother?

> (I suppose missing the issue you fix here would have been less likely
> if proper documentation existed before.)

*nod*.  That's why I added the new, longer comments: the comments I'd
added before were not enough to stop even myself from reintroducing the
error, because there was a shallow, more obvious (apparent?  harmless?)
race, and a deeper, less obvious one that's very harmful when it hits.

> Are you aware of any synchronization in the TLS stuff that is not based
> on locks?

Yes, plenty.  Off the top of my head, there's static TLS block
initialization at dlopen vs use by other threads, there was DTV
initialization vs use and resize (fixed by my 1.5yo-patch, broken by my
patch from last week, and fixed again once the patch that reverts the
breakage goes in, because we only mess with a thread's DTV at thread its
start-up and then by the thread itself) but IIRC there are
self-concurrency issues if DTV resizing is interrupted by a signal that
uses GD and starts further resizing) and IIRC there is plenty of
suspicious stuff in slotinfo manipulation vs use (the data structure is
only modified while holding the rtld lock, but it's read while updating
a DTV without any explicit synchronization; I never looked into it
enough to tell what the assumptions were that make it safe at some
point, to tell whether there's any chance they still are); I have a
hunch that there are probably issues with TLS descriptors as well,
because although they're created while holding the rtld lock, their use
is not explicitly guarded either, nor are than the GOT entries relocated
so as to point to them.

IMHO these are all internal implementations, so the C11 memory model for
applications doesn't quite apply to them: it's ok if they make stronger
or weaker assumptions than what the language is supposed to expose to
users of the language.  Just like the *implementation* of mutexes,
locks, condition variables and other such concepts won't necessarily
abide by the rules imposed on users of the language, as long as they
deliver the intended semantics, icluding synchronization and memory
model properties.

> If so, it would be great if you could transform them to using the
> new-style C11 atomics.

I guess it would, wouldn't it?  Too bad I'm probably not.  I just can't
stand even the thought of the never-ending conversations that would
ensue.  So, I'm out of here; the only reason you're even from me is that
I'm responsible enough to at least try to fix the breakage I introduced,
once it was brought to my attention (at a time in which I actually
noticed it; too bad I missed it the first time).

> I've reviewed nscd concurrency recently and converted it to use C11
> atomics, and this process revealed many synchronization bugs such as
> missing barriers.

That shouldn't be surprising.  A lot of that code was written before C11
was even in the foreseeable future.

>> +     Another race, not to be mistaken for the above, has to do with
>> +     the TLS block initialized below.  Although we perform this
>> +     initialization while holding the rtld lock, once dlopen
>> +     completes, other threads might run code that accesses the
>> +     variables in these TLS blocks, without taking the rtld lock or
>> +     any other explicit memory acquire.

> What is a "memory acquire"?

That's a term from long before C standardized atomics, from the 20 years
or so in which people like myself have been doing concurrent programming
on multi-CPU computers without explicit language support, and even on
single-CPU multi-tasked systems.  In dusted books from that age, you may
be able to find tons of information about such things as memory barriers
or fences, some with acquire and release semantics, as well as the
implied acquire and release of all memory that occurs at mutex acquire
and release time, respectively.

There are still some remnants from those days in e.g. Linux docs:
https://www.kernel.org/doc/Documentation/memory-barriers.txt

but a lot of what can be found on the Internet now seems to be more
recent than C and C++ atomics.  Still, the notion of memory acquire and
release from the ancient scrolls is still there, even if a bit in
disguise.  There's no reason for us to limit our language and throw away
all the old books just because the term was reused in a new, narrower
context, is there?

>> +     Our implementation appears to assume that, in order for the other
>> +     threads to access those variables, there will be some form of
>> +     synchronization between the end of dlopen() in one thread and the
>> +     use of the just-loaded modules in others, otherwise the very use
>> +     of the loaded code could be undefined.  */

> I'd reverse that: I find it misleading that you start that there is a
> race condition, when in fact this seems not to be the case because of
> the assumption.  Therefore, I'd mention the assumption first, and then
> derive from that there is no data-race and that uses of the TLS will
> find it initialized.

Unless the assumption fails to meet the definition of data race in some
possibly irrelevant fashion.  The point is that the implementation
doesn't guard itself from certain races, if someone's determined to
create one, but I pose they don't matter.  Let me give you some
examples:

- there's the obvious case of the program with a race: the dlopen caller
sets a global non-atomic variable when dlopen completes, and another
thread busy-waits for the variable to change, and then uses its value to
call a function that accesses the newly-loaded TLS variable.  The race
on the global variable enables the race on the TLS variable, so
undefined behavior has already been invoked by the time the TLS
implementation incurs its own race.

- here's a variant that doesn't involve any other race: the dlopen
caller writes to a pipe a pointer to a function that accesses the TLS
variable, and another thread reads the pointer from the pipe and calls
it -> race?  there's clearly a happens-before order implied by the
write() and read(), but that doesn't imply memory synchronization in the
memory model AFAICT.

- another variant in which there's a clear happens-before order, but no
memory synchronization, so there's possibly a race: the dlopen caller
installs a signal handler that accesses the TLS variable, and then gets
the signal sent to the process itself; another thread is chosen to
handle the signal and accesses the TLS variable -> race?

- generalizing, use any out-of-(memory-)band information-carrying side
effect that one thread could use to communicate to another thread how to
call a function that accesses the TLS variable, both function and
variable brought in by dlopen, and you'll have, technically, a data
race, but IMO a harmless data race.
  
Andreas Schwab Sept. 26, 2016, 8:29 a.m. UTC | #3
If you have to modify it anyway, would you mind adding the testcase from
the bug?

Andreas.
  
Alexandre Oliva Sept. 26, 2016, 11:55 p.m. UTC | #4
On Sep 26, 2016, Andreas Schwab <schwab@suse.de> wrote:

> If you have to modify it anyway, would you mind adding the testcase from
> the bug?

The reason I didn't add the testcase was that I understood it was
redundant with another existing test.  Reviewing the earlier report, I
got the idea that the regression introduced by last year's patch had
already been detected by existing tests on the affected platforms, it
just so happened that I missed the report.

Ideally, we'd have a test to detects such a regression on x86_64 too,
but I don't see a way to do that that's not too fragile.  Like, we could
introduce a TLS variable in the main program (so that it's hopefully
assigned to the beginning of the TLS block), include the custom
implementation of tls_get_addr in the test, and call it (with module id
1 and offset 0) to ensure it returns the same address that LE computes
for the variable.

There might be issues with different calling conventions for
tls_get_addr, but if we introduced the custom function using a different
name and a standardized calling convention (even if different from the
one used by the TLS implementation), maybe we'd be able to catch such
regressions.  But we'd be taking too different a path from the normal
use of __tls_get_addr to make the test reliable, not to mention the
undeserved assumption that, just because a TLS variable is defined in
the first object file to be linked in, its dynamic TLS offset is going
to be zero.

Do you all think it make sense to introduce such a test, in spite of all
these caveats?
  
Torvald Riegel Sept. 29, 2016, 11:19 a.m. UTC | #5
On Sun, 2016-09-25 at 19:18 -0300, Alexandre Oliva wrote:
> On Sep 24, 2016, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > On Sat, 2016-09-24 at 00:44 -0300, Alexandre Oliva wrote:
> >> In this patch, I add a long comment explaining why we can't modify
> >> another threads' DTV while initializing its static TLS block during
> >> dlopen.
> 
> > I appreciate that you added a comment.  Given that you seemed to have
> > paged in all of this currently, it would be even better if you could try
> > to further improve the documentation of all of this.
> 
> As your feedback to the patch shows, it's pointless for me to even try.

I do not think it is, provided that you are willing to work within the
memory model that we have chosen for glibc (which is basically
equivalent to C11's).

> My way of reasoning about this stuff doesn't seem to make sense to you.
> So why bother?

This isn't about my personal opinion, or taste.  glibc has chosen the
C11 memory model (see above), so I consider this our technical context.
As with any other technical context we pick, we want to build code (and
documentation) that is sound within that context.  Hence, if I think
that a patch can be improved in this regard, I'll speak up.

Consider an exaggerated but similar example:  I want to add a
non-trivial piece of code, and like to reason about it using notes on
paper.  Yet glibc has decided that it wants comments in the code.  What
should I do?  Should I just give up, or try to find a way to do what
glibc wants and how it decided to maintain the code base (eg, finding a
back-and-forth conversion between glibc model of taking notes and mine)?

> > (I suppose missing the issue you fix here would have been less likely
> > if proper documentation existed before.)
> 
> *nod*.  That's why I added the new, longer comments: the comments I'd
> added before were not enough to stop even myself from reintroducing the
> error, because there was a shallow, more obvious (apparent?  harmless?)
> race, and a deeper, less obvious one that's very harmful when it hits.

Right.  But we also need to consider that more than one person will
eventually want to maintain any piece of glibc code.  Thus, we need a
common way of reasoning about it.  This needs to very well specified,
and that's a job we can't do on our own.  So, picking something that
will be wide-spread, like C11's model, is the right thing to do.  But
then we also need to make use of it.

I've noted a couple of times that of course, the project needs to find
its own way of how to talk about concurrency.  Discussing patches such
as yours is a way of doing that.  I've also always asked reviewers of my
patches that concurrent code to also comment on whether they find the
comments about concurrency in these good (eg, level of detail), or
whether they have suggestions for improvement.
However, we need to be technically correct in how we speak about it, and
that includes sticking to the terminology that defines the memory model
we have picked.  Otherwise, we're deviating from our technical base of
reasoning, and this will just create misunderstandings.

> > Are you aware of any synchronization in the TLS stuff that is not based
> > on locks?
> 
> Yes, plenty.  Off the top of my head, there's static TLS block
> initialization at dlopen vs use by other threads, there was DTV
> initialization vs use and resize (fixed by my 1.5yo-patch, broken by my
> patch from last week, and fixed again once the patch that reverts the
> breakage goes in, because we only mess with a thread's DTV at thread its
> start-up and then by the thread itself)

Can you fix this to use C11 atomics (or even the old-style atomics if
you're not comfortable with using C11 atomics yet), please?

> but IIRC there are
> self-concurrency issues if DTV resizing is interrupted by a signal that
> uses GD and starts further resizing) and IIRC there is plenty of
> suspicious stuff in slotinfo manipulation vs use (the data structure is
> only modified while holding the rtld lock, but it's read while updating
> a DTV without any explicit synchronization; I never looked into it
> enough to tell what the assumptions were that make it safe at some
> point, to tell whether there's any chance they still are);

That sounds like a incorrect attempt at double-checked locking, which
I've found several cases of so far, so it would be "within the pattern".

> I have a
> hunch that there are probably issues with TLS descriptors as well,
> because although they're created while holding the rtld lock, their use
> is not explicitly guarded either, nor are than the GOT entries relocated
> so as to point to them.
> 
> IMHO these are all internal implementations, so the C11 memory model for
> applications doesn't quite apply to them:

The C11 model is the memory that glibc uses. See
https://sourceware.org/glibc/wiki/Concurrency

> it's ok if they make stronger
> or weaker assumptions than what the language is supposed to expose to
> users of the language.  Just like the *implementation* of mutexes,
> locks, condition variables and other such concepts won't necessarily
> abide by the rules imposed on users of the language, as long as they
> deliver the intended semantics, icluding synchronization and memory
> model properties.

See above.  While I agree that we could use something else, we chose C11
for glibc's code too.  There are very few cases where we'd need more
than C11 atomics to implement glibc's concurrent code.

> > If so, it would be great if you could transform them to using the
> > new-style C11 atomics.
> 
> I guess it would, wouldn't it?  Too bad I'm probably not.  I just can't
> stand even the thought of the never-ending conversations that would
> ensue.

If your patches would follow the C11 model and the comments would be
fine, you'd just get an "OK" in my review.

> So, I'm out of here; the only reason you're even from me is that
> I'm responsible enough to at least try to fix the breakage I introduced,
> once it was brought to my attention (at a time in which I actually
> noticed it; too bad I missed it the first time).
> 
> > I've reviewed nscd concurrency recently and converted it to use C11
> > atomics, and this process revealed many synchronization bugs such as
> > missing barriers.
> 
> That shouldn't be surprising.  A lot of that code was written before C11
> was even in the foreseeable future.

It was fine to assume TSO when glibc was started, and the role of
compilers in concurrency was less well understood than before.
However, glibc claims to support ARM and PowerPC correctly now for quite
a while, and that is not the case in concurrent code in many cases (eg,
because assuming something like TSO).

Also note that even if assuming a strong memory model, some of the nscd
synchronization would be broken.

Furthermore, you reviewed nscd-related code and marked it as
thread-safe.  This wasn't that long ago, and happened at a time at which
weak memory models such as ARM's, and the role of the compiler were well
understood.  My patch shows that this wasn't thread-safe.  Thus, I find
it surprising that you think it wasn't surprising that the
synchronization in nscd is buggy.

> >> +     Another race, not to be mistaken for the above, has to do with
> >> +     the TLS block initialized below.  Although we perform this
> >> +     initialization while holding the rtld lock, once dlopen
> >> +     completes, other threads might run code that accesses the
> >> +     variables in these TLS blocks, without taking the rtld lock or
> >> +     any other explicit memory acquire.
> 
> > What is a "memory acquire"?
> 
> That's a term from long before C standardized atomics, from the 20 years
> or so in which people like myself have been doing concurrent programming
> on multi-CPU computers without explicit language support, and even on
> single-CPU multi-tasked systems.  In dusted books from that age, you may
> be able to find tons of information about such things as memory barriers
> or fences, some with acquire and release semantics, as well as the
> implied acquire and release of all memory that occurs at mutex acquire
> and release time, respectively.
> 
> There are still some remnants from those days in e.g. Linux docs:
> https://www.kernel.org/doc/Documentation/memory-barriers.txt
> 
> but a lot of what can be found on the Internet now seems to be more
> recent than C and C++ atomics.  Still, the notion of memory acquire and
> release from the ancient scrolls is still there, even if a bit in
> disguise.  There's no reason for us to limit our language and throw away
> all the old books just because the term was reused in a new, narrower
> context, is there?

My point is that you're not specific.  There's no such thing as The
Acquire Semantics -- so if you should be specific which semantics you
are referring to.  Referring to a group of semantics isn't good enough
if the members of this group differ significantly.
One such difference is, for example, whether you assume that all
acquire/release actions are global, or always "tied" to accesses to a
specific memory location.  You seem to think the former, but this isn't
true in weak HW memory models we have today and that glibc claims to
support.

The language we use in glibc needs to be sufficiently precise for what
we want to do in glibc.  What you (or anyone else for that matter)
thinks might have been popular terminology X years ago or on some
architectures of the past is frankly irrelevant; sure, it can helps us
get to proper terminology, but it's not sufficient in glibc code /
comments.

> >> +     Our implementation appears to assume that, in order for the other
> >> +     threads to access those variables, there will be some form of
> >> +     synchronization between the end of dlopen() in one thread and the
> >> +     use of the just-loaded modules in others, otherwise the very use
> >> +     of the loaded code could be undefined.  */
> 
> > I'd reverse that: I find it misleading that you start that there is a
> > race condition, when in fact this seems not to be the case because of
> > the assumption.  Therefore, I'd mention the assumption first, and then
> > derive from that there is no data-race and that uses of the TLS will
> > find it initialized.
> 
> Unless the assumption fails to meet the definition of data race in some
> possibly irrelevant fashion.

This assumption is orthogonal to how a data race is defined.

> The point is that the implementation
> doesn't guard itself from certain races, if someone's determined to
> create one,

That's why I suggested to clearly state the assumption first.  You could
also call it a precondition for this functionality in glibc.  (I'm not
aware of any "official" requirement / precondition, but if it exists, we
should cite it.)

> but I pose they don't matter.  Let me give you some
> examples:
> 
> - there's the obvious case of the program with a race: the dlopen caller
> sets a global non-atomic variable when dlopen completes, and another
> thread busy-waits for the variable to change, and then uses its value to
> call a function that accesses the newly-loaded TLS variable.  The race
> on the global variable enables the race on the TLS variable, so
> undefined behavior has already been invoked by the time the TLS
> implementation incurs its own race.
> 
> - here's a variant that doesn't involve any other race: the dlopen
> caller writes to a pipe a pointer to a function that accesses the TLS
> variable, and another thread reads the pointer from the pipe and calls
> it -> race?  there's clearly a happens-before order implied by the
> write() and read(), but that doesn't imply memory synchronization in the
> memory model AFAICT.

I would think that the preferable semantics would be that a read()
reading data from a write() call would be considered to create a
synchronizes-with edge between these.

If it doesn't, then it's the same as if using relaxed-MO atomics to
communicate the pointer, which doesn't create a synchronizes-with.  In
this case, which is not a data race, the precondition of dlopen would
apply; if the user violates it, then it's the user's fault.  The comment
I suggest would then remind glibc developers that there is this
precondition, and thus glibc is correct wrt its specification (which
includes the precondition).

> - another variant in which there's a clear happens-before order, but no
> memory synchronization, so there's possibly a race: the dlopen caller
> installs a signal handler that accesses the TLS variable, and then gets
> the signal sent to the process itself; another thread is chosen to
> handle the signal and accesses the TLS variable -> race?

Violates the precondition, so undefined behavior.  Whether that might or
might not lead to a data race in the execution of the implementation is
an implementation detail.

> - generalizing, use any out-of-(memory-)band information-carrying side
> effect that one thread could use to communicate to another thread how to
> call a function that accesses the TLS variable, both function and
> variable brought in by dlopen, and you'll have, technically, a data
> race, but IMO a harmless data race.

I'm missing some detail in the description of the case, so I can't say
whether it would or would not be a data race in how the variable is
communicated.  But irrespective of whether the
  
Alexandre Oliva Sept. 29, 2016, 9:51 p.m. UTC | #6
On Sep 29, 2016, Torvald Riegel <triegel@redhat.com> wrote:

> Consider an exaggerated but similar example:  I want to add a
> non-trivial piece of code, and like to reason about it using notes on
> paper.  Yet glibc has decided that it wants comments in the code.  What
> should I do?  Should I just give up, or try to find a way to do what
> glibc wants and how it decided to maintain the code base (eg, finding a
> back-and-forth conversion between glibc model of taking notes and mine)?

IMHO the important question is not what *you* should do in this
scenario, but what should the GNU Libc community do, when an occasional
contributor comes up with a fix for an existing bug.  Should you boss
around playing language police and demand further unrelated changes, so
as to push the contributor away, or should you welcome and thank the
contribution so that the contributor feels welcome and inclined to come
back?

> Right.  But we also need to consider that more than one person will
> eventually want to maintain any piece of glibc code.

We also have to consider that by the time some of these people do, the
standards will have changed again, and the language in the comments will
again no longer be in line with the newer consensus of the community.
That's already happened.  As much as you want to portray things this
way, some key concepts have not changed, nor have they been ruled out of
the vocabulary.

> Can you fix this to use C11 atomics (or even the old-style atomics if
> you're not comfortable with using C11 atomics yet), please?

Of course not.  I have a lot of other work to do, and absolutely zero
interest in going through the pain of trying to get anything vaguely
concurrency-related into glibc, as much as I have studied and been
interested in this area.  That's unfortunate for glibc, that I'm still
somewhat responsible for, but I just can't stand this pain.

>> but IIRC there are
>> self-concurrency issues if DTV resizing is interrupted by a signal that
>> uses GD and starts further resizing) and IIRC there is plenty of
>> suspicious stuff in slotinfo manipulation vs use (the data structure is
>> only modified while holding the rtld lock, but it's read while updating
>> a DTV without any explicit synchronization; I never looked into it
>> enough to tell what the assumptions were that make it safe at some
>> point, to tell whether there's any chance they still are);

> That sounds like a incorrect attempt at double-checked locking, which
> I've found several cases of so far, so it would be "within the pattern".

I don't see how you could possibly have drawn that conclusion, since the
readers do NOT take locks, before or after any test.

> Also note that even if assuming a strong memory model, some of the nscd
> synchronization would be broken.

> Furthermore, you reviewed nscd-related code and marked it as
> thread-safe.

The review was limited to functions provided by glibc and documented in
the glibc manual.  nscd provides none.  So your claim that I reviewed
them may very well be the result of a misunderstanding leading to an
incorrect assumption.

Regardless, if you found any comments about nscd code in my
thread-safety annotations (and presumably corrected them), it would have
been polite (and appreciated) for you to copy me, so that I could have
been informed and educated about the changes.  I don't have recollection
of having been copied on any such messages.

> I find it surprising that you think it wasn't surprising that the
> synchronization in nscd is buggy.

I vaguely recall finding a few nss synchronization problems in nss
plgins years before the thread-safety review and documentation project
started.  That's why I wasn't surprised by the existence of
synchronization porblems in that general area, even though nss plugins
were not part of the review.

> I'm missing some detail in the description of the case, so I can't say
> whether it would or would not be a data race in how the variable is
> communicated.  But irrespective of whether the 

(-:  If this message didn't end at this comma,
  
Torvald Riegel Sept. 30, 2016, 9:58 a.m. UTC | #7
On Thu, 2016-09-29 at 18:51 -0300, Alexandre Oliva wrote:
> On Sep 29, 2016, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > Consider an exaggerated but similar example:  I want to add a
> > non-trivial piece of code, and like to reason about it using notes on
> > paper.  Yet glibc has decided that it wants comments in the code.  What
> > should I do?  Should I just give up, or try to find a way to do what
> > glibc wants and how it decided to maintain the code base (eg, finding a
> > back-and-forth conversion between glibc model of taking notes and mine)?
> 
> IMHO the important question is not what *you* should do in this
> scenario, but what should the GNU Libc community do, when an occasional
> contributor comes up with a fix for an existing bug.

First of all, you are not a hobbyist contributor AFAIK.

> Should you boss
> around playing language police

This is part of patch review.  Being consistent re how we document and
reason about concurrency, and ensuring that such comments are
technically correct and match the code, is important.

I've offered my help plenty of times, and the standard for new
concurrent code we set is not unreasonable (eg, glibc doesn't use a
custom memory model or atomics, but something that all modern C and C++
code will use).  So asking patch contributors to adhere to that is fine
I think.  We're also expecting code to be proper C in all other aspects,
so why should we make an exception for concurrency aspects?

> and demand further unrelated changes, so
> as to push the contributor away,

I asked whether you could look at related parts of the code, because you
mentioned that you know this larger part of the code base.

> > Right.  But we also need to consider that more than one person will
> > eventually want to maintain any piece of glibc code.
> 
> We also have to consider that by the time some of these people do, the
> standards will have changed again, and the language in the comments will
> again no longer be in line with the newer consensus of the community.

When do you think will we get a new ISO C/C++ memory model?  You seem to
think that this will happen in just a few years...

> >> but IIRC there are
> >> self-concurrency issues if DTV resizing is interrupted by a signal that
> >> uses GD and starts further resizing) and IIRC there is plenty of
> >> suspicious stuff in slotinfo manipulation vs use (the data structure is
> >> only modified while holding the rtld lock, but it's read while updating
> >> a DTV without any explicit synchronization; I never looked into it
> >> enough to tell what the assumptions were that make it safe at some
> >> point, to tell whether there's any chance they still are);
> 
> > That sounds like a incorrect attempt at double-checked locking, which
> > I've found several cases of so far, so it would be "within the pattern".
> 
> I don't see how you could possibly have drawn that conclusion, since the
> readers do NOT take locks, before or after any test.

Your description sounded as if it could match, in the sense that readers
try to, for example, detect whether the initialization has been done
(which uses a critical section); in that case, readers take no lock.
Even if it's not initialization-only, the similar point is that you
can't publish a result consiting of more than one memory location
without memory barriers, and you can't read from it without barriers.

> > Also note that even if assuming a strong memory model, some of the nscd
> > synchronization would be broken.
> 
> > Furthermore, you reviewed nscd-related code and marked it as
> > thread-safe.
> 
> The review was limited to functions provided by glibc and documented in
> the glibc manual.  nscd provides none.  So your claim that I reviewed
> them may very well be the result of a misunderstanding leading to an
> incorrect assumption.

The manual contains annotations of several functions that start with
"nscd_".  These are in the nscd client code, and are called from
functions provided by glibc.  Was it somebody else that put these
annotations into the manual?

nscd client-server synchronization is broken.  That there is something
wrong is easy to see because there clearly is synchronization and
comments that hint at low-level synchronization -- but there are no
atomic operations (nor sufficient locking), and thus no memory barriers
for example; also, volatile-qualified variables are used even though
there's no low-level IO, which is an indicator for synchronization done
in a way before proper compiler support for it.

> Regardless, if you found any comments about nscd code in my
> thread-safety annotations (and presumably corrected them), it would have
> been polite (and appreciated) for you to copy me, so that I could have
> been informed and educated about the changes.  I don't have recollection
> of having been copied on any such messages.

I have already sent the draft patch to a list you have access to.  I
haven't checked all the annotations again yet, but I believe that my
patch now makes the annotations correct (considering the thread-safety
aspect here).
  

Patch

diff --git a/elf/dl-reloc.c b/elf/dl-reloc.c
index dcab666..42bddc1 100644
--- a/elf/dl-reloc.c
+++ b/elf/dl-reloc.c
@@ -137,12 +137,6 @@  _dl_nothread_init_static_tls (struct link_map *map)
 # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 #endif
 
-  /* Fill in the DTV slot so that a later LD/GD access will find it.  */
-  dtv_t *dtv = THREAD_DTV ();
-  assert (map->l_tls_modid <= dtv[-1].counter);
-  dtv[map->l_tls_modid].pointer.to_free = NULL;
-  dtv[map->l_tls_modid].pointer.val = dest;
-
   /* Initialize the memory.  */
   memset (__mempcpy (dest, map->l_tls_initimage, map->l_tls_initimage_size),
 	  '\0', map->l_tls_blocksize - map->l_tls_initimage_size);
diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index 3016a2e..38b6530 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -1207,12 +1207,36 @@  init_one_static_tls (struct pthread *curp, struct link_map *map)
 #  error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 # endif
 
-  /* Fill in the DTV slot so that a later LD/GD access will find it.  */
-  dtv_t *dtv = GET_DTV (TLS_TPADJ (curp));
-  dtv[map->l_tls_modid].pointer.to_free = NULL;
-  dtv[map->l_tls_modid].pointer.val = dest;
-
-  /* Initialize the memory.  */
+  /* We cannot delay the initialization of the Static TLS area, since
+     it can be accessed with LE or IE, but since the DTV is only used
+     by GD and LD (*) we can delay its update to avoid a race (**).
+
+     (*) The main program's DTV of statically-linked programs may be
+     used by a custom __tls_get_addr used on platforms that don't
+     require TLS relaxations, but that one is never initialized by us,
+     since the main program is never dlopened; the main thread's DTV
+     entry for the main program gets initialized in __libc_setup_tls,
+     whereas other threads' corresponding DTV entries are initialized
+     by _dl_allocate_tls_init.
+
+     (**) If we were to write to other threads' DTV entries here, we
+     might race with their DTV resizing.  We might then write to
+     stale, possibly already free()d and reallocated memory.  Or we
+     could write past the end of the still-active DTV.  So, don't do
+     that!
+
+     Another race, not to be mistaken for the above, has to do with
+     the TLS block initialized below.  Although we perform this
+     initialization while holding the rtld lock, once dlopen
+     completes, other threads might run code that accesses the
+     variables in these TLS blocks, without taking the rtld lock or
+     any other explicit memory acquire.
+
+     Our implementation appears to assume that, in order for the other
+     threads to access those variables, there will be some form of
+     synchronization between the end of dlopen() in one thread and the
+     use of the just-loaded modules in others, otherwise the very use
+     of the loaded code could be undefined.  */
   memset (__mempcpy (dest, map->l_tls_initimage, map->l_tls_initimage_size),
 	  '\0', map->l_tls_blocksize - map->l_tls_initimage_size);
 }