Fixing the distribution problems with TLS and DTV_SURPLUS slots.

Message ID 5432EFF9.5020602@redhat.com
State Dropped
Headers

Commit Message

Carlos O'Donell Oct. 6, 2014, 7:39 p.m. UTC
  Adam,

I'm sitting on this patch in Fedora, and you asked me to send it
upstream. Unfortunately I don't think it is right solution for
upstream.

Firstly, please don't respond with  "But DSOs using TLS IE accesses
are not allowed." It's allowed because the compiler and linker let
people use it and we should have prevented it or spent more time
educating our users. Either way there are valid uses of it, and glibc
itself along with other core libraries want the speed that it offers.
In the future we want to switch them to TLS descriptors which give
you the same fastness, but either way the momentum is there and we'd 
have to patch every MESA to get rid of it, so 10 years down the road
we'll be done. Please see the "WARNING:" below about TLS descriptors
and AArch64 (and likely other TLS descriptor targets).

The patch in Fedora is this:
~~~
#
# This is an experimental patch that should go into rawhide and
# Fedora 21 to fix failures where python applications fail to 
# load graphics applications because of the slot usages for TLS.
# This should eventually go upstream.
#
# - Carlos O'Donell
#
~~~

The error users are seeing is this:
"dlopen: cannot load any more object with static TLS"

This is triggered by this code:

elf/dl-open.c:

523   /* We need a second pass for static tls data, because _dl_update_slotinfo
524      must not be run while calls to _dl_add_to_slotinfo are still pending.  */
525   for (unsigned int i = first_static_tls; i < new->l_searchlist.r_nlist; ++i)
526     {
527       struct link_map *imap = new->l_searchlist.r_list[i];
528 
529       if (imap->l_need_tls_init
530           && ! imap->l_init_called
531           && imap->l_tls_blocksize > 0)
532         {
533           /* For static TLS we have to allocate the memory here and
534              now.  This includes allocating memory in the DTV.  But we
535              cannot change any DTV other than our own.  So, if we
536              cannot guarantee that there is room in the DTV we don't
537              even try it and fail the load.
538 
539              XXX We could track the minimum DTV slots allocated in
540              all threads.  */
541           if (! RTLD_SINGLE_THREAD_P && imap->l_tls_modid > DTV_SURPLUS)
542             _dl_signal_error (0, "dlopen", NULL, N_("\
543 cannot load any more object with static TLS"));
544 
545           imap->l_need_tls_init = 0;
546 #ifdef SHARED
547           /* Update the slot information data for at least the
548              generation of the DSO we are allocating data for.  */
549           _dl_update_slotinfo (imap->l_tls_modid);
550 #endif
551 
552           GL(dl_init_static_tls) (imap);
553           assert (imap->l_need_tls_init == 0);
554         }
555     }

This code is a *heuristic*, it basically fails the load if there
are no DTV slots left, even though we can still do the following:

(a) Grow the DTV dynamically as many times as we want, with the
    generation counter causing other threads to update.

and

(b) Allocate from the static TLS image surplus until it is exhausted.

The heuristic avoids doing (a) and (b) if all the surplus slots
were taken.

A better solution would be:
- Keep the use of DTV_SURPLUS to avoid immediately having to reallocate
  the DTV when you dlopen a couple of modules.
- Remove the check above, allowing the code to grow the DTV as large
  as it wants for as many STATIC_TLS modules as it wants.
- Restrict only on the size of static TLS image space and error when
  that is exhausted.

The most common application framework to trigger this is
Python. There are more than 14 libraries in Fedora using TLS, 
in fact there are ~40, which is why I raised the DTV_SURPLUS
limit to 32 in Fedora (several can't be loaded simultaneously).

This raising of the DTV_SURPLUS limit is a bandaid, with the
added effect of optimizing performance for Python at the cost
of 18 * (sizeof(size_t)*sizeof(void*)) bytes of dtv_t entries
per thread which avoids the DTV realloc.

I'm not going to have time right now to implement the better
solution. What I'm looking for is expert advice on what to do
here.

The better solution requires considerably more testing, because
now we're doing something we've never done before: allocating
up to the limit of the surplus static TLS image.

Do we grow the DTV_SURPLUS knowing it's a bandaid?

WARNING: On AArch64 or any architecture that uses the generic-ish
code for TLS descriptors, you will have further problems. There
the descriptors consume static TLS image greedily, which means
you may find that there is zero static TLS image space when you
go to dlopen an application. We need to further subdivide the
static TLS image space into "reserved for general use" and
"reserved for DSO load uses." With the TLS descriptors allocating
from the general use space only. On Fedora for AArch64 this
caused no end of headaches attempting to load TLS IE using DSOs
only to find it was literally impossible because so much of the
implementation used TLS descriptors that the surplus static TLS
image space was gone, and while descriptors can be allocated 
dynamically, the DSOs can't. In Fedora we disallow greedy
consumption of TLS descriptors on any targets that have TLS
descriptors on by default. Which leads me to the last point. 
We need to turn on TLS descriptors by default on x86_64 such
that we can get the benefits there, and start moving DSOs away
from TLS IE.

Comments?

Cheers,
Carlos.
  

Comments

Rich Felker Oct. 6, 2014, 8:35 p.m. UTC | #1
On Mon, Oct 06, 2014 at 03:39:37PM -0400, Carlos O'Donell wrote:
> Adam,
> 
> I'm sitting on this patch in Fedora, and you asked me to send it
> upstream. Unfortunately I don't think it is right solution for
> upstream.
> 
> Firstly, please don't respond with  "But DSOs using TLS IE accesses
> are not allowed."

That's what I wanted to say, but I'll refrain.

> Do we grow the DTV_SURPLUS knowing it's a bandaid?

Considering that the growth is not that much memory, I have no
objection, but I hope this will be accompanied by a commitment to push
for moving everything to TLSDESC. Ideally this would include making ld
produce a warning (to be converted to an error somewhere down the
line) when producing a .so with TLS IE model.

> WARNING: On AArch64 or any architecture that uses the generic-ish
> code for TLS descriptors, you will have further problems. There
> the descriptors consume static TLS image greedily, which means
> you may find that there is zero static TLS image space when you
> go to dlopen an application.

dlopen an application? I'm not understanding the whole issue here, but
it's probably not that important to considering the change you want to
make, anyway, is it?

Rich
  
Carlos O'Donell Oct. 6, 2014, 8:51 p.m. UTC | #2
On 10/06/2014 04:35 PM, Rich Felker wrote:
>> WARNING: On AArch64 or any architecture that uses the generic-ish
>> code for TLS descriptors, you will have further problems. There
>> the descriptors consume static TLS image greedily, which means
>> you may find that there is zero static TLS image space when you
>> go to dlopen an application.
> 
> dlopen an application? I'm not understanding the whole issue here, but
> it's probably not that important to considering the change you want to
> make, anyway, is it?

Sorry, I meant to write "a DSO that uses static TLS."

Cheers,
Carlos.
  
Alexandre Oliva Oct. 7, 2014, 6:15 a.m. UTC | #3
On Oct  6, 2014, "Carlos O'Donell" <carlos@redhat.com> wrote:

> This code is a *heuristic*, it basically fails the load if there
> are no DTV slots left, even though we can still do the following:

> (a) Grow the DTV dynamically as many times as we want, with the
>     generation counter causing other threads to update.

or

  (a)' Stop wasting DTV entries with modules assigned to static TLS.
       There's no reason whatsoever to do so.

       This optimization is even described in the GCC Summit article in
       which I first proposed TLS Descriptors.  Unfortunately, I never
       got around to implementing it.

> and

> (b) Allocate from the static TLS image surplus until it is exhausted.


> - Remove the check above, allowing the code to grow the DTV as large
>   as it wants for as many STATIC_TLS modules as it wants.

We don't really need to grow the DTV right away.  If we have static TLS,
we could just leave the DTV alone.  No code will ever access the
corresponding DTV entry.  If any code needs to update the DTV, because
of some module assigned to dynamic TLS, then, and only then, should the
DTV grow.

> WARNING: On AArch64 or any architecture that uses the generic-ish
> code for TLS descriptors, you will have further problems. There
> the descriptors consume static TLS image greedily, which means
> you may find that there is zero static TLS image space when you
> go to dlopen an application.

That's perfectly valid behavior, that exposes the bug in libraries that
are expected to be loadable after a program starts (say, by dlopen) when
relocations indicate they had to be brought in by Initial Exec.

That they worked was not by design; it was pretty much by accident,
because glibc led by (bad) example instead of coming up with a real
solution, and others followed suit, breaking glibc's own assumption that
only a very small amount of static TLS space would ever be used after
theprogram started, and that the consumer of that space would be glibc
itself.

> We need to further subdivide the static TLS image space into "reserved
> for general use" and "reserved for DSO load uses."  With the TLS
> descriptors allocating from the general use space only.

?!?

Static TLS space grows as much as needed to fit all IE DSOs.  Some
excess is reserved (and this should be configurable), but if we don't
use it for modules that could benefit from it, what should we use it
for?

> On Fedora for AArch64 this
> caused no end of headaches attempting to load TLS IE using DSOs
> only to find it was literally impossible because so much of the
> implementation used TLS descriptors that the surplus static TLS
> image space was gone, and while descriptors can be allocated 
> dynamically, the DSOs can't.

Err...  I get a feeling I have no idea of what you refer to as DSO.
From the description, it's not Dynamically-loaded Shared Object.  What
is it, then?

I suppose you may be speaking of modules that assume IE is usable to
access TLS of some module (itself, or any other), even though the
assumption is no warranted.

So assume you load a module A defining a TLS section, and conservatively
assign it to dynamic TLS, for whatever reason.  Then you load a module B
that expects A to be in static TLS, because it uses IE to reference its
TLS symbols.  Kaboom.  The “conservative” approach just broke what would
have worked if you hadn't gratuitously taken it out of TLS.

Now, of course when you load A you don't know whether module B is going
to be loaded, and whether it will require A to use static TLS or not, or
whether module C would fail to load afterwards because there's not
enough static TLS space for its own TLS section, and it uses IE even
though it's NOT being loaded as a dependency of the IE.

So not saving static TLS space for later use may expose breakage in
subsequently loaded modules, whereas saving it may equally expose
breakage in subsequently loaded modules, but waste static TLS space and
*significantly* impact performance of TLS Descriptor-using modules that
could have got IE-like performance.  That sounds like a losing strategy
to me.

Greedy allocation doesn't guarantee optimal results, but it won't break
anything that isn't already broken, and if and when such breakage is
exposed, switching the broken modules to TLS Descriptors will get them
nearly identical performance for TLS references that happen to land in
static TLS, but that will NOT cause the library to fail to load
otherwise: it will just get GD-like performance.

So, in addition to stopping wasting DTV entries with static TLS
segments, Isuggest not papering over the problem in glibc, but rather
proactively convert dlopenable libraries that use IE to access TLS
symbols that are not guaranteed to have been loaded along with the IE to
use TLS Descriptors in General Dynamic mode.

In order to ease this sort of transition, I've been thinking of
introducing an alias access model in GCC, that would map to GD if TLS
Descriptors are enable, or to the failure-prone IE with old-style TLS.
Then those who incorrectly use IE today can switch to that; it will be a
no-op on arches that don't have TLSDesc, or that don't use it by
default, but it will make the fix automatic as each platform switches to
the superior TLS design.

> In Fedora we disallow greedy consumption of TLS descriptors on any
> targets that have TLS descriptors on by default.

Oh, wow, this is such a great move that it makes TLS Descriptors's
performance the *worst* of all existing access models.  If we want to
artificially force them into their worst case, we might as well get rid
of them altogether!

Whom should I thank for making my work appear to suck?  :-(

:-P :-)

> We need to turn on TLS descriptors by default on x86_64 such
> that we can get the benefits there, and start moving DSOs away
> from TLS IE.

Hallelujah! :-)
  
Carlos O'Donell Oct. 9, 2014, 2:48 p.m. UTC | #4
On 10/07/2014 02:15 AM, Alexandre Oliva wrote:
> On Oct  6, 2014, "Carlos O'Donell" <carlos@redhat.com> wrote:
> 
>> This code is a *heuristic*, it basically fails the load if there
>> are no DTV slots left, even though we can still do the following:
> 
>> (a) Grow the DTV dynamically as many times as we want, with the
>>     generation counter causing other threads to update.
> 
> or
> 
>   (a)' Stop wasting DTV entries with modules assigned to static TLS.
>        There's no reason whatsoever to do so.
> 
>        This optimization is even described in the GCC Summit article in
>        which I first proposed TLS Descriptors.  Unfortunately, I never
>        got around to implementing it.

I was not aware of this, but if possible is a great solution.

>> and
> 
>> (b) Allocate from the static TLS image surplus until it is exhausted.
> 
> 
>> - Remove the check above, allowing the code to grow the DTV as large
>>   as it wants for as many STATIC_TLS modules as it wants.
> 
> We don't really need to grow the DTV right away.  If we have static TLS,
> we could just leave the DTV alone.  No code will ever access the
> corresponding DTV entry.  If any code needs to update the DTV, because
> of some module assigned to dynamic TLS, then, and only then, should the
> DTV grow.

I had not considered this optimization, but I guess it would work.

>> WARNING: On AArch64 or any architecture that uses the generic-ish
>> code for TLS descriptors, you will have further problems. There
>> the descriptors consume static TLS image greedily, which means
>> you may find that there is zero static TLS image space when you
>> go to dlopen an application.
> 
> That's perfectly valid behavior, that exposes the bug in libraries that
> are expected to be loadable after a program starts (say, by dlopen) when
> relocations indicate they had to be brought in by Initial Exec.

I did not argue that it was invalid behaviour. I only wished to warn
the reader that the situation at present will result in broken applications.
We the tools authors allows this situation to get out of hand, and now we
have both pieces when it breaks, and must do our level best to ensure
things continue to work while providing a way out of the situation.
 
> That they worked was not by design; it was pretty much by accident,
> because glibc led by (bad) example instead of coming up with a real
> solution, and others followed suit, breaking glibc's own assumption that
> only a very small amount of static TLS space would ever be used after
> theprogram started, and that the consumer of that space would be glibc
> itself.

I agree.

>> We need to further subdivide the static TLS image space into "reserved
>> for general use" and "reserved for DSO load uses."  With the TLS
>> descriptors allocating from the general use space only.
> 
> ?!?
> 
> Static TLS space grows as much as needed to fit all IE DSOs.  Some
> excess is reserved (and this should be configurable), but if we don't
> use it for modules that could benefit from it, what should we use it
> for?

My apologies let me clarify. The static TLS space that is allocated
is only for DSOs that are known apriori to the static linker. They
must have been specified on the command line. Unfortunately in programs
written in interpreted languages like python, everything is a dlopen'd
DSO. When you use dlopen with IE you run into the problem that that
we see today with TLS descriptors. You have a desire to keep the
application working with the existing set of ~40 DSOs on the system
that use IE, and we have a desire to keep TLS descriptors optimal.
If we keep TLS descriptors optimal, they may consume all static TLS
image and result in an application crash if a dlopen'd DSO uses
IE, and I wish to avoid that crash.

>> On Fedora for AArch64 this
>> caused no end of headaches attempting to load TLS IE using DSOs
>> only to find it was literally impossible because so much of the
>> implementation used TLS descriptors that the surplus static TLS
>> image space was gone, and while descriptors can be allocated 
>> dynamically, the DSOs can't.
> 
> Err...  I get a feeling I have no idea of what you refer to as DSO.
> From the description, it's not Dynamically-loaded Shared Object.  What
> is it, then?

My apologies again. Given that known DSOs using IE at link time will
have static TLS image space allocated I have stopped talking about
those since we know they work correctly. When I speak about DSOs I
speak singularly about those loaded via dlopen.
 
> I suppose you may be speaking of modules that assume IE is usable to
> access TLS of some module (itself, or any other), even though the
> assumption is no warranted.

Yes. We have libraries in the OS using GCC constructs to force IE for
certain __thread variables. We need to move them away from those uses,
but we need to ensure a good migration path e.g. same speed, continues
to work until we migrate all DSOs etc.

> So assume you load a module A defining a TLS section, and conservatively
> assign it to dynamic TLS, for whatever reason.  Then you load a module B
> that expects A to be in static TLS, because it uses IE to reference its
> TLS symbols.  Kaboom.  The “conservative” approach just broke what would
> have worked if you hadn't gratuitously taken it out of TLS.

I don't think this scenario is supported by the present tools.

The only uses I have ever seen for IE in a DSO is optimal access of local
thread variables.

If the static linker could see B accesses A's TLS using IE (requires B to
be listed as a dependency or in the link list) then both A and B
would have to use static TLS, and that forces both into the static TLS
image. It would then be wrong for the dyn loader to load A as dynamic TLS.

If you do think it can happen please start a distinct thread and we talk
about it and look into the source.

> Now, of course when you load A you don't know whether module B is going
> to be loaded, and whether it will require A to use static TLS or not, or
> whether module C would fail to load afterwards because there's not
> enough static TLS space for its own TLS section, and it uses IE even
> though it's NOT being loaded as a dependency of the IE.
> 
> So not saving static TLS space for later use may expose breakage in
> subsequently loaded modules, whereas saving it may equally expose
> breakage in subsequently loaded modules, but waste static TLS space and
> *significantly* impact performance of TLS Descriptor-using modules that
> could have got IE-like performance.  That sounds like a losing strategy
> to me.

The only valid sequences I know of are:

(a) Module uses static TLS and is known by the static linker and has
    static TLS image space allocated.

(b) Module uses static TLS and is not known to the static linker, accesses
    only it's own variables with IE, and has no static TLS images space
    reserved for it.

The optimizing use of static TLS by thread descriptors breaks (b).

> Greedy allocation doesn't guarantee optimal results, but it won't break
> anything that isn't already broken, and if and when such breakage is
> exposed, switching the broken modules to TLS Descriptors will get them
> nearly identical performance for TLS references that happen to land in
> static TLS, but that will NOT cause the library to fail to load
> otherwise: it will just get GD-like performance.

What if the module author can never tolerate GD-like performance and
would rather it fail than load and run slowly e.g. MESA/OpenGL?

Remember, and keep in mind our users, we do this for them, and some
of them have strict performance requirements. We should not lightly
tell them what they want is wrong.

For example our work on tunnables to allow users to tweak up the size
of static TLS image surplus is one potential solution to this problem.

It might also be possible to try make the static TLS image size a single
mapping that we might possible be able to grow with kernel help?

> So, in addition to stopping wasting DTV entries with static TLS
> segments, Isuggest not papering over the problem in glibc, but rather
> proactively convert dlopenable libraries that use IE to access TLS
> symbols that are not guaranteed to have been loaded along with the IE to
> use TLS Descriptors in General Dynamic mode.

I agree that this is the correct solution, but *today* we have problems
loading user applications. I see no options but to follow a staggered
strategy:

(a) Immediately increase DTV surplus size.

	- Distribution patches are doing this already to keep applications working.

(b) Implement static TLS support without needing a DTV increase.

	- Reduces memory usage of DTV. Small optimization.

    and

    Remove faulty heursitics around not wanting to increase DTV size.

(c) Approach upstream projects with patches to convert to TLS descriptors.

When we do (c), can it be done on a per-variable basis?

Can I convert one variable at a time to be a TLS descriptor?

As is done currently with the gcc attributes for TLS mode?

> In order to ease this sort of transition, I've been thinking of
> introducing an alias access model in GCC, that would map to GD if TLS
> Descriptors are enable, or to the failure-prone IE with old-style TLS.
> Then those who incorrectly use IE today can switch to that; it will be a
> no-op on arches that don't have TLSDesc, or that don't use it by
> default, but it will make the fix automatic as each platform switches to
> the superior TLS design.

Oh. Right. If upstream can't use TLS descriptors everywhere, then it
may find itself failing to compile on certain targets that don't support
descriptors.

The alias access model in GCC would be something like:
`__attribute__((tls_model("go-fast")))`?

>> In Fedora we disallow greedy consumption of TLS descriptors on any
>> targets that have TLS descriptors on by default.
> 
> Oh, wow, this is such a great move that it makes TLS Descriptors's
> performance the *worst* of all existing access models.  If we want to
> artificially force them into their worst case, we might as well get rid
> of them altogether!

If it doesn't work and causes applications to stop working
I'll disable it, and I did :-)

> Whom should I thank for making my work appear to suck?  :-(

Me. I didn't do it because I think it was the right solution.
I did it because users need working applications to do the tasks
they chose Fedora for.

It is not sufficient for me to say: "Wait a few months while I
fix the fundamental flaws in the education of users and the usage
of our tools." :-}
 
> :-P :-)
> 
>> We need to turn on TLS descriptors by default on x86_64 such
>> that we can get the benefits there, and start moving DSOs away
>> from TLS IE.
> 
> Hallelujah! :-)

You know I know what the right answer is, but we have to get there
one step at a time with working applications the whole way.

In summary looks like we need:

(a) Immediately increase DTV surplus size.
(b) Implement static TLS support without needing a DTV increase.
(c) Remove faulty heursitics around not wanting to increase DTV size.
(d) Add __attribute__((tls_model("go-fast"))) to gcc that defaults to
    IE if TLS Desc is not present.
(e) Approach upstream projects with patches to convert to TLS descriptors
    using go-fast model.

Does this plan make sense?

Cheers,
Carlos.
  
Rich Felker Oct. 10, 2014, 12:15 a.m. UTC | #5
On Thu, Oct 09, 2014 at 10:48:31AM -0400, Carlos O'Donell wrote:
> On 10/07/2014 02:15 AM, Alexandre Oliva wrote:
> > On Oct  6, 2014, "Carlos O'Donell" <carlos@redhat.com> wrote:
> > 
> >> This code is a *heuristic*, it basically fails the load if there
> >> are no DTV slots left, even though we can still do the following:
> > 
> >> (a) Grow the DTV dynamically as many times as we want, with the
> >>     generation counter causing other threads to update.
> > 
> > or
> > 
> >   (a)' Stop wasting DTV entries with modules assigned to static TLS.
> >        There's no reason whatsoever to do so.
> > 
> >        This optimization is even described in the GCC Summit article in
> >        which I first proposed TLS Descriptors.  Unfortunately, I never
> >        got around to implementing it.
> 
> I was not aware of this, but if possible is a great solution.
> 
> >> and
> > 
> >> (b) Allocate from the static TLS image surplus until it is exhausted.
> > 
> > 
> >> - Remove the check above, allowing the code to grow the DTV as large
> >>   as it wants for as many STATIC_TLS modules as it wants.
> > 
> > We don't really need to grow the DTV right away.  If we have static TLS,
> > we could just leave the DTV alone.  No code will ever access the
> > corresponding DTV entry.  If any code needs to update the DTV, because
> > of some module assigned to dynamic TLS, then, and only then, should the
> > DTV grow.
> 
> I had not considered this optimization, but I guess it would work.
> 
> >> WARNING: On AArch64 or any architecture that uses the generic-ish
> >> code for TLS descriptors, you will have further problems. There
> >> the descriptors consume static TLS image greedily, which means
> >> you may find that there is zero static TLS image space when you
> >> go to dlopen an application.
> > 
> > That's perfectly valid behavior, that exposes the bug in libraries that
> > are expected to be loadable after a program starts (say, by dlopen) when
> > relocations indicate they had to be brought in by Initial Exec.
> 
> I did not argue that it was invalid behaviour. I only wished to warn
> the reader that the situation at present will result in broken applications.
> We the tools authors allows this situation to get out of hand, and now we
> have both pieces when it breaks, and must do our level best to ensure
> things continue to work while providing a way out of the situation.
>  
> > That they worked was not by design; it was pretty much by accident,
> > because glibc led by (bad) example instead of coming up with a real
> > solution, and others followed suit, breaking glibc's own assumption that
> > only a very small amount of static TLS space would ever be used after
> > theprogram started, and that the consumer of that space would be glibc
> > itself.
> 
> I agree.
> 
> >> We need to further subdivide the static TLS image space into "reserved
> >> for general use" and "reserved for DSO load uses."  With the TLS
> >> descriptors allocating from the general use space only.
> > 
> > ?!?
> > 
> > Static TLS space grows as much as needed to fit all IE DSOs.  Some
> > excess is reserved (and this should be configurable), but if we don't
> > use it for modules that could benefit from it, what should we use it
> > for?
> 
> My apologies let me clarify. The static TLS space that is allocated
> is only for DSOs that are known apriori to the static linker. They
> must have been specified on the command line. Unfortunately in programs
> written in interpreted languages like python, everything is a dlopen'd
> DSO. When you use dlopen with IE you run into the problem that that
> we see today with TLS descriptors. You have a desire to keep the
> application working with the existing set of ~40 DSOs on the system
> that use IE, and we have a desire to keep TLS descriptors optimal.
> If we keep TLS descriptors optimal, they may consume all static TLS
> image and result in an application crash if a dlopen'd DSO uses
> IE, and I wish to avoid that crash.
> 
> >> On Fedora for AArch64 this
> >> caused no end of headaches attempting to load TLS IE using DSOs
> >> only to find it was literally impossible because so much of the
> >> implementation used TLS descriptors that the surplus static TLS
> >> image space was gone, and while descriptors can be allocated 
> >> dynamically, the DSOs can't.
> > 
> > Err...  I get a feeling I have no idea of what you refer to as DSO.
> > From the description, it's not Dynamically-loaded Shared Object.  What
> > is it, then?
> 
> My apologies again. Given that known DSOs using IE at link time will
> have static TLS image space allocated I have stopped talking about
> those since we know they work correctly. When I speak about DSOs I
> speak singularly about those loaded via dlopen.
>  
> > I suppose you may be speaking of modules that assume IE is usable to
> > access TLS of some module (itself, or any other), even though the
> > assumption is no warranted.
> 
> Yes. We have libraries in the OS using GCC constructs to force IE for
> certain __thread variables. We need to move them away from those uses,
> but we need to ensure a good migration path e.g. same speed, continues
> to work until we migrate all DSOs etc.
> 
> > So assume you load a module A defining a TLS section, and conservatively
> > assign it to dynamic TLS, for whatever reason.  Then you load a module B
> > that expects A to be in static TLS, because it uses IE to reference its
> > TLS symbols.  Kaboom.  The “conservative” approach just broke what would
> > have worked if you hadn't gratuitously taken it out of TLS.
> 
> I don't think this scenario is supported by the present tools.
> 
> The only uses I have ever seen for IE in a DSO is optimal access of local
> thread variables.
> 
> If the static linker could see B accesses A's TLS using IE (requires B to
> be listed as a dependency or in the link list) then both A and B
> would have to use static TLS, and that forces both into the static TLS
> image. It would then be wrong for the dyn loader to load A as dynamic TLS.
> 
> If you do think it can happen please start a distinct thread and we talk
> about it and look into the source.
> 
> > Now, of course when you load A you don't know whether module B is going
> > to be loaded, and whether it will require A to use static TLS or not, or
> > whether module C would fail to load afterwards because there's not
> > enough static TLS space for its own TLS section, and it uses IE even
> > though it's NOT being loaded as a dependency of the IE.
> > 
> > So not saving static TLS space for later use may expose breakage in
> > subsequently loaded modules, whereas saving it may equally expose
> > breakage in subsequently loaded modules, but waste static TLS space and
> > *significantly* impact performance of TLS Descriptor-using modules that
> > could have got IE-like performance.  That sounds like a losing strategy
> > to me.
> 
> The only valid sequences I know of are:
> 
> (a) Module uses static TLS and is known by the static linker and has
>     static TLS image space allocated.
> 
> (b) Module uses static TLS and is not known to the static linker, accesses
>     only it's own variables with IE, and has no static TLS images space
>     reserved for it.
> 
> The optimizing use of static TLS by thread descriptors breaks (b).
> 
> > Greedy allocation doesn't guarantee optimal results, but it won't break
> > anything that isn't already broken, and if and when such breakage is
> > exposed, switching the broken modules to TLS Descriptors will get them
> > nearly identical performance for TLS references that happen to land in
> > static TLS, but that will NOT cause the library to fail to load
> > otherwise: it will just get GD-like performance.
> 
> What if the module author can never tolerate GD-like performance and
> would rather it fail than load and run slowly e.g. MESA/OpenGL?

This is not the module author's decision to make. If the user wants to
run, the user should be able to run. And the performance difference is
not measurable anyway except in artificial benchmarks that do nothing
but hammer TLS accesses without even using the data they read.

> Remember, and keep in mind our users, we do this for them, and some
> of them have strict performance requirements. We should not lightly
> tell them what they want is wrong.

Then like I said, you should not give the user an error just because
a hardware/driver vendor doesn't want to look bad (slow) and wrongly
things dynamic-model will make the driver look slow.

> For example our work on tunnables to allow users to tweak up the size
> of static TLS image surplus is one potential solution to this problem.
> 
> It might also be possible to try make the static TLS image size a single
> mapping that we might possible be able to grow with kernel help?

The only way to make it growable is to reserve space to begin with. In
any case it's not practical for the dynamic linker to "stop the
world" and probe whether each thread would have space to grow its
static TLS mapping in-place.

> 
> > So, in addition to stopping wasting DTV entries with static TLS
> > segments, Isuggest not papering over the problem in glibc, but rather
> > proactively convert dlopenable libraries that use IE to access TLS
> > symbols that are not guaranteed to have been loaded along with the IE to
> > use TLS Descriptors in General Dynamic mode.
> 
> I agree that this is the correct solution, but *today* we have problems
> loading user applications. I see no options but to follow a staggered
> strategy:
> 
> (a) Immediately increase DTV surplus size.
> 
> 	- Distribution patches are doing this already to keep applications working.

No objection.

> (b) Implement static TLS support without needing a DTV increase.
> 
> 	- Reduces memory usage of DTV. Small optimization.
> 
>     and
> 
>     Remove faulty heursitics around not wanting to increase DTV size.

Seems okay; I would defer to Alexandre's opinion.

> (c) Approach upstream projects with patches to convert to TLS descriptors.
> 
> When we do (c), can it be done on a per-variable basis?
> 
> Can I convert one variable at a time to be a TLS descriptor?
> 
> As is done currently with the gcc attributes for TLS mode?

Why would you want to? If I understand correctly, your idea is that
current libraries are using GD for most TLS, and IE for specific
variables, and they'd want to keep using the same (non-TLSDESC) GD for
most TLS but TLSDESC for specific variables. This makes no sense.
Simply using TLSDESC (which technically is GD model) for everything is
an improvement with no drawbacks.

> > In order to ease this sort of transition, I've been thinking of
> > introducing an alias access model in GCC, that would map to GD if TLS
> > Descriptors are enable, or to the failure-prone IE with old-style TLS.
> > Then those who incorrectly use IE today can switch to that; it will be a
> > no-op on arches that don't have TLSDesc, or that don't use it by
> > default, but it will make the fix automatic as each platform switches to
> > the superior TLS design.
> 
> Oh. Right. If upstream can't use TLS descriptors everywhere, then it
> may find itself failing to compile on certain targets that don't support
> descriptors.

TLSDESC is supported for the archs where it's likely to matter. On
some of the ones where it's not, even reading the thread pointer is
normally a trap to kernelspace, so whatever userspace overhead there
is in getting the offset for a TLS variable is going to be utterly
irrelevant (dominated by the trap).

> >> In Fedora we disallow greedy consumption of TLS descriptors on any
> >> targets that have TLS descriptors on by default.
> > 
> > Oh, wow, this is such a great move that it makes TLS Descriptors's
> > performance the *worst* of all existing access models.  If we want to
> > artificially force them into their worst case, we might as well get rid
> > of them altogether!
> 
> If it doesn't work and causes applications to stop working
> I'll disable it, and I did :-)
> 
> > Whom should I thank for making my work appear to suck?  :-(
> 
> Me. I didn't do it because I think it was the right solution.
> I did it because users need working applications to do the tasks
> they chose Fedora for.
> 
> It is not sufficient for me to say: "Wait a few months while I
> fix the fundamental flaws in the education of users and the usage
> of our tools." :-}

I'm with Alexandre on this. In this case it seems like your "quick
fix" may not just be neutral but actually _discouraging_ people from
switching to the better system.

> > :-P :-)
> > 
> >> We need to turn on TLS descriptors by default on x86_64 such
> >> that we can get the benefits there, and start moving DSOs away
> >> from TLS IE.
> > 
> > Hallelujah! :-)
> 
> You know I know what the right answer is, but we have to get there
> one step at a time with working applications the whole way.
> 
> In summary looks like we need:
> 
> (a) Immediately increase DTV surplus size.
> (b) Implement static TLS support without needing a DTV increase.
> (c) Remove faulty heursitics around not wanting to increase DTV size.
> (d) Add __attribute__((tls_model("go-fast"))) to gcc that defaults to
>     IE if TLS Desc is not present.
> (e) Approach upstream projects with patches to convert to TLS descriptors
>     using go-fast model.
> 
> Does this plan make sense?

I think (d) should be omitted, and a step (f) should be added: patch
binutils to disallow the creation of .so files with IE TLS.

Rich
  
Alexandre Oliva Oct. 10, 2014, 12:22 p.m. UTC | #6
On Oct  9, 2014, "Carlos O'Donell" <carlos@redhat.com> wrote:

> On 10/07/2014 02:15 AM, Alexandre Oliva wrote:
>> On Oct  6, 2014, "Carlos O'Donell" <carlos@redhat.com> wrote:
>> 
>>> This code is a *heuristic*, it basically fails the load if there
>>> are no DTV slots left, even though we can still do the following:
>> 
>>> (a) Grow the DTV dynamically as many times as we want, with the
>>> generation counter causing other threads to update.
>> 
>> or
>> 
>> (a)' Stop wasting DTV entries with modules assigned to static TLS.
>> There's no reason whatsoever to do so.
>> 
>> This optimization is even described in the GCC Summit article in
>> which I first proposed TLS Descriptors.  Unfortunately, I never
>> got around to implementing it.

> I was not aware of this, but if possible is a great solution.

It might seem like a solution for the glibc bug that arbitrarily limits
the number of DTV entries for modules assigned to Static TLS, yes.  But
that's just glibc behaving silly.

If we didn't have such a blatant bug, it would be just an optimization.

>>> and
>> 
>>> (b) Allocate from the static TLS image surplus until it is exhausted.
>> 
>> 
>>> - Remove the check above, allowing the code to grow the DTV as large
>>> as it wants for as many STATIC_TLS modules as it wants.
>> 
>> We don't really need to grow the DTV right away.  If we have static TLS,
>> we could just leave the DTV alone.  No code will ever access the
>> corresponding DTV entry.  If any code needs to update the DTV, because
>> of some module assigned to dynamic TLS, then, and only then, should the
>> DTV grow.

> I had not considered this optimization, but I guess it would work.

That's another optimization that helps work around a bug.  I'd rather we
fixed the bug and stop limiting the number of DTV entries for Static
TLS.


>>> WARNING: On AArch64 or any architecture that uses the generic-ish
>>> code for TLS descriptors, you will have further problems. There
>>> the descriptors consume static TLS image greedily, which means
>>> you may find that there is zero static TLS image space when you
>>> go to dlopen an application.

>> That's perfectly valid behavior, that exposes the bug in libraries that
>> are expected to be loadable after a program starts (say, by dlopen) when
>> relocations indicate they had to be brought in by Initial Exec.

> I did not argue that it was invalid behaviour. I only wished to warn
> the reader that the situation at present will result in broken applications.

The DTV arbitrary limit bug, yes, it's a bug that needs fixing right
away.

As for the applications, if they dlopen libs that use IE TLS to access
variables in non-IE modules, they are broken already.  There's nothing
we can do in glibc to unbreak them.  Any limit we set on the static TLS
area can be exceeded if you load enough of these libraries.  Using
static TLS when it's not strictly necessary just makes their bug more
visible.  But it's their bug.  dlopen of IE is oxymoronic.

> We the tools authors allows this situation to get out of hand, and now we
> have both pieces when it breaks, and must do our level best to ensure
> things continue to work while providing a way out of the situation.
 
So, we tell libs that abuse IE to switch to TLS Desc GD on platforms
where TLS Descriptors are implemented.  On other platforms, they might
get lucky with IE, but they're still broken and asking for trouble.

> When you use dlopen with IE you run into the problem that that
> we see today with TLS descriptors.

Make it “problems”.  There's the arbitrary limit on DTV size, that's a
bug in glibc, and there's the unwarranted assumption that the Static TLS
area size is infinite, that's a bug in libs using IE to access dlopened
TLS.

> You have a desire to keep the application working with the existing
> set of ~40 DSOs on the system that use IE,

I don't, really.  They're buggy, and their fix is trivial: switching to
TLS Descriptors.

> and we have a desire to keep TLS descriptors optimal.

Once they fix their bug, that would follow naturally.

> If we keep TLS descriptors optimal, they may consume all static TLS
> image and result in an application crash if a dlopen'd DSO uses
> IE, and I wish to avoid that crash.

Sorry, you can't, unless you come up with a way to make the statlc TLS
area infinite.

Making it configurable would help work around the bug in the libs, but
if the libs can't fallback to dynamic TLS, like TLS Descriptors do, load
enough of them and you'll run into the error.  It's not fixable.

> When I speak about DSOs I speak singularly about those loaded via
> dlopen.
 
Let's call them non-IE modules, dlopened modules, or late-loaded
modules?  LLDSO?  non-IE DSO?

> Yes. We have libraries in the OS using GCC constructs to force IE for
> certain __thread variables. We need to move them away from those uses,
> but we need to ensure a good migration path e.g. same speed, continues
> to work until we migrate all DSOs etc.

I don't understand the bit about migration path.  One can drop the
initial exec buggy annotation and use -mtls-dialect=gnu2 on x86 or
x86_64, and the problem of running out of static TLS space will go away.
I'm not sure this is enough to fix our bug of arbitrarily limiting the
DTV size used by static TLS from non-IE modules.

>> So assume you load a module A defining a TLS section, and conservatively
>> assign it to dynamic TLS, for whatever reason.  Then you load a module B
>> that expects A to be in static TLS, because it uses IE to reference its
>> TLS symbols.  Kaboom.  The “conservative” approach just broke what would
>> have worked if you hadn't gratuitously taken it out of TLS.

> I don't think this scenario is supported by the present tools.

Why not?

There's nothing that stops a libA from using the IE access model to
access symbols not defined in itself or any of its dependencies.

And there's nothing that stops another “unrelated” libB from defining
those symbols.

Without TLS Descriptors' greedy use static TLS, you have to arrange for
libB to be initially-loaded for it to get static TLS, otherwise dlopen
will fail because the IE model can't be satisfied.  But TLS Descriptors'
greedy use of TLS made this more flexible, and it's been around for
almost a decade.  'cept now, if your patch made it, it's broken.  People
who switched to GNU2 TLS in order for this to work now get a failure.

> The only uses I have ever seen for IE in a DSO is optimal access of
> local thread variables.

I wouldn't be surprised if they exist anyway.  Consider a primary
library that defines the TLS variables, and a separately-loadable plugin
that accesses them.  Even if it were to list the primary library as a
direct dependency, if you load the primary library first, the loader
makes a decision of where to place its TLS segment at that point.  It
doesn't wait to see whether you load a subsequent plugin that demands IE
to access the primary lib's TLS vars.

> If you do think it can happen please start a distinct thread and we talk
> about it and look into the source.

Uhh...  Why use a distinct thread for the same topic?  It's not like
this is a departure to a different topic, it's just proof that the
heuristics proposed to alleviate or fix the problem are broken.  It
might make some cases work, but at the expense of breaking others that
have worked for a long time.

> What if the module author can never tolerate GD-like performance and
> would rather it fail than load and run slowly e.g. MESA/OpenGL?

Then they use IE and make the library an IE dep.  If they don't, and it
fails because static TLS was exhausted, they get the failure they asked
for.

> For example our work on tunnables to allow users to tweak up the size
> of static TLS image surplus is one potential solution to this problem.

It's a workaround, not a solution.  Unbounded static TLS would be a
solution, but that's not possible.

> It might also be possible to try make the static TLS image size a single
> mapping that we might possible be able to grow with kernel help?

We'd still have to reserve a limited amount of unmapped VM next to each
thread's static TLS area.  This would enable some growth without using
more memory pages, but it would still be limited in size, because we
can't move it: we could only grow it into the area reserved for its
growth.

>> So, in addition to stopping wasting DTV entries with static TLS
>> segments, Isuggest not papering over the problem in glibc, but rather
>> proactively convert dlopenable libraries that use IE to access TLS
>> symbols that are not guaranteed to have been loaded along with the IE to
>> use TLS Descriptors in General Dynamic mode.

> I agree that this is the correct solution, but *today* we have problems
> loading user applications.

But why can't the broken libraries be fixed right away?

> Can I convert one variable at a time to be a TLS descriptor?

Moo.  The question doesn't make sense.  The variable is unrelated to the
access model used to access it.  It could even be defined in a separate
module.

You could in theory specify the access model to use on a per-use basis.

Currently, however, because the annotation is placed on the variable
declaration, and there can only be one declaration of each variable per
translation unit, you can only choose the access model to use for a
variable on a per-translation-unit basis.

However, GNU2 is not an access model, it's an alternate set of access
models.  You can't currently specify “I want to use TLS Descriptor-based
Global Dynamic for this variable in this translation unit” unless you
switch to the GNU2 TLS dialect, and if you do, you don't have to specify
anything.

However, if you use stricter access models such as IE in one translation
unit, the linker will relax less-strict access models in other units to
the stricter one, as it links them into a single SO.

>> In order to ease this sort of transition, I've been thinking of
>> introducing an alias access model in GCC, that would map to GD if TLS
>> Descriptors are enable, or to the failure-prone IE with old-style TLS.
>> Then those who incorrectly use IE today can switch to that; it will be a
>> no-op on arches that don't have TLSDesc, or that don't use it by
>> default, but it will make the fix automatic as each platform switches to
>> the superior TLS design.

> Oh. Right. If upstream can't use TLS descriptors everywhere, then it
> may find itself failing to compile on certain targets that don't support
> descriptors.

> The alias access model in GCC would be something like:
> `__attribute__((tls_model("go-fast")))`?

Yeah.  I meant to ask for suggestions on the spelling, but I forgot.

go-fast is not a good one, though; anyone familiar with TLS access
models would assume it means LE, since that's the fastest access model.
But then, unless both the variable and the access end up in the main
executable, the linker will error out.

Maybe "desc_or_initial"?

>>> In Fedora we disallow greedy consumption of TLS descriptors on any
>>> targets that have TLS descriptors on by default.

>> Oh, wow, this is such a great move that it makes TLS Descriptors's
>> performance the *worst* of all existing access models.  If we want to
>> artificially force them into their worst case, we might as well get rid
>> of them altogether!

> If it doesn't work and causes applications to stop working
> I'll disable it, and I did :-)

Are you really speaking of the same thing?

I mean, there are two different related problems here: exhausting the
DTV, and exhausting the static TLS space.  AFAIK, all you did was work
around the former, by growing the DTV surplus.  Did you ALSO disable
TLSDesc's dynamic relaxation of GD to IE, by preventing greedy use of
the TLS area?  I recently saw patches proposed to that end, months ago,
but I didn't notice any approved patches to that end.

>> Whom should I thank for making my work appear to suck?  :-(

> Me. I didn't do it because I think it was the right solution.
> I did it because users need working applications to do the tasks
> they chose Fedora for.

If you did that, you broke long-existing glibc features so that they
didn't have to fix the bugs in their own libs.  You traded a failure
caused by an application bug for a failure in a bugless application.

It's not the right solution.  It's not even the wrong solution.  It's
not a solution at all.

> It is not sufficient for me to say: "Wait a few months while I
> fix the fundamental flaws in the education of users and the usage
> of our tools." :-}
 
How about “lib authors, use -mtls-dialect=gnu2 and drop the unwarranted
initial_exec tls model selection”?

> (a) Immediately increase DTV surplus size.
> (b) Implement static TLS support without needing a DTV increase.
> (c) Remove faulty heursitics around not wanting to increase DTV size.
> (d) Add __attribute__((tls_model("go-fast"))) to gcc that defaults to
>     IE if TLS Desc is not present.
> (e) Approach upstream projects with patches to convert to TLS descriptors
>     using go-fast model.

Heh.  I guess my suggestion is that we go backwards in your list.  IE
abuse is not our bug.

(b) requires little more than dropping some incorrect asserts.  And once
we get to that (remember, going backwards), (a) is completely
unnecessary: why waste per-thread memory for everyone if it's not
needed?
  
Alexandre Oliva Oct. 10, 2014, 12:37 p.m. UTC | #7
On Oct  9, 2014, Rich Felker <dalias@libc.org> wrote:

> On Thu, Oct 09, 2014 at 10:48:31AM -0400, Carlos O'Donell wrote:

>> What if the module author can never tolerate GD-like performance and
>> would rather it fail than load and run slowly e.g. MESA/OpenGL?

> This is not the module author's decision to make. If the user wants to
> run, the user should be able to run.

Granted: user can always LD_PRELOAD the module that defines the variable
to ensure it gets to IE.

> I think (d) should be omitted, and a step (f) should be added: patch
> binutils to disallow the creation of .so files with IE TLS.

I wouldn't go as far as disallowing it, since there are perfectly
legitimate cases of IE accesses in dynamic libs.  The most obvious
example is glibc plugins that access libdl.so or libc.so symbols, that
are always IE, but any app that provides symbols in IE modules and wants
to use IE to access them even from plugins should have no problem doing
so.  It's not like TLSDesc GD is as efficient as IE; it's just pretty
close if the variable is in static TLS.  If someone wishes to structure
their app to squeeze a little bit of extra performance by making
legitimate uses of IE, why not let them?
  
Rich Felker Oct. 10, 2014, 2:33 p.m. UTC | #8
On Fri, Oct 10, 2014 at 09:37:22AM -0300, Alexandre Oliva wrote:
> On Oct  9, 2014, Rich Felker <dalias@libc.org> wrote:
> 
> > On Thu, Oct 09, 2014 at 10:48:31AM -0400, Carlos O'Donell wrote:
> 
> >> What if the module author can never tolerate GD-like performance and
> >> would rather it fail than load and run slowly e.g. MESA/OpenGL?
> 
> > This is not the module author's decision to make. If the user wants to
> > run, the user should be able to run.
> 
> Granted: user can always LD_PRELOAD the module that defines the variable
> to ensure it gets to IE.

Yes, but that actually doesn't work right when the library's symbols
are never intended to reach the global namespace because it comes in
as a dependency of an RTLD_LOCAL dlopen. For this usage to be
reliable, tt would be necessary to have something like LD_PRELOAD
("LD_PREREQ"?) but that opens the library and allocates static TLS for
it at startup time, but defers ctor execution and visibility of the
symbols until something explicitly loads/needs it.

> > I think (d) should be omitted, and a step (f) should be added: patch
> > binutils to disallow the creation of .so files with IE TLS.
> 
> I wouldn't go as far as disallowing it, since there are perfectly
> legitimate cases of IE accesses in dynamic libs.  The most obvious
> example is glibc plugins that access libdl.so or libc.so symbols, that
> are always IE, but any app that provides symbols in IE modules and wants
> to use IE to access them even from plugins should have no problem doing
> so.

I agree with this claim, but question whether there's any practical
benefit of doing so.

> It's not like TLSDesc GD is as efficient as IE; it's just pretty
> close if the variable is in static TLS.  If someone wishes to structure
> their app to squeeze a little bit of extra performance by making
> legitimate uses of IE, why not let them?

If you're actually doing anything with the data rather than just
hammering TLS accessed with no intervening code, I think the
difference will be so small that it's borderline on not even being
statistically significant without a huge number of runs.

Rich
  
Carlos O'Donell Oct. 10, 2014, 5:41 p.m. UTC | #9
On 10/10/2014 08:22 AM, Alexandre Oliva wrote:
>> (a) Immediately increase DTV surplus size.
>> (b) Implement static TLS support without needing a DTV increase.
>> (c) Remove faulty heursitics around not wanting to increase DTV size.
>> (d) Add __attribute__((tls_model("go-fast"))) to gcc that defaults to
>>     IE if TLS Desc is not present.
>> (e) Approach upstream projects with patches to convert to TLS descriptors
>>     using go-fast model.
> 
> Heh.  I guess my suggestion is that we go backwards in your list.  IE
> abuse is not our bug.

:-)
 
> (b) requires little more than dropping some incorrect asserts.  And once
> we get to that (remember, going backwards), (a) is completely
> unnecessary: why waste per-thread memory for everyone if it's not
> needed?
 
I've outlined a slightly different path in the other email to Rich.

See if you like that set of steps better.

Cheers,
Carlos.
  

Patch

diff -urN glibc-2.19-886-gdd763fd/sysdeps/generic/ldsodefs.h glibc-2.19-886-gdd763fd.mod/sysdeps/generic/ldsodefs.h
--- glibc-2.19-886-gdd763fd/sysdeps/generic/ldsodefs.h  2014-08-21 01:00:55.000000000 -0400
+++ glibc-2.19-886-gdd763fd.mod/sysdeps/generic/ldsodefs.h      2014-09-04 19:29:42.929692810 -0400
@@ -388,8 +388,18 @@ 
    have to iterate beyond the first element in the slotinfo list.  */
 #define TLS_SLOTINFO_SURPLUS (62)

-/* Number of additional slots in the dtv allocated.  */
-#define DTV_SURPLUS    (14)
+/* Number of additional allocated dtv slots.  This was initially
+   14, but problems with python, MESA, and X11's uses of static TLS meant
+   that most distributions were very close to this limit when they loaded
+   dynamically interpreted languages that used graphics. The simplest
+   solution was to roughly double the number of slots. The actual static
+   image space usage was relatively small, for example in MESA you
+   had only two dispatch pointers for a total of 16 bytes.  If we hit up
+   against this limit again we should start a campaign with the
+   distributions to coordinate the usage of static TLS.  Any user of this
+   resource is effectively coordinating a global resource since this
+   surplus is allocated for each thread at startup.  */
+#define DTV_SURPLUS    (32)

   /* Initial dtv of the main thread, not allocated with normal malloc.  */
   EXTERN void *_dl_initial_dtv;