Fixing the distribution problems with TLS and DTV_SURPLUS slots.

  Adam,

I'm sitting on this patch in Fedora, and you asked me to send it
upstream. Unfortunately I don't think it is right solution for
upstream.

Firstly, please don't respond with  "But DSOs using TLS IE accesses
are not allowed." It's allowed because the compiler and linker let
people use it and we should have prevented it or spent more time
educating our users. Either way there are valid uses of it, and glibc
itself along with other core libraries want the speed that it offers.
In the future we want to switch them to TLS descriptors which give
you the same fastness, but either way the momentum is there and we'd 
have to patch every MESA to get rid of it, so 10 years down the road
we'll be done. Please see the "WARNING:" below about TLS descriptors
and AArch64 (and likely other TLS descriptor targets).

The patch in Fedora is this:
~~~
#
# This is an experimental patch that should go into rawhide and
# Fedora 21 to fix failures where python applications fail to 
# load graphics applications because of the slot usages for TLS.
# This should eventually go upstream.
#
# - Carlos O'Donell
#
~~~

The error users are seeing is this:
"dlopen: cannot load any more object with static TLS"

This is triggered by this code:

elf/dl-open.c:

523   /* We need a second pass for static tls data, because _dl_update_slotinfo
524      must not be run while calls to _dl_add_to_slotinfo are still pending.  */
525   for (unsigned int i = first_static_tls; i < new->l_searchlist.r_nlist; ++i)
526     {
527       struct link_map *imap = new->l_searchlist.r_list[i];
528 
529       if (imap->l_need_tls_init
530           && ! imap->l_init_called
531           && imap->l_tls_blocksize > 0)
532         {
533           /* For static TLS we have to allocate the memory here and
534              now.  This includes allocating memory in the DTV.  But we
535              cannot change any DTV other than our own.  So, if we
536              cannot guarantee that there is room in the DTV we don't
537              even try it and fail the load.
538 
539              XXX We could track the minimum DTV slots allocated in
540              all threads.  */
541           if (! RTLD_SINGLE_THREAD_P && imap->l_tls_modid > DTV_SURPLUS)
542             _dl_signal_error (0, "dlopen", NULL, N_("\
543 cannot load any more object with static TLS"));
544 
545           imap->l_need_tls_init = 0;
546 #ifdef SHARED
547           /* Update the slot information data for at least the
548              generation of the DSO we are allocating data for.  */
549           _dl_update_slotinfo (imap->l_tls_modid);
550 #endif
551 
552           GL(dl_init_static_tls) (imap);
553           assert (imap->l_need_tls_init == 0);
554         }
555     }

This code is a *heuristic*, it basically fails the load if there
are no DTV slots left, even though we can still do the following:

(a) Grow the DTV dynamically as many times as we want, with the
    generation counter causing other threads to update.

and

(b) Allocate from the static TLS image surplus until it is exhausted.

The heuristic avoids doing (a) and (b) if all the surplus slots
were taken.

A better solution would be:
- Keep the use of DTV_SURPLUS to avoid immediately having to reallocate
  the DTV when you dlopen a couple of modules.
- Remove the check above, allowing the code to grow the DTV as large
  as it wants for as many STATIC_TLS modules as it wants.
- Restrict only on the size of static TLS image space and error when
  that is exhausted.

The most common application framework to trigger this is
Python. There are more than 14 libraries in Fedora using TLS, 
in fact there are ~40, which is why I raised the DTV_SURPLUS
limit to 32 in Fedora (several can't be loaded simultaneously).

This raising of the DTV_SURPLUS limit is a bandaid, with the
added effect of optimizing performance for Python at the cost
of 18 * (sizeof(size_t)*sizeof(void*)) bytes of dtv_t entries
per thread which avoids the DTV realloc.

I'm not going to have time right now to implement the better
solution. What I'm looking for is expert advice on what to do
here.

The better solution requires considerably more testing, because
now we're doing something we've never done before: allocating
up to the limit of the surplus static TLS image.

Do we grow the DTV_SURPLUS knowing it's a bandaid?

WARNING: On AArch64 or any architecture that uses the generic-ish
code for TLS descriptors, you will have further problems. There
the descriptors consume static TLS image greedily, which means
you may find that there is zero static TLS image space when you
go to dlopen an application. We need to further subdivide the
static TLS image space into "reserved for general use" and
"reserved for DSO load uses." With the TLS descriptors allocating
from the general use space only. On Fedora for AArch64 this
caused no end of headaches attempting to load TLS IE using DSOs
only to find it was literally impossible because so much of the
implementation used TLS descriptors that the surplus static TLS
image space was gone, and while descriptors can be allocated 
dynamically, the DSOs can't. In Fedora we disallow greedy
consumption of TLS descriptors on any targets that have TLS
descriptors on by default. Which leads me to the last point. 
We need to turn on TLS descriptors by default on x86_64 such
that we can get the benefits there, and start moving DSOs away
from TLS IE.

Comments?

Cheers,
Carlos.

Fixing the distribution problems with TLS and DTV_SURPLUS slots.

Commit Message

Comments

Patch