Fixes TLS performance degradation after dlopen

Message ID 1465309688.1188.19.camel@mailbox.tu-dresden.de
State New, archived
Headers

Commit Message

Philipp Trommler June 7, 2016, 2:28 p.m. UTC
  Fixes a performance degradation of TLS access in shared libraries which
can be observed after another shared library that uses TLS is loaded
via dl_open. In this case __tls_get_addr takes significant more time.
Once that shared library accesses it's TLS, the performance normalizes.

Attached is a minimal example which gives the following output:

time: 0.723744
after dlopen,
time: 1.789016
after tls access in loaded lib,
time: 0.672865

We do have a use-case where this is actually really significant. I
believe this happens for instance if libstdc++ is loaded implicitly,
but TLS features are not actively used.

I strongly suspect this is the same issue as discussed in this post on
the µClibc mailing list:

https://lists.osuosl.org/pipermail/uclibc/2009-December/043375.html

and therefore the patch provided mainly reuses the solution they've
found for µClibc.


Main development platform information (tested on multiple platforms):

* latest glibc git version (pulled at Tue Jun  7 15:45:13 CEST 2016)
* x86_64 (but should be independent)
* Linux TP_Trommler 4.5.4-1-ARCH #1 SMP PREEMPT Wed May 11 22:21:28
CEST 2016 x86_64 GNU/Linux
* gcc (GCC) 6.1.1 20160501
* GNU ld (GNU Binutils) 2.26.0.20160501

The patch provided touches code already altered (but not released) by
the work on bug 19329.



2016-06-07  Philipp Trommler  <philipp.trommler@tu-dresden.de>

	[BZ #19924]
	* elf/dl-close.c: Port changes from the µClibc to fix
	performance degradation after dl_open
	* elf/dl-open.c: Likewise
	* elf/dl-tls.c: Likewise

---
 elf/dl-close.c |  2 ++
 elf/dl-open.c  | 10 ++++++++--
 elf/dl-tls.c   | 35 ++++++++++++++++++++++-------------
 3 files changed, 32 insertions(+), 15 deletions(-)

       /* We have to look through the entire dtv slotinfo list.  */
@@ -916,7 +915,7 @@ _dl_add_to_slotinfo (struct link_map *l)
 	 the first slot.  */
       assert (idx == 0);
 
-      listp = prevp->next = (struct dtv_slotinfo_list *)
+      listp = (struct dtv_slotinfo_list *)
 	malloc (sizeof (struct dtv_slotinfo_list)
 		+ TLS_SLOTINFO_SURPLUS * sizeof (struct
dtv_slotinfo));
       if (listp == NULL)
@@ -939,9 +938,19 @@ cannot create TLS data structures"));
       listp->next = NULL;
       memset (listp->slotinfo, '\0',
 	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
+
+      /* Make sure the new slotinfo element is in memory before
exposing it. */
+      atomic_write_barrier();
+      prevp->next = listp;
     }
 
-  /* Add the information into the slotinfo data structure.  */
-  listp->slotinfo[idx].map = l;
+  /* Add the information into the slotinfo data structure.  Make sure
+     to barrier as we do it, to present a consistent view to other
+     threads that may be calling __tls_get_addr() as we do this.
+     Barrier after the writes so we know everything is ready for an
+     increment of _dl_tls_generation.  */
   listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
+  atomic_write_barrier();
+  listp->slotinfo[idx].map = l;
+  atomic_write_barrier();
 }
-- 
2.5.0
  

Comments

Carlos O'Donell June 8, 2016, 7:33 p.m. UTC | #1
On 06/07/2016 10:28 AM, Philipp Trommler wrote:
> 2016-06-07  Philipp Trommler  <philipp.trommler@tu-dresden.de>
> 
> 	[BZ #19924]
> 	* elf/dl-close.c: Port changes from the µClibc to fix
> 	performance degradation after dl_open
> 	* elf/dl-open.c: Likewise
> 	* elf/dl-tls.c: Likewise

I really appreciate the effort, but there are some serious copyright
issues going on here.

This code is written by Chris Metcalf, not by yourself, therefore you
can't contribute this to glibc or the FSF.

At the very least we need to have Chris post the patch (if he still
can after the Mellanox acquisition) since last I know he had assignments
in place.

Cheers,
Carlos.
  
Chris Metcalf June 10, 2016, 8:04 p.m. UTC | #2
On 6/7/2016 10:28 AM, Philipp Trommler wrote:
> Fixes a performance degradation of TLS access in shared libraries which
> can be observed after another shared library that uses TLS is loaded
> via dl_open. In this case __tls_get_addr takes significant more time.
> Once that shared library accesses it's TLS, the performance normalizes.
>
> Attached is a minimal example which gives the following output:
>
> time: 0.723744
> after dlopen,
> time: 1.789016
> after tls access in loaded lib,
> time: 0.672865
>
> We do have a use-case where this is actually really significant. I
> believe this happens for instance if libstdc++ is loaded implicitly,
> but TLS features are not actively used.
>
> I strongly suspect this is the same issue as discussed in this post on
> the µClibc mailing list:
>
> https://lists.osuosl.org/pipermail/uclibc/2009-December/043375.html
>
> and therefore the patch provided mainly reuses the solution they've
> found for µClibc.

I'm not sure this patch ever landed in uclibc; Tilera switched to using glibc
around that time and I didn't follow up with uclibc.

In a reply to that same thread on the uclibc mailing list (Oct 13 2010) I wrote:
"In my email quoted below, I found that __tls_get_addr() under the uClibc
version we were running with was always running the slow path.  We were
running an old version from the nptl branch, back at change 22990 on svn.
Since then, I've ported glibc-2.11.2 to tile.  I looked at the performance
of __tls_get_addr() under glibc, and found that the fast path is correctly
taken in that environment."

So it sounds like you may have found a case in glibc where we don't end
up on the slow path.

I admit I never felt like I fully understood what was going on in this code at
the time, and I was hoping my patch would inspire someone to jump on the
issue and either sign off on my patch, or figure out a deeper cause to the problem.
So I don't know that I feel qualified to sign off on your patch, particularly
six years after I last looked at this stuff.

One nitpick (and it's from my original code):

> -  if (dtv[0].counter < listp->slotinfo[idx].gen)
>       {
> -      /* The generation counter for the slot is higher than what the
> -	 current dtv implements.  We have to update the whole dtv but
> -	 only those entries with a generation counter <= the one for
> -	 the entry we need.  */
> -      size_t new_gen = listp->slotinfo[idx].gen;
> +      /* Tilera: leave scope to make diffs against our baseline
> easier. */
>         size_t total = 0;
>   
>         /* We have to look through the entire dtv slotinfo list.  */

You may have intended this to help review by avoiding whitespace diffs,
as I did when I first mailed my patch out, but obviously you'll want to
remove the "Tilera" comment line, the bracket before it, and then re-indent
the whole body.

Carlos raises a concern about copyright.  I'm certainly happy to do whatever
I need to to make this code that I authored available to the FSF, but since
I am not directly involved with the testing and hacking of Philipp's use case,
it probably makes more sense for Philipp to drive the process.  So Carlos,
what is the right way to make this happen?  If this code is more or less
the right answer, and Philipp can test it and confirm that it works, and
someone else is happy to review it and sign off on it, I could resubmit it
under my name, I suppose?  Even if Philipp ends up making some minor
tweaks, that should fall under the usual "trivial work" exemption. And if
it's the wrong approach entirely, none of this matters :-)

By the way, I'm assuming my assignment on file for Tilera and/or EZchip
continues to be in force for Mellanox.  I don't know why it wouldn't,
but IANAL, so...
  
Carlos O'Donell June 13, 2016, 2:39 p.m. UTC | #3
On 06/10/2016 04:04 PM, Chris Metcalf wrote:
> Carlos raises a concern about copyright.  I'm certainly happy to do whatever
> I need to to make this code that I authored available to the FSF, but since
> I am not directly involved with the testing and hacking of Philipp's use case,
> it probably makes more sense for Philipp to drive the process.  So Carlos,
> what is the right way to make this happen?  If this code is more or less
> the right answer, and Philipp can test it and confirm that it works, and
> someone else is happy to review it and sign off on it, I could resubmit it
> under my name, I suppose?  Even if Philipp ends up making some minor
> tweaks, that should fall under the usual "trivial work" exemption. And if
> it's the wrong approach entirely, none of this matters :-)

Chris,

All I want to do is confirm that you have assignments on file with the FSF
for glibc. The recent acquisition by Mellanox puts this into question.
Could you please confirm that the Tilera Corp. assigments on file with the
FSF still cover you?

Philipp,

Do you have copyright assignments with the FSF in place for glibc? This is
required to allow you to contribute code to the glibc project.

Please see the "Contribution Checklist":
https://sourceware.org/glibc/wiki/Contribution%20checklist

Thanks.
  
Chris Metcalf June 13, 2016, 2:41 p.m. UTC | #4
On 6/13/2016 10:39 AM, Carlos O'Donell wrote:
> Chris,
>
> All I want to do is confirm that you have assignments on file with the FSF
> for glibc. The recent acquisition by Mellanox puts this into question.
> Could you please confirm that the Tilera Corp. assigments on file with the
> FSF still cover you?

I've started the process of looking into this.
  
Philipp Trommler June 14, 2016, 8:32 a.m. UTC | #5
Am Montag, den 13.06.2016, 10:39 -0400 schrieb Carlos O'Donell:
> On 06/10/2016 04:04 PM, Chris Metcalf wrote:
> > Carlos raises a concern about copyright.  I'm certainly happy to do whatever
> > I need to to make this code that I authored available to the FSF, but since
> > I am not directly involved with the testing and hacking of Philipp's use case,
> > it probably makes more sense for Philipp to drive the process.  So Carlos,
> > what is the right way to make this happen?  If this code is more or less
> > the right answer, and Philipp can test it and confirm that it works, and
> > someone else is happy to review it and sign off on it, I could resubmit it
> > under my name, I suppose?  Even if Philipp ends up making some minor
> > tweaks, that should fall under the usual "trivial work" exemption. And if
> > it's the wrong approach entirely, none of this matters :-)
> 
> Chris,
> 
> All I want to do is confirm that you have assignments on file with the FSF
> for glibc. The recent acquisition by Mellanox puts this into question.
> Could you please confirm that the Tilera Corp. assigments on file with the
> FSF still cover you?
> 
> Philipp,
> 
> Do you have copyright assignments with the FSF in place for glibc? This is
> required to allow you to contribute code to the glibc project.
> 
> Please see the "Contribution Checklist":
> https://sourceware.org/glibc/wiki/Contribution%20checklist

No, I do not have a copyright assignment with the FSF yet. Am I right that in
order to get one I have to fill out this (http://git.savannah.gnu.org/cgit/gnuli
b.git/plain/doc/Copyright/request-assign.changes) form and send it to
assign@gnu.org?

> Thanks.
>
  
Chris Metcalf June 14, 2016, 3:05 p.m. UTC | #6
On 6/13/2016 10:39 AM, Carlos O'Donell wrote:
> All I want to do is confirm that you have assignments on file with the FSF
> for glibc. The recent acquisition by Mellanox puts this into question.
> Could you please confirm that the Tilera Corp. assigments on file with the
> FSF still cover you?

I heard back from Legal this morning saying "Yes, the process continues to run."  So we are good.
  
Szabolcs Nagy June 14, 2016, 4:18 p.m. UTC | #7
On 07/06/16 15:28, Philipp Trommler wrote:
> Fixes a performance degradation of TLS access in shared libraries which
> can be observed after another shared library that uses TLS is loaded
> via dl_open. In this case __tls_get_addr takes significant more time.
> Once that shared library accesses it's TLS, the performance normalizes.
> 
...
> I strongly suspect this is the same issue as discussed in this post on
> the µClibc mailing list:
> 
> https://lists.osuosl.org/pipermail/uclibc/2009-December/043375.html
> 
...
> The patch provided touches code already altered (but not released) by
> the work on bug 19329.
> 

BZ 19329 is a correctness issue, this is a performance
optimization, but it also addresses some of the
synchronization problems around dl_tls_generation.

my guess would be that it's cleaner to address the
correctness issues first and then do the optimization
by only changing _dl_update_slotinfo later (to always
update dtv up to the current dl_tls_generation) since
adding memory barriers is a mostly independent change.
  
Carlos O'Donell June 14, 2016, 6:59 p.m. UTC | #8
On 06/14/2016 11:05 AM, Chris Metcalf wrote:
> On 6/13/2016 10:39 AM, Carlos O'Donell wrote:
>> All I want to do is confirm that you have assignments on file with the FSF
>> for glibc. The recent acquisition by Mellanox puts this into question.
>> Could you please confirm that the Tilera Corp. assigments on file with the
>> FSF still cover you?
> 
> I heard back from Legal this morning saying "Yes, the process
> continues to run." So we are good.

Awesome. So the code as-is is fine for us to talk about belonging
to Chris Metcalf and being assigned to the FSF for glibc.

It doesn't cover legally significant changes made by Philipp.
  
Carlos O'Donell June 14, 2016, 7:03 p.m. UTC | #9
On 06/14/2016 04:32 AM, Philipp Trommler wrote:
>> Do you have copyright assignments with the FSF in place for glibc? This is
>> required to allow you to contribute code to the glibc project.
>>
>> Please see the "Contribution Checklist":
>> https://sourceware.org/glibc/wiki/Contribution%20checklist
> 
> No, I do not have a copyright assignment with the FSF yet. Am I right that in
> order to get one I have to fill out this (http://git.savannah.gnu.org/cgit/gnuli
> b.git/plain/doc/Copyright/request-assign.changes) form and send it to
> assign@gnu.org?

Yes. That form limits you to just the changes you are making today.
Be aware that you will have to go through the process all over again
for any different future changes.

The alternative is a futures assignment for the project which covers
all present and future contributions to the project e.g.
http://git.savannah.gnu.org/cgit/gnulib.git/plain/doc/Copyright/request-assign.future

Which need be done only once to allow you to contribute to the project
from here on out.
  
Philipp Trommler June 15, 2016, 11:07 a.m. UTC | #10
Zitat von Carlos O'Donell <carlos@redhat.com>:

> On 06/14/2016 04:32 AM, Philipp Trommler wrote:
>>> Do you have copyright assignments with the FSF in place for glibc? This is
>>> required to allow you to contribute code to the glibc project.
>>>
>>> Please see the "Contribution Checklist":
>>> https://sourceware.org/glibc/wiki/Contribution%20checklist
>>
>> No, I do not have a copyright assignment with the FSF yet. Am I  
>> right that in
>> order to get one I have to fill out this  
>> (http://git.savannah.gnu.org/cgit/gnuli
>> b.git/plain/doc/Copyright/request-assign.changes) form and send it to
>> assign@gnu.org?
>
> Yes. That form limits you to just the changes you are making today.
> Be aware that you will have to go through the process all over again
> for any different future changes.
>
> The alternative is a futures assignment for the project which covers
> all present and future contributions to the project e.g.
> http://git.savannah.gnu.org/cgit/gnulib.git/plain/doc/Copyright/request-assign.future
>
> Which need be done only once to allow you to contribute to the project
> from here on out.
>

Thanks, I'll fill that out and give you feedback when I get a response.

> --
> Cheers,
> Carlos.
  
Philipp Trommler June 30, 2016, 1:24 p.m. UTC | #11
Am Dienstag, den 14.06.2016, 10:32 +0200 schrieb Philipp Trommler:
> Am Montag, den 13.06.2016, 10:39 -0400 schrieb Carlos O'Donell:
> > On 06/10/2016 04:04 PM, Chris Metcalf wrote:
> > > Carlos raises a concern about copyright.  I'm certainly happy to do whatever
> > > I need to to make this code that I authored available to the FSF, but since
> > > I am not directly involved with the testing and hacking of Philipp's use case,
> > > it probably makes more sense for Philipp to drive the process.  So Carlos,
> > > what is the right way to make this happen?  If this code is more or less
> > > the right answer, and Philipp can test it and confirm that it works, and
> > > someone else is happy to review it and sign off on it, I could resubmit it
> > > under my name, I suppose?  Even if Philipp ends up making some minor
> > > tweaks, that should fall under the usual "trivial work" exemption. And if
> > > it's the wrong approach entirely, none of this matters :-)
> > 
> > Chris,
> > 
> > All I want to do is confirm that you have assignments on file with the FSF
> > for glibc. The recent acquisition by Mellanox puts this into question.
> > Could you please confirm that the Tilera Corp. assigments on file with the
> > FSF still cover you?
> > 
> > Philipp,
> > 
> > Do you have copyright assignments with the FSF in place for glibc? This is
> > required to allow you to contribute code to the glibc project.
> > 
> > Please see the "Contribution Checklist":
> > https://sourceware.org/glibc/wiki/Contribution%20checklist
> 
> No, I do not have a copyright assignment with the FSF yet. Am I right that in
> order to get one I have to fill out this (http://git.savannah.gnu.org/cgit/gnuli
> b.git/plain/doc/Copyright/request-assign.changes) form and send it to
> assign@gnu.org?
> 

Today I got my copyright assignment signed. What are the next steps?

Thanks,
Philipp


> > Thanks.
  

Patch

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 687d7de..f1d008f 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -32,6 +32,7 @@ 
 #include <sysdep-cancel.h>
 #include <tls.h>
 #include <stap-probe.h>
+#include <atomic.h>
 
 #include <dl-unmap-segments.h>
 
@@ -753,6 +754,7 @@  _dl_close_worker (struct link_map *map, bool force)
   /* If we removed any object which uses TLS bump the generation
counter.  */
   if (any_tls)
     {
+      atomic_write_barrier ();
       if (__glibc_unlikely (++GL(dl_tls_generation) == 0))
 	_dl_fatal_printf ("TLS generation counter wrapped!  Please
report as described in "REPORT_BUGS_TO".\n");
 
diff --git a/elf/dl-open.c b/elf/dl-open.c
index 6f178b3..e696c4e 100644
--- a/elf/dl-open.c
+++ b/elf/dl-open.c
@@ -524,9 +524,15 @@  dl_open_worker (void *a)
     }
 
   /* Bump the generation number if necessary.  */
-  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
-    _dl_fatal_printf (N_("\
+  if (any_tls)
+    {
+      atomic_write_barrier ();
+      if (__builtin_expect (++GL(dl_tls_generation) == 0, 0))
+        {
+          _dl_fatal_printf (N_("\
 TLS generation counter wrapped!  Please report this."));
+        }
+    }
 
   /* We need a second pass for static tls data, because
_dl_update_slotinfo
      must not be run while calls to _dl_add_to_slotinfo are still
pending.  */
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index ed13fd9..7c2c5c7 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -629,12 +629,16 @@  _dl_update_slotinfo (unsigned long int req_modid)
      possible that other threads at the same time dynamically load
      code and therefore add to the slotinfo list.  This is a problem
      since we must not pick up any information about incomplete work.
-     The solution to this is to ignore all dtv slots which were
-     created after the one we are currently interested.  We know that
-     dynamic loading for this module is completed and this is the last
-     load operation we know finished.  */
+     The solution to this is to ensure that we capture the current
+     generation count before walking the list, and ignore any slots
+     which are marked with a higher generation count, since they may
+     still be in the process of being updated.  */
   unsigned long int idx = req_modid;
   struct dtv_slotinfo_list *listp = GL(dl_tls_dtv_slotinfo_list);
+  size_t new_gen = GL(dl_tls_generation);
+
+  /* Make sure we don't read _dl_tls_generation too late. */
+  atomic_read_barrier();
 
   while (idx >= listp->len)
     {
@@ -642,13 +646,8 @@  _dl_update_slotinfo (unsigned long int req_modid)
       listp = listp->next;
     }
 
-  if (dtv[0].counter < listp->slotinfo[idx].gen)
     {
-      /* The generation counter for the slot is higher than what the
-	 current dtv implements.  We have to update the whole dtv but
-	 only those entries with a generation counter <= the one for
-	 the entry we need.  */
-      size_t new_gen = listp->slotinfo[idx].gen;
+      /* Tilera: leave scope to make diffs against our baseline
easier. */
       size_t total = 0;
Â