powerpc __tls_get_addr call optimization

Message ID 20150318061145.GE24573@bubble.grove.modra.org
State Superseded
Delegated to: Joseph Myers
Headers

Commit Message

Alan Modra March 18, 2015, 6:11 a.m. UTC
  Now that Alex's fixes for static TLS have gone in, I figure it's worth
revisiting an old patch of mine.
https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html

This patch is glibc support for a PowerPC TLS optimization, inspired
by Alexandre Oliva's TLS optimization for other processors,
http://www.lsd.ic.unicamp.br/~oliva/writeups/TLS/RFC-TLSDESC-x86.txt

In essence, this optimization uses a zero module id in the TLS
descriptor to indicate that a TLS variable is allocated space in the
static TLS area.  A special plt call linker stub for __tls_get_addr
checks for such a TLS descriptor and if found, returns the offset
immediately.  The linker communicates the fact that the special
__tls_get_addr stub is present by setting a bit in the dynamic tag
DT_PPC64_OPT/DT_PPC_OPT.

tst-tlsmod2.so is built with -Wl,--no-tls-get-addr-optimize for
tst-tls-dlinfo, which otherwise would fail since it tests that no
static tls is allocated.  The ld option --no-tls-get-addr-optimize has
been available since binutils-2.20 so doesn't need a configure test.

Regression tested powerpc-linux and powerpc64-linux.

	* NEWS: Advertise TLS optimization.
	* elf/elf.h (R_PPC_TLSGD, R_PPC_TLSLD, DT_PPC_OPT, PPC_OPT_TLS): Define.
	(DT_PPC_NUM): Increment.
	* elf/dynamic-link.h (HAVE_STATIC_TLS): Define, extracted from..
	(CHECK_STATIC_TLS): ..here.
	* sysdeps/powerpc/powerpc32/dl-machine.h (elf_machine_rela): Optimize
	TLS descriptors.
	* sysdeps/powerpc/powerpc64/dl-machine.h (elf_machine_rela): Likewise.
	* sysdeps/powerpc/dl-tls.c: New file.
	* sysdeps/powerpc/Versions: Add __tls_get_addr_opt.
	* sysdeps/unix/sysv/linux/powerpc/Makefile: Build tst-tlsmod2.so
	with --no-tls-get-addr-optimize.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist: Update.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/ld.abilist: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/ld-le.abilist: Likewise.
  

Comments

Carlos O'Donell March 18, 2015, 5:07 p.m. UTC | #1
On 03/18/2015 02:11 AM, Alan Modra wrote:
> Now that Alex's fixes for static TLS have gone in, I figure it's worth
> revisiting an old patch of mine.
> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html

I'm not against this patch, but it certainly seems like you would be
better served by just implementing tls descriptors?

Do you have a reference to the binutils patch?
 
> This patch is glibc support for a PowerPC TLS optimization, inspired
> by Alexandre Oliva's TLS optimization for other processors,
> http://www.lsd.ic.unicamp.br/~oliva/writeups/TLS/RFC-TLSDESC-x86.txt
> 
> In essence, this optimization uses a zero module id in the TLS
> descriptor to indicate that a TLS variable is allocated space in the
> static TLS area.  A special plt call linker stub for __tls_get_addr
> checks for such a TLS descriptor and if found, returns the offset
> immediately.  The linker communicates the fact that the special
> __tls_get_addr stub is present by setting a bit in the dynamic tag
> DT_PPC64_OPT/DT_PPC_OPT.

I'm confused, you write "TLS descriptor" but power doesn't have TLS DESC
support yet in glibc?

The code in question writes a module id of zero into the GOT entry
associated with the TLS variable, not really the TLS descriptor?

Speaking of which, you wouldn't happen to have a Latex contribution
that describes the Power TLS support so I can add it to and update
tls.pdf? :-)

> tst-tlsmod2.so is built with -Wl,--no-tls-get-addr-optimize for
> tst-tls-dlinfo, which otherwise would fail since it tests that no
> static tls is allocated.  The ld option --no-tls-get-addr-optimize has
> been available since binutils-2.20 so doesn't need a configure test.

OK.

> Regression tested powerpc-linux and powerpc64-linux.
> 
> 	* NEWS: Advertise TLS optimization.
> 	* elf/elf.h (R_PPC_TLSGD, R_PPC_TLSLD, DT_PPC_OPT, PPC_OPT_TLS): Define.
> 	(DT_PPC_NUM): Increment.
> 	* elf/dynamic-link.h (HAVE_STATIC_TLS): Define, extracted from..
> 	(CHECK_STATIC_TLS): ..here.
> 	* sysdeps/powerpc/powerpc32/dl-machine.h (elf_machine_rela): Optimize
> 	TLS descriptors.
> 	* sysdeps/powerpc/powerpc64/dl-machine.h (elf_machine_rela): Likewise.
> 	* sysdeps/powerpc/dl-tls.c: New file.
> 	* sysdeps/powerpc/Versions: Add __tls_get_addr_opt.
> 	* sysdeps/unix/sysv/linux/powerpc/Makefile: Build tst-tlsmod2.so
> 	with --no-tls-get-addr-optimize.
> 	* sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist: Update.
> 	* sysdeps/unix/sysv/linux/powerpc/powerpc64/ld.abilist: Likewise.
> 	* sysdeps/unix/sysv/linux/powerpc/powerpc64/ld-le.abilist: Likewise.

This absolutely needs a new ppc64-specific test case to make sure this is
actually working as intended? If it requires a new binutils, then you'll need
to have the test return 77 (UNSUPPORTED) if the present binutils is not new
enough.

The rest looks fine.

Cheers,
Carlos.
  
Joseph Myers March 18, 2015, 5:14 p.m. UTC | #2
On Wed, 18 Mar 2015, Alan Modra wrote:

> diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist
> index d71611f..052f311 100644
> --- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist
> +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist
> @@ -1,3 +1,6 @@
> +GLIBC_2.22
> + GLIBC_2.22 A
> + __tls_get_addr_opt F
>  GLIBC_2.0
>   GLIBC_2.0 A
>   __libc_memalign F

That positioning looks odd - I thought these files were alphabetical (so 
it would go between GLIBC_2.1 and GLIBC_2.3)?
  
Alan Modra March 19, 2015, 2:56 a.m. UTC | #3
On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
> On 03/18/2015 02:11 AM, Alan Modra wrote:
> > Now that Alex's fixes for static TLS have gone in, I figure it's worth
> > revisiting an old patch of mine.
> > https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
> 
> I'm not against this patch, but it certainly seems like you would be
> better served by just implementing tls descriptors?

I think this is one better than tls descriptors, because powerpc
avoids the indirect function call used by tls descriptors.

> Do you have a reference to the binutils patch?

https://sourceware.org/ml/binutils/2009-03/msg00498.html

> > This patch is glibc support for a PowerPC TLS optimization, inspired
> > by Alexandre Oliva's TLS optimization for other processors,
> > http://www.lsd.ic.unicamp.br/~oliva/writeups/TLS/RFC-TLSDESC-x86.txt
> > 
> > In essence, this optimization uses a zero module id in the TLS
> > descriptor to indicate that a TLS variable is allocated space in the
> > static TLS area.  A special plt call linker stub for __tls_get_addr
> > checks for such a TLS descriptor and if found, returns the offset
> > immediately.  The linker communicates the fact that the special
> > __tls_get_addr stub is present by setting a bit in the dynamic tag
> > DT_PPC64_OPT/DT_PPC_OPT.
> 
> I'm confused, you write "TLS descriptor" but power doesn't have TLS DESC
> support yet in glibc?

Oops, I meant a tls_index object as defined in Drepper's tls.pdf.
The binutils reference above makes the same error..

> The code in question writes a module id of zero into the GOT entry
> associated with the TLS variable, not really the TLS descriptor?

Right.

> Speaking of which, you wouldn't happen to have a Latex contribution
> that describes the Power TLS support so I can add it to and update
> tls.pdf? :-)

No, sorry, I wrote the original powerpc tls abi as plain text.

> This absolutely needs a new ppc64-specific test case to make sure this is
> actually working as intended? If it requires a new binutils, then you'll need
> to have the test return 77 (UNSUPPORTED) if the present binutils is not new
> enough.

We actually get the support turned on automatically, no gcc or ld
options needed, so the existing tls tests run using the optimized
__tls_get_addr support.

Hmm, your comment reminded me that I need to check older binutils,
because I renamed DT_PPC64_TLSOPT to DT_PPC64_OPT and changed the tag
to a bitfield.  On looking at that, it seems you'll need binutils-2.24
to build executables and shared libraries that work with the patch as
is (well, they'll work but glibc won't provide the (0,offset)
tls_index objects).  glibc itself doesn't need the newer binutils to
build, but you're right, I should mention this in NEWS and there
should be a test that the new support is working for those that don't
read NEWS.
  
Alan Modra March 19, 2015, 2:59 a.m. UTC | #4
On Wed, Mar 18, 2015 at 05:14:19PM +0000, Joseph Myers wrote:
> On Wed, 18 Mar 2015, Alan Modra wrote:
> 
> > diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist
> > index d71611f..052f311 100644
> > --- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist
> > +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist
> > @@ -1,3 +1,6 @@
> > +GLIBC_2.22
> > + GLIBC_2.22 A
> > + __tls_get_addr_opt F
> >  GLIBC_2.0
> >   GLIBC_2.0 A
> >   __libc_memalign F
> 
> That positioning looks odd - I thought these files were alphabetical (so 
> it would go between GLIBC_2.1 and GLIBC_2.3)?

I'd better test that again.  I did have the location wrong first time
I changed the file, so may have a discrepancy between what I posted
here and the source actually tested.
  
Carlos O'Donell March 20, 2015, 3:33 a.m. UTC | #5
On 03/18/2015 10:56 PM, Alan Modra wrote:
> On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
>> On 03/18/2015 02:11 AM, Alan Modra wrote:
>>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
>>> revisiting an old patch of mine.
>>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
>>
>> I'm not against this patch, but it certainly seems like you would be
>> better served by just implementing tls descriptors?
> 
> I think this is one better than tls descriptors, because powerpc
> avoids the indirect function call used by tls descriptors.

You mean to say it is "faster" than tls descriptors, but at the same
time "harder" to maintain because it's a custom implementation that
anyone debugging glibc has to learn about. That's not a bad thing,
I just want us all to acknowledge the tradeoff.

The present goal for glibc and the toolchain in general has been
to move to TLS descriptors, and thus provide a way for the dozen or
so packages in the distribution to stop doing this:

mesa (src/mapi/u_current.h):

extern __thread struct mapi_table *u_current_table
    __attribute__((tls_model("initial-exec")));

They would instead use TLS descriptors, and the above markings would
be removed and the access would be as fast as possible without needing
to specify the IE model.

These packages are sometimes linked with applications, and sometimes
arbitrarily dlopened.

Would this present optimization you propose for power support this
use case?

Would it use static TLS for the above access if it could and fall
back gracefully if it can't?

What I want to make sure is that Power isn't left behind when we
eventually transition everyone else to TLS Descriptors and remove
the above markings from source programs.

>> Do you have a reference to the binutils patch?
> 
> https://sourceware.org/ml/binutils/2009-03/msg00498.html

Excellent, that makes it much easier to review the glibc pieces since
I can see what the static linker is going to do and review the stub
itself.

>>> In essence, this optimization uses a zero module id in the TLS
>>> descriptor to indicate that a TLS variable is allocated space in the
>>> static TLS area.  A special plt call linker stub for __tls_get_addr
>>> checks for such a TLS descriptor and if found, returns the offset
>>> immediately.  The linker communicates the fact that the special
>>> __tls_get_addr stub is present by setting a bit in the dynamic tag
>>> DT_PPC64_OPT/DT_PPC_OPT.
>>
>> I'm confused, you write "TLS descriptor" but power doesn't have TLS DESC
>> support yet in glibc?
> 
> Oops, I meant a tls_index object as defined in Drepper's tls.pdf.
> The binutils reference above makes the same error..

No problem. Thanks for clarifying. This is part of the problem with
having an alternate implementation.

>> The code in question writes a module id of zero into the GOT entry
>> associated with the TLS variable, not really the TLS descriptor?
> 
> Right.

OK.

>> Speaking of which, you wouldn't happen to have a Latex contribution
>> that describes the Power TLS support so I can add it to and update
>> tls.pdf? :-)
> 
> No, sorry, I wrote the original powerpc tls abi as plain text.

Could you mail that to me privately please? I'd like a copy for my
own reference.

>> This absolutely needs a new ppc64-specific test case to make sure this is
>> actually working as intended? If it requires a new binutils, then you'll need
>> to have the test return 77 (UNSUPPORTED) if the present binutils is not new
>> enough.
> 
> We actually get the support turned on automatically, no gcc or ld
> options needed, so the existing tls tests run using the optimized
> __tls_get_addr support.

OK, as long as there are binutils test that make sure the stub is in
place and being used, and static tls is allocated for the entries,
then I'm fine.

> Hmm, your comment reminded me that I need to check older binutils,
> because I renamed DT_PPC64_TLSOPT to DT_PPC64_OPT and changed the tag
> to a bitfield.  On looking at that, it seems you'll need binutils-2.24
> to build executables and shared libraries that work with the patch as
> is (well, they'll work but glibc won't provide the (0,offset)
> tls_index objects).  glibc itself doesn't need the newer binutils to
> build, but you're right, I should mention this in NEWS and there
> should be a test that the new support is working for those that don't
> read NEWS.

OK.

Cheers,
Carlos.
  
Alan Modra March 20, 2015, 7:55 a.m. UTC | #6
On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
> On 03/18/2015 10:56 PM, Alan Modra wrote:
> > On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
> >> On 03/18/2015 02:11 AM, Alan Modra wrote:
> >>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
> >>> revisiting an old patch of mine.
> >>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
> >>
> >> I'm not against this patch, but it certainly seems like you would be
> >> better served by just implementing tls descriptors?
> > 
> > I think this is one better than tls descriptors, because powerpc
> > avoids the indirect function call used by tls descriptors.
> 
> You mean to say it is "faster" than tls descriptors, but at the same

To be honest, there isn't much difference in the optimized case where
static TLS is available.  It boils down to an indirect call to a
function that loads one value vs. a direct call to a stub that loads
two values and compares one against zero.  I think what I've
implemented is slightly better for PowerPC, but whether that would
carry over to other architectures is debatable.

> time "harder" to maintain because it's a custom implementation that
> anyone debugging glibc has to learn about. That's not a bad thing,
> I just want us all to acknowledge the tradeoff.

Well, yes, but the PowerPC implementation is all in dl-machine.h, and
looks very similar to x86_64 in use of CHECK_STATIC_TLS,
TRY_STATIC_TLS and modification of the tls_index entry.  PowerPC
doesn't have the complication and potential failure of allocating
extended descriptors.  We also don't need to pass extra flags to gcc
to enable the optimization.

> The present goal for glibc and the toolchain in general has been
> to move to TLS descriptors, and thus provide a way for the dozen or
> so packages in the distribution to stop doing this:
> 
> mesa (src/mapi/u_current.h):
> 
> extern __thread struct mapi_table *u_current_table
>     __attribute__((tls_model("initial-exec")));
> 
> They would instead use TLS descriptors, and the above markings would
> be removed and the access would be as fast as possible without needing
> to specify the IE model.
> 
> These packages are sometimes linked with applications, and sometimes
> arbitrarily dlopened.
> 
> Would this present optimization you propose for power support this
> use case?

Sure.  This is exactly the use case the powerpc optimization tackles,
shared libraries using general dynamic or local dynamic TLS access.
Like TLS descriptors, it can also handle general dynamic or local
dynamic TLS access in an executable, but these will normally be
optimized to IE or LE by GNU ld.

> Would it use static TLS for the above access if it could and fall
> back gracefully if it can't?

Yes.

> What I want to make sure is that Power isn't left behind when we
> eventually transition everyone else to TLS Descriptors and remove
> the above markings from source programs.

Other architectures left behind by the PowerPC implementation might
like to transition from TLS descriptors.  Just kidding.  :)
  
Carlos O'Donell March 20, 2015, 1:54 p.m. UTC | #7
On 03/20/2015 03:55 AM, Alan Modra wrote:
> On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
>> On 03/18/2015 10:56 PM, Alan Modra wrote:
>>> On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
>>>> On 03/18/2015 02:11 AM, Alan Modra wrote:
>>>>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
>>>>> revisiting an old patch of mine.
>>>>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
>>>>
>>>> I'm not against this patch, but it certainly seems like you would be
>>>> better served by just implementing tls descriptors?
>>>
>>> I think this is one better than tls descriptors, because powerpc
>>> avoids the indirect function call used by tls descriptors.
>>
>> You mean to say it is "faster" than tls descriptors, but at the same
> 
> To be honest, there isn't much difference in the optimized case where
> static TLS is available.  It boils down to an indirect call to a
> function that loads one value vs. a direct call to a stub that loads
> two values and compares one against zero.  I think what I've
> implemented is slightly better for PowerPC, but whether that would
> carry over to other architectures is debatable.

I agree that what you have implemented is faster for power.

>> time "harder" to maintain because it's a custom implementation that
>> anyone debugging glibc has to learn about. That's not a bad thing,
>> I just want us all to acknowledge the tradeoff.
> 
> Well, yes, but the PowerPC implementation is all in dl-machine.h, and
> looks very similar to x86_64 in use of CHECK_STATIC_TLS,
> TRY_STATIC_TLS and modification of the tls_index entry.  PowerPC
> doesn't have the complication and potential failure of allocating
> extended descriptors.  We also don't need to pass extra flags to gcc
> to enable the optimization.

I also agree that your present implementation mirrors TLS DESC in
the implementation and reuse of CHECK_STATIC_TLS/TRY_STAIC_TLS,
and I like that aspect of the change.

>> The present goal for glibc and the toolchain in general has been
>> to move to TLS descriptors, and thus provide a way for the dozen or
>> so packages in the distribution to stop doing this:
>>
>> mesa (src/mapi/u_current.h):
>>
>> extern __thread struct mapi_table *u_current_table
>>     __attribute__((tls_model("initial-exec")));
>>
>> They would instead use TLS descriptors, and the above markings would
>> be removed and the access would be as fast as possible without needing
>> to specify the IE model.
>>
>> These packages are sometimes linked with applications, and sometimes
>> arbitrarily dlopened.
>>
>> Would this present optimization you propose for power support this
>> use case?
> 
> Sure.  This is exactly the use case the powerpc optimization tackles,
> shared libraries using general dynamic or local dynamic TLS access.
> Like TLS descriptors, it can also handle general dynamic or local
> dynamic TLS access in an executable, but these will normally be
> optimized to IE or LE by GNU ld.

Perfect, just making sure were were on the same page. I figured, after
reading the binutils patch this is mostly operated like TLS DESC, but
slightly optimized for power.

>> Would it use static TLS for the above access if it could and fall
>> back gracefully if it can't?
> 
> Yes.

Good. I expected that it would simply degenerate to a call to
__tls_get_addr if it can't get static tls space.

>> What I want to make sure is that Power isn't left behind when we
>> eventually transition everyone else to TLS Descriptors and remove
>> the above markings from source programs.
> 
> Other architectures left behind by the PowerPC implementation might
> like to transition from TLS descriptors.  Just kidding.  :)

Given your answers above I'm happy to see this go into glibc.

The patch itself looks fine to me, the real magic is in binutils
with yet another super-secret stub that has no debug information
and must be recognized by memory by the person doing the debugging :}

Cheers,
Carlos.
  
Rich Felker March 20, 2015, 3:27 p.m. UTC | #8
On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
> On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
> > On 03/18/2015 10:56 PM, Alan Modra wrote:
> > > On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
> > >> On 03/18/2015 02:11 AM, Alan Modra wrote:
> > >>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
> > >>> revisiting an old patch of mine.
> > >>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
> > >>
> > >> I'm not against this patch, but it certainly seems like you would be
> > >> better served by just implementing tls descriptors?
> > > 
> > > I think this is one better than tls descriptors, because powerpc
> > > avoids the indirect function call used by tls descriptors.
> > 
> > You mean to say it is "faster" than tls descriptors, but at the same
> 
> To be honest, there isn't much difference in the optimized case where
> static TLS is available.  It boils down to an indirect call to a
> function that loads one value vs. a direct call to a stub that loads
> two values and compares one against zero.  I think what I've
> implemented is slightly better for PowerPC, but whether that would
> carry over to other architectures is debatable.

If the performance difference isn't measurable in real-world
applications, I would think uniformity between targets would be a lot
more valuable.

I also don't see how your approach is a "direct call". The function
being called is in a different DSO so it has to go through a pointer
in the GOT or similar, in which case it's just as "indirect" as the
TLSDESC call would be.

Rich
  
Carlos O'Donell March 20, 2015, 3:48 p.m. UTC | #9
On 03/20/2015 11:27 AM, Rich Felker wrote:
> On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
>> On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
>>> On 03/18/2015 10:56 PM, Alan Modra wrote:
>>>> On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
>>>>> On 03/18/2015 02:11 AM, Alan Modra wrote:
>>>>>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
>>>>>> revisiting an old patch of mine.
>>>>>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
>>>>>
>>>>> I'm not against this patch, but it certainly seems like you would be
>>>>> better served by just implementing tls descriptors?
>>>>
>>>> I think this is one better than tls descriptors, because powerpc
>>>> avoids the indirect function call used by tls descriptors.
>>>
>>> You mean to say it is "faster" than tls descriptors, but at the same
>>
>> To be honest, there isn't much difference in the optimized case where
>> static TLS is available.  It boils down to an indirect call to a
>> function that loads one value vs. a direct call to a stub that loads
>> two values and compares one against zero.  I think what I've
>> implemented is slightly better for PowerPC, but whether that would
>> carry over to other architectures is debatable.
> 
> If the performance difference isn't measurable in real-world
> applications, I would think uniformity between targets would be a lot
> more valuable.
> 
> I also don't see how your approach is a "direct call". The function
> being called is in a different DSO so it has to go through a pointer
> in the GOT or similar, in which case it's just as "indirect" as the
> TLSDESC call would be.

I agree. And this was my initial inclination, but I'm not against what
Alan has implemented. As a machine maintainer he should be allowed some
leeway to argue this implementation is "N instructions less" and therefore
must be faster, but that such speed is harder to show in a microbenchmark,
it would in the mean result in say less CPU usage over billions of cycles.

IBM has to accept that the downside to all of this is that breakage in
this area may take longer to fix, and get less fixes than those arches
already using TLS DESC.

Cheers,
Carlos.
  
H.J. Lu March 20, 2015, 3:51 p.m. UTC | #10
On Fri, Mar 20, 2015 at 8:48 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 03/20/2015 11:27 AM, Rich Felker wrote:
>> On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
>>> On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
>>>> On 03/18/2015 10:56 PM, Alan Modra wrote:
>>>>> On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
>>>>>> On 03/18/2015 02:11 AM, Alan Modra wrote:
>>>>>>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
>>>>>>> revisiting an old patch of mine.
>>>>>>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
>>>>>>
>>>>>> I'm not against this patch, but it certainly seems like you would be
>>>>>> better served by just implementing tls descriptors?
>>>>>
>>>>> I think this is one better than tls descriptors, because powerpc
>>>>> avoids the indirect function call used by tls descriptors.
>>>>
>>>> You mean to say it is "faster" than tls descriptors, but at the same
>>>
>>> To be honest, there isn't much difference in the optimized case where
>>> static TLS is available.  It boils down to an indirect call to a
>>> function that loads one value vs. a direct call to a stub that loads
>>> two values and compares one against zero.  I think what I've
>>> implemented is slightly better for PowerPC, but whether that would
>>> carry over to other architectures is debatable.
>>
>> If the performance difference isn't measurable in real-world
>> applications, I would think uniformity between targets would be a lot
>> more valuable.
>>
>> I also don't see how your approach is a "direct call". The function
>> being called is in a different DSO so it has to go through a pointer
>> in the GOT or similar, in which case it's just as "indirect" as the
>> TLSDESC call would be.
>
> I agree. And this was my initial inclination, but I'm not against what
> Alan has implemented. As a machine maintainer he should be allowed some
> leeway to argue this implementation is "N instructions less" and therefore
> must be faster, but that such speed is harder to show in a microbenchmark,
> it would in the mean result in say less CPU usage over billions of cycles.
>
> IBM has to accept that the downside to all of this is that breakage in
> this area may take longer to fix, and get less fixes than those arches
> already using TLS DESC.

Speaking of TLS DESC, are there any tests for TLS DESC in
glibc?  I never implemented TLS DESC for x32 since I didn't
find any run-time tests for TLS DESC in GCC nor glibc.
  
Rich Felker March 20, 2015, 4:14 p.m. UTC | #11
On Fri, Mar 20, 2015 at 08:51:39AM -0700, H.J. Lu wrote:
> On Fri, Mar 20, 2015 at 8:48 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> > On 03/20/2015 11:27 AM, Rich Felker wrote:
> >> On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
> >>> On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
> >>>> On 03/18/2015 10:56 PM, Alan Modra wrote:
> >>>>> On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
> >>>>>> On 03/18/2015 02:11 AM, Alan Modra wrote:
> >>>>>>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
> >>>>>>> revisiting an old patch of mine.
> >>>>>>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
> >>>>>>
> >>>>>> I'm not against this patch, but it certainly seems like you would be
> >>>>>> better served by just implementing tls descriptors?
> >>>>>
> >>>>> I think this is one better than tls descriptors, because powerpc
> >>>>> avoids the indirect function call used by tls descriptors.
> >>>>
> >>>> You mean to say it is "faster" than tls descriptors, but at the same
> >>>
> >>> To be honest, there isn't much difference in the optimized case where
> >>> static TLS is available.  It boils down to an indirect call to a
> >>> function that loads one value vs. a direct call to a stub that loads
> >>> two values and compares one against zero.  I think what I've
> >>> implemented is slightly better for PowerPC, but whether that would
> >>> carry over to other architectures is debatable.
> >>
> >> If the performance difference isn't measurable in real-world
> >> applications, I would think uniformity between targets would be a lot
> >> more valuable.
> >>
> >> I also don't see how your approach is a "direct call". The function
> >> being called is in a different DSO so it has to go through a pointer
> >> in the GOT or similar, in which case it's just as "indirect" as the
> >> TLSDESC call would be.
> >
> > I agree. And this was my initial inclination, but I'm not against what
> > Alan has implemented. As a machine maintainer he should be allowed some
> > leeway to argue this implementation is "N instructions less" and therefore
> > must be faster, but that such speed is harder to show in a microbenchmark,
> > it would in the mean result in say less CPU usage over billions of cycles.
> >
> > IBM has to accept that the downside to all of this is that breakage in
> > this area may take longer to fix, and get less fixes than those arches
> > already using TLS DESC.
> 
> Speaking of TLS DESC, are there any tests for TLS DESC in
> glibc?  I never implemented TLS DESC for x32 since I didn't
> find any run-time tests for TLS DESC in GCC nor glibc.

Not that I know of. i386 TLSDESC was broken in binutils for several
years and only recently fixed... Until a couple months ago nobody
noticed. :-(

This situation really should be set right (with proper tests and
timeline for changing the default to TLSDESC) so we can put an end to
the invalid use of IE-model in shared libraries.

Rich
  
H.J. Lu March 20, 2015, 4:19 p.m. UTC | #12
On Fri, Mar 20, 2015 at 9:14 AM, Rich Felker <dalias@libc.org> wrote:
> On Fri, Mar 20, 2015 at 08:51:39AM -0700, H.J. Lu wrote:
>> On Fri, Mar 20, 2015 at 8:48 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> > On 03/20/2015 11:27 AM, Rich Felker wrote:
>> >> On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
>> >>> On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
>> >>>> On 03/18/2015 10:56 PM, Alan Modra wrote:
>> >>>>> On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
>> >>>>>> On 03/18/2015 02:11 AM, Alan Modra wrote:
>> >>>>>>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
>> >>>>>>> revisiting an old patch of mine.
>> >>>>>>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
>> >>>>>>
>> >>>>>> I'm not against this patch, but it certainly seems like you would be
>> >>>>>> better served by just implementing tls descriptors?
>> >>>>>
>> >>>>> I think this is one better than tls descriptors, because powerpc
>> >>>>> avoids the indirect function call used by tls descriptors.
>> >>>>
>> >>>> You mean to say it is "faster" than tls descriptors, but at the same
>> >>>
>> >>> To be honest, there isn't much difference in the optimized case where
>> >>> static TLS is available.  It boils down to an indirect call to a
>> >>> function that loads one value vs. a direct call to a stub that loads
>> >>> two values and compares one against zero.  I think what I've
>> >>> implemented is slightly better for PowerPC, but whether that would
>> >>> carry over to other architectures is debatable.
>> >>
>> >> If the performance difference isn't measurable in real-world
>> >> applications, I would think uniformity between targets would be a lot
>> >> more valuable.
>> >>
>> >> I also don't see how your approach is a "direct call". The function
>> >> being called is in a different DSO so it has to go through a pointer
>> >> in the GOT or similar, in which case it's just as "indirect" as the
>> >> TLSDESC call would be.
>> >
>> > I agree. And this was my initial inclination, but I'm not against what
>> > Alan has implemented. As a machine maintainer he should be allowed some
>> > leeway to argue this implementation is "N instructions less" and therefore
>> > must be faster, but that such speed is harder to show in a microbenchmark,
>> > it would in the mean result in say less CPU usage over billions of cycles.
>> >
>> > IBM has to accept that the downside to all of this is that breakage in
>> > this area may take longer to fix, and get less fixes than those arches
>> > already using TLS DESC.
>>
>> Speaking of TLS DESC, are there any tests for TLS DESC in
>> glibc?  I never implemented TLS DESC for x32 since I didn't
>> find any run-time tests for TLS DESC in GCC nor glibc.
>
> Not that I know of. i386 TLSDESC was broken in binutils for several
> years and only recently fixed... Until a couple months ago nobody
> noticed. :-(
>
> This situation really should be set right (with proper tests and
> timeline for changing the default to TLSDESC) so we can put an end to
> the invalid use of IE-model in shared libraries.

Another thing,  x86 and x86-64 TLS DESC spec should be
in x86 and x86-64 psABIs, not a URL.
  
Carlos O'Donell March 20, 2015, 4:21 p.m. UTC | #13
On 03/20/2015 12:19 PM, H.J. Lu wrote:
> On Fri, Mar 20, 2015 at 9:14 AM, Rich Felker <dalias@libc.org> wrote:
>> On Fri, Mar 20, 2015 at 08:51:39AM -0700, H.J. Lu wrote:
>>> On Fri, Mar 20, 2015 at 8:48 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>>>> On 03/20/2015 11:27 AM, Rich Felker wrote:
>>>>> On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
>>>>>> On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
>>>>>>> On 03/18/2015 10:56 PM, Alan Modra wrote:
>>>>>>>> On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
>>>>>>>>> On 03/18/2015 02:11 AM, Alan Modra wrote:
>>>>>>>>>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
>>>>>>>>>> revisiting an old patch of mine.
>>>>>>>>>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
>>>>>>>>>
>>>>>>>>> I'm not against this patch, but it certainly seems like you would be
>>>>>>>>> better served by just implementing tls descriptors?
>>>>>>>>
>>>>>>>> I think this is one better than tls descriptors, because powerpc
>>>>>>>> avoids the indirect function call used by tls descriptors.
>>>>>>>
>>>>>>> You mean to say it is "faster" than tls descriptors, but at the same
>>>>>>
>>>>>> To be honest, there isn't much difference in the optimized case where
>>>>>> static TLS is available.  It boils down to an indirect call to a
>>>>>> function that loads one value vs. a direct call to a stub that loads
>>>>>> two values and compares one against zero.  I think what I've
>>>>>> implemented is slightly better for PowerPC, but whether that would
>>>>>> carry over to other architectures is debatable.
>>>>>
>>>>> If the performance difference isn't measurable in real-world
>>>>> applications, I would think uniformity between targets would be a lot
>>>>> more valuable.
>>>>>
>>>>> I also don't see how your approach is a "direct call". The function
>>>>> being called is in a different DSO so it has to go through a pointer
>>>>> in the GOT or similar, in which case it's just as "indirect" as the
>>>>> TLSDESC call would be.
>>>>
>>>> I agree. And this was my initial inclination, but I'm not against what
>>>> Alan has implemented. As a machine maintainer he should be allowed some
>>>> leeway to argue this implementation is "N instructions less" and therefore
>>>> must be faster, but that such speed is harder to show in a microbenchmark,
>>>> it would in the mean result in say less CPU usage over billions of cycles.
>>>>
>>>> IBM has to accept that the downside to all of this is that breakage in
>>>> this area may take longer to fix, and get less fixes than those arches
>>>> already using TLS DESC.
>>>
>>> Speaking of TLS DESC, are there any tests for TLS DESC in
>>> glibc?  I never implemented TLS DESC for x32 since I didn't
>>> find any run-time tests for TLS DESC in GCC nor glibc.
>>
>> Not that I know of. i386 TLSDESC was broken in binutils for several
>> years and only recently fixed... Until a couple months ago nobody
>> noticed. :-(
>>
>> This situation really should be set right (with proper tests and
>> timeline for changing the default to TLSDESC) so we can put an end to
>> the invalid use of IE-model in shared libraries.
> 
> Another thing,  x86 and x86-64 TLS DESC spec should be
> in x86 and x86-64 psABIs, not a URL.

Agreed. As should the TLS specification instead of a URL reference to
tls.pdf which is going to get out of date.

Cheers,
Carlos.
  
H.J. Lu March 20, 2015, 4:24 p.m. UTC | #14
On Fri, Mar 20, 2015 at 9:21 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 03/20/2015 12:19 PM, H.J. Lu wrote:
>> On Fri, Mar 20, 2015 at 9:14 AM, Rich Felker <dalias@libc.org> wrote:
>>> On Fri, Mar 20, 2015 at 08:51:39AM -0700, H.J. Lu wrote:
>>>> On Fri, Mar 20, 2015 at 8:48 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>>>>> On 03/20/2015 11:27 AM, Rich Felker wrote:
>>>>>> On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
>>>>>>> On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
>>>>>>>> On 03/18/2015 10:56 PM, Alan Modra wrote:
>>>>>>>>> On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
>>>>>>>>>> On 03/18/2015 02:11 AM, Alan Modra wrote:
>>>>>>>>>>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
>>>>>>>>>>> revisiting an old patch of mine.
>>>>>>>>>>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
>>>>>>>>>>
>>>>>>>>>> I'm not against this patch, but it certainly seems like you would be
>>>>>>>>>> better served by just implementing tls descriptors?
>>>>>>>>>
>>>>>>>>> I think this is one better than tls descriptors, because powerpc
>>>>>>>>> avoids the indirect function call used by tls descriptors.
>>>>>>>>
>>>>>>>> You mean to say it is "faster" than tls descriptors, but at the same
>>>>>>>
>>>>>>> To be honest, there isn't much difference in the optimized case where
>>>>>>> static TLS is available.  It boils down to an indirect call to a
>>>>>>> function that loads one value vs. a direct call to a stub that loads
>>>>>>> two values and compares one against zero.  I think what I've
>>>>>>> implemented is slightly better for PowerPC, but whether that would
>>>>>>> carry over to other architectures is debatable.
>>>>>>
>>>>>> If the performance difference isn't measurable in real-world
>>>>>> applications, I would think uniformity between targets would be a lot
>>>>>> more valuable.
>>>>>>
>>>>>> I also don't see how your approach is a "direct call". The function
>>>>>> being called is in a different DSO so it has to go through a pointer
>>>>>> in the GOT or similar, in which case it's just as "indirect" as the
>>>>>> TLSDESC call would be.
>>>>>
>>>>> I agree. And this was my initial inclination, but I'm not against what
>>>>> Alan has implemented. As a machine maintainer he should be allowed some
>>>>> leeway to argue this implementation is "N instructions less" and therefore
>>>>> must be faster, but that such speed is harder to show in a microbenchmark,
>>>>> it would in the mean result in say less CPU usage over billions of cycles.
>>>>>
>>>>> IBM has to accept that the downside to all of this is that breakage in
>>>>> this area may take longer to fix, and get less fixes than those arches
>>>>> already using TLS DESC.
>>>>
>>>> Speaking of TLS DESC, are there any tests for TLS DESC in
>>>> glibc?  I never implemented TLS DESC for x32 since I didn't
>>>> find any run-time tests for TLS DESC in GCC nor glibc.
>>>
>>> Not that I know of. i386 TLSDESC was broken in binutils for several
>>> years and only recently fixed... Until a couple months ago nobody
>>> noticed. :-(
>>>
>>> This situation really should be set right (with proper tests and
>>> timeline for changing the default to TLSDESC) so we can put an end to
>>> the invalid use of IE-model in shared libraries.
>>
>> Another thing,  x86 and x86-64 TLS DESC spec should be
>> in x86 and x86-64 psABIs, not a URL.
>
> Agreed. As should the TLS specification instead of a URL reference to
> tls.pdf which is going to get out of date.

TLS spec is too big to be included in x86 psABIs unless
Ulrich contributed patches for tex source to x86 psABIs.
  
Carlos O'Donell March 20, 2015, 5:34 p.m. UTC | #15
On 03/20/2015 12:24 PM, H.J. Lu wrote:
>>> Another thing,  x86 and x86-64 TLS DESC spec should be
>>> in x86 and x86-64 psABIs, not a URL.
>>
>> Agreed. As should the TLS specification instead of a URL reference to
>> tls.pdf which is going to get out of date.
> 
> TLS spec is too big to be included in x86 psABIs unless
> Ulrich contributed patches for tex source to x86 psABIs.

Yes, the whole spec is too big, the psABI would have only
the portion for x86.

Does that make sense?

Cheers,
Carlos.
  
H.J. Lu March 20, 2015, 5:37 p.m. UTC | #16
On Fri, Mar 20, 2015 at 10:34 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 03/20/2015 12:24 PM, H.J. Lu wrote:
>>>> Another thing,  x86 and x86-64 TLS DESC spec should be
>>>> in x86 and x86-64 psABIs, not a URL.
>>>
>>> Agreed. As should the TLS specification instead of a URL reference to
>>> tls.pdf which is going to get out of date.
>>
>> TLS spec is too big to be included in x86 psABIs unless
>> Ulrich contributed patches for tex source to x86 psABIs.
>
> Yes, the whole spec is too big, the psABI would have only
> the portion for x86.
>
> Does that make sense?

Sure.  Patches are welcome.
  
Rich Felker March 20, 2015, 6:01 p.m. UTC | #17
On Fri, Mar 20, 2015 at 09:24:21AM -0700, H.J. Lu wrote:
> >>> Not that I know of. i386 TLSDESC was broken in binutils for several
> >>> years and only recently fixed... Until a couple months ago nobody
> >>> noticed. :-(
> >>>
> >>> This situation really should be set right (with proper tests and
> >>> timeline for changing the default to TLSDESC) so we can put an end to
> >>> the invalid use of IE-model in shared libraries.
> >>
> >> Another thing,  x86 and x86-64 TLS DESC spec should be
> >> in x86 and x86-64 psABIs, not a URL.
> >
> > Agreed. As should the TLS specification instead of a URL reference to
> > tls.pdf which is going to get out of date.
> 
> TLS spec is too big to be included in x86 psABIs unless
> Ulrich contributed patches for tex source to x86 psABIs.

Are you sure? His TLS docs contain a lot of informative content that
should not be taken as spec. There's no reason for a psABI to document
optimizations a linker can make. Simply documenting the semantics of
the relocation types and the actual ABI constraints they impose
(mainly, location of static TLS relative to the thread-pointer) should
be possible in a fairly compact text suitable for inclusion in the
psABI. Of course actually writing that is a bit of work...

Rich
  
Carlos O'Donell March 20, 2015, 6:04 p.m. UTC | #18
On 03/20/2015 01:37 PM, H.J. Lu wrote:
> On Fri, Mar 20, 2015 at 10:34 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> On 03/20/2015 12:24 PM, H.J. Lu wrote:
>>>>> Another thing,  x86 and x86-64 TLS DESC spec should be
>>>>> in x86 and x86-64 psABIs, not a URL.
>>>>
>>>> Agreed. As should the TLS specification instead of a URL reference to
>>>> tls.pdf which is going to get out of date.
>>>
>>> TLS spec is too big to be included in x86 psABIs unless
>>> Ulrich contributed patches for tex source to x86 psABIs.
>>
>> Yes, the whole spec is too big, the psABI would have only
>> the portion for x86.
>>
>> Does that make sense?
> 
> Sure.  Patches are welcome.

Thanks, just making sure you didn't object.

c.
  
H.J. Lu March 20, 2015, 6:09 p.m. UTC | #19
On Fri, Mar 20, 2015 at 11:04 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 03/20/2015 01:37 PM, H.J. Lu wrote:
>> On Fri, Mar 20, 2015 at 10:34 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>>> On 03/20/2015 12:24 PM, H.J. Lu wrote:
>>>>>> Another thing,  x86 and x86-64 TLS DESC spec should be
>>>>>> in x86 and x86-64 psABIs, not a URL.
>>>>>
>>>>> Agreed. As should the TLS specification instead of a URL reference to
>>>>> tls.pdf which is going to get out of date.
>>>>
>>>> TLS spec is too big to be included in x86 psABIs unless
>>>> Ulrich contributed patches for tex source to x86 psABIs.
>>>
>>> Yes, the whole spec is too big, the psABI would have only
>>> the portion for x86.
>>>
>>> Does that make sense?
>>
>> Sure.  Patches are welcome.
>
> Thanks, just making sure you didn't object.

I have been trying to keep x86 psABIs up to date.  Since x86-32
psABI is based on x86-64 psABI,  changes like this should go
into x86-64 psABI first.  X86-64 psABI patches should be sent to

https://groups.google.com/forum/#!forum/x86-64-abi

Thanks.
  
Alan Modra March 21, 2015, 3:07 a.m. UTC | #20
On Fri, Mar 20, 2015 at 11:27:12AM -0400, Rich Felker wrote:
> On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
> > On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
> > > On 03/18/2015 10:56 PM, Alan Modra wrote:
> > > > On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
> > > >> On 03/18/2015 02:11 AM, Alan Modra wrote:
> > > >>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
> > > >>> revisiting an old patch of mine.
> > > >>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
> > > >>
> > > >> I'm not against this patch, but it certainly seems like you would be
> > > >> better served by just implementing tls descriptors?
> > > > 
> > > > I think this is one better than tls descriptors, because powerpc
> > > > avoids the indirect function call used by tls descriptors.
> > > 
> > > You mean to say it is "faster" than tls descriptors, but at the same
> > 
> > To be honest, there isn't much difference in the optimized case where
> > static TLS is available.  It boils down to an indirect call to a
> > function that loads one value vs. a direct call to a stub that loads
> > two values and compares one against zero.  I think what I've
> > implemented is slightly better for PowerPC, but whether that would
> > carry over to other architectures is debatable.
> 
> If the performance difference isn't measurable in real-world
> applications, I would think uniformity between targets would be a lot
> more valuable.

Think of my design as "TLS descriptors version 2".  I take the best
features of TLS descriptors and add one trick, the special linker
stub, that allows you to omit many of the nasty details of the current
TLS descriptor design.  A target that currently has TLS support but no
TLS descriptor support and follows the powerpc design:
1) won't need to implement gcc changes for tls descriptors,
2) won't need to define new relocations,
3) won't need to implement linker support for tls descriptors, quite a
   large effort, and
4) won't need to implement dl-tlsdesc.S and tlsdesc.c in glibc, also
   not a simple task.
Another benefit in terms of reliability (and repeatable user timing!)
is that extended TLS descriptors are not needed, so the locking and
mallocing in tlsdeschtab.h is avoided.

Admittedly, part of the reason a port is so much easier is due to
omitting lazy TLS resolution.  Lazy TLS is complex.  What's more, the
per-target support code is non-trivial.  All of tlsdesc.c and half of
dl-tlsdesc.S is lazy TLS support.  I question whether the added
complexity provides commensurate benefit in real-world applications,
apart from the degenerate case of loading a shared library that is
never used.  (And even then, you'd need a lot of __thread variables to
make it worthwhile.)

In fact, I wouldn't be surprised to find lazy TLS has a net negative
benefit in real-world applications!
/me dons asbestos suit.  :)

> I also don't see how your approach is a "direct call". The function
> being called is in a different DSO so it has to go through a pointer
> in the GOT or similar, in which case it's just as "indirect" as the
> TLSDESC call would be.

It is a direct call to the linker provided stub, which will return
after a few instructions in the optimized case when static TLS is
available.

Control is passed to __tls_get_addr_opt only when no static TLS was
available for the shared library at the time the library was
dynamically relocated, ie. it was dlopen'ed and not enough spare
static TLS was free.

Note that __tls_get_addr_opt is currently an alias for
__tls_get_addr.  I believe it could be implemented as a different
function with a few more bells and whistles to provide lazy TLS
resolution, but I haven't proven that.
  
Rich Felker March 21, 2015, 4:36 a.m. UTC | #21
On Sat, Mar 21, 2015 at 01:37:02PM +1030, Alan Modra wrote:
> On Fri, Mar 20, 2015 at 11:27:12AM -0400, Rich Felker wrote:
> > On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
> > > On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
> > > > On 03/18/2015 10:56 PM, Alan Modra wrote:
> > > > > On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
> > > > >> On 03/18/2015 02:11 AM, Alan Modra wrote:
> > > > >>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
> > > > >>> revisiting an old patch of mine.
> > > > >>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
> > > > >>
> > > > >> I'm not against this patch, but it certainly seems like you would be
> > > > >> better served by just implementing tls descriptors?
> > > > > 
> > > > > I think this is one better than tls descriptors, because powerpc
> > > > > avoids the indirect function call used by tls descriptors.
> > > > 
> > > > You mean to say it is "faster" than tls descriptors, but at the same
> > > 
> > > To be honest, there isn't much difference in the optimized case where
> > > static TLS is available.  It boils down to an indirect call to a
> > > function that loads one value vs. a direct call to a stub that loads
> > > two values and compares one against zero.  I think what I've
> > > implemented is slightly better for PowerPC, but whether that would
> > > carry over to other architectures is debatable.
> > 
> > If the performance difference isn't measurable in real-world
> > applications, I would think uniformity between targets would be a lot
> > more valuable.
> 
> Think of my design as "TLS descriptors version 2".  I take the best
> features of TLS descriptors and add one trick, the special linker
> stub, that allows you to omit many of the nasty details of the current
> TLS descriptor design.  A target that currently has TLS support but no
> TLS descriptor support and follows the powerpc design:
> 1) won't need to implement gcc changes for tls descriptors,
> 2) won't need to define new relocations,
> 3) won't need to implement linker support for tls descriptors, quite a
>    large effort, and
> 4) won't need to implement dl-tlsdesc.S and tlsdesc.c in glibc, also
>    not a simple task.
> Another benefit in terms of reliability (and repeatable user timing!)
> is that extended TLS descriptors are not needed, so the locking and
> mallocing in tlsdeschtab.h is avoided.

If the lazy allocation stuff is removed (which it should be; it breaks
AS-safety and other things), the last issue would go away.

> Admittedly, part of the reason a port is so much easier is due to
> omitting lazy TLS resolution.  Lazy TLS is complex.  What's more, the
> per-target support code is non-trivial.  All of tlsdesc.c and half of
> dl-tlsdesc.S is lazy TLS support.  I question whether the added
> complexity provides commensurate benefit in real-world applications,
> apart from the degenerate case of loading a shared library that is
> never used.  (And even then, you'd need a lot of __thread variables to
> make it worthwhile.)
> 
> In fact, I wouldn't be surprised to find lazy TLS has a net negative
> benefit in real-world applications!
> /me dons asbestos suit.  :)

I completely agree. I want to see it removed.

> > I also don't see how your approach is a "direct call". The function
> > being called is in a different DSO so it has to go through a pointer
> > in the GOT or similar, in which case it's just as "indirect" as the
> > TLSDESC call would be.
> 
> It is a direct call to the linker provided stub, which will return
> after a few instructions in the optimized case when static TLS is
> available.

That linker-provided stub address is loaded from a "GOT slot" of some
sort, just like the tlsdesc function would be. Either way you have a
PC/GP-relative load followed by a jump to the loaded address. There's
actually one additional level of indirection to load this pointer for
TLSDESC, but for static TLS, the callee returns instantly after
performing a single load.

With non-TLSDESC dynamic TLS on the other hand, there's an additional
PC/GP-relative address computation (for the module/offset structure's
address to pass) in the caller, which should equal out with the cost
of the extra indirection for TLSDESC. But then there's a fair bit of
additional work to be done in the callee.

> Control is passed to __tls_get_addr_opt only when no static TLS was
> available for the shared library at the time the library was
> dynamically relocated, ie. it was dlopen'ed and not enough spare
> static TLS was free.

Where is contol passed if static TLS was used? Maybe I'm
misunderstanding your design? How would the dynamic linker resolve
some calls to __tls_get_addr to different places than other calls,
when there's only a single GOT entry for it?

Rich
  
Alan Modra March 21, 2015, 7:33 a.m. UTC | #22
On Sat, Mar 21, 2015 at 12:36:30AM -0400, Rich Felker wrote:
> On Sat, Mar 21, 2015 at 01:37:02PM +1030, Alan Modra wrote:
> > On Fri, Mar 20, 2015 at 11:27:12AM -0400, Rich Felker wrote:
> > > I also don't see how your approach is a "direct call". The function
> > > being called is in a different DSO so it has to go through a pointer
> > > in the GOT or similar, in which case it's just as "indirect" as the
> > > TLSDESC call would be.
> > 
> > It is a direct call to the linker provided stub, which will return
> > after a few instructions in the optimized case when static TLS is
> > available.
> 
> That linker-provided stub address is loaded from a "GOT slot" of some
> sort,

No, it really is a direct call.  The linker provided stub is local.

This ppc64 elfv2 GD sequence in a relocatable object file

 addi r3,r2,x@got@tlsgd
 bl __tls_get_addr(x@tlsgd)
 nop

results in shared library code of

 addi r3,r2,x@got@tlsgd     # r3 -> tls_index entry in GOT
 bl __tls_get_addr_opt_stub # direct call
 nop
.
.

__tls_get_addr_opt_stub:
 ld r11,0(r3)	# tls_index->ti_module
 ld r12 8(r3)	# tls_index->ti_offset
 mr r0,r3
 cmpdi r11,0
 add r3,r12,r13 # r13 == thread pointer
 beqlr		# return if static TLS allocated

 mr r3,r0
 mflr r11
 std r11, 8(r1)
 std r2 24(r1)
 addis r12,r2,__tls_get_addr_opt@plt@ha
 ld r12, __tls_get_addr_opt@plt@l(r12)
 mtctr r12
 bctrl		# call __tls_get_addr_opt
 ld r2,24(r1)
 ld r11,8(r1)
 mtlr r11
 blr
  

Patch

diff --git a/NEWS b/NEWS
index 86394b8..a6a8b6d 100644
--- a/NEWS
+++ b/NEWS
@@ -17,6 +17,9 @@  Version 2.22
   18042, 18043, 18046, 18047, 18068, 18080, 18093, 18104, 18110, 18111,
   18128.
 
+* A powerpc and powerpc64 optimization for TLS, similar to TLS descriptors
+  for LD and GD on x86 and x86-64, has been implemented.
+
 * Character encoding and ctype tables were updated to Unicode 7.0.0, using
   new generator scripts contributed by Pravin Satpute and Mike FABIAN (Red
   Hat).  These updates cause user visible changes, such as the fix for bug
diff --git a/elf/dynamic-link.h b/elf/dynamic-link.h
index 6f4a773..8d428e2 100644
--- a/elf/dynamic-link.h
+++ b/elf/dynamic-link.h
@@ -25,11 +25,14 @@ 
    an attempt to allocate it in surplus space on the fly.  If that
    can't be done, we fall back to the error that DF_STATIC_TLS is
    intended to produce.  */
+#define HAVE_STATIC_TLS(map, sym_map)					\
+    (__builtin_expect ((sym_map)->l_tls_offset != NO_TLS_OFFSET		\
+		       && ((sym_map)->l_tls_offset			\
+			   != FORCED_DYNAMIC_TLS_OFFSET), 1))
+
 #define CHECK_STATIC_TLS(map, sym_map)					\
     do {								\
-      if (__builtin_expect ((sym_map)->l_tls_offset == NO_TLS_OFFSET	\
-			    || ((sym_map)->l_tls_offset			\
-				== FORCED_DYNAMIC_TLS_OFFSET), 0))	\
+      if (!HAVE_STATIC_TLS (map, sym_map))				\
 	_dl_allocate_static_tls (sym_map);				\
     } while (0)
 
diff --git a/elf/elf.h b/elf/elf.h
index 496f08d..71492a2 100644
--- a/elf/elf.h
+++ b/elf/elf.h
@@ -2194,6 +2194,8 @@  enum
 #define R_PPC_GOT_DTPREL16_LO	92 /* half16*	(sym+add)@got@dtprel@l */
 #define R_PPC_GOT_DTPREL16_HI	93 /* half16*	(sym+add)@got@dtprel@h */
 #define R_PPC_GOT_DTPREL16_HA	94 /* half16*	(sym+add)@got@dtprel@ha */
+#define R_PPC_TLSGD		95 /* none	(sym+add)@tlsgd */
+#define R_PPC_TLSLD		96 /* none	(sym+add)@tlsld */
 
 /* The remaining relocs are from the Embedded ELF ABI, and are not
    in the SVR4 ELF ABI.  */
@@ -2237,7 +2239,11 @@  enum
 
 /* PowerPC specific values for the Dyn d_tag field.  */
 #define DT_PPC_GOT		(DT_LOPROC + 0)
-#define DT_PPC_NUM		1
+#define DT_PPC_OPT		(DT_LOPROC + 1)
+#define DT_PPC_NUM		2
+
+/* PowerPC specific values for the DT_PPC_OPT Dyn entry.  */
+#define PPC_OPT_TLS		1
 
 /* PowerPC64 relocations defined by the ABIs */
 #define R_PPC64_NONE		R_PPC_NONE
diff --git a/sysdeps/powerpc/Versions b/sysdeps/powerpc/Versions
index 47c2c3e..2aebf7c 100644
--- a/sysdeps/powerpc/Versions
+++ b/sysdeps/powerpc/Versions
@@ -15,3 +15,9 @@  libc {
     __vmx__libc_longjmp; __vmx__libc_siglongjmp;
   }
 }
+
+ld {
+  GLIBC_2.22 {
+    __tls_get_addr_opt;
+  }
+}
diff --git a/sysdeps/powerpc/dl-tls.c b/sysdeps/powerpc/dl-tls.c
new file mode 100644
index 0000000..a18b23e
--- /dev/null
+++ b/sysdeps/powerpc/dl-tls.c
@@ -0,0 +1,24 @@ 
+/* Thread-local storage handling in the ELF dynamic linker.  PowerPC version.
+   Copyright (C) 2009-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, write to the Free
+   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+   02111-1307 USA.  */
+
+#include "elf/dl-tls.c"
+
+#ifdef SHARED
+strong_alias(__tls_get_addr, __tls_get_addr_opt)
+#endif
diff --git a/sysdeps/powerpc/powerpc32/dl-machine.h b/sysdeps/powerpc/powerpc32/dl-machine.h
index c94674f..8b0c067 100644
--- a/sysdeps/powerpc/powerpc32/dl-machine.h
+++ b/sysdeps/powerpc/powerpc32/dl-machine.h
@@ -333,6 +333,32 @@  elf_machine_rela (struct link_map *map, const Elf32_Rela *reloc,
 # endif
 
     case R_PPC_DTPMOD32:
+      if (map->l_info[DT_PPC(OPT)]
+	  && (map->l_info[DT_PPC(OPT)]->d_un.d_val & PPC_OPT_TLS))
+	{
+	  if (!NOT_BOOTSTRAP)
+	    {
+	      reloc_addr[0] = 0;
+	      reloc_addr[1] = (sym_map->l_tls_offset - TLS_TP_OFFSET
+			       + TLS_DTV_OFFSET);
+	      break;
+	    }
+	  else if (sym_map != NULL)
+	    {
+# ifndef SHARED
+	      CHECK_STATIC_TLS (map, sym_map);
+# else
+	      if (TRY_STATIC_TLS (map, sym_map))
+# endif
+		{
+		  reloc_addr[0] = 0;
+		  /* Set up for local dynamic.  */
+		  reloc_addr[1] = (sym_map->l_tls_offset - TLS_TP_OFFSET
+				   + TLS_DTV_OFFSET);
+		  break;
+		}
+	    }
+	}
       if (!NOT_BOOTSTRAP)
 	/* During startup the dynamic linker is always index 1.  */
 	*reloc_addr = 1;
@@ -342,6 +368,28 @@  elf_machine_rela (struct link_map *map, const Elf32_Rela *reloc,
 	*reloc_addr = sym_map->l_tls_modid;
       break;
     case R_PPC_DTPREL32:
+      if (map->l_info[DT_PPC(OPT)]
+	  && (map->l_info[DT_PPC(OPT)]->d_un.d_val & PPC_OPT_TLS))
+	{
+	  if (!NOT_BOOTSTRAP)
+	    {
+	      *reloc_addr = TLS_TPREL_VALUE (sym_map, sym, reloc);
+	      break;
+	    }
+	  else if (sym_map != NULL)
+	    {
+	      /* This reloc is always preceded by R_PPC_DTPMOD32.  */
+# ifndef SHARED
+	      assert (HAVE_STATIC_TLS (map, sym_map));
+# else
+	      if (HAVE_STATIC_TLS (map, sym_map))
+# endif
+		{
+		  *reloc_addr = TLS_TPREL_VALUE (sym_map, sym, reloc);
+		  break;
+		}
+	    }
+	}
       /* During relocation all TLS symbols are defined and used.
 	 Therefore the offset is already correct.  */
       if (NOT_BOOTSTRAP && sym_map != NULL)
diff --git a/sysdeps/powerpc/powerpc64/dl-machine.h b/sysdeps/powerpc/powerpc64/dl-machine.h
index 5cb0087..55ac736 100644
--- a/sysdeps/powerpc/powerpc64/dl-machine.h
+++ b/sysdeps/powerpc/powerpc64/dl-machine.h
@@ -701,6 +701,32 @@  elf_machine_rela (struct link_map *map,
       return;
 
     case R_PPC64_DTPMOD64:
+      if (map->l_info[DT_PPC64(OPT)]
+	  && (map->l_info[DT_PPC64(OPT)]->d_un.d_val & PPC64_OPT_TLS))
+	{
+#ifdef RTLD_BOOTSTRAP
+	  reloc_addr[0] = 0;
+	  reloc_addr[1] = (sym_map->l_tls_offset - TLS_TP_OFFSET
+			   + TLS_DTV_OFFSET);
+	  return;
+#else
+	  if (sym_map != NULL)
+	    {
+# ifndef SHARED
+	      CHECK_STATIC_TLS (map, sym_map);
+# else
+	      if (TRY_STATIC_TLS (map, sym_map))
+# endif
+		{
+		  reloc_addr[0] = 0;
+		  /* Set up for local dynamic.  */
+		  reloc_addr[1] = (sym_map->l_tls_offset - TLS_TP_OFFSET
+				   + TLS_DTV_OFFSET);
+		  return;
+		}
+	    }
+#endif
+	}
 #ifdef RTLD_BOOTSTRAP
       /* During startup the dynamic linker is always index 1.  */
       *reloc_addr = 1;
@@ -713,6 +739,28 @@  elf_machine_rela (struct link_map *map,
       return;
 
     case R_PPC64_DTPREL64:
+      if (map->l_info[DT_PPC64(OPT)]
+	  && (map->l_info[DT_PPC64(OPT)]->d_un.d_val & PPC64_OPT_TLS))
+	{
+#ifdef RTLD_BOOTSTRAP
+	  *reloc_addr = TLS_TPREL_VALUE (sym_map, sym, reloc);
+	  return;
+#else
+	  if (sym_map != NULL)
+	    {
+	      /* This reloc is always preceded by R_PPC64_DTPMOD64.  */
+# ifndef SHARED
+	      assert (HAVE_STATIC_TLS (map, sym_map));
+# else
+	      if (HAVE_STATIC_TLS (map, sym_map))
+#  endif
+		{
+		  *reloc_addr = TLS_TPREL_VALUE (sym_map, sym, reloc);
+		  return;
+		}
+	    }
+#endif
+	}
       /* During relocation all TLS symbols are defined and used.
 	 Therefore the offset is already correct.  */
 #ifndef RTLD_BOOTSTRAP
diff --git a/sysdeps/unix/sysv/linux/powerpc/Makefile b/sysdeps/unix/sysv/linux/powerpc/Makefile
index fcf3bb5..c89ed9e 100644
--- a/sysdeps/unix/sysv/linux/powerpc/Makefile
+++ b/sysdeps/unix/sysv/linux/powerpc/Makefile
@@ -20,6 +20,8 @@  ifeq ($(build-shared),yes)
 # This is needed for DSO loading from static binaries.
 sysdep-dl-routines += dl-static
 endif
+# Otherwise tst-tls-dlinfo fails due to tst-tlsmod2.so using static tls.
+LDFLAGS-tst-tlsmod2.so += -Wl,--no-tls-get-addr-optimize
 endif
 
 ifeq ($(subdir),misc)
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist
index d71611f..052f311 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist
@@ -1,3 +1,6 @@ 
+GLIBC_2.22
+ GLIBC_2.22 A
+ __tls_get_addr_opt F
 GLIBC_2.0
  GLIBC_2.0 A
  __libc_memalign F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/ld-le.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/ld-le.abilist
index 3530fb4..3174e21 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/ld-le.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/ld-le.abilist
@@ -9,3 +9,6 @@  GLIBC_2.17
  free F
  malloc F
  realloc F
+GLIBC_2.22
+ GLIBC_2.22 A
+ __tls_get_addr_opt F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/ld.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/ld.abilist
index 899360e..d8c4201 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/ld.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/ld.abilist
@@ -1,3 +1,6 @@ 
+GLIBC_2.22
+ GLIBC_2.22 A
+ __tls_get_addr_opt F
 GLIBC_2.3
  GLIBC_2.3 A
  __libc_memalign F