x86-64: Align the stack in __tls_get_addr [BZ #21609]

Message ID CAMe9rOrOvwrCr5stJ+TkNFczmFYP6+yQLRwmTjjEdQCOguXyXg@mail.gmail.com
State New, archived
Headers

Commit Message

H.J. Lu July 4, 2017, 8:28 p.m. UTC
  On Tue, Jul 4, 2017 at 1:16 PM, Florian Weimer <fweimer@redhat.com> wrote:
> On 07/04/2017 06:25 PM, H.J. Lu wrote:
>> Something like this.
>
> Sorry, I lost track of the patches.
>
> Would you please post the two patches you propose to apply?
>

Here are 2 patches.
  

Comments

Florian Weimer July 6, 2017, 11:34 a.m. UTC | #1
On 07/04/2017 10:28 PM, H.J. Lu wrote:
> 2017-07-04  Florian Weimer  <fweimer@redhat.com>
> 	    H.J. Lu  <hongjiu.lu@intel.com>
> 
> 	[BZ #21609]
> 	* sysdeps/x86_64/Makefile (sysdep-dl-routines): Add tls_get_addr.
> 	(gen-as-const-headers): Add rtld-offsets.sym.
> 	* sysdeps/x86_64/dl-tls.c: New file.
> 	* sysdeps/x86_64/rtld-offsets.sym: Likwise.
> 	* sysdeps/x86_64/tls_get_addr.S: Likewise.
> 	* sysdeps/x86_64/dl-tls.h: Add multiple inclusion guards.
> 	* sysdeps/x86_64/tlsdesc.sym (TI_MODULE_OFFSET): New.
> 	(TI_OFFSET_OFFSET): Likwise.

I verified that the addition of __tls_get_addr_slow does not affect the
inlining decisions made by GCC 6, so I think this patch is okay.  Would
you please commit it?

I would like to backport it to older glibc branches as well (at least
2.25 and 2.24).

I'm not sure if the second patch is actually needed, and if the new
symbol is the right way to do it. It penalizes those who use modern and
fixed compilers by making their lives more complicated, while we should
do it the other way round (those who use unfixed compilers should
suffer, at least if the choice is between the two outcomes).

Thanks,
Florian
  
Carlos O'Donell July 6, 2017, 2:30 p.m. UTC | #2
On 07/06/2017 07:34 AM, Florian Weimer wrote:
> On 07/04/2017 10:28 PM, H.J. Lu wrote:
>> 2017-07-04  Florian Weimer  <fweimer@redhat.com>
>> 	    H.J. Lu  <hongjiu.lu@intel.com>
>>
>> 	[BZ #21609]
>> 	* sysdeps/x86_64/Makefile (sysdep-dl-routines): Add tls_get_addr.
>> 	(gen-as-const-headers): Add rtld-offsets.sym.
>> 	* sysdeps/x86_64/dl-tls.c: New file.
>> 	* sysdeps/x86_64/rtld-offsets.sym: Likwise.
>> 	* sysdeps/x86_64/tls_get_addr.S: Likewise.
>> 	* sysdeps/x86_64/dl-tls.h: Add multiple inclusion guards.
>> 	* sysdeps/x86_64/tlsdesc.sym (TI_MODULE_OFFSET): New.
>> 	(TI_OFFSET_OFFSET): Likwise.
> 
> I verified that the addition of __tls_get_addr_slow does not affect the
> inlining decisions made by GCC 6, so I think this patch is okay.  Would
> you please commit it?
> 
> I would like to backport it to older glibc branches as well (at least
> 2.25 and 2.24).
> 
> I'm not sure if the second patch is actually needed, and if the new
> symbol is the right way to do it. It penalizes those who use modern and
> fixed compilers by making their lives more complicated, while we should
> do it the other way round (those who use unfixed compilers should
> suffer, at least if the choice is between the two outcomes).

Isn't the "suffering" just that you cannot build a compiler that is both
compatible with the old glibc (performance loss) and new glibc (fastest
performance)?

You have to choose one.
  
Florian Weimer July 6, 2017, 2:35 p.m. UTC | #3
On 07/06/2017 04:30 PM, Carlos O'Donell wrote:

>> I'm not sure if the second patch is actually needed, and if the new
>> symbol is the right way to do it. It penalizes those who use modern and
>> fixed compilers by making their lives more complicated, while we should
>> do it the other way round (those who use unfixed compilers should
>> suffer, at least if the choice is between the two outcomes).
> 
> Isn't the "suffering" just that you cannot build a compiler that is both
> compatible with the old glibc (performance loss) and new glibc (fastest
> performance)?

No, there is also the programmer inconvenience that they'd have to deal
with yet another -m flag.

If it's unconditional, you wouldn't be able to use the new GCC with the
alternate symbol enabled for glibc development (stable branch
maintenance, to be specific).

It's that kind of complexity I'm objecting to.

Thanks,
Florian
  
Carlos O'Donell July 6, 2017, 2:52 p.m. UTC | #4
On 07/06/2017 10:35 AM, Florian Weimer wrote:
> On 07/06/2017 04:30 PM, Carlos O'Donell wrote:
> 
>>> I'm not sure if the second patch is actually needed, and if the new
>>> symbol is the right way to do it. It penalizes those who use modern and
>>> fixed compilers by making their lives more complicated, while we should
>>> do it the other way round (those who use unfixed compilers should
>>> suffer, at least if the choice is between the two outcomes).
>>
>> Isn't the "suffering" just that you cannot build a compiler that is both
>> compatible with the old glibc (performance loss) and new glibc (fastest
>> performance)?
> 
> No, there is also the programmer inconvenience that they'd have to deal
> with yet another -m flag.
> 
> If it's unconditional, you wouldn't be able to use the new GCC with the
> alternate symbol enabled for glibc development (stable branch
> maintenance, to be specific).
> 
> It's that kind of complexity I'm objecting to.

The complexity is a consequence of the bug and our desire to attain optimal
performance for __tls_get_addr calls, which are integral to __thread usage.

Objecting abstractly to the complexity does not take into account the entire
set of consequences surrounding this problem. The objections must be balanced
against the gains being sought here, and the performance our users want.

As far as I understand it there are two paths we could take:

(a) __tls_get_addr aligns the stack for you if required, with a fast and
    slow path, everyone pays the small cost to check fir misalignment, and
    only the broken old binaries are forced to realign. This change works
    for old and new binaries without any need to coordinate between the
    compiler and the runtime.

    Pro: No gcc/glibc coordination (no additional flags).
    Con: All applications pay a small cost to check.

(b) We add a ___tls_get_addr which assumes alignment of the stack and a
    newly configured gcc that notices the system glibc provides this symbol
    automatically defaults to building with this option enabled.
    One of:
    (b.1) A configure option to disable the use of the new ___tls_get_addr.
    or
    (b.2) A new machine option -m<disable tls align>/-m<enable tls align>
    option is used to control this feature.
    Both b.1 and b.2 change the ABI.

    Pro: Fastest solution. Only old applications pay the check+align cost.
    Con: All application developers must ensure gcc/glibc coordination.
    When building a new gcc with a new glibc, the resulting compiler defaults
    to the new ABI, and if you need to build an old glibc you must add to
    your CFLAGS the right -m<disable tls align>.

So the question becomes:

How many users build their gcc/glibc and need to handle the coordination?

Are we asking all future users to pay a cost to simplify developer workflow?

As much as I don't like the complexity, I see the inherent value to our users
in what HJ is proposing.
  
Florian Weimer July 6, 2017, 3:07 p.m. UTC | #5
On 07/06/2017 04:52 PM, Carlos O'Donell wrote:

> As far as I understand it there are two paths we could take:
> 
> (a) __tls_get_addr aligns the stack for you if required, with a fast and
>     slow path, everyone pays the small cost to check fir misalignment, and
>     only the broken old binaries are forced to realign. This change works
>     for old and new binaries without any need to coordinate between the
>     compiler and the runtime.
> 
>     Pro: No gcc/glibc coordination (no additional flags).
>     Con: All applications pay a small cost to check.
> 
> (b) We add a ___tls_get_addr which assumes alignment of the stack and a
>     newly configured gcc that notices the system glibc provides this symbol
>     automatically defaults to building with this option enabled.
>     One of:
>     (b.1) A configure option to disable the use of the new ___tls_get_addr.
>     or
>     (b.2) A new machine option -m<disable tls align>/-m<enable tls align>
>     option is used to control this feature.
>     Both b.1 and b.2 change the ABI.
> 
>     Pro: Fastest solution. Only old applications pay the check+align cost.
>     Con: All application developers must ensure gcc/glibc coordination.
>     When building a new gcc with a new glibc, the resulting compiler defaults
>     to the new ABI, and if you need to build an old glibc you must add to
>     your CFLAGS the right -m<disable tls align>.

There's also this option:

(c) glibc provides __tls_get_addr (two leading underscores)
    under a new symbol version.

    Pro: No GCC changes required for upstream maintained versions.
    No problems for glibc development, no new -m flag.
    As fast as (b) after relinking against a newer glibc.
    Breakage with older/unfixed GCC is detectable during development
    (unlike the reason for the workaround, which would have needed
    clairvoyance on the developer's part).

    Con: Relinking code compiled with the old, unfixed GCC against the
    new glibc still produces broken binaries.

I think (c) is better than (b).  But (c) vs (a) is a tough call.

Thanks,
Florian
  
H.J. Lu July 6, 2017, 3:11 p.m. UTC | #6
On Thu, Jul 6, 2017 at 8:07 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 07/06/2017 04:52 PM, Carlos O'Donell wrote:
>
>> As far as I understand it there are two paths we could take:
>>
>> (a) __tls_get_addr aligns the stack for you if required, with a fast and
>>     slow path, everyone pays the small cost to check fir misalignment, and
>>     only the broken old binaries are forced to realign. This change works
>>     for old and new binaries without any need to coordinate between the
>>     compiler and the runtime.
>>
>>     Pro: No gcc/glibc coordination (no additional flags).
>>     Con: All applications pay a small cost to check.
>>
>> (b) We add a ___tls_get_addr which assumes alignment of the stack and a
>>     newly configured gcc that notices the system glibc provides this symbol
>>     automatically defaults to building with this option enabled.
>>     One of:
>>     (b.1) A configure option to disable the use of the new ___tls_get_addr.
>>     or
>>     (b.2) A new machine option -m<disable tls align>/-m<enable tls align>
>>     option is used to control this feature.
>>     Both b.1 and b.2 change the ABI.
>>
>>     Pro: Fastest solution. Only old applications pay the check+align cost.
>>     Con: All application developers must ensure gcc/glibc coordination.
>>     When building a new gcc with a new glibc, the resulting compiler defaults
>>     to the new ABI, and if you need to build an old glibc you must add to
>>     your CFLAGS the right -m<disable tls align>.
>
> There's also this option:
>
> (c) glibc provides __tls_get_addr (two leading underscores)
>     under a new symbol version.
>
>     Pro: No GCC changes required for upstream maintained versions.
>     No problems for glibc development, no new -m flag.
>     As fast as (b) after relinking against a newer glibc.
>     Breakage with older/unfixed GCC is detectable during development
>     (unlike the reason for the workaround, which would have needed
>     clairvoyance on the developer's part).
>
>     Con: Relinking code compiled with the old, unfixed GCC against the
>     new glibc still produces broken binaries.
>

(c) doesn't fix the problem for unfixed GCC at all.  As far as unfixed GCC
is concerned, nothing is changed.  For fixed GCC, it is the same as without
the bug for [BZ #21609].  I think (c) is the least suitable option.
  
Florian Weimer July 6, 2017, 3:15 p.m. UTC | #7
On 07/06/2017 05:11 PM, H.J. Lu wrote:

>> There's also this option:
>>
>> (c) glibc provides __tls_get_addr (two leading underscores)
>>     under a new symbol version.
>>
>>     Pro: No GCC changes required for upstream maintained versions.
>>     No problems for glibc development, no new -m flag.
>>     As fast as (b) after relinking against a newer glibc.
>>     Breakage with older/unfixed GCC is detectable during development
>>     (unlike the reason for the workaround, which would have needed
>>     clairvoyance on the developer's part).
>>
>>     Con: Relinking code compiled with the old, unfixed GCC against the
>>     new glibc still produces broken binaries.
>>
> 
> (c) doesn't fix the problem for unfixed GCC at all.  As far as unfixed GCC
> is concerned, nothing is changed.

It fixes existing binaries by addressing the implied ABI breakage by the
glibc update.

It's my strong belief that compiler bugs should be fixed in the
compiler, not in glibc.

Thanks,
Florian
  
Alexander Monakov July 6, 2017, 3:22 p.m. UTC | #8
On Thu, 6 Jul 2017, Carlos O'Donell wrote:
> (a) __tls_get_addr aligns the stack for you if required, with a fast and
>     slow path, everyone pays the small cost to check fir misalignment, and
>     only the broken old binaries are forced to realign. This change works
>     for old and new binaries without any need to coordinate between the
>     compiler and the runtime.
> 
>     Pro: No gcc/glibc coordination (no additional flags).
>     Con: All applications pay a small cost to check.

No, this is not entirely accurate.

Only *on first access to a given TLS object in a given thread* you go
via the slow path and need to realign the stack.

In which case the work to realign the stack is _tiny_ compared to the
amount of work Glibc already does on initial TLS access.

Repeated accesses to TLS data can go via the fast path and *not even
check stack alignment*.  There's no additional cost there.

There is no need to introduce new symbols nor symbol versions. It's
sufficient to implement amd64 __tls_get_attr in assembly, rename generic
__tls_get_attr, and call the renamed generic implementation from the slow
path after realigning the stack.

(it's might be even possible to avoid "in assembly" part by relying on
optimization and GCC extensions to achieve the same)

Alexander
  
Alexander Monakov July 6, 2017, 3:32 p.m. UTC | #9
On Thu, 6 Jul 2017, Alexander Monakov wrote:
> There is no need to introduce new symbols nor symbol versions. It's
> sufficient to implement amd64 __tls_get_attr in assembly, rename generic
> __tls_get_attr, and call the renamed generic implementation from the slow
> path after realigning the stack.

Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need
aligned stack, and its callees update_get_addr and tls_get_addr_tail are
already marked noinline, so isn't it sufficient to additionally mark those
with __attribute__((force_align_arg_pointer)) on x86-64?

Alexander
  
H.J. Lu July 6, 2017, 3:34 p.m. UTC | #10
On Thu, Jul 6, 2017 at 8:32 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> On Thu, 6 Jul 2017, Alexander Monakov wrote:
>> There is no need to introduce new symbols nor symbol versions. It's
>> sufficient to implement amd64 __tls_get_attr in assembly, rename generic
>> __tls_get_attr, and call the renamed generic implementation from the slow
>> path after realigning the stack.
>
> Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need
> aligned stack, and its callees update_get_addr and tls_get_addr_tail are
> already marked noinline, so isn't it sufficient to additionally mark those
> with __attribute__((force_align_arg_pointer)) on x86-64?
>

Please take a look at the current master branch.  I have pushed a fix
today:

https://sourceware.org/git/?p=glibc.git;a=commit;h=031e519c95c069abe4e4c7c59e2b4b67efccdee5
  
Alexander Monakov July 6, 2017, 3:44 p.m. UTC | #11
On Thu, 6 Jul 2017, H.J. Lu wrote:

> On Thu, Jul 6, 2017 at 8:32 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> > On Thu, 6 Jul 2017, Alexander Monakov wrote:
> > Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need
> > aligned stack, and its callees update_get_addr and tls_get_addr_tail are
> > already marked noinline, so isn't it sufficient to additionally mark those
> > with __attribute__((force_align_arg_pointer)) on x86-64?
> >
> 
> Please take a look at the current master branch.  I have pushed a fix
> today:
> 
> https://sourceware.org/git/?p=glibc.git;a=commit;h=031e519c95c069abe4e4c7c59e2b4b67efccdee5

Right, but my point was that by simply adding the attribute to two existing
noinline functions all that new complexity could have been avoided.

What's the point of the proposed new ___tls_get_addr? Avoiding realigning
the stack in cases when Glibc has to do rather complex work anyway? Is it
even possible to measure the difference?

Alexander
  
Carlos O'Donell July 6, 2017, 4:08 p.m. UTC | #12
On 07/06/2017 11:15 AM, Florian Weimer wrote:
> On 07/06/2017 05:11 PM, H.J. Lu wrote:
> 
>>> There's also this option:
>>>
>>> (c) glibc provides __tls_get_addr (two leading underscores)
>>>     under a new symbol version.
>>>
>>>     Pro: No GCC changes required for upstream maintained versions.
>>>     No problems for glibc development, no new -m flag.
>>>     As fast as (b) after relinking against a newer glibc.
>>>     Breakage with older/unfixed GCC is detectable during development
>>>     (unlike the reason for the workaround, which would have needed
>>>     clairvoyance on the developer's part).
>>>
>>>     Con: Relinking code compiled with the old, unfixed GCC against the
>>>     new glibc still produces broken binaries.
>>>
>>
>> (c) doesn't fix the problem for unfixed GCC at all.  As far as unfixed GCC
>> is concerned, nothing is changed.
> 
> It fixes existing binaries by addressing the implied ABI breakage by the
> glibc update.
> 
> It's my strong belief that compiler bugs should be fixed in the
> compiler, not in glibc.

The GNU C Library is part of the GNU Toolchain and the implementation.

We should work together to offer the best solution to our users.

In the case of (a), and (b) (assuming you don't downgrade your glibc, which
is dangerous for other reasons) it works even if the packages are rebuilt,
and that's an important "Pro" benefit.

I'd say choosing between (a) and (b) is what's really up for discussion.

Yes (c) is a simple option, but in the end, given how distributions operate,
symbol versioning should be our last choice (one which we don't have to make
in this case).

I think we've pretty much agreed that (b) is going to be the solution for
glibc 2.26, and HJ has checked in changes to that effect. Future optimizations
about ___tls_get_addr can be discussed later.
  
Carlos O'Donell July 6, 2017, 4:10 p.m. UTC | #13
On 07/06/2017 11:44 AM, Alexander Monakov wrote:
> On Thu, 6 Jul 2017, H.J. Lu wrote:
> 
>> On Thu, Jul 6, 2017 at 8:32 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
>>> On Thu, 6 Jul 2017, Alexander Monakov wrote:
>>> Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need
>>> aligned stack, and its callees update_get_addr and tls_get_addr_tail are
>>> already marked noinline, so isn't it sufficient to additionally mark those
>>> with __attribute__((force_align_arg_pointer)) on x86-64?
>>>
>>
>> Please take a look at the current master branch.  I have pushed a fix
>> today:
>>
>> https://sourceware.org/git/?p=glibc.git;a=commit;h=031e519c95c069abe4e4c7c59e2b4b67efccdee5
> 
> Right, but my point was that by simply adding the attribute to two existing
> noinline functions all that new complexity could have been avoided.
> 
> What's the point of the proposed new ___tls_get_addr? Avoiding realigning
> the stack in cases when Glibc has to do rather complex work anyway? Is it
> even possible to measure the difference?

Option (b) has been chosen for this release, so that's what we're doing for
the release, but we can continue to discuss the optimization here.

(1) Cost.

No matter what you do it's extra instructions on the hot path forever.

(2) Small changes add up.

Measuring small changes like this is hard, and what about the sum of the
impact on a whole system, or systems? Power? CPU cycles? All of those have
costs for something as fundamental as __tls_get_addr. The extra instructions
do have i-cache pressure (which might be worked around by using an 'unlikely'
section far away to store code, which complicates jumping to it).

(3) Better benchmarks.

It would be nice to add a TLS microbenchmark into the suite and see if
it's measurable in a longer running test. Objective data would help settle
the decision on if the optimization is worth the developer complexity.
  
Alexander Monakov July 6, 2017, 4:23 p.m. UTC | #14
On Thu, 6 Jul 2017, Carlos O'Donell wrote:
> Option (b) has been chosen for this release, so that's what we're doing for
> the release, but we can continue to discuss the optimization here.
> 
> (1) Cost.
> 
> No matter what you do it's extra instructions on the hot path forever.

What extra instructions on the hot path?  I don't see any.  There's no
need to check stack alignment on the hot path -- only on the slow path.

Alexadner
  
Florian Weimer July 6, 2017, 8:15 p.m. UTC | #15
On 07/06/2017 05:32 PM, Alexander Monakov wrote:
> On Thu, 6 Jul 2017, Alexander Monakov wrote:
>> There is no need to introduce new symbols nor symbol versions. It's
>> sufficient to implement amd64 __tls_get_attr in assembly, rename generic
>> __tls_get_attr, and call the renamed generic implementation from the slow
>> path after realigning the stack.
> 
> Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need
> aligned stack, and its callees update_get_addr and tls_get_addr_tail are
> already marked noinline, so isn't it sufficient to additionally mark those
> with __attribute__((force_align_arg_pointer)) on x86-64?

I'm not sure if GCC will honor force_align_arg_pointer on a static
function which provably called with a properly aligned stack only.  We'd
need another attribute on __tls_get_addr to tell GCC that it is called
with a misaligned stack pointer.

Thanks,
Florian
  
Florian Weimer July 6, 2017, 8:17 p.m. UTC | #16
On 07/06/2017 05:44 PM, Alexander Monakov wrote:
> On Thu, 6 Jul 2017, H.J. Lu wrote:

>> Please take a look at the current master branch.  I have pushed a fix
>> today:
>>
>> https://sourceware.org/git/?p=glibc.git;a=commit;h=031e519c95c069abe4e4c7c59e2b4b67efccdee5

> Right, but my point was that by simply adding the attribute to two existing
> noinline functions all that new complexity could have been avoided.

At least the complexity is restricted to the x86-64 port.  Injecting the
x86-specific function attribute into generic code is not exactly
trivial, either.

Thanks,
Florian
  
Florian Weimer July 6, 2017, 8:17 p.m. UTC | #17
On 07/06/2017 06:23 PM, Alexander Monakov wrote:
> On Thu, 6 Jul 2017, Carlos O'Donell wrote:
>> Option (b) has been chosen for this release, so that's what we're doing for
>> the release, but we can continue to discuss the optimization here.
>>
>> (1) Cost.
>>
>> No matter what you do it's extra instructions on the hot path forever.
> 
> What extra instructions on the hot path?  I don't see any.  There's no
> need to check stack alignment on the hot path -- only on the slow path.

Indeed, the x86-64 fast path is identical to the GCC-compiled version in
current master.

Florian
  
Jakub Jelinek July 6, 2017, 8:19 p.m. UTC | #18
On Thu, Jul 06, 2017 at 10:15:36PM +0200, Florian Weimer wrote:
> On 07/06/2017 05:32 PM, Alexander Monakov wrote:
> > On Thu, 6 Jul 2017, Alexander Monakov wrote:
> >> There is no need to introduce new symbols nor symbol versions. It's
> >> sufficient to implement amd64 __tls_get_attr in assembly, rename generic
> >> __tls_get_attr, and call the renamed generic implementation from the slow
> >> path after realigning the stack.
> > 
> > Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need
> > aligned stack, and its callees update_get_addr and tls_get_addr_tail are
> > already marked noinline, so isn't it sufficient to additionally mark those
> > with __attribute__((force_align_arg_pointer)) on x86-64?
> 
> I'm not sure if GCC will honor force_align_arg_pointer on a static
> function which provably called with a properly aligned stack only.  We'd
> need another attribute on __tls_get_addr to tell GCC that it is called
> with a misaligned stack pointer.

If it is not inlined, it will honor the attribute.
But if you want to be double-sure, add used attribute to the function,
that means it may be called in means not known to the compiler.

	Jakub
  
Florian Weimer July 6, 2017, 8:26 p.m. UTC | #19
On 07/06/2017 06:08 PM, Carlos O'Donell wrote:
> On 07/06/2017 11:15 AM, Florian Weimer wrote:
>> On 07/06/2017 05:11 PM, H.J. Lu wrote:
>>
>>>> There's also this option:
>>>>
>>>> (c) glibc provides __tls_get_addr (two leading underscores)
>>>>     under a new symbol version.
>>>>
>>>>     Pro: No GCC changes required for upstream maintained versions.
>>>>     No problems for glibc development, no new -m flag.
>>>>     As fast as (b) after relinking against a newer glibc.
>>>>     Breakage with older/unfixed GCC is detectable during development
>>>>     (unlike the reason for the workaround, which would have needed
>>>>     clairvoyance on the developer's part).
>>>>
>>>>     Con: Relinking code compiled with the old, unfixed GCC against the
>>>>     new glibc still produces broken binaries.
>>>>
>>>
>>> (c) doesn't fix the problem for unfixed GCC at all.  As far as unfixed GCC
>>> is concerned, nothing is changed.
>>
>> It fixes existing binaries by addressing the implied ABI breakage by the
>> glibc update.
>>
>> It's my strong belief that compiler bugs should be fixed in the
>> compiler, not in glibc.
> 
> The GNU C Library is part of the GNU Toolchain and the implementation.
> 
> We should work together to offer the best solution to our users.

Easier said than done when the guilty party has moved on and no longer
supports the buggy releases.

> In the case of (a), and (b) (assuming you don't downgrade your glibc, which
> is dangerous for other reasons) it works even if the packages are rebuilt,
> and that's an important "Pro" benefit.

But that's *never* been a goal with the way we maintain glibc.  We are
quite eager to break this in many other contexts.

> I'd say choosing between (a) and (b) is what's really up for discussion.

I still think (c) is superior to the ___tls_get_addr (three _) solution.
 The three _s probably have to be made known to the sanitizers and to
valgrind, too.

> Yes (c) is a simple option, but in the end, given how distributions operate,
> symbol versioning should be our last choice (one which we don't have to make
> in this case).

Would you please expand on that?  For RPM-based distributions, the new
___tls_get_addr (3 _) symbol would also add a GLIBC_2.26 or GLIBC_2.27
dependency, causing the same issue as any other versioned symbol.

> I think we've pretty much agreed that (b) is going to be the solution for
> glibc 2.26, and HJ has checked in changes to that effect. Future optimizations
> about ___tls_get_addr can be discussed later.

Surely you mean (a) here?  No new symbol for 2.26?

Thanks,
Florian
  
Carlos O'Donell July 7, 2017, 3:52 p.m. UTC | #20
On 07/06/2017 04:17 PM, Florian Weimer wrote:
> On 07/06/2017 06:23 PM, Alexander Monakov wrote:
>> On Thu, 6 Jul 2017, Carlos O'Donell wrote:
>>> Option (b) has been chosen for this release, so that's what we're doing for
>>> the release, but we can continue to discuss the optimization here.
>>>
>>> (1) Cost.
>>>
>>> No matter what you do it's extra instructions on the hot path forever.
>>
>> What extra instructions on the hot path?  I don't see any.  There's no
>> need to check stack alignment on the hot path -- only on the slow path.
> 
> Indeed, the x86-64 fast path is identical to the GCC-compiled version in
> current master.

My apologies, when I wrote 'hot path' I was thinking of process startup
where the first call to __tls_get_addr (for a given dtv entry) always goes
through the slow path.

The reason I still want to highlight this is that there is a non-zero
cost paid, and it adds up over time with other decisions we make.

So while profiling is important, eventually we don't see any particular
peak in the profile, just a lot of costly noise which is hard to speed
up without a deeper understanding of the infrastructure and how we want
it to work.
  
Carlos O'Donell July 7, 2017, 4:19 p.m. UTC | #21
On 07/06/2017 04:26 PM, Florian Weimer wrote:
> On 07/06/2017 06:08 PM, Carlos O'Donell wrote:
>> In the case of (a), and (b) (assuming you don't downgrade your glibc, which
>> is dangerous for other reasons) it works even if the packages are rebuilt,
>> and that's an important "Pro" benefit.
> 
> But that's *never* been a goal with the way we maintain glibc.  We are
> quite eager to break this in many other contexts.

I do not mean to set precedent for other technical discussions, and I am
talking specifically about the problem of __tls_get_addr.

I think that symbol versioning is a good solution to this problem, but it's
the simplest solution, and if we can do better, then we should.
 
>> I'd say choosing between (a) and (b) is what's really up for discussion.
> 
> I still think (c) is superior to the ___tls_get_addr (three _) solution.
>  The three _s probably have to be made known to the sanitizers and to
> valgrind, too.

That is a good 'Con' against (b). Is it true? We should have someone like
Mark Wielaard comment on this?

>> Yes (c) is a simple option, but in the end, given how distributions operate,
>> symbol versioning should be our last choice (one which we don't have to make
>> in this case).
> 
> Would you please expand on that?  For RPM-based distributions, the new
> ___tls_get_addr (3 _) symbol would also add a GLIBC_2.26 or GLIBC_2.27
> dependency, causing the same issue as any other versioned symbol.

My comment above was only to say that distributions rebuild packages for many
reasons, and after that rebuild is complete the package may fail to compile,
or operate correctly at runtime because of a semantic change we made using
symbol versioning. That is OK though, we allowed existing users to continue
to user their packages for as long as they wanted, and that is the ABI guarantee
we give. I do not want to remove value from symbol versioning, or header changes,
all things which we are allowed to do.

I want us to do better than symbol versioning, if we can, and if the cost/reward
is in favour of doing that.

When comparing (b) to (c) from a distro perspective we see:

(b) New ___tls_get_addr (3x_) API.
    Pro: Link failures if you get the coordination wrong.
    Con: Need to coordinate a glibc/gcc release in the distro.

(c) New symbol version.
    Pro: No need to coordinate a glibc/gcc release in the distro.
    Con: Can create broken packages depending on which glibc/gcc
         combination is in the buildroot, or was used by the user.

>> I think we've pretty much agreed that (b) is going to be the solution for
>> glibc 2.26, and HJ has checked in changes to that effect. Future optimizations
>> about ___tls_get_addr can be discussed later.
> 
> Surely you mean (a) here?  No new symbol for 2.26?

Sorry, yes (a). No new symbol for 2.26, which is what HJ checked in.

In closing I would like to point out that it is not as cut-and-dry as we make
it out to be and that choosing between (a), (b), and (c) should be done with
due consideration.

Thankfully we can always revisit an optimization given our choice with (a).

For example we should probably look at TLSDESC again for x86_64.
  
Mark Wielaard July 7, 2017, 4:36 p.m. UTC | #22
On Fri, 2017-07-07 at 12:19 -0400, Carlos O'Donell wrote:
> On 07/06/2017 04:26 PM, Florian Weimer wrote:
> > I still think (c) is superior to the ___tls_get_addr (three _) solution.
> >  The three _s probably have to be made known to the sanitizers and to
> > valgrind, too.
> 
> That is a good 'Con' against (b). Is it true? We should have someone like
> Mark Wielaard comment on this?

valgrind doesn't treat __tls_get_addr special. I haven't checked what
__tls_get_addr does precisely. But apparently it operates on well
defined memory locations, so doesn't need any special treatment.
  
Alexander Monakov July 7, 2017, 5:13 p.m. UTC | #23
On Fri, 7 Jul 2017, Carlos O'Donell wrote:
> My apologies, when I wrote 'hot path' I was thinking of process startup
> where the first call to __tls_get_addr (for a given dtv entry) always goes
> through the slow path.
> 
> The reason I still want to highlight this is that there is a non-zero
> cost paid, and it adds up over time with other decisions we make.

In this case you're talking about literally 5 instructions on paths that
involve syscalls and thousands of other instructions.

Consider whether bloating the dynamic table by 1 entry and causing
additional work for the dynamic linker (for all programs by the way,
not only those using __tls_get_addr) outweighs the cost of those 5
instructions.

Alexander
  
Jakub Jelinek July 7, 2017, 5:24 p.m. UTC | #24
On Fri, Jul 07, 2017 at 08:13:49PM +0300, Alexander Monakov wrote:
> On Fri, 7 Jul 2017, Carlos O'Donell wrote:
> > My apologies, when I wrote 'hot path' I was thinking of process startup
> > where the first call to __tls_get_addr (for a given dtv entry) always goes
> > through the slow path.
> > 
> > The reason I still want to highlight this is that there is a non-zero
> > cost paid, and it adds up over time with other decisions we make.
> 
> In this case you're talking about literally 5 instructions on paths that
> involve syscalls and thousands of other instructions.

And it could be even 2 (have the fast path done for when __tls_get_addr
has been called on that TLS object already first, then
  testb $15, %spl
  je __tls_get_addr_slow
! Now do the actual stack realignment and __tls_get_addr_slow call

(or testl $15, %esp; whatever is faster).

	Jakub
  
Carlos O'Donell July 7, 2017, 7:56 p.m. UTC | #25
On 07/07/2017 01:13 PM, Alexander Monakov wrote:
> On Fri, 7 Jul 2017, Carlos O'Donell wrote:
>> My apologies, when I wrote 'hot path' I was thinking of process startup
>> where the first call to __tls_get_addr (for a given dtv entry) always goes
>> through the slow path.
>>
>> The reason I still want to highlight this is that there is a non-zero
>> cost paid, and it adds up over time with other decisions we make.
> 
> In this case you're talking about literally 5 instructions on paths that
> involve syscalls and thousands of other instructions.
> 
> Consider whether bloating the dynamic table by 1 entry and causing
> additional work for the dynamic linker (for all programs by the way,
> not only those using __tls_get_addr) outweighs the cost of those 5
> instructions.

The answer is not black and white.

One must pay the slow path cost for every dtv entry that needs
to be initialized, but one only pays the symbol lookup cost once for
the call to ___tls_get_addr.

One always has to lookup a version of __tls_get_addr, so it might or
might not, cost a little more to have an extra symbol (depends on the
symbol order choices made by the static linker).

The extra .dynamic entry for the triple underscore or versioned symbol
is added to a hash bucket for lookup, and so should be as easy to find
as the original symbol if the new version sorts first.

I think the final decision (a), to keep the original symbol and alter
the slow path to realign the stack (as designed by Florian) is a good
solution. It is a better solution than (c) symbol versioning, but only
because it lacks the downside that recompiling bumps you forward to the
new version which breaks. However, I don't throw out HJ's suggestion
for a new API without giving it due consideration, and it has attractive
benefits. Though I don't see the additional symbol in the .dynsym as a
significant "Con" for that solution, but then again I should try to
experiment with some microbenchmarks.
  
Florian Weimer July 8, 2017, 9 p.m. UTC | #26
* Jakub Jelinek:

> On Fri, Jul 07, 2017 at 08:13:49PM +0300, Alexander Monakov wrote:
>> On Fri, 7 Jul 2017, Carlos O'Donell wrote:
>> > My apologies, when I wrote 'hot path' I was thinking of process startup
>> > where the first call to __tls_get_addr (for a given dtv entry) always goes
>> > through the slow path.
>> > 
>> > The reason I still want to highlight this is that there is a non-zero
>> > cost paid, and it adds up over time with other decisions we make.
>> 
>> In this case you're talking about literally 5 instructions on paths that
>> involve syscalls and thousands of other instructions.
>
> And it could be even 2 (have the fast path done for when __tls_get_addr
> has been called on that TLS object already first, then
>   testb $15, %spl
>   je __tls_get_addr_slow
> ! Now do the actual stack realignment and __tls_get_addr_slow call
>
> (or testl $15, %esp; whatever is faster).

I think the condition needs to be reversed, at the very least.

I don't know the exact nature of the GCC bug.  If it only causes
misalignment by 8 bytes, this could work:

  testl $8, %esp
  /* Jump if the stack is already properly aligned for the function entry.  */
  jne __tls_get_addr_slow
  /* Otherwise, __tls_get_addr was called with a misaligned stack, but this
     means the stack is properly aligned for a call instruction.  */
  call __tls_get_addr_slow
  ret

But it really assumes that the stack alignment is either 0 or 8
(mod 16) at the entry of the __tls_get_addr function.
  

Patch

From b59d4b240c18673a9018e338702e51536d6f25d2 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Tue, 4 Jul 2017 09:24:09 -0700
Subject: [PATCH 2/2] x86-64: Export ___tls_get_addr

On x86-64, __tls_get_addr is changed to realign stack to support GCC
older than GCC 4.9.4:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066

This patch exports ___tls_get_addr as the alternative x86-64 runtime
interface to TLS with aligned stack.

This requires linker support for TLS optimization with ___tls_get_addr.
The GCC 4.9.4 and newer can generate call to ___tls_get_addr, instead of
__tls_get_addr.  We can use the weakref assembly directive to redirect
__tls_get_addr to ___tls_get_addr if linker supports ___tls_get_addr and
GCC doesn't.

	* sysdeps/unix/sysv/linux/x86_64/64/ld.abilist: Add GLIBC_2.26
	and ___tls_get_addr.
	* sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist: Likewise.
	* sysdeps/x86_64/Versions: Export ___tls_get_addr in GLIBC_2.26.
---
 sysdeps/unix/sysv/linux/x86_64/64/ld.abilist  | 2 ++
 sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist | 2 ++
 sysdeps/x86_64/Versions                       | 6 ++++++
 3 files changed, 10 insertions(+)

diff --git a/sysdeps/unix/sysv/linux/x86_64/64/ld.abilist b/sysdeps/unix/sysv/linux/x86_64/64/ld.abilist
index 07cab4b..884e52c 100644
--- a/sysdeps/unix/sysv/linux/x86_64/64/ld.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/64/ld.abilist
@@ -6,6 +6,8 @@  GLIBC_2.2.5 calloc F
 GLIBC_2.2.5 free F
 GLIBC_2.2.5 malloc F
 GLIBC_2.2.5 realloc F
+GLIBC_2.26 GLIBC_2.26 A
+GLIBC_2.26 ___tls_get_addr F
 GLIBC_2.3 GLIBC_2.3 A
 GLIBC_2.3 __tls_get_addr F
 GLIBC_2.4 GLIBC_2.4 A
diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist
index 236357b..d42a7a4 100644
--- a/sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist
@@ -7,3 +7,5 @@  GLIBC_2.16 calloc F
 GLIBC_2.16 free F
 GLIBC_2.16 malloc F
 GLIBC_2.16 realloc F
+GLIBC_2.26 GLIBC_2.26 A
+GLIBC_2.26 ___tls_get_addr F
diff --git a/sysdeps/x86_64/Versions b/sysdeps/x86_64/Versions
index a437f85..b8c25c9 100644
--- a/sysdeps/x86_64/Versions
+++ b/sysdeps/x86_64/Versions
@@ -10,3 +10,9 @@  libm {
     exp2l;
   }
 }
+ld {
+  GLIBC_2.26 {
+    # The alternative x86-64 runtime interface to TLS with aligned stack.
+    ___tls_get_addr;
+  }
+}
-- 
2.9.4