powerpc: New feature - HWCAP/HWCAP2 bits in the TCB

Message ID 55760314.6070601@linux.vnet.ibm.com
State Superseded
Delegated to: Carlos O'Donell
Headers

Commit Message

Carlos Eduardo Seo June 8, 2015, 9:03 p.m. UTC
  The proposed patch adds a new feature for powerpc. In order to get faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. This enables users to write versioned code based on the HWCAP bits without going through the overhead of reading them from the auxiliary vector.

A new API is published in ppc.h for get/set the bits in the aforementioned memory area (mainly for gcc to use to create builtins).

Testcases for the API functions were also created.

Tested on ppc32, ppc64 and ppc64le.

Okay to commit?

Thanks,
  

Comments

Joseph Myers June 8, 2015, 9:06 p.m. UTC | #1
This patch is missing documentation updates to platform.texi.
  
Adhemerval Zanella Netto June 9, 2015, 2:22 p.m. UTC | #2
Hi

On 08-06-2015 18:03, Carlos Eduardo Seo wrote:
> 
> The proposed patch adds a new feature for powerpc. In order to get faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. This enables users to write versioned code based on the HWCAP bits without going through the overhead of reading them from the auxiliary vector.
> 
> A new API is published in ppc.h for get/set the bits in the aforementioned memory area (mainly for gcc to use to create builtins).
> 
> Testcases for the API functions were also created.
> 
> Tested on ppc32, ppc64 and ppc64le.
> 
> Okay to commit?
> 
> Thanks,
> 

Besides the documentation missing pointed by Joseph, some comments below.

> @@ -203,6 +214,32 @@ register void *__thread_register __asm__
>  # define THREAD_SET_TM_CAPABLE(value) \
>      (THREAD_GET_TM_CAPABLE () = (value))
>  
> +/* hwcap & hwcap2 fields in TCB head.  */
> +# define THREAD_GET_HWCAP() \
> +    (((tcbhead_t *) ((char *) __thread_register				      \
> +		     - TLS_TCB_OFFSET))[-1].hwcap)
> +# define THREAD_SET_HWCAP(value) \
> +    if (value & PPC_FEATURE_ARCH_2_06)					      \
> +      value |= PPC_FEATURE_ARCH_2_05 |					      \
> +	       PPC_FEATURE_POWER5_PLUS |				      \
> +	       PPC_FEATURE_POWER5 |					      \
> +	       PPC_FEATURE_POWER4;					      \
> +    else if (value & PPC_FEATURE_ARCH_2_05)				      \
> +      value |= PPC_FEATURE_POWER5_PLUS |				      \
> +             PPC_FEATURE_POWER5 |					      \
> +             PPC_FEATURE_POWER4;					      \
> +    else if (value & PPC_FEATURE_POWER5_PLUS)				      \
> +      value |= PPC_FEATURE_POWER5 |					      \
> +             PPC_FEATURE_POWER4;					      \
> +    else if (value & PPC_FEATURE_POWER5)				      \
> +      value |= PPC_FEATURE_POWER4;					      \

This very logic is already presented at other powerpc32 sysdep file [1].
Instead of duplicate the logic, I think it is better to move it in a common
file.

[1] sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h

> Index: glibc-working/sysdeps/powerpc/sys/platform/ppc.h
> ===================================================================
> --- glibc-working.orig/sysdeps/powerpc/sys/platform/ppc.h
> +++ glibc-working/sysdeps/powerpc/sys/platform/ppc.h
> @@ -23,6 +23,86 @@
>  #include <stdint.h>
>  #include <bits/ppc.h>
>  
> +
> +/* Get the hwcap/hwcap2 information from the TCB. Offsets taken
> +   from tcb-offsets.h.  */
> +static inline uint32_t
> +__ppc_get_hwcap (void)
> +{
> +
> +  uint32_t __tcb_hwcap;
> +

> Index: glibc-working/sysdeps/powerpc/sys/platform/ppc.h
> ===================================================================
> --- glibc-working.orig/sysdeps/powerpc/sys/platform/ppc.h
> +++ glibc-working/sysdeps/powerpc/sys/platform/ppc.h
> @@ -23,6 +23,86 @@
>  #include <stdint.h>
>  #include <bits/ppc.h>
>  
> +
> +/* Get the hwcap/hwcap2 information from the TCB. Offsets taken
> +   from tcb-offsets.h.  */
> +static inline uint32_t
> +__ppc_get_hwcap (void)
> +{
> +
> +  uint32_t __tcb_hwcap;
> +
> +#ifdef __powerpc64__
> +  register unsigned long __tp __asm__ ("r13");
> +  __asm__ volatile ("lwz %0,-28772(%1)\n"
> +		    : "=r" (__tcb_hwcap)
> +		    : "r" (__tp));
> +#else
> +  register unsigned long __tp __asm__ ("r2");
> +  __asm__ volatile ("lwz %0,-28724(%1)\n"
> +		    : "=r" (__tcb_hwcap)
> +		    : "r" (__tp));
> +#endif
> +
> +  return __tcb_hwcap;
> +}

There is no need to use underline names inside inline functions.  I would also
change to something more simple like:

#ifdef __powerpc64__
# define __TPREG     "r13"
# define __HWCAP1OFF -28772
#else
# define __TPREG     "r2"
# define __HWCAP1OFF -28724
#else

static inline uint32_t
__ppc_get_hwcap (void)
{
  uint32_t tcb_hwcap;
  register unsigned long tp __asm__ (__TPREG);
  __asm__ ("lwz %0, %1(%2)\n"
           : "=r" (tcb_hwcap)
           : "i" (__HWCAPOFF), "r" (tp));
  return tcp_hwcap;
}

I also think the volatile in asm is not required (there is no need to refrain
compiler to possible optimize this load inside the inline function itself).

> Index: glibc-working/sysdeps/powerpc/test-get_hwcap.c
> ===================================================================
> --- /dev/null
> +++ glibc-working/sysdeps/powerpc/test-get_hwcap.c

The test are not wrong, but you could make only on test for this functionality,
instead of splitting the set and get in different ones.
  
Joseph Myers June 9, 2015, 2:26 p.m. UTC | #3
On Tue, 9 Jun 2015, Adhemerval Zanella wrote:

> There is no need to use underline names inside inline functions.  I would also

Yes there is, when in installed headers - installed headers should only 
take a non-reserved name from the namespace of macros the user might 
define before including the header if that name is actually intended to be 
part of the API for that header.
  
Szabolcs Nagy June 9, 2015, 2:47 p.m. UTC | #4
On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> The proposed patch adds a new feature for powerpc. In order to get
> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> This enables users to write versioned code based on the HWCAP bits
> without going through the overhead of reading them from the auxiliary
> vector.

i assume this is for multi-versioning.

i dont see how the compiler can generate code to access the
hwcap bits currently (without making assumptions about libc
interfaces).

> A new API is published in ppc.h for get/set the bits in the
> aforementioned memory area (mainly for gcc to use to create builtins).

how can the compiler use ppc.h? will it replicate the
offset logic instead?

if hwcap is useful abi between compiler and libc
then why is this done in a powerpc specific way?
  
Carlos Eduardo Seo June 9, 2015, 2:56 p.m. UTC | #5
On 06/09/2015 11:22 AM, Adhemerval Zanella wrote:
> 
>> @@ -203,6 +214,32 @@ register void *__thread_register __asm__
>>  # define THREAD_SET_TM_CAPABLE(value) \
>>      (THREAD_GET_TM_CAPABLE () = (value))
>>  
>> +/* hwcap & hwcap2 fields in TCB head.  */
>> +# define THREAD_GET_HWCAP() \
>> +    (((tcbhead_t *) ((char *) __thread_register				      \
>> +		     - TLS_TCB_OFFSET))[-1].hwcap)
>> +# define THREAD_SET_HWCAP(value) \
>> +    if (value & PPC_FEATURE_ARCH_2_06)					      \
>> +      value |= PPC_FEATURE_ARCH_2_05 |					      \
>> +	       PPC_FEATURE_POWER5_PLUS |				      \
>> +	       PPC_FEATURE_POWER5 |					      \
>> +	       PPC_FEATURE_POWER4;					      \
>> +    else if (value & PPC_FEATURE_ARCH_2_05)				      \
>> +      value |= PPC_FEATURE_POWER5_PLUS |				      \
>> +             PPC_FEATURE_POWER5 |					      \
>> +             PPC_FEATURE_POWER4;					      \
>> +    else if (value & PPC_FEATURE_POWER5_PLUS)				      \
>> +      value |= PPC_FEATURE_POWER5 |					      \
>> +             PPC_FEATURE_POWER4;					      \
>> +    else if (value & PPC_FEATURE_POWER5)				      \
>> +      value |= PPC_FEATURE_POWER4;					      \
> 
> This very logic is already presented at other powerpc32 sysdep file [1].
> Instead of duplicate the logic, I think it is better to move it in a common
> file.
> 
> [1] sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h
> 

So, do you suggest a cleanup patch first to move this to a common file, then a rewrite of this patch on top of that? If so, in which header should I put that?

Thanks,

Carlos Eduardo Seo
Software Engineer - Linux on Power Toolchain
cseo@linux.vnet.ibm.com
  
Steven Munroe June 9, 2015, 3:06 p.m. UTC | #6
On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> 
> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > The proposed patch adds a new feature for powerpc. In order to get
> > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > This enables users to write versioned code based on the HWCAP bits
> > without going through the overhead of reading them from the auxiliary
> > vector.

> i assume this is for multi-versioning.

The intent is for the compiler to implement the equivalent of
__builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
is RISC so we use the HWCAP. The trick to access the HWCAP[2]
efficiently as getauxv and scanning the auxv is too slow for inline
optimizations.

> i dont see how the compiler can generate code to access the
> hwcap bits currently (without making assumptions about libc
> interfaces).
> 
These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.

The TCB offsets are already fixed and can not change from release to
release.


> > A new API is published in ppc.h for get/set the bits in the
> > aforementioned memory area (mainly for gcc to use to create builtins).
> 
> how can the compiler use ppc.h? will it replicate the
> offset logic instead?
> 
See above

> if hwcap is useful abi between compiler and libc
> then why is this done in a powerpc specific way?
> 

Other platform are free use this technique.
  
Adhemerval Zanella Netto June 9, 2015, 3:22 p.m. UTC | #7
On 09-06-2015 11:26, Joseph Myers wrote:
> On Tue, 9 Jun 2015, Adhemerval Zanella wrote:
> 
>> There is no need to use underline names inside inline functions.  I would also
> 
> Yes there is, when in installed headers - installed headers should only 
> take a non-reserved name from the namespace of macros the user might 
> define before including the header if that name is actually intended to be 
> part of the API for that header.
> 

Does this also apply for the the variable defined inside the function?
My example still uses '__' for the defines used across header.
  
Joseph Myers June 9, 2015, 3:25 p.m. UTC | #8
On Tue, 9 Jun 2015, Adhemerval Zanella wrote:

> 
> 
> On 09-06-2015 11:26, Joseph Myers wrote:
> > On Tue, 9 Jun 2015, Adhemerval Zanella wrote:
> > 
> >> There is no need to use underline names inside inline functions.  I would also
> > 
> > Yes there is, when in installed headers - installed headers should only 
> > take a non-reserved name from the namespace of macros the user might 
> > define before including the header if that name is actually intended to be 
> > part of the API for that header.
> > 
> 
> Does this also apply for the the variable defined inside the function?

Yes.  Users should be able to define macros called "tp" or "tcb_hwcap" 
before including the header, without those macros having any effect on the 
header, unless those names are documented interfaces.
  
Ondrej Bilka June 9, 2015, 3:42 p.m. UTC | #9
On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > 
> > On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > The proposed patch adds a new feature for powerpc. In order to get
> > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > This enables users to write versioned code based on the HWCAP bits
> > > without going through the overhead of reading them from the auxiliary
> > > vector.
> 
> > i assume this is for multi-versioning.
> 
> The intent is for the compiler to implement the equivalent of
> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> efficiently as getauxv and scanning the auxv is too slow for inline
> optimizations.
> 
> > i dont see how the compiler can generate code to access the
> > hwcap bits currently (without making assumptions about libc
> > interfaces).
> > 
> These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
> 
> The TCB offsets are already fixed and can not change from release to
> release.
> 
I don't have problem with this but why do you add tls, how can different
threads have different ones when kernel could move them between cores.

So instead we just add to libc api following two variables below. These would
be initialized by linker as we will probably use them internally.

extern int __hwcap, __hwcap2;
  
Szabolcs Nagy June 9, 2015, 3:48 p.m. UTC | #10
On 09/06/15 16:06, Steven Munroe wrote:
> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
>> i assume this is for multi-versioning.
> 
> The intent is for the compiler to implement the equivalent of
> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> efficiently as getauxv and scanning the auxv is too slow for inline
> optimizations.

i think getauxv is not usable by the compiler anyway,
it's not a standard api.

>> i dont see how the compiler can generate code to access the
>> hwcap bits currently (without making assumptions about libc
>> interfaces).
>>
> These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
> 
> The TCB offsets are already fixed and can not change from release to
> release.

hard coded arch specific tcb offsets make sure that
targets need different tcb layout which means more
target specific maintainance instead of common c code.

>> if hwcap is useful abi between compiler and libc
>> then why is this done in a powerpc specific way?
> 
> Other platform are free use this technique.

i think this is not a sustainable approach for
compiler abi extensions.

(it means juggling with magic offsets on the order
of compilers * libcs * targets).

unfortunately accessing the ssp canary is already
broken this way, i'm not sure what's a better abi,
but it's probably worth thinking about one before
the tcb code gets too messy.
  
Steven Munroe June 9, 2015, 4:01 p.m. UTC | #11
On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote:
> On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
> > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > 
> > > On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > > The proposed patch adds a new feature for powerpc. In order to get
> > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > > This enables users to write versioned code based on the HWCAP bits
> > > > without going through the overhead of reading them from the auxiliary
> > > > vector.
> > 
> > > i assume this is for multi-versioning.
> > 
> > The intent is for the compiler to implement the equivalent of
> > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > efficiently as getauxv and scanning the auxv is too slow for inline
> > optimizations.
> > 
> > > i dont see how the compiler can generate code to access the
> > > hwcap bits currently (without making assumptions about libc
> > > interfaces).
> > > 
> > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
> > 
> > The TCB offsets are already fixed and can not change from release to
> > release.
> > 
> I don't have problem with this but why do you add tls, how can different
> threads have different ones when kernel could move them between cores.
> 
> So instead we just add to libc api following two variables below. These would
> be initialized by linker as we will probably use them internally.
> 
> extern int __hwcap, __hwcap2;
> 
The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This
guarantees one instruction load from TCB.

A Static variable would require a an indirect load via the TOC/GOT
(which can be megabytes for a large program/library). I really really
want the avoid that.

The point is to make fast decisions about which code the execute.
STT_GNU_IFUNC is just too complication for most application programmers
to use.

Now if the GLIBC community wants to provide a durable API for static
access to the HWCAP. I have not problem with that, but it does not solve
this problem.
  
Steven Munroe June 9, 2015, 4:04 p.m. UTC | #12
On Tue, 2015-06-09 at 16:48 +0100, Szabolcs Nagy wrote:
> 
> On 09/06/15 16:06, Steven Munroe wrote:
> > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> >> i assume this is for multi-versioning.
> > 
> > The intent is for the compiler to implement the equivalent of
> > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > efficiently as getauxv and scanning the auxv is too slow for inline
> > optimizations.
> 
> i think getauxv is not usable by the compiler anyway,
> it's not a standard api.
> 
> >> i dont see how the compiler can generate code to access the
> >> hwcap bits currently (without making assumptions about libc
> >> interfaces).
> >>
> > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
> > 
> > The TCB offsets are already fixed and can not change from release to
> > release.
> 
> hard coded arch specific tcb offsets make sure that
> targets need different tcb layout which means more
> target specific maintainance instead of common c code.
> 
> >> if hwcap is useful abi between compiler and libc
> >> then why is this done in a powerpc specific way?
> > 
> > Other platform are free use this technique.
> 
> i think this is not a sustainable approach for
> compiler abi extensions.
> 
> (it means juggling with magic offsets on the order
> of compilers * libcs * targets).
> 
> unfortunately accessing the ssp canary is already
> broken this way, i'm not sure what's a better abi,
> but it's probably worth thinking about one before
> the tcb code gets too messy.
> 

I have thought about it.

Based on my detailed knowledge of the PowerISA and PowerPC ABIs this the
simplest and fastest solution.
  
Rich Felker June 9, 2015, 4:38 p.m. UTC | #13
On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote:
> 
> The proposed patch adds a new feature for powerpc. In order to get
> faster access to the HWCAP/HWCAP2 bits, we now store them in the
> TCB. This enables users to write versioned code based on the HWCAP
> bits without going through the overhead of reading them from the
> auxiliary vector.
> 
> A new API is published in ppc.h for get/set the bits in the
> aforementioned memory area (mainly for gcc to use to create
> builtins).

Do you have any justification (actual performance figures for a
real-world usage case) for adding ABI constraints like this? This is
not something that should be done lightly. My understanding is that
hwcap bits are normally used in initializing functions pointers (or
equivalent things like ifunc resolvers), not again and again at
runtime, so I'm having a hard time seeing how this could help even if
it does make the individual hwcap accesses measurably faster.

It would also be nice to see some justification for the magic number
offsets. Will they be stable under changes to the TCB structure or
will preserving them require tip-toeing around them?

Rich
  
Rich Felker June 9, 2015, 4:45 p.m. UTC | #14
On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
> On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote:
> > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
> > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > > 
> > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > > > The proposed patch adds a new feature for powerpc. In order to get
> > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > > > This enables users to write versioned code based on the HWCAP bits
> > > > > without going through the overhead of reading them from the auxiliary
> > > > > vector.
> > > 
> > > > i assume this is for multi-versioning.
> > > 
> > > The intent is for the compiler to implement the equivalent of
> > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > efficiently as getauxv and scanning the auxv is too slow for inline
> > > optimizations.
> > > 
> > > > i dont see how the compiler can generate code to access the
> > > > hwcap bits currently (without making assumptions about libc
> > > > interfaces).
> > > > 
> > > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
> > > 
> > > The TCB offsets are already fixed and can not change from release to
> > > release.
> > > 
> > I don't have problem with this but why do you add tls, how can different
> > threads have different ones when kernel could move them between cores.
> > 
> > So instead we just add to libc api following two variables below. These would
> > be initialized by linker as we will probably use them internally.
> > 
> > extern int __hwcap, __hwcap2;
> > 
> The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This
> guarantees one instruction load from TCB.
> 
> A Static variable would require a an indirect load via the TOC/GOT

I do not see this as a justification. There are a lot more pressing
things with respect to performance that could be micro-optimized by
adding TCB ABI for them, but it's not done because it's the wrong
solution.

> (which can be megabytes for a large program/library). I really really
> want the avoid that.

The size of the GOT is utterly irrelevant to the performance reading
an element from it, so I don't see why you brought this up.

Rich
  
Rich Felker June 9, 2015, 4:50 p.m. UTC | #15
On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote:
> >> if hwcap is useful abi between compiler and libc
> >> then why is this done in a powerpc specific way?
> > 
> > Other platform are free use this technique.
> 
> i think this is not a sustainable approach for
> compiler abi extensions.
> 
> (it means juggling with magic offsets on the order
> of compilers * libcs * targets).
> 
> unfortunately accessing the ssp canary is already
> broken this way, i'm not sure what's a better abi,
> but it's probably worth thinking about one before
> the tcb code gets too messy.

For the canary I think it makes sense, even though it's ugly -- the
compiler has to generate a reference in every single function (for
'all' mode, or just most non-trivial functions in 'strong' mode).
That's much different from a feature (hwcap) that should only be used
at init-time and where, even if programmers did abuse it and use it
over and over at runtime, it's only going to be a small constant
overhead in a presumably medium to large sized function, and the cost
is only the need to setup the GOT register and load from the GOT,
anyway.

Rich
  
Steven Munroe June 9, 2015, 5:37 p.m. UTC | #16
On Tue, 2015-06-09 at 12:50 -0400, Rich Felker wrote:
> On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote:
> > >> if hwcap is useful abi between compiler and libc
> > >> then why is this done in a powerpc specific way?
> > > 
> > > Other platform are free use this technique.
> > 
> > i think this is not a sustainable approach for
> > compiler abi extensions.
> > 
> > (it means juggling with magic offsets on the order
> > of compilers * libcs * targets).
> > 
> > unfortunately accessing the ssp canary is already
> > broken this way, i'm not sure what's a better abi,
> > but it's probably worth thinking about one before
> > the tcb code gets too messy.
> 
> For the canary I think it makes sense, even though it's ugly -- the
> compiler has to generate a reference in every single function (for
> 'all' mode, or just most non-trivial functions in 'strong' mode).
> That's much different from a feature (hwcap) that should only be used
> at init-time and where, even if programmers did abuse it and use it
> over and over at runtime, it's only going to be a small constant
> overhead in a presumably medium to large sized function, and the cost
> is only the need to setup the GOT register and load from the GOT,
> anyway.

You are entitled to you own opinion but you are not accounting for the
aggressive out of order execution the POWER processors and specifics of
the PowerISA. In the time it take to load indirect via the TOC (4 cycles
minimum) compare/branch we could have executed 12-16 useful
instructions. 

Any indirection exposes the sequences to hazards (like cache miss) which
only make things worse.

As stated before I have thought about this and understand the options in
the context of the PowerISA, POWER micro-architecture, and the PowerPC
ABIs. This information is publicly available (if a little hard to find)
but I doubt you have taken the time to study it in detail, if at all.

I suspect you base your opinion on other architectures and hardware
implementations that do not apply to this situation.
  
Rich Felker June 9, 2015, 5:42 p.m. UTC | #17
On Tue, Jun 09, 2015 at 12:37:04PM -0500, Steven Munroe wrote:
> On Tue, 2015-06-09 at 12:50 -0400, Rich Felker wrote:
> > On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote:
> > > >> if hwcap is useful abi between compiler and libc
> > > >> then why is this done in a powerpc specific way?
> > > > 
> > > > Other platform are free use this technique.
> > > 
> > > i think this is not a sustainable approach for
> > > compiler abi extensions.
> > > 
> > > (it means juggling with magic offsets on the order
> > > of compilers * libcs * targets).
> > > 
> > > unfortunately accessing the ssp canary is already
> > > broken this way, i'm not sure what's a better abi,
> > > but it's probably worth thinking about one before
> > > the tcb code gets too messy.
> > 
> > For the canary I think it makes sense, even though it's ugly -- the
> > compiler has to generate a reference in every single function (for
> > 'all' mode, or just most non-trivial functions in 'strong' mode).
> > That's much different from a feature (hwcap) that should only be used
> > at init-time and where, even if programmers did abuse it and use it
> > over and over at runtime, it's only going to be a small constant
> > overhead in a presumably medium to large sized function, and the cost
> > is only the need to setup the GOT register and load from the GOT,
> > anyway.
> 
> You are entitled to you own opinion but you are not accounting for the
> aggressive out of order execution the POWER processors and specifics of
> the PowerISA. In the time it take to load indirect via the TOC (4 cycles
> minimum) compare/branch we could have executed 12-16 useful
> instructions. 
> 
> Any indirection exposes the sequences to hazards (like cache miss) which
> only make things worse.
> 
> As stated before I have thought about this and understand the options in
> the context of the PowerISA, POWER micro-architecture, and the PowerPC
> ABIs. This information is publicly available (if a little hard to find)
> but I doubt you have taken the time to study it in detail, if at all.
> 
> I suspect you base your opinion on other architectures and hardware
> implementations that do not apply to this situation. 

That's nice but all theoretical. I've seen countless such theoretical
claims from people who are coming from a standpoint of the vendor
manuals for the ISA they're working with, and more often than not,
they don't translate into measurable benefits. (I've been guilty of
this myself too, going to great lengths to tweak x86 codegen or even
write the asm by hand, only to find the resulting code to run the
exact same speed.) Creating a permanent ABI is an extremely high cost,
and unless you can justify the cost with actual measurements and a
reason to believe those measurements have anything to do with
real-world usage needs, I believe it's an unjustified cost.

Rich
  
Adhemerval Zanella Netto June 9, 2015, 5:47 p.m. UTC | #18
On 09-06-2015 13:38, Rich Felker wrote:
> On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote:
>>
>> The proposed patch adds a new feature for powerpc. In order to get
>> faster access to the HWCAP/HWCAP2 bits, we now store them in the
>> TCB. This enables users to write versioned code based on the HWCAP
>> bits without going through the overhead of reading them from the
>> auxiliary vector.
>>
>> A new API is published in ppc.h for get/set the bits in the
>> aforementioned memory area (mainly for gcc to use to create
>> builtins).
> 
> Do you have any justification (actual performance figures for a
> real-world usage case) for adding ABI constraints like this? This is
> not something that should be done lightly. My understanding is that
> hwcap bits are normally used in initializing functions pointers (or
> equivalent things like ifunc resolvers), not again and again at
> runtime, so I'm having a hard time seeing how this could help even if
> it does make the individual hwcap accesses measurably faster.

I believe the idea is to provide a fast way to emulate a functionality
similar to __builtin_cpu_supports for powerpc.  For x86, this builtin
will create 'cpuid' instruction, but since powerpc lacks a similar one
it should rely on hardware capability information provided by kernel.

And using TCB is the fastest way to provide such functionality. By 
exporting the symbol as a normal variable (extern int hwcap), it will 
require a R_PPC64_ADDR64 relocation plus two load accesses and some 
arithmetic (TOC materialization and load plus the variable load) 

> 
> It would also be nice to see some justification for the magic number
> offsets. Will they be stable under changes to the TCB structure or
> will preserving them require tip-toeing around them?

It requires not change TCB fields over releases and adding newer on top
(to not change previous offset).  And it has been done for a while,
since the ssp canary.

> 
> Rich
>
  
Florian Weimer June 9, 2015, 6:21 p.m. UTC | #19
On 06/09/2015 06:01 PM, Steven Munroe wrote:

> A Static variable would require a an indirect load via the TOC/GOT
> (which can be megabytes for a large program/library). I really really
> want the avoid that.

Could you encode the information in the address itself?  Then the
indirection goes away.
  
Rich Felker June 9, 2015, 6:26 p.m. UTC | #20
On Tue, Jun 09, 2015 at 08:21:38PM +0200, Florian Weimer wrote:
> On 06/09/2015 06:01 PM, Steven Munroe wrote:
> 
> > A Static variable would require a an indirect load via the TOC/GOT
> > (which can be megabytes for a large program/library). I really really
> > want the avoid that.
> 
> Could you encode the information in the address itself?  Then the
> indirection goes away.

You mean using (unsigned long)&__hwcap_hack or similar as the hwcap
bits? I don't see how you could make that work for static linking,
where the linker is going to put the GOT in the read-only text
segment. Otherwise it's a neat idea.

Rich
  
Roland McGrath June 9, 2015, 6:33 p.m. UTC | #21
> I believe the idea is to provide a fast way to emulate a functionality
> similar to __builtin_cpu_supports for powerpc.  For x86, this builtin
> will create 'cpuid' instruction, but since powerpc lacks a similar one
> it should rely on hardware capability information provided by kernel.

On x86 using cpuid is quite slow as instruction-level overheads go.
It's certainly nowhere near as fast as doing a direct load from memory.
So this analogue does not suggest anything like justification for the
kind of microoptimization being discussed.
  
Steven Munroe June 9, 2015, 6:43 p.m. UTC | #22
On Tue, 2015-06-09 at 13:42 -0400, Rich Felker wrote:
> On Tue, Jun 09, 2015 at 12:37:04PM -0500, Steven Munroe wrote:
> > On Tue, 2015-06-09 at 12:50 -0400, Rich Felker wrote:
> > > On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote:
> > > > >> if hwcap is useful abi between compiler and libc
> > > > >> then why is this done in a powerpc specific way?
> > > > > 
> > > > > Other platform are free use this technique.
> > > > 
> > > > i think this is not a sustainable approach for
> > > > compiler abi extensions.
> > > > 
> > > > (it means juggling with magic offsets on the order
> > > > of compilers * libcs * targets).
> > > > 
> > > > unfortunately accessing the ssp canary is already
> > > > broken this way, i'm not sure what's a better abi,
> > > > but it's probably worth thinking about one before
> > > > the tcb code gets too messy.
> > > 
> > > For the canary I think it makes sense, even though it's ugly -- the
> > > compiler has to generate a reference in every single function (for
> > > 'all' mode, or just most non-trivial functions in 'strong' mode).
> > > That's much different from a feature (hwcap) that should only be used
> > > at init-time and where, even if programmers did abuse it and use it
> > > over and over at runtime, it's only going to be a small constant
> > > overhead in a presumably medium to large sized function, and the cost
> > > is only the need to setup the GOT register and load from the GOT,
> > > anyway.
> > 
> > You are entitled to you own opinion but you are not accounting for the
> > aggressive out of order execution the POWER processors and specifics of
> > the PowerISA. In the time it take to load indirect via the TOC (4 cycles
> > minimum) compare/branch we could have executed 12-16 useful
> > instructions. 
> > 
> > Any indirection exposes the sequences to hazards (like cache miss) which
> > only make things worse.
> > 
> > As stated before I have thought about this and understand the options in
> > the context of the PowerISA, POWER micro-architecture, and the PowerPC
> > ABIs. This information is publicly available (if a little hard to find)
> > but I doubt you have taken the time to study it in detail, if at all.
> > 
> > I suspect you base your opinion on other architectures and hardware
> > implementations that do not apply to this situation. 
> 
> That's nice but all theoretical. I've seen countless such theoretical
> claims from people who are coming from a standpoint of the vendor
> manuals for the ISA they're working with, and more often than not,
> they don't translate into measurable benefits. (I've been guilty of
> this myself too, going to great lengths to tweak x86 codegen or even
> write the asm by hand, only to find the resulting code to run the
> exact same speed.) Creating a permanent ABI is an extremely high cost,
> and unless you can justify the cost with actual measurements and a
> reason to believe those measurements have anything to do with
> real-world usage needs, I believe it's an unjustified cost.
> 

This is not theory, I am thinking at the level of pipeline cycle timing
for P7/P8. I have been at this so long I can do this in my head.

Now experience does tell me that adding an indirection and the
associated exposure to cache miss hazard can mean the the performance
optimization gets lost in the hazard when it is measured.

I have been to this movie, I don't need to see it again.
  
Steven Munroe June 9, 2015, 6:51 p.m. UTC | #23
On Tue, 2015-06-09 at 11:33 -0700, Roland McGrath wrote:
> > I believe the idea is to provide a fast way to emulate a functionality
> > similar to __builtin_cpu_supports for powerpc.  For x86, this builtin
> > will create 'cpuid' instruction, but since powerpc lacks a similar one
> > it should rely on hardware capability information provided by kernel.
> 
> On x86 using cpuid is quite slow as instruction-level overheads go.
> It's certainly nowhere near as fast as doing a direct load from memory.
> So this analogue does not suggest anything like justification for the
> kind of microoptimization being discussed.

In the X86 implementation the cpuid is cached by __builtin_cpu_init(). I
suspect the result is saved in static or TLS. 

That said the x86/x86_64 ISA and micro arch are different from POWER
with different tradeoffs.

It would inappropriate to impose these assumptions on other platforms

Our proposal is appropriate for the reality of POWER and using the
HWCAP.
  
Rich Felker June 9, 2015, 6:57 p.m. UTC | #24
On Tue, Jun 09, 2015 at 01:43:09PM -0500, Steven Munroe wrote:
> On Tue, 2015-06-09 at 13:42 -0400, Rich Felker wrote:
> > On Tue, Jun 09, 2015 at 12:37:04PM -0500, Steven Munroe wrote:
> > > On Tue, 2015-06-09 at 12:50 -0400, Rich Felker wrote:
> > > > On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote:
> > > > > >> if hwcap is useful abi between compiler and libc
> > > > > >> then why is this done in a powerpc specific way?
> > > > > > 
> > > > > > Other platform are free use this technique.
> > > > > 
> > > > > i think this is not a sustainable approach for
> > > > > compiler abi extensions.
> > > > > 
> > > > > (it means juggling with magic offsets on the order
> > > > > of compilers * libcs * targets).
> > > > > 
> > > > > unfortunately accessing the ssp canary is already
> > > > > broken this way, i'm not sure what's a better abi,
> > > > > but it's probably worth thinking about one before
> > > > > the tcb code gets too messy.
> > > > 
> > > > For the canary I think it makes sense, even though it's ugly -- the
> > > > compiler has to generate a reference in every single function (for
> > > > 'all' mode, or just most non-trivial functions in 'strong' mode).
> > > > That's much different from a feature (hwcap) that should only be used
> > > > at init-time and where, even if programmers did abuse it and use it
> > > > over and over at runtime, it's only going to be a small constant
> > > > overhead in a presumably medium to large sized function, and the cost
> > > > is only the need to setup the GOT register and load from the GOT,
> > > > anyway.
> > > 
> > > You are entitled to you own opinion but you are not accounting for the
> > > aggressive out of order execution the POWER processors and specifics of
> > > the PowerISA. In the time it take to load indirect via the TOC (4 cycles
> > > minimum) compare/branch we could have executed 12-16 useful
> > > instructions. 
> > > 
> > > Any indirection exposes the sequences to hazards (like cache miss) which
> > > only make things worse.
> > > 
> > > As stated before I have thought about this and understand the options in
> > > the context of the PowerISA, POWER micro-architecture, and the PowerPC
> > > ABIs. This information is publicly available (if a little hard to find)
> > > but I doubt you have taken the time to study it in detail, if at all.
> > > 
> > > I suspect you base your opinion on other architectures and hardware
> > > implementations that do not apply to this situation. 
> > 
> > That's nice but all theoretical. I've seen countless such theoretical
> > claims from people who are coming from a standpoint of the vendor
> > manuals for the ISA they're working with, and more often than not,
> > they don't translate into measurable benefits. (I've been guilty of
> > this myself too, going to great lengths to tweak x86 codegen or even
> > write the asm by hand, only to find the resulting code to run the
> > exact same speed.) Creating a permanent ABI is an extremely high cost,
> > and unless you can justify the cost with actual measurements and a
> > reason to believe those measurements have anything to do with
> > real-world usage needs, I believe it's an unjustified cost.
> 
> This is not theory, I am thinking at the level of pipeline cycle timing
> for P7/P8. I have been at this so long I can do this in my head.
> 
> Now experience does tell me that adding an indirection and the
> associated exposure to cache miss hazard can mean the the performance
> optimization gets lost in the hazard when it is measured.
> 
> I have been to this movie, I don't need to see it again.

Doing this in your head is EXACTLY what I mean by theoretical.
Non-theoretical would be having a test program that demonstrates the
timing difference, i.e. empirical.

Rich
  
Adhemerval Zanella Netto June 9, 2015, 7:17 p.m. UTC | #25
On 09-06-2015 15:51, Steven Munroe wrote:
> On Tue, 2015-06-09 at 11:33 -0700, Roland McGrath wrote:
>>> I believe the idea is to provide a fast way to emulate a functionality
>>> similar to __builtin_cpu_supports for powerpc.  For x86, this builtin
>>> will create 'cpuid' instruction, but since powerpc lacks a similar one
>>> it should rely on hardware capability information provided by kernel.
>>
>> On x86 using cpuid is quite slow as instruction-level overheads go.
>> It's certainly nowhere near as fast as doing a direct load from memory.
>> So this analogue does not suggest anything like justification for the
>> kind of microoptimization being discussed.
> 
> In the X86 implementation the cpuid is cached by __builtin_cpu_init(). I
> suspect the result is saved in static or TLS. 
> 
> That said the x86/x86_64 ISA and micro arch are different from POWER
> with different tradeoffs.
> 
> It would inappropriate to impose these assumptions on other platforms
> 
> Our proposal is appropriate for the reality of POWER and using the
> HWCAP.
> 

In fact the __builtin_cpu_supports generate for x86_64 a read from
a static struct defined in libgcc: 

* libgcc/config/i386/cpuinfo.c:

struct __processor_model
{
  unsigned int __cpu_vendor;
  unsigned int __cpu_type;
  unsigned int __cpu_subtype;
  unsigned int __cpu_features[1];
} __cpu_model = { };

And it is initialized in constructor (__cpu_indicator_init) using the
cpuid. Either way, for powerpc even using the same mechanism will
incur in a static GOT relocation as it is defined in a dynamic library
(with the different it won't have a dynamic relocation).
  
Florian Weimer June 10, 2015, 9:28 a.m. UTC | #26
On 06/09/2015 08:26 PM, Rich Felker wrote:
> On Tue, Jun 09, 2015 at 08:21:38PM +0200, Florian Weimer wrote:
>> On 06/09/2015 06:01 PM, Steven Munroe wrote:
>>
>>> A Static variable would require a an indirect load via the TOC/GOT
>>> (which can be megabytes for a large program/library). I really really
>>> want the avoid that.
>>
>> Could you encode the information in the address itself?  Then the
>> indirection goes away.
> 
> You mean using (unsigned long)&__hwcap_hack or similar as the hwcap
> bits?

Exactly.

> I don't see how you could make that work for static linking,
> where the linker is going to put the GOT in the read-only text
> segment.

Oh.  Is this optimization relevant to statically-linked binaries?

I suppose the static linking case could be addressed with a new
relocation for the static linker, as long as it is possible to reach a
writable page from the GOT base using an offset determined at linked
time.  Whether all this is worth the effort, I do not know.  The entire
mechanism might turn out generally useful for mostly-read global
variables without strong consistency requirements.
  
Ondrej Bilka June 10, 2015, 12:50 p.m. UTC | #27
On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
> On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote:
> > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
> > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > > 
> > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > > > The proposed patch adds a new feature for powerpc. In order to get
> > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > > > This enables users to write versioned code based on the HWCAP bits
> > > > > without going through the overhead of reading them from the auxiliary
> > > > > vector.
> > > 
> > > > i assume this is for multi-versioning.
> > > 
> > > The intent is for the compiler to implement the equivalent of
> > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > efficiently as getauxv and scanning the auxv is too slow for inline
> > > optimizations.
> > > 
> > > > i dont see how the compiler can generate code to access the
> > > > hwcap bits currently (without making assumptions about libc
> > > > interfaces).
> > > > 
> > > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
> > > 
> > > The TCB offsets are already fixed and can not change from release to
> > > release.
> > > 
> > I don't have problem with this but why do you add tls, how can different
> > threads have different ones when kernel could move them between cores.
> > 
> > So instead we just add to libc api following two variables below. These would
> > be initialized by linker as we will probably use them internally.
> > 
> > extern int __hwcap, __hwcap2;
> > 
> The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This
> guarantees one instruction load from TCB.
> 
> A Static variable would require a an indirect load via the TOC/GOT
> (which can be megabytes for a large program/library). I really really
> want the avoid that.
> 
> The point is to make fast decisions about which code the execute.
> STT_GNU_IFUNC is just too complication for most application programmers
> to use.
> 
> Now if the GLIBC community wants to provide a durable API for static
> access to the HWCAP. I have not problem with that, but it does not solve
> this problem.
> 
Thats completely false and outright dangerous advice.

First that if ifuncs are too much complication to use they shouldn't
touch hwcap at first place. Ifuncs are relatively easy to read if you
take optimizing for specific cpu seriously and are aware of precautions
you could take.

If you let other programmers touch hwcap you would get disaster. You
need to compile each variant separately with appropriate gcc flags.
Otherwise if you just do decision inline then compiler is free to insert
newer instructions to generic code. That could lead to unexpected
crashes caused just by compiling with different gcc than original
programmer used.

So you need to have different file for each enabled capability and
compile these separately. (Or use assembly but most programmers don't
qualify.) Or you could try to add pragmas to tell gcc which part of file
should be optimized with which optimizations but thats even worse that
ifunc.

So you read hwcap register and need to call function. That indirection
already costs you more than GOT access you tried to save. 

Also even if you could handle previous problems with assembly functions
you lose more cycles than save as you couldn't compile file with
-march=native. Best solution I found would be distributions package
gentoo model, have variant of package for each cpu that would package
manager fetch based on your cpu and a script on startup that checks if
cpu changed and if so then he would relink all packages to generic
versions.

That would allow programmers use #ifdef _HAS_SSE4 for code thats easier
to maintain.

Finally while Florian solution works your argument is suspect. First it
costs tls so it needs to be frequently used. That makes address always
be in L1 cache which makes GOT size irrelevant. And if you have problems
with hwcap not being in cache duplicating it ten times if you have ten
threads would make situation worse, not better.
  
Adhemerval Zanella Netto June 10, 2015, 1:35 p.m. UTC | #28
On 10-06-2015 09:50, Ondřej Bílka wrote:
> On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
>> On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote:
>>> On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
>>>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
>>>>>
>>>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
>>>>>> The proposed patch adds a new feature for powerpc. In order to get
>>>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
>>>>>> This enables users to write versioned code based on the HWCAP bits
>>>>>> without going through the overhead of reading them from the auxiliary
>>>>>> vector.
>>>>
>>>>> i assume this is for multi-versioning.
>>>>
>>>> The intent is for the compiler to implement the equivalent of
>>>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
>>>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
>>>> efficiently as getauxv and scanning the auxv is too slow for inline
>>>> optimizations.
>>>>
>>>>> i dont see how the compiler can generate code to access the
>>>>> hwcap bits currently (without making assumptions about libc
>>>>> interfaces).
>>>>>
>>>> These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
>>>>
>>>> The TCB offsets are already fixed and can not change from release to
>>>> release.
>>>>
>>> I don't have problem with this but why do you add tls, how can different
>>> threads have different ones when kernel could move them between cores.
>>>
>>> So instead we just add to libc api following two variables below. These would
>>> be initialized by linker as we will probably use them internally.
>>>
>>> extern int __hwcap, __hwcap2;
>>>
>> The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This
>> guarantees one instruction load from TCB.
>>
>> A Static variable would require a an indirect load via the TOC/GOT
>> (which can be megabytes for a large program/library). I really really
>> want the avoid that.
>>
>> The point is to make fast decisions about which code the execute.
>> STT_GNU_IFUNC is just too complication for most application programmers
>> to use.
>>
>> Now if the GLIBC community wants to provide a durable API for static
>> access to the HWCAP. I have not problem with that, but it does not solve
>> this problem.
>>
> Thats completely false and outright dangerous advice.
> 
> First that if ifuncs are too much complication to use they shouldn't
> touch hwcap at first place. Ifuncs are relatively easy to read if you
> take optimizing for specific cpu seriously and are aware of precautions
> you could take.
> 
> If you let other programmers touch hwcap you would get disaster. You
> need to compile each variant separately with appropriate gcc flags.
> Otherwise if you just do decision inline then compiler is free to insert
> newer instructions to generic code. That could lead to unexpected
> crashes caused just by compiling with different gcc than original
> programmer used.
> 
> So you need to have different file for each enabled capability and
> compile these separately. (Or use assembly but most programmers don't
> qualify.) Or you could try to add pragmas to tell gcc which part of file
> should be optimized with which optimizations but thats even worse that
> ifunc.
> 
> So you read hwcap register and need to call function. That indirection
> already costs you more than GOT access you tried to save. 

I agree that adding an API to modify the current hwcap is not a good
approach. However the cost you are assuming here are *very* x86 biased,
where you have only on instruction (movl <variable>(%rip), %<destiny>) 
to load an external variable defined in a shared library, where for
powerpc it is more costly:

extern int foo;

int bar ()
{
  return foo;
}

	.type	bar, @function
bar:
0:	addis 2,12,.TOC.-0b@ha
	addi 2,2,.TOC.-0b@l
	.localentry	bar,.-bar
	addis 9,2,.LC0@toc@ha		# gpr load fusion, type long
	ld 9,.LC0@toc@l(9)
	lwa 3,0(9)
	blr


So you need a 2 arithmetic instruction to materialize the TOC, plus 
an addis+ld to load the load and then another load to load the external
variable (you have a optimization where the symbol call is local, where
you do not need to materialize the TOC). That is the *exactly* the cost 
Steven is trying to avoid.

> 
> Also even if you could handle previous problems with assembly functions
> you lose more cycles than save as you couldn't compile file with
> -march=native. Best solution I found would be distributions package
> gentoo model, have variant of package for each cpu that would package
> manager fetch based on your cpu and a script on startup that checks if
> cpu changed and if so then he would relink all packages to generic
> versions.
> 
> That would allow programmers use #ifdef _HAS_SSE4 for code thats easier
> to maintain.
> 

The relink strategy seems reasonable, but still the provider of
packages should build all the pre-compiled objects for each CPU variant.
This is what usual powerpc distro have done for some time: CPU variant
libc/libm/etc that are selects during runtime using hwcap. And the ifunc
idea is exactly to avoid such different CPU DSO variants.

> Finally while Florian solution works your argument is suspect. First it
> costs tls so it needs to be frequently used. That makes address always
> be in L1 cache which makes GOT size irrelevant. And if you have problems
> with hwcap not being in cache duplicating it ten times if you have ten
> threads would make situation worse, not better.

Again you are being x86 biased: the idea is a tradeoff between hwcap size
for each thread against its access speed using TLS. Steve is advocating
that he prefer to have the latency.
  
Szabolcs Nagy June 10, 2015, 2:16 p.m. UTC | #29
On 10/06/15 14:35, Adhemerval Zanella wrote:
> I agree that adding an API to modify the current hwcap is not a good
> approach. However the cost you are assuming here are *very* x86 biased,
> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
> to load an external variable defined in a shared library, where for
> powerpc it is more costly:

debian codesearch found 4 references to __builtin_cpu_supports
all seem to avoid using it repeatedly.

multiversioning dispatch only happens at startup (for a small
number of functions according to existing practice).

so why is hwcap expected to be used in hot loops?
  
Adhemerval Zanella Netto June 10, 2015, 2:21 p.m. UTC | #30
On 10-06-2015 11:16, Szabolcs Nagy wrote:
> On 10/06/15 14:35, Adhemerval Zanella wrote:
>> I agree that adding an API to modify the current hwcap is not a good
>> approach. However the cost you are assuming here are *very* x86 biased,
>> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
>> to load an external variable defined in a shared library, where for
>> powerpc it is more costly:
> 
> debian codesearch found 4 references to __builtin_cpu_supports
> all seem to avoid using it repeatedly.
> 
> multiversioning dispatch only happens at startup (for a small
> number of functions according to existing practice).
> 
> so why is hwcap expected to be used in hot loops?
> 

Good question, I do not know and I believe Steve could answer this
better than me.  I am only advocating here that assuming x86 costs
for powerpc is not the way to evaluate this patch.
  
Ondrej Bilka June 10, 2015, 3:09 p.m. UTC | #31
On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 10-06-2015 11:16, Szabolcs Nagy wrote:
> > On 10/06/15 14:35, Adhemerval Zanella wrote:
> >> I agree that adding an API to modify the current hwcap is not a good
> >> approach. However the cost you are assuming here are *very* x86 biased,
> >> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
> >> to load an external variable defined in a shared library, where for
> >> powerpc it is more costly:
> > 
> > debian codesearch found 4 references to __builtin_cpu_supports
> > all seem to avoid using it repeatedly.
> > 
> > multiversioning dispatch only happens at startup (for a small
> > number of functions according to existing practice).
> > 
> > so why is hwcap expected to be used in hot loops?
> > 
> 
> Good question, I do not know and I believe Steve could answer this
> better than me.  I am only advocating here that assuming x86 costs
> for powerpc is not the way to evaluate this patch.

Sorry but your details don't matter when underlying idea is just bad.
Even if getting hwcap took 20 cycles otherwise it would still be bad
idea. As you need to use hwcap only once at initialization bringing cost
is completely irrelevant.

First as I explained major flaw of Steve approach how exactly do you
ensure that gcc won't insert newer instruction that would lead to crash
on older platform?

Second is that it makes no sense. If you are at situation where hwcap
access gets noticable on profile a checking is also noticable on
profile. So use ifunc which will save you that additional cycles on
checking hwcap bits.

A programmer that uses hwcap in hot loop is just incompetent. Its stays
constant on application. So he should make more copies of loop, each
with appropriate options.

Then even if compiler still handled these issues correctly you will
probaly lose more on missed compiler optimizations that your supposed
gain. Compiler can select suboptimal patch as he doesn't want to expand
function too much due size concerns.

That quite easy, for example in following would get magnitude slower
with hwcap than ifuncs. Reason is that even gcc-5.1 doesn't split it
into two branches each doing shift. Instead it emits div instruction
which takes forever.

int hwcap;
unsigned int foo(unsigned int i)
{
  int d = 8;
  if (hwcap & 42)
    d = 4;
  return i / d;
}
  
Andrew Pinski June 10, 2015, 3:12 p.m. UTC | #32
> On Jun 10, 2015, at 11:09 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
> 
>> On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
>> 
>> 
>>> On 10-06-2015 11:16, Szabolcs Nagy wrote:
>>>> On 10/06/15 14:35, Adhemerval Zanella wrote:
>>>> I agree that adding an API to modify the current hwcap is not a good
>>>> approach. However the cost you are assuming here are *very* x86 biased,
>>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
>>>> to load an external variable defined in a shared library, where for
>>>> powerpc it is more costly:
>>> 
>>> debian codesearch found 4 references to __builtin_cpu_supports
>>> all seem to avoid using it repeatedly.
>>> 
>>> multiversioning dispatch only happens at startup (for a small
>>> number of functions according to existing practice).
>>> 
>>> so why is hwcap expected to be used in hot loops?
>>> 
>> 
>> Good question, I do not know and I believe Steve could answer this
>> better than me.  I am only advocating here that assuming x86 costs
>> for powerpc is not the way to evaluate this patch.
> 
> Sorry but your details don't matter when underlying idea is just bad.
> Even if getting hwcap took 20 cycles otherwise it would still be bad
> idea. As you need to use hwcap only once at initialization bringing cost
> is completely irrelevant.
> 
> First as I explained major flaw of Steve approach how exactly do you
> ensure that gcc won't insert newer instruction that would lead to crash
> on older platform?
> 
> Second is that it makes no sense. If you are at situation where hwcap
> access gets noticable on profile a checking is also noticable on
> profile. So use ifunc which will save you that additional cycles on
> checking hwcap bits.
> 
> A programmer that uses hwcap in hot loop is just incompetent. Its stays
> constant on application. So he should make more copies of loop, each
> with appropriate options.
> 
> Then even if compiler still handled these issues correctly you will
> probaly lose more on missed compiler optimizations that your supposed
> gain. Compiler can select suboptimal patch as he doesn't want to expand
> function too much due size concerns.
> 
> That quite easy, for example in following would get magnitude slower
> with hwcap than ifuncs. Reason is that even gcc-5.1 doesn't split it
> into two branches each doing shift. Instead it emits div instruction
> which takes forever.
> 
> int hwcap;
> unsigned int foo(unsigned int i)
> {
>  int d = 8;
>  if (hwcap & 42)
>    d = 4;
>  return i / d;
> }
>
  
Adhemerval Zanella Netto June 10, 2015, 3:23 p.m. UTC | #33
On 10-06-2015 12:09, Ondřej Bílka wrote:
> On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 10-06-2015 11:16, Szabolcs Nagy wrote:
>>> On 10/06/15 14:35, Adhemerval Zanella wrote:
>>>> I agree that adding an API to modify the current hwcap is not a good
>>>> approach. However the cost you are assuming here are *very* x86 biased,
>>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
>>>> to load an external variable defined in a shared library, where for
>>>> powerpc it is more costly:
>>>
>>> debian codesearch found 4 references to __builtin_cpu_supports
>>> all seem to avoid using it repeatedly.
>>>
>>> multiversioning dispatch only happens at startup (for a small
>>> number of functions according to existing practice).
>>>
>>> so why is hwcap expected to be used in hot loops?
>>>
>>
>> Good question, I do not know and I believe Steve could answer this
>> better than me.  I am only advocating here that assuming x86 costs
>> for powerpc is not the way to evaluate this patch.
> 
> Sorry but your details don't matter when underlying idea is just bad.
> Even if getting hwcap took 20 cycles otherwise it would still be bad
> idea. As you need to use hwcap only once at initialization bringing cost
> is completely irrelevant.
> 
> First as I explained major flaw of Steve approach how exactly do you
> ensure that gcc won't insert newer instruction that would lead to crash
> on older platform?
> 
> Second is that it makes no sense. If you are at situation where hwcap
> access gets noticable on profile a checking is also noticable on
> profile. So use ifunc which will save you that additional cycles on
> checking hwcap bits.
> 
> A programmer that uses hwcap in hot loop is just incompetent. Its stays
> constant on application. So he should make more copies of loop, each
> with appropriate options.
> 
> Then even if compiler still handled these issues correctly you will
> probaly lose more on missed compiler optimizations that your supposed
> gain. Compiler can select suboptimal patch as he doesn't want to expand
> function too much due size concerns.
> 
> That quite easy, for example in following would get magnitude slower
> with hwcap than ifuncs. Reason is that even gcc-5.1 doesn't split it
> into two branches each doing shift. Instead it emits div instruction
> which takes forever.
> 
> int hwcap;
> unsigned int foo(unsigned int i)
> {
>   int d = 8;
>   if (hwcap & 42)
>     d = 4;
>   return i / d;
> }
> 

And you can use GCC extensions to generate architecture specific instructions
based on architecture specific flags (check testsuite/gcc.target/powerpc/ppc-target-1.c).
And these are architecture specific and just a subset of options are enabled.

And my understanding is to optimize hwcap access to provide a 'better' way
to enable '__builtin_cpu_supports' for powerpc.  IFUNC is another way to provide
function selection, but it does not exclude that accessing hwcap through
TLS is *faster* than current options. It is up to developer to decide to use
either IFUNC or __builtin_cpu_supports. If the developer will use it in
hot loops or not, it is up to them to profile and use another way.

You can say the same about current x86 __builtin_cpu_supports support: you should
not use in loops, you should use ifunc, whatever.
  
Rich Felker June 10, 2015, 3:32 p.m. UTC | #34
On Wed, Jun 10, 2015 at 11:28:15AM +0200, Florian Weimer wrote:
> On 06/09/2015 08:26 PM, Rich Felker wrote:
> > On Tue, Jun 09, 2015 at 08:21:38PM +0200, Florian Weimer wrote:
> >> On 06/09/2015 06:01 PM, Steven Munroe wrote:
> >>
> >>> A Static variable would require a an indirect load via the TOC/GOT
> >>> (which can be megabytes for a large program/library). I really really
> >>> want the avoid that.
> >>
> >> Could you encode the information in the address itself?  Then the
> >> indirection goes away.
> > 
> > You mean using (unsigned long)&__hwcap_hack or similar as the hwcap
> > bits?
> 
> Exactly.
> 
> > I don't see how you could make that work for static linking,
> > where the linker is going to put the GOT in the read-only text
> > segment.
> 
> Oh.  Is this optimization relevant to statically-linked binaries?

Global data access is mildly expensive even in static binaries for
PPC, I think, because there are no 32-bit immediates. Maybe it could
use two 16-bit immediates and bypass the GOT but I'm not sure if it
does this. I suspect there are a lot of codegen improvements like this
that could be made on MIPS-like RISC targets with poor support for
immediates and data addressing which would be A LOT more worthwhile
than just hacking a few arbitrarily-privileged pieces of data into the
TCB...

> I suppose the static linking case could be addressed with a new
> relocation for the static linker, as long as it is possible to reach a
> writable page from the GOT base using an offset determined at linked
> time.  Whether all this is worth the effort, I do not know.  The entire
> mechanism might turn out generally useful for mostly-read global
> variables without strong consistency requirements.

In the case of huge programs with lots of GOTs that access hwcap from
lots of places, I think you'd have to make lots of pages writable.

In the case of programs that just access hwcap from some cold-path
init code, this whole discussion is pointless.

Rich
  
Ondrej Bilka June 10, 2015, 3:53 p.m. UTC | #35
On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 10-06-2015 12:09, Ondřej Bílka wrote:
> > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 10-06-2015 11:16, Szabolcs Nagy wrote:
> >>> On 10/06/15 14:35, Adhemerval Zanella wrote:
> >>>> I agree that adding an API to modify the current hwcap is not a good
> >>>> approach. However the cost you are assuming here are *very* x86 biased,
> >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
> >>>> to load an external variable defined in a shared library, where for
> >>>> powerpc it is more costly:
> >>>
> >>> debian codesearch found 4 references to __builtin_cpu_supports
> >>> all seem to avoid using it repeatedly.
> >>>
> >>> multiversioning dispatch only happens at startup (for a small
> >>> number of functions according to existing practice).
> >>>
> >>> so why is hwcap expected to be used in hot loops?
> >>>
> >>
> >> Good question, I do not know and I believe Steve could answer this
> >> better than me.  I am only advocating here that assuming x86 costs
> >> for powerpc is not the way to evaluate this patch.
> > 
> > Sorry but your details don't matter when underlying idea is just bad.
> > Even if getting hwcap took 20 cycles otherwise it would still be bad
> > idea. As you need to use hwcap only once at initialization bringing cost
> > is completely irrelevant.
> > 
> > First as I explained major flaw of Steve approach how exactly do you
> > ensure that gcc won't insert newer instruction that would lead to crash
> > on older platform?
> > 
> > Second is that it makes no sense. If you are at situation where hwcap
> > access gets noticable on profile a checking is also noticable on
> > profile. So use ifunc which will save you that additional cycles on
> > checking hwcap bits.
> > 
> > A programmer that uses hwcap in hot loop is just incompetent. Its stays
> > constant on application. So he should make more copies of loop, each
> > with appropriate options.
> > 
> > Then even if compiler still handled these issues correctly you will
> > probaly lose more on missed compiler optimizations that your supposed
> > gain. Compiler can select suboptimal patch as he doesn't want to expand
> > function too much due size concerns.
> > 
> > That quite easy, for example in following would get magnitude slower
> > with hwcap than ifuncs. Reason is that even gcc-5.1 doesn't split it
> > into two branches each doing shift. Instead it emits div instruction
> > which takes forever.
> > 
> > int hwcap;
> > unsigned int foo(unsigned int i)
> > {
> >   int d = 8;
> >   if (hwcap & 42)
> >     d = 4;
> >   return i / d;
> > }
> > 
> 
> And you can use GCC extensions to generate architecture specific instructions
> based on architecture specific flags (check testsuite/gcc.target/powerpc/ppc-target-1.c).
> And these are architecture specific and just a subset of options are enabled.
> 
> And my understanding is to optimize hwcap access to provide a 'better' way
> to enable '__builtin_cpu_supports' for powerpc.  IFUNC is another way to provide
> function selection, but it does not exclude that accessing hwcap through
> TLS is *faster* than current options. It is up to developer to decide to use
> either IFUNC or __builtin_cpu_supports. If the developer will use it in
> hot loops or not, it is up to them to profile and use another way.
> 
> You can say the same about current x86 __builtin_cpu_supports support: you should
> not use in loops, you should use ifunc, whatever.

Sorry but no again. We are talking here on difference between variable
access and tcb access. You forgot to count total cost. That includes
initialization overhead per thread to initialize hwcap, increased
per-thread memory usage, maintainance burden and increased cache misses.
If you access hwcap only rarely as you should then per-thread copies
would introduce cache miss that is more costy than GOT overhead. In GOT
case it could be avoided as combined threads would access it more often.

So if your multithreaded application access hwcap maybe 10 times per run 
you would likely harm performance.

I could from my head tell ten functions that with tcb entry lead to much
bigger performance gains. So if this is applicable I will submit strspn
improvement that keeps 32 bitmask and checks if second argument didn't
changed. That would be better usage of tls than keeping hwcap data.
  
Steven Munroe June 10, 2015, 4:45 p.m. UTC | #36
On Wed, 2015-06-10 at 11:21 -0300, Adhemerval Zanella wrote:
> 
> On 10-06-2015 11:16, Szabolcs Nagy wrote:
> > On 10/06/15 14:35, Adhemerval Zanella wrote:
> >> I agree that adding an API to modify the current hwcap is not a good
> >> approach. However the cost you are assuming here are *very* x86 biased,
> >> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
> >> to load an external variable defined in a shared library, where for
> >> powerpc it is more costly:
> > 
> > debian codesearch found 4 references to __builtin_cpu_supports
> > all seem to avoid using it repeatedly.
> > 
> > multiversioning dispatch only happens at startup (for a small
> > number of functions according to existing practice).
> > 
> > so why is hwcap expected to be used in hot loops?
> > 
> 
> Good question, I do not know and I believe Steve could answer this
> better than me.  I am only advocating here that assuming x86 costs
> for powerpc is not the way to evaluate this patch.
> 

The trade off is that the dynamic solutions (platform library selection
via AT_PLATFORM) and STT_GNU_IFUNC require a dynamic call which in our
ABI required an indirect branch and link via the CTR. There is also the
overhead of the TOC save/reload.

The net is the trade-offs are different for POWER then for other
platform. I spend a lot of time looking at performance data from
customer applications and see these issues (as measurable additional
path length and forced hazards).

So there is a place for this proposed optimization strategy where we can
avoid the overhead of the dynamic call and substitute the smaller more
predictable latency of the HWCAP; load word, and immediate record, and
branch conditional (3 instructions, low cache hazard, and highly
predictable branch).

The concern about the cache foot print does not apply as these fields
share the cache line with other active TCB fields. This line will be in
L1 for any active thread.
  
Steven Munroe June 10, 2015, 6:58 p.m. UTC | #37
On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote:
> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote:
> > 
> > 
> > On 10-06-2015 12:09, Ondřej Bílka wrote:
> > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
> > >>
> > >>
> > >> On 10-06-2015 11:16, Szabolcs Nagy wrote:
> > >>> On 10/06/15 14:35, Adhemerval Zanella wrote:
> > >>>> I agree that adding an API to modify the current hwcap is not a good
> > >>>> approach. However the cost you are assuming here are *very* x86 biased,
> > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
> > >>>> to load an external variable defined in a shared library, where for
> > >>>> powerpc it is more costly:
> > >>>
> > >>> debian codesearch found 4 references to __builtin_cpu_supports
> > >>> all seem to avoid using it repeatedly.
> > >>>
> > >>> multiversioning dispatch only happens at startup (for a small
> > >>> number of functions according to existing practice).
> > >>>
> > >>> so why is hwcap expected to be used in hot loops?
> > >>>
> > >>
> snip
> > And my understanding is to optimize hwcap access to provide a 'better' way
> > to enable '__builtin_cpu_supports' for powerpc.  IFUNC is another way to provide
> > function selection, but it does not exclude that accessing hwcap through
> > TLS is *faster* than current options. It is up to developer to decide to use
> > either IFUNC or __builtin_cpu_supports. If the developer will use it in
> > hot loops or not, it is up to them to profile and use another way.
> > 
> > You can say the same about current x86 __builtin_cpu_supports support: you should
> > not use in loops, you should use ifunc, whatever.
> 
> Sorry but no again. We are talking here on difference between variable
> access and tcb access. You forgot to count total cost. That includes
> initialization overhead per thread to initialize hwcap, increased
> per-thread memory usage, maintainance burden and increased cache misses.
> If you access hwcap only rarely as you should then per-thread copies
> would introduce cache miss that is more costy than GOT overhead. In GOT
> case it could be avoided as combined threads would access it more often.
> 
Actually Adhemerval does have the knowledge, background, and experience
to understand this difference and accurately access the trade-offs.

> So if your multithreaded application access hwcap maybe 10 times per run 
> you would likely harm performance.
> 
Sorry this is not an accurate assessment as the proposed fields are in
the same cache line as other more frequently accessed fields of the TCB.

The proposal will not effectively increase the cache foot-print.

> I could from my head tell ten functions that with tcb entry lead to much
> bigger performance gains. So if this is applicable I will submit strspn
> improvement that keeps 32 bitmask and checks if second argument didn't
> changed. That would be better usage of tls than keeping hwcap data.
>
If you are suggestion saving results across strspn calls then a normal
TLS variable would be an appropriate choice.

This proposal covers a different situation.


/soap box
While I am no expert in all things and try not to comment on things
which I really don't have the expertise (especially other platforms), I
do know a lot about the POWER platform.

I am responsible for the overall delivery of the open source toolchain
for Linux on Power. GLIBC is just one component of many that needs to be
coordinated for delivery. I also get involved directly with Linux
customers and try to respond to issues they identify. As such I am in a
good position to see how all the pieces (hardware, software, ABIs, ...)
fit together and where they can be made better.

With this larger responsibility, I don't have much time to quibble over
the fine point of esoteric design. So I tend to short cut to conclusions
and support my team.

If you do catch me pontificating on some other platform, without basis
in fact, please feel free to call me out.

But lots people seem to want to provide their opinion based on their
experience with other platforms and point out where I might have
strayed. Fine, but I can and do try to point out that their argument
does not apply (to my platform).

But recent comments and responses have gone past the normal give and
take of a healthy community, and into accusations and attacks.

That is going too far should not be tolerated.

\soap box
  
Ondrej Bilka June 10, 2015, 8:56 p.m. UTC | #38
On Wed, Jun 10, 2015 at 01:58:27PM -0500, Steven Munroe wrote:
> On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote:
> > On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote:
> > > 
> > > 
> > > On 10-06-2015 12:09, Ondřej Bílka wrote:
> > > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
> > > >>
> > > >>
> > > >> On 10-06-2015 11:16, Szabolcs Nagy wrote:
> > > >>> On 10/06/15 14:35, Adhemerval Zanella wrote:
> > > >>>> I agree that adding an API to modify the current hwcap is not a good
> > > >>>> approach. However the cost you are assuming here are *very* x86 biased,
> > > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
> > > >>>> to load an external variable defined in a shared library, where for
> > > >>>> powerpc it is more costly:
> > > >>>
> > > >>> debian codesearch found 4 references to __builtin_cpu_supports
> > > >>> all seem to avoid using it repeatedly.
> > > >>>
> > > >>> multiversioning dispatch only happens at startup (for a small
> > > >>> number of functions according to existing practice).
> > > >>>
> > > >>> so why is hwcap expected to be used in hot loops?
> > > >>>
> > > >>
> > snip
> > > And my understanding is to optimize hwcap access to provide a 'better' way
> > > to enable '__builtin_cpu_supports' for powerpc.  IFUNC is another way to provide
> > > function selection, but it does not exclude that accessing hwcap through
> > > TLS is *faster* than current options. It is up to developer to decide to use
> > > either IFUNC or __builtin_cpu_supports. If the developer will use it in
> > > hot loops or not, it is up to them to profile and use another way.
> > > 
> > > You can say the same about current x86 __builtin_cpu_supports support: you should
> > > not use in loops, you should use ifunc, whatever.
> > 
> > Sorry but no again. We are talking here on difference between variable
> > access and tcb access. You forgot to count total cost. That includes
> > initialization overhead per thread to initialize hwcap, increased
> > per-thread memory usage, maintainance burden and increased cache misses.
> > If you access hwcap only rarely as you should then per-thread copies
> > would introduce cache miss that is more costy than GOT overhead. In GOT
> > case it could be avoided as combined threads would access it more often.
> > 
> Actually Adhemerval does have the knowledge, background, and experience
> to understand this difference and accurately access the trade-offs.
>
While he may have background he didn't cover drawbacks. So I needed to
point them out to start discussing cost-benefit analysis instead looking
at them with rose glasses.
 
> > So if your multithreaded application access hwcap maybe 10 times per run 
> > you would likely harm performance.
> > 
> Sorry this is not an accurate assessment as the proposed fields are in
> the same cache line as other more frequently accessed fields of the TCB.
> 
> The proposal will not effectively increase the cache foot-print.
> 
It could by displacement. Whats next field? By adding that you could
shift that to next cache line. When it would be frequently used you are
using two cache lines instead one.


> > I could from my head tell ten functions that with tcb entry lead to much
> > bigger performance gains. So if this is applicable I will submit strspn
> > improvement that keeps 32 bitmask and checks if second argument didn't
> > changed. That would be better usage of tls than keeping hwcap data.
> >
> If you are suggestion saving results across strspn calls then a normal
> TLS variable would be an appropriate choice.
> 
> This proposal covers a different situation.
> 
I am not saying that. I am saying that place at tcb table is resource
that needs to be managed.

I am not convinced about your proposal as it would help only your
application. Remaining applications that won't use hwcap would pay in
increased startup overhead of threads and bit bigger memory comsumption.

For example we could decide to add per-thread 256 byte cache to malloc 
and inline small allocations to use that cache with fast access by tcb.
That would likely benefit everybody and is wise thing to do. Then there
are other use cases and we should set treshold on how big average performance
gain you need to show. 

Thats why you need to calculate cost and you need to show that benefits are bigger. 
It may benefit your application which is one of thousand. Remaining 999 applications
could also find tcb variable that will give them similar
speedup as your application. If we are impartial we should add them all.
That would result in each thread needing additional 8kb tls space per
thread and being slowed down by initialization. So where is your
evidence that gains would be so widespread?

Also I wasn't saying that strspn could benefit from normal tls variable.
I was saying that if you do a cost benefit analysis which one of hwcap
and strspn optimization should use tcb then you should include strspn
and leave hwcap alone. There are many more applications that use strspn
so overall gain would be bigger.


> 
> /soap box
> While I am no expert in all things and try not to comment on things
> which I really don't have the expertise (especially other platforms), I
> do know a lot about the POWER platform.
> 
> I am responsible for the overall delivery of the open source toolchain
> for Linux on Power. GLIBC is just one component of many that needs to be
> coordinated for delivery. I also get involved directly with Linux
> customers and try to respond to issues they identify. As such I am in a
> good position to see how all the pieces (hardware, software, ABIs, ...)
> fit together and where they can be made better.
> 
> With this larger responsibility, I don't have much time to quibble over
> the fine point of esoteric design. So I tend to short cut to conclusions
> and support my team.
>
Thats problem as naturaly these shortcut lead to worse decisions. You
should delegate that responsibility to somebody who knows details.

 
> If you do catch me pontificating on some other platform, without basis
> in fact, please feel free to call me out.
> 
> But lots people seem to want to provide their opinion based on their
> experience with other platforms and point out where I might have
> strayed. Fine, but I can and do try to point out that their argument
> does not apply (to my platform).
> 
> But recent comments and responses have gone past the normal give and
> take of a healthy community, and into accusations and attacks.
> 
> That is going too far should not be tolerated.
> 
> \soap box
> 
> 
> 
> 
> 
>
  
Adhemerval Zanella Netto June 10, 2015, 10:09 p.m. UTC | #39
On 10-06-2015 17:56, Ondřej Bílka wrote:
> On Wed, Jun 10, 2015 at 01:58:27PM -0500, Steven Munroe wrote:
>> On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote:
>>> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 10-06-2015 12:09, Ondřej Bílka wrote:
>>>>> On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
>>>>>>
>>>>>>
>>>>>> On 10-06-2015 11:16, Szabolcs Nagy wrote:
>>>>>>> On 10/06/15 14:35, Adhemerval Zanella wrote:
>>>>>>>> I agree that adding an API to modify the current hwcap is not a good
>>>>>>>> approach. However the cost you are assuming here are *very* x86 biased,
>>>>>>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
>>>>>>>> to load an external variable defined in a shared library, where for
>>>>>>>> powerpc it is more costly:
>>>>>>>
>>>>>>> debian codesearch found 4 references to __builtin_cpu_supports
>>>>>>> all seem to avoid using it repeatedly.
>>>>>>>
>>>>>>> multiversioning dispatch only happens at startup (for a small
>>>>>>> number of functions according to existing practice).
>>>>>>>
>>>>>>> so why is hwcap expected to be used in hot loops?
>>>>>>>
>>>>>>
>>> snip
>>>> And my understanding is to optimize hwcap access to provide a 'better' way
>>>> to enable '__builtin_cpu_supports' for powerpc.  IFUNC is another way to provide
>>>> function selection, but it does not exclude that accessing hwcap through
>>>> TLS is *faster* than current options. It is up to developer to decide to use
>>>> either IFUNC or __builtin_cpu_supports. If the developer will use it in
>>>> hot loops or not, it is up to them to profile and use another way.
>>>>
>>>> You can say the same about current x86 __builtin_cpu_supports support: you should
>>>> not use in loops, you should use ifunc, whatever.
>>>
>>> Sorry but no again. We are talking here on difference between variable
>>> access and tcb access. You forgot to count total cost. That includes
>>> initialization overhead per thread to initialize hwcap, increased
>>> per-thread memory usage, maintainance burden and increased cache misses.
>>> If you access hwcap only rarely as you should then per-thread copies
>>> would introduce cache miss that is more costy than GOT overhead. In GOT
>>> case it could be avoided as combined threads would access it more often.
>>>
>> Actually Adhemerval does have the knowledge, background, and experience
>> to understand this difference and accurately access the trade-offs.
>>
> While he may have background he didn't cover drawbacks. So I needed to
> point them out to start discussing cost-benefit analysis instead looking
> at them with rose glasses.
>  

What I did was to pointed out your earlier analysis related to instruction
latency was x86 biased and did not hold out for powerpc TOC cost model.  I
was *not* advocating anything more neither saying this hwcap in TCB is the
best approach.

And I do see the raised points you brought as valid, but IMHO this kind of
discussion will stretch without end mainly because it is based on assumptions
and tradeoffs.

Now, my opinion for powerpc to implement __builtin_cpu_supports is to similar
to x86 by adding it on libgcc and using initial executable TLS variables. It
will create 2 dynamic relocations (R_PPC64_TPREL16_HI and R_PPC64_TPREL16_LO),
but the access will require 2 arithmetic instruction and 1 load.  It will
decouple the implementation from GLIBC and not required any more TCB fields.
  
Ondrej Bilka June 11, 2015, 4:34 a.m. UTC | #40
On Wed, Jun 10, 2015 at 11:45:53AM -0500, Steven Munroe wrote:
> On Wed, 2015-06-10 at 11:21 -0300, Adhemerval Zanella wrote:
> > 
> > On 10-06-2015 11:16, Szabolcs Nagy wrote:
> > > On 10/06/15 14:35, Adhemerval Zanella wrote:
> > >> I agree that adding an API to modify the current hwcap is not a good
> > >> approach. However the cost you are assuming here are *very* x86 biased,
> > >> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
> > >> to load an external variable defined in a shared library, where for
> > >> powerpc it is more costly:
> > > 
> > > debian codesearch found 4 references to __builtin_cpu_supports
> > > all seem to avoid using it repeatedly.
> > > 
> > > multiversioning dispatch only happens at startup (for a small
> > > number of functions according to existing practice).
> > > 
> > > so why is hwcap expected to be used in hot loops?
> > > 
> > 
> > Good question, I do not know and I believe Steve could answer this
> > better than me.  I am only advocating here that assuming x86 costs
> > for powerpc is not the way to evaluate this patch.
> > 
> 
> The trade off is that the dynamic solutions (platform library selection
> via AT_PLATFORM) and STT_GNU_IFUNC require a dynamic call which in our
> ABI required an indirect branch and link via the CTR. There is also the
> overhead of the TOC save/reload.
> 
Wait you are using dynamic libraries anyway which require that already
so it wouldn't make any difference.

Or you are trying to say that you statically link libraries to generic
one instead specialized ones and using simple wrapper script to run per-cpu application like following one?

if [ ! -z `cat /proc/cpuinfo | grep power11` ]
  app_power11 $*
elif [ ! -z `cat /proc/cpuinfo | grep power10` ]
  app_power10 $*
  ...

> The net is the trade-offs are different for POWER then for other
> platform. I spend a lot of time looking at performance data from
> customer applications and see these issues (as measurable additional
> path length and forced hazards).
> 
> So there is a place for this proposed optimization strategy where we can
> avoid the overhead of the dynamic call and substitute the smaller more
> predictable latency of the HWCAP; load word, and immediate record, and
> branch conditional (3 instructions, low cache hazard, and highly
> predictable branch).
> 
But my point is that there shouldn't be no dynamic call nor hwcap
branch. As that function is hot-spot you would gain more by inlining it
and doing decision in callers.


> The concern about the cache foot print does not apply as these fields
> share the cache line with other active TCB fields. This line will be in
> L1 for any active thread.
>
Excellent you have applications. So you could show that there is some
measurable performance benefit of your claims.

So Steven you have several applications from customers that statically
link every library for performance? I assume that as if cost of GOT on
powerpc is so high as you claim it has better cost/benefit ratio of
eliminating them than just plt entry of hwcap.

First report benchmark with unchanged application.
Then report number when you use ifdef to make it constant and compile
application with -mcpu=power7 and report difference versus generic.

When you have this you could try measure difference between plt and
noplt hwcap to see if its real or you are just micromanaging and don't
improve actual performance as you spend time on cold path instead.
  
Andrew Pinski June 11, 2015, 5:30 a.m. UTC | #41
On Thu, Jun 11, 2015 at 2:58 AM, Steven Munroe
<munroesj@linux.vnet.ibm.com> wrote:
> On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote:
>> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote:
>> >
>> >
>> > On 10-06-2015 12:09, Ondřej Bílka wrote:
>> > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
>> > >>
>> > >>
>> > >> On 10-06-2015 11:16, Szabolcs Nagy wrote:
>> > >>> On 10/06/15 14:35, Adhemerval Zanella wrote:
>> > >>>> I agree that adding an API to modify the current hwcap is not a good
>> > >>>> approach. However the cost you are assuming here are *very* x86 biased,
>> > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>)
>> > >>>> to load an external variable defined in a shared library, where for
>> > >>>> powerpc it is more costly:
>> > >>>
>> > >>> debian codesearch found 4 references to __builtin_cpu_supports
>> > >>> all seem to avoid using it repeatedly.
>> > >>>
>> > >>> multiversioning dispatch only happens at startup (for a small
>> > >>> number of functions according to existing practice).
>> > >>>
>> > >>> so why is hwcap expected to be used in hot loops?
>> > >>>
>> > >>
>> snip
>> > And my understanding is to optimize hwcap access to provide a 'better' way
>> > to enable '__builtin_cpu_supports' for powerpc.  IFUNC is another way to provide
>> > function selection, but it does not exclude that accessing hwcap through
>> > TLS is *faster* than current options. It is up to developer to decide to use
>> > either IFUNC or __builtin_cpu_supports. If the developer will use it in
>> > hot loops or not, it is up to them to profile and use another way.
>> >
>> > You can say the same about current x86 __builtin_cpu_supports support: you should
>> > not use in loops, you should use ifunc, whatever.
>>
>> Sorry but no again. We are talking here on difference between variable
>> access and tcb access. You forgot to count total cost. That includes
>> initialization overhead per thread to initialize hwcap, increased
>> per-thread memory usage, maintainance burden and increased cache misses.
>> If you access hwcap only rarely as you should then per-thread copies
>> would introduce cache miss that is more costy than GOT overhead. In GOT
>> case it could be avoided as combined threads would access it more often.
>>
> Actually Adhemerval does have the knowledge, background, and experience
> to understand this difference and accurately access the trade-offs.

Yes and the trade-offs for Power are going to be different than the
trade-offs for AARCH64 and x86_64.  And it gets harder for AARCH64
really as there are many micro-architectures and not controlled by
just one vendor (this is getting off topic).



>
>> So if your multithreaded application access hwcap maybe 10 times per run
>> you would likely harm performance.
>>
> Sorry this is not an accurate assessment as the proposed fields are in
> the same cache line as other more frequently accessed fields of the TCB.
>
> The proposal will not effectively increase the cache foot-print.

very true, it might actually decrease it :).

>
>> I could from my head tell ten functions that with tcb entry lead to much
>> bigger performance gains. So if this is applicable I will submit strspn
>> improvement that keeps 32 bitmask and checks if second argument didn't
>> changed. That would be better usage of tls than keeping hwcap data.
>>
> If you are suggestion saving results across strspn calls then a normal
> TLS variable would be an appropriate choice.
>
> This proposal covers a different situation.
>
>
> /soap box
> While I am no expert in all things and try not to comment on things
> which I really don't have the expertise (especially other platforms), I
> do know a lot about the POWER platform.
>
> I am responsible for the overall delivery of the open source toolchain
> for Linux on Power. GLIBC is just one component of many that needs to be
> coordinated for delivery. I also get involved directly with Linux
> customers and try to respond to issues they identify. As such I am in a
> good position to see how all the pieces (hardware, software, ABIs, ...)
> fit together and where they can be made better.
>
> With this larger responsibility, I don't have much time to quibble over
> the fine point of esoteric design. So I tend to short cut to conclusions
> and support my team.

I know how it feels, I am in the same boat.  Usually my suggestions
are more aimed at getting some free work done for myself :).
But I actually like this proposal and even thinking about it for
AARCH64 with both hwcap and another AUVX varaible.

>
> If you do catch me pontificating on some other platform, without basis
> in fact, please feel free to call me out.
>
> But lots people seem to want to provide their opinion based on their
> experience with other platforms and point out where I might have
> strayed. Fine, but I can and do try to point out that their argument
> does not apply (to my platform).

Totally 100% agree.  Even then there is some micro-architectures
differences even on some architectures which some folks don't
understand that trade-offs need to be taken even for differences in
micro-architectures.

Thanks,
Andrew Pinski

>
> But recent comments and responses have gone past the normal give and
> take of a healthy community, and into accusations and attacks.
>
> That is going too far should not be tolerated.
>
> \soap box
>
>
>
>
>
>
>
>
  
Ondrej Bilka June 11, 2015, 6:52 a.m. UTC | #42
On Thu, Jun 11, 2015 at 01:30:51PM +0800, Andrew Pinski wrote:
> On Thu, Jun 11, 2015 at 2:58 AM, Steven Munroe
> <munroesj@linux.vnet.ibm.com> wrote:
> > On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote:
> >> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote:
> >> >
> >> >
> >> > On 10-06-2015 12:09, Ondřej Bílka wrote:
> >> > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
> >> > >>
> >> > >>
> >> > >> On 10-06-2015 11:16, Szabolcs Nagy wrote:
> >> > >>> On 10/06/15 14:35, Adhemerval Zanella wrote:
> >> > >>>> I agree that adding an API to modify the current hwcap is not a good
> >> > >>>> approach. However the cost you are assuming here are *very* x86 biased,
> >> > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>)
> >> > >>>> to load an external variable defined in a shared library, where for
> >> > >>>> powerpc it is more costly:
> >> > >>>
> >> > >>> debian codesearch found 4 references to __builtin_cpu_supports
> >> > >>> all seem to avoid using it repeatedly.
> >> > >>>
> >> > >>> multiversioning dispatch only happens at startup (for a small
> >> > >>> number of functions according to existing practice).
> >> > >>>
> >> > >>> so why is hwcap expected to be used in hot loops?
> >> > >>>
> >> > >>
> >> snip
> >> > And my understanding is to optimize hwcap access to provide a 'better' way
> >> > to enable '__builtin_cpu_supports' for powerpc.  IFUNC is another way to provide
> >> > function selection, but it does not exclude that accessing hwcap through
> >> > TLS is *faster* than current options. It is up to developer to decide to use
> >> > either IFUNC or __builtin_cpu_supports. If the developer will use it in
> >> > hot loops or not, it is up to them to profile and use another way.
> >> >
> >> > You can say the same about current x86 __builtin_cpu_supports support: you should
> >> > not use in loops, you should use ifunc, whatever.
> >>
> >> Sorry but no again. We are talking here on difference between variable
> >> access and tcb access. You forgot to count total cost. That includes
> >> initialization overhead per thread to initialize hwcap, increased
> >> per-thread memory usage, maintainance burden and increased cache misses.
> >> If you access hwcap only rarely as you should then per-thread copies
> >> would introduce cache miss that is more costy than GOT overhead. In GOT
> >> case it could be avoided as combined threads would access it more often.
> >>
> > Actually Adhemerval does have the knowledge, background, and experience
> > to understand this difference and accurately access the trade-offs.
> 
> Yes and the trade-offs for Power are going to be different than the
> trade-offs for AARCH64 and x86_64.  And it gets harder for AARCH64
> really as there are many micro-architectures and not controlled by
> just one vendor (this is getting off topic).
> 
> 
But I was talking about general trade off that you shouldn't do
instruction selection frequently. You should select granularity that
makes overhead of selection itself insignificant. If there is small
function that requires it you should inline it or resolve which variant
to do in caller. That stays true on all platforms.
> 
> >
> >> So if your multithreaded application access hwcap maybe 10 times per run
> >> you would likely harm performance.
> >>
> > Sorry this is not an accurate assessment as the proposed fields are in
> > the same cache line as other more frequently accessed fields of the TCB.
> >
> > The proposal will not effectively increase the cache foot-print.
> 
> very true, it might actually decrease it :).
>
Are you claiming that adding a unused fields to between frequently used
fields of structure decreases cache footprint?

Or are you claiming that at least 10% of applications on powerpc will
frequently access hwcap?

As I said before provide evidence. Naturally if 90% of applications
wouldn't access hwcap then it would probably increase memory footprint
as you add unused field per thread.

I am talking about average impact. I could say about almost anything
that in best case it decreases cache footprint. For example that by
chance adding variable makes frequently used firefox tls structure
aligned to 64 bytes.

 
> >
> >> I could from my head tell ten functions that with tcb entry lead to much
> >> bigger performance gains. So if this is applicable I will submit strspn
> >> improvement that keeps 32 bitmask and checks if second argument didn't
> >> changed. That would be better usage of tls than keeping hwcap data.
> >>
> > If you are suggestion saving results across strspn calls then a normal
> > TLS variable would be an appropriate choice.
> >
> > This proposal covers a different situation.
> >
> >
> > /soap box
> > While I am no expert in all things and try not to comment on things
> > which I really don't have the expertise (especially other platforms), I
> > do know a lot about the POWER platform.
> >
> > I am responsible for the overall delivery of the open source toolchain
> > for Linux on Power. GLIBC is just one component of many that needs to be
> > coordinated for delivery. I also get involved directly with Linux
> > customers and try to respond to issues they identify. As such I am in a
> > good position to see how all the pieces (hardware, software, ABIs, ...)
> > fit together and where they can be made better.
> >
> > With this larger responsibility, I don't have much time to quibble over
> > the fine point of esoteric design. So I tend to short cut to conclusions
> > and support my team.
> 
> I know how it feels, I am in the same boat.  Usually my suggestions
> are more aimed at getting some free work done for myself :).
> But I actually like this proposal and even thinking about it for
> AARCH64 with both hwcap and another AUVX varaible.
>
Which ones and why not reparse parse entire AUXV to translate each
getauxval(x) to have static offset for each.

if (__builtin_constant_p(x) && x == foo)
&(auxval_hack_foo)

That would provide faster getgid and geteuid. If you do this with
Florian's hack it could help.
  
Andrew Pinski June 11, 2015, 7:08 a.m. UTC | #43
On Thu, Jun 11, 2015 at 2:52 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Thu, Jun 11, 2015 at 01:30:51PM +0800, Andrew Pinski wrote:
>> On Thu, Jun 11, 2015 at 2:58 AM, Steven Munroe
>> <munroesj@linux.vnet.ibm.com> wrote:
>> > On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote:
>> >> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote:
>> >> >
>> >> >
>> >> > On 10-06-2015 12:09, Ondřej Bílka wrote:
>> >> > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
>> >> > >>
>> >> > >>
>> >> > >> On 10-06-2015 11:16, Szabolcs Nagy wrote:
>> >> > >>> On 10/06/15 14:35, Adhemerval Zanella wrote:
>> >> > >>>> I agree that adding an API to modify the current hwcap is not a good
>> >> > >>>> approach. However the cost you are assuming here are *very* x86 biased,
>> >> > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>)
>> >> > >>>> to load an external variable defined in a shared library, where for
>> >> > >>>> powerpc it is more costly:
>> >> > >>>
>> >> > >>> debian codesearch found 4 references to __builtin_cpu_supports
>> >> > >>> all seem to avoid using it repeatedly.
>> >> > >>>
>> >> > >>> multiversioning dispatch only happens at startup (for a small
>> >> > >>> number of functions according to existing practice).
>> >> > >>>
>> >> > >>> so why is hwcap expected to be used in hot loops?
>> >> > >>>
>> >> > >>
>> >> snip
>> >> > And my understanding is to optimize hwcap access to provide a 'better' way
>> >> > to enable '__builtin_cpu_supports' for powerpc.  IFUNC is another way to provide
>> >> > function selection, but it does not exclude that accessing hwcap through
>> >> > TLS is *faster* than current options. It is up to developer to decide to use
>> >> > either IFUNC or __builtin_cpu_supports. If the developer will use it in
>> >> > hot loops or not, it is up to them to profile and use another way.
>> >> >
>> >> > You can say the same about current x86 __builtin_cpu_supports support: you should
>> >> > not use in loops, you should use ifunc, whatever.
>> >>
>> >> Sorry but no again. We are talking here on difference between variable
>> >> access and tcb access. You forgot to count total cost. That includes
>> >> initialization overhead per thread to initialize hwcap, increased
>> >> per-thread memory usage, maintainance burden and increased cache misses.
>> >> If you access hwcap only rarely as you should then per-thread copies
>> >> would introduce cache miss that is more costy than GOT overhead. In GOT
>> >> case it could be avoided as combined threads would access it more often.
>> >>
>> > Actually Adhemerval does have the knowledge, background, and experience
>> > to understand this difference and accurately access the trade-offs.
>>
>> Yes and the trade-offs for Power are going to be different than the
>> trade-offs for AARCH64 and x86_64.  And it gets harder for AARCH64
>> really as there are many micro-architectures and not controlled by
>> just one vendor (this is getting off topic).
>>
>>
> But I was talking about general trade off that you shouldn't do
> instruction selection frequently. You should select granularity that
> makes overhead of selection itself insignificant. If there is small
> function that requires it you should inline it or resolve which variant
> to do in caller. That stays true on all platforms.
>>
>> >
>> >> So if your multithreaded application access hwcap maybe 10 times per run
>> >> you would likely harm performance.
>> >>
>> > Sorry this is not an accurate assessment as the proposed fields are in
>> > the same cache line as other more frequently accessed fields of the TCB.
>> >
>> > The proposal will not effectively increase the cache foot-print.
>>
>> very true, it might actually decrease it :).
>>
> Are you claiming that adding a unused fields to between frequently used
> fields of structure decreases cache footprint?
>
> Or are you claiming that at least 10% of applications on powerpc will
> frequently access hwcap?
>
> As I said before provide evidence. Naturally if 90% of applications
> wouldn't access hwcap then it would probably increase memory footprint
> as you add unused field per thread.
>
> I am talking about average impact. I could say about almost anything
> that in best case it decreases cache footprint. For example that by
> chance adding variable makes frequently used firefox tls structure
> aligned to 64 bytes.
>
>
>> >
>> >> I could from my head tell ten functions that with tcb entry lead to much
>> >> bigger performance gains. So if this is applicable I will submit strspn
>> >> improvement that keeps 32 bitmask and checks if second argument didn't
>> >> changed. That would be better usage of tls than keeping hwcap data.
>> >>
>> > If you are suggestion saving results across strspn calls then a normal
>> > TLS variable would be an appropriate choice.
>> >
>> > This proposal covers a different situation.
>> >
>> >
>> > /soap box
>> > While I am no expert in all things and try not to comment on things
>> > which I really don't have the expertise (especially other platforms), I
>> > do know a lot about the POWER platform.
>> >
>> > I am responsible for the overall delivery of the open source toolchain
>> > for Linux on Power. GLIBC is just one component of many that needs to be
>> > coordinated for delivery. I also get involved directly with Linux
>> > customers and try to respond to issues they identify. As such I am in a
>> > good position to see how all the pieces (hardware, software, ABIs, ...)
>> > fit together and where they can be made better.
>> >
>> > With this larger responsibility, I don't have much time to quibble over
>> > the fine point of esoteric design. So I tend to short cut to conclusions
>> > and support my team.
>>
>> I know how it feels, I am in the same boat.  Usually my suggestions
>> are more aimed at getting some free work done for myself :).
>> But I actually like this proposal and even thinking about it for
>> AARCH64 with both hwcap and another AUVX varaible.
>>
> Which ones and why not reparse parse entire AUXV to translate each
> getauxval(x) to have static offset for each.

The one (MIDR) which is equivalent of doing cpuid on x86.  I still
need to submit the kernel patch for this but that will be next week.
HWCAP is not enough in this case as there are going to be many more
micro-architectures and even different passes (major revisions) of the
same micro-architecture might have slightly different behavior (I
already know of one but I can't say anything more than that).

Thanks,
Andrew

>
> if (__builtin_constant_p(x) && x == foo)
> &(auxval_hack_foo)
>
> That would provide faster getgid and geteuid. If you do this with
> Florian's hack it could help.
>
>
  
Steven Munroe June 25, 2015, 3:58 p.m. UTC | #44
On Wed, 2015-06-10 at 14:50 +0200, Ondřej Bílka wrote:
> On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
> > On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote:
> > > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
> > > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > > > 
> > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > > > > The proposed patch adds a new feature for powerpc. In order to get
> > > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > > > > This enables users to write versioned code based on the HWCAP bits
> > > > > > without going through the overhead of reading them from the auxiliary
> > > > > > vector.
> > > > 
> > > > > i assume this is for multi-versioning.
> > > > 
> > > > The intent is for the compiler to implement the equivalent of
> > > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > > efficiently as getauxv and scanning the auxv is too slow for inline
> > > > optimizations.
> > > > 
>Snip

After all was said and done, much more was said then done ....

Sorry I have been on vacation and them catching up on day job from being
on vacation. 

But i think we need to reset the discussion and hopefully eliminate some
misconceptions:

1) This is not about the clever things what this clever things that this
community knows how to do, it is what the average Linux application
developer is willing to learn and use. 

I have tried to get them to use; CPU Platform libraries (library search
based on AT_PLATFORM). the AuxV and HWCAP directly, and use IFUNC. They
will not do this. 

They think this is all silly and too complicated. But we still want them
to exploit features of the latest processor while continuing to run on
existing processors in the field. Processor architectures evolve and we
have to give them a simple mechanism that they will actually use, to
handle this.  __builtin_cpu_supports() seems to be something they will
use.

2) This is not about exposing a private GLIBC resource (TCB) to the the
compiler. The TCB and TLS is part of the Platform ABI and must be known,
used, and understood by the compiler (GCC, LLVM, ...) binutils,
debuggers, etc in addition to GLIBC:

Power Architecture 64-Bit ELF V2 ABI Specification, OpenPOWER ABI for
Linux Supplement: Section 3.7.2 TLS Runtime Handling

This and other useful documents are available from the OpenPOWER
Foundation: http://openpowerfoundation.org/

If you look, you will see that TCB slots have already been allocated to
support other PowerISA specific features like; Event Based Branching,
Dynamic System Optimization, and Target Address Save. Recently we added
split-stack support for the GO language that required a TCB slot. So
adding a double word slot to cache AT_HWCAP and AT_HWCAP2 is no big
deal.

So far this all fits nicely in a single 128 byte cache-line. The TLS ABI
(which I defined back in back in 2004) reserved a full 4KB for the TCB
and extensions.

This all was not done lightly and was discussed extensively with the
appropriate developers in the corresponding projects. You all may not
have seen this because GLIBC not directly involved except as the owner
of ./sysdeps/powerpc/nptl/tls.h

The only reason we raised this discussion here because we wanted to
publish a platform specific API
in ./sysdeps/unix/sysv/linux/powerpc/bits/ppc.h to make is easier for
the compilers to access it. And we felt it would be rude not discuss
this with the community.

3) I would think that the platform maintainers would have the standing
to implement their own platform ABI? Perhaps the project maintainers
would like to weigh in on this?

4) I have ask Carlos Seo to develop some micro benchmarks to illuminate
the performance implications of the various alternatives to the direct
TCB access proposal. If necessarily we can provide detail cycle accurate
instruction pipeline timings.
  
Ondrej Bilka June 26, 2015, 4:59 a.m. UTC | #45
On Thu, Jun 25, 2015 at 10:58:46AM -0500, Steven Munroe wrote:
> On Wed, 2015-06-10 at 14:50 +0200, Ondřej Bílka wrote:
> > On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
> > > On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote:
> > > > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
> > > > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > > > > 
> > > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > > > > > The proposed patch adds a new feature for powerpc. In order to get
> > > > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > > > > > This enables users to write versioned code based on the HWCAP bits
> > > > > > > without going through the overhead of reading them from the auxiliary
> > > > > > > vector.
> > > > > 
> > > > > > i assume this is for multi-versioning.
> > > > > 
> > > > > The intent is for the compiler to implement the equivalent of
> > > > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > > > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > > > efficiently as getauxv and scanning the auxv is too slow for inline
> > > > > optimizations.
> > > > > 
> >Snip
> 
> After all was said and done, much more was said then done ....
> 
> Sorry I have been on vacation and them catching up on day job from being
> on vacation. 
> 
> But i think we need to reset the discussion and hopefully eliminate some
> misconceptions:
> 
> 1) This is not about the clever things what this clever things that this
> community knows how to do, it is what the average Linux application
> developer is willing to learn and use. 
> 
No, discussion is about what will lead to biggest overall performance
gain. Clearly a best solution would be have compiler that automatically
produces best code for each cpu, average application developer doesn't
have to learn anything.

> I have tried to get them to use; CPU Platform libraries (library search
> based on AT_PLATFORM). the AuxV and HWCAP directly, and use IFUNC. They
> will not do this. 
> 
> They think this is all silly and too complicated. But we still want them
> to exploit features of the latest processor while continuing to run on
> existing processors in the field. Processor architectures evolve and we
> have to give them a simple mechanism that they will actually use, to
> handle this.  __builtin_cpu_supports() seems to be something they will
> use.
>
There is error in reasoning: Something needs to be done. X is something. So
X needs to be done.

They are wrong that ifunc, AT_PLATFORM are silly but correct that its
complicated because problem is complicated.

As I said before it could be more harm than good. One example app
programmer uses __builtin_cpu_supports but compiles file with
-mcpu=power8 to get features he want. Then after upgrading gcc
application breaks as gcc inserted unsupported instruction into
nonpower8 branch.

Also its dubious that average programmer could do better than gcc with
correct -mcpu flag. I asked before if you could measure impact of
compiling applications with correct -mcpu and if hwcap could beat it.

For these you need distro maintainers setup compiling with
AT_PLATFORM... and that will also cover libraries where developers don't
care about powerpc niche platform.

If programmers don't use something it means that interface is bad and
you should come with better interface.

A best interface would be tell them to use flags -O3 -mmulticpu 
where -mmulticpu would take care of details by using
AT_PLATFORM/ifuncs...

Or you could tell them to use __attribute__((multicpu)) for hot
functions, below is how to implement that with macro that wraps ifunc,
would they do better than just adding this to each function that shows
more than 1% of total time in profile?

int foo (double x, double y) __attribute__((multicpu))
{
  return x * y;
}

or

multicpu (int, foo, (x, y) (double x, double y))
{
  return x * y;
}

with

#define multicpu(tp, name, arg, tparg) \
tp __##name tparg; \
tp __##name##_power5 tparg __attribute__((__target__("cpu=power5")))\
{ \
  return (tp) __##name arg; \
} \
tp __##name##_power6 tparg __attribute__((__target__("cpu=power6")))\
{ \
  return (tp) __##name arg; \
} \
tp name tparg \
{ \
 /* select ifunc */ \
} \
tp __##name tparg 


Also did you tried to ask application programmers after they used
__builtin_cpu_supports if they tested it on both machines?

Thas pretty basic and it wouldn't be surprise that it would regulary
introduce regressions as feature needs to be used in certain way.

I recalled new pitfall that user needs to ensure gains are more than
savings. How big is typically powerpc branch cache? If user adds
__builtin_cpu_supports checks to less frequent functions it may be
always mispredicted as it isn't in cache and you pay for increased code
size.


If situation is same as at x64 then if cpu supports foo means nothing.
You need to be quite careful how do you use feature to get improvement.

For example take optimizing loop with avx/avx2. You have three choices
1. use 256 bit loads/stores and loop operation
2. use 128 bit loads/stores and merge/split them for loop operation
3. use 128 bit loads/stores and 128bit loop operation.

What you choose depends on if you do unaligned loads/stores or not. As
these are quite expensive on fx10 you need to special case it even that
it supports avx. On ivy bridge splitting/merging gives performance
improvement but penalty is smaller. On haswell a 256bit loads/stores are
faster that splitting/merging.

That was quite simple example. To complicate matters more even with
haswell 256 bit loads/stores have big latency so you need to use them
only in loops.

 
> 2) This is not about exposing a private GLIBC resource (TCB) to the the
> compiler. The TCB and TLS is part of the Platform ABI and must be known,
> used, and understood by the compiler (GCC, LLVM, ...) binutils,
> debuggers, etc in addition to GLIBC:
> 
> Power Architecture 64-Bit ELF V2 ABI Specification, OpenPOWER ABI for
> Linux Supplement: Section 3.7.2 TLS Runtime Handling
> 
> This and other useful documents are available from the OpenPOWER
> Foundation: http://openpowerfoundation.org/
> 
> If you look, you will see that TCB slots have already been allocated to
> support other PowerISA specific features like; Event Based Branching,
> Dynamic System Optimization, and Target Address Save. Recently we added
> split-stack support for the GO language that required a TCB slot. So
> adding a double word slot to cache AT_HWCAP and AT_HWCAP2 is no big
> deal.
> 
> So far this all fits nicely in a single 128 byte cache-line. The TLS ABI
> (which I defined back in back in 2004) reserved a full 4KB for the TCB
> and extensions.
> 
> This all was not done lightly and was discussed extensively with the
> appropriate developers in the corresponding projects. You all may not
> have seen this because GLIBC not directly involved except as the owner
> of ./sysdeps/powerpc/nptl/tls.h
> 
You should say first that it uses reserved memory. 

So it isn't issue now. But if plt is as expensive as you say it will
quickly fill up. Save strcmp address in tcb to improve performance as
strcmp is most called function in libc and you would save several
magnitudes more on plt indirections than rarer hwcap. Then continue with
less called functions until that makes sense.


> The only reason we raised this discussion here because we wanted to
> publish a platform specific API
> in ./sysdeps/unix/sysv/linux/powerpc/bits/ppc.h to make is easier for
> the compilers to access it. And we felt it would be rude not discuss
> this with the community.
> 
> 3) I would think that the platform maintainers would have the standing
> to implement their own platform ABI? Perhaps the project maintainers
> would like to weigh in on this?
> 
> 4) I have ask Carlos Seo to develop some micro benchmarks to illuminate
> the performance implications of the various alternatives to the direct
> TCB access proposal. If necessarily we can provide detail cycle accurate
> instruction pipeline timings. 
>
Please benchmarks, microbenchmarks are not very useful, they measure
small constant c in expression c*x - y where positive is improvement. If x is hundred times y then exact value of c doesn't matter.

There is still unknown basic use cases, it doesn't make sense do
detailed measurement only to find that it on average saves hundred
cycles per app but its used by one app in thousand and it costs each which
doesn't use it a cycle. Thats net loss. Also performance will wary
depending how frequent is usage, when its mostly on cold code then you
have problems that hwcap branch is always mispredicted and increased
instruction cache usage so not using hwcap could be better if you do
only small saving.

So get some of these average programers, let them optimize some app with
hwcap and then check result.
  
Steven Munroe June 26, 2015, 4:27 p.m. UTC | #46
On Fri, 2015-06-26 at 06:59 +0200, Ondřej Bílka wrote:
> On Thu, Jun 25, 2015 at 10:58:46AM -0500, Steven Munroe wrote:
> > On Wed, 2015-06-10 at 14:50 +0200, Ondřej Bílka wrote:
> > > On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
> > > > On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote:
> > > > > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
> > > > > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > > > > > 
> > > > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > > > > > > The proposed patch adds a new feature for powerpc. In order to get
> > > > > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > > > > > > This enables users to write versioned code based on the HWCAP bits
> > > > > > > > without going through the overhead of reading them from the auxiliary
> > > > > > > > vector.
> > > > > > 
> > > > > > > i assume this is for multi-versioning.
> > > > > > 
> > > > > > The intent is for the compiler to implement the equivalent of
> > > > > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > > > > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > > > > efficiently as getauxv and scanning the auxv is too slow for inline
> > > > > > optimizations.
> > > > > > 
> > >Snip
> > 
> > After all was said and done, much more was said then done ....
> > 
> > Sorry I have been on vacation and them catching up on day job from being
> > on vacation. 
> > 
> > But i think we need to reset the discussion and hopefully eliminate some
> > misconceptions:
> > 
> > 1) This is not about the clever things what this clever things that this
> > community knows how to do, it is what the average Linux application
> > developer is willing to learn and use. 
> > 
> No, discussion is about what will lead to biggest overall performance
> gain. Clearly a best solution would be have compiler that automatically
> produces best code for each cpu, average application developer doesn't
> have to learn anything.
> 
Unfortunately this is not a realistic expectation in the real world.
Nothing is ever as simple are you would dlike.


> > I have tried to get them to use; CPU Platform libraries (library search
> > based on AT_PLATFORM). the AuxV and HWCAP directly, and use IFUNC. They
> > will not do this. 
> > 
> > They think this is all silly and too complicated. But we still want them
> > to exploit features of the latest processor while continuing to run on
> > existing processors in the field. Processor architectures evolve and we
> > have to give them a simple mechanism that they will actually use, to
> > handle this.  __builtin_cpu_supports() seems to be something they will
> > use.
> >
> There is error in reasoning: Something needs to be done. X is something. So
> X needs to be done.
> 
> They are wrong that ifunc, AT_PLATFORM are silly but correct that its
> complicated because problem is complicated.
> 
> As I said before it could be more harm than good. One example app
> programmer uses __builtin_cpu_supports but compiles file with
> -mcpu=power8 to get features he want. Then after upgrading gcc
> application breaks as gcc inserted unsupported instruction into
> nonpower8 branch.
> 
> Also its dubious that average programmer could do better than gcc with
> correct -mcpu flag. I asked before if you could measure impact of
> compiling applications with correct -mcpu and if hwcap could beat it.
> 
> For these you need distro maintainers setup compiling with
> AT_PLATFORM... and that will also cover libraries where developers don't
> care about powerpc niche platform.
> 
> If programmers don't use something it means that interface is bad and
> you should come with better interface.
> 
> A best interface would be tell them to use flags -O3 -mmulticpu 
> where -mmulticpu would take care of details by using
> AT_PLATFORM/ifuncs...
> 
> Or you could tell them to use __attribute__((multicpu)) for hot
> functions, below is how to implement that with macro that wraps ifunc,
> would they do better than just adding this to each function that shows
> more than 1% of total time in profile?
> 
> int foo (double x, double y) __attribute__((multicpu))
> {
>   return x * y;
> }
> 
> or
> 
> multicpu (int, foo, (x, y) (double x, double y))
> {
>   return x * y;
> }
> 
> with
> 
> #define multicpu(tp, name, arg, tparg) \
> tp __##name tparg; \
> tp __##name##_power5 tparg __attribute__((__target__("cpu=power5")))\
> { \
>   return (tp) __##name arg; \
> } \
> tp __##name##_power6 tparg __attribute__((__target__("cpu=power6")))\
> { \
>   return (tp) __##name arg; \
> } \
> tp name tparg \
> { \
>  /* select ifunc */ \
> } \
> tp __##name tparg 
> 
> 
> Also did you tried to ask application programmers after they used
> __builtin_cpu_supports if they tested it on both machines?
> 
> Thas pretty basic and it wouldn't be surprise that it would regulary
> introduce regressions as feature needs to be used in certain way.
> 
> I recalled new pitfall that user needs to ensure gains are more than
> savings. How big is typically powerpc branch cache? If user adds
> __builtin_cpu_supports checks to less frequent functions it may be
> always mispredicted as it isn't in cache and you pay for increased code
> size.
> 
You assume a lot.

You assume my team ans I do not know these techniques. We do.

You assume my team and I do not practice these technique in our own
code. We do.

You assume we do not advise our customers to use these techniques and
provide documentations on this topics. We do:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550


> 
> If situation is same as at x64 then if cpu supports foo means nothing.
> You need to be quite careful how do you use feature to get improvement.
> 
> For example take optimizing loop with avx/avx2. You have three choices
> 1. use 256 bit loads/stores and loop operation
> 2. use 128 bit loads/stores and merge/split them for loop operation
> 3. use 128 bit loads/stores and 128bit loop operation.
> 
> What you choose depends on if you do unaligned loads/stores or not. As
> these are quite expensive on fx10 you need to special case it even that
> it supports avx. On ivy bridge splitting/merging gives performance
> improvement but penalty is smaller. On haswell a 256bit loads/stores are
> faster that splitting/merging.
> 
> That was quite simple example. To complicate matters more even with
> haswell 256 bit loads/stores have big latency so you need to use them
> only in loops.
> 
You assume that my team and I do not about loop unrolling. We do.

You assume that we do not tell our customers this. We do.

However in this discussion, performance characteristics for Intel
processors are irrelevant.

> 
> > 2) This is not about exposing a private GLIBC resource (TCB) to the the
> > compiler. The TCB and TLS is part of the Platform ABI and must be known,
> > used, and understood by the compiler (GCC, LLVM, ...) binutils,
> > debuggers, etc in addition to GLIBC:
> > 
> > Power Architecture 64-Bit ELF V2 ABI Specification, OpenPOWER ABI for
> > Linux Supplement: Section 3.7.2 TLS Runtime Handling
> > 
> > This and other useful documents are available from the OpenPOWER
> > Foundation: http://openpowerfoundation.org/
> > 
> > If you look, you will see that TCB slots have already been allocated to
> > support other PowerISA specific features like; Event Based Branching,
> > Dynamic System Optimization, and Target Address Save. Recently we added
> > split-stack support for the GO language that required a TCB slot. So
> > adding a double word slot to cache AT_HWCAP and AT_HWCAP2 is no big
> > deal.
> > 
> > So far this all fits nicely in a single 128 byte cache-line. The TLS ABI
> > (which I defined back in back in 2004) reserved a full 4KB for the TCB
> > and extensions.
> > 
> > This all was not done lightly and was discussed extensively with the
> > appropriate developers in the corresponding projects. You all may not
> > have seen this because GLIBC not directly involved except as the owner
> > of ./sysdeps/powerpc/nptl/tls.h
> > 
> You should say first that it uses reserved memory. 
> 
> So it isn't issue now. But if plt is as expensive as you say it will
> quickly fill up. Save strcmp address in tcb to improve performance as
> strcmp is most called function in libc and you would save several
> magnitudes more on plt indirections than rarer hwcap. Then continue with
> less called functions until that makes sense.
> 
You assume my team and I do not know the performance characteristics of
our own platform. We do. 

You too could learn more by reading the 'POWER8 Processor User’s Manual
for the Single-Chip Module' Available on OpenPOWER.org


> 
> > The only reason we raised this discussion here because we wanted to
> > publish a platform specific API
> > in ./sysdeps/unix/sysv/linux/powerpc/bits/ppc.h to make is easier for
> > the compilers to access it. And we felt it would be rude not discuss
> > this with the community.
> > 
> > 3) I would think that the platform maintainers would have the standing
> > to implement their own platform ABI? Perhaps the project maintainers
> > would like to weigh in on this?
> > 
> > 4) I have ask Carlos Seo to develop some micro benchmarks to illuminate
> > the performance implications of the various alternatives to the direct
> > TCB access proposal. If necessarily we can provide detail cycle accurate
> > instruction pipeline timings. 
> >
> Please benchmarks, microbenchmarks are not very useful, they measure
> small constant c in expression c*x - y where positive is improvement. If x is hundred times y then exact value of c doesn't matter.
> 
You assume that I do not know how to development a benchmarks that are
repeatable and meaningful. I do. How many books have you published on
that topic?

You don't know my platform.

You don't know my customers.

You don't know my team.

You don't know me.

But you assume a lot that is just irrelevant and or not factually true.

At this point you are acting like a troll that just disagrees with
everything said.

> There is still unknown basic use cases, it doesn't make sense do
> detailed measurement only to find that it on average saves hundred
> cycles per app but its used by one app in thousand and it costs each which
> doesn't use it a cycle. Thats net loss. Also performance will wary
> depending how frequent is usage, when its mostly on cold code then you
> have problems that hwcap branch is always mispredicted and increased
> instruction cache usage so not using hwcap could be better if you do
> only small saving.
> 
Again I have to live in the real world and deal with real customers who
are not too interested in my platform problems. They just want a
simple/quick solution that is easy for them to understand.

I am just trying to provide an option for them to use.

> So get some of these average programers, let them optimize some app with
> hwcap and then check result. 
> 

We are done with this discussion.
  
Richard Henderson June 29, 2015, 10:53 a.m. UTC | #47
On 06/09/2015 04:06 PM, Steven Munroe wrote:
> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
>>
>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
>>> The proposed patch adds a new feature for powerpc. In order to get
>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
>>> This enables users to write versioned code based on the HWCAP bits
>>> without going through the overhead of reading them from the auxiliary
>>> vector.
>
>> i assume this is for multi-versioning.
>
> The intent is for the compiler to implement the equivalent of
> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> efficiently as getauxv and scanning the auxv is too slow for inline
> optimizations.
>

There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads 
the variables private to glibc that already contain this information.  That 
ought to be fast enough for the builtin, rather than consuming space in the TCB.



r~
  
Steven Munroe June 29, 2015, 6:37 p.m. UTC | #48
On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
> On 06/09/2015 04:06 PM, Steven Munroe wrote:
> > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> >>
> >> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> >>> The proposed patch adds a new feature for powerpc. In order to get
> >>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> >>> This enables users to write versioned code based on the HWCAP bits
> >>> without going through the overhead of reading them from the auxiliary
> >>> vector.
> >
> >> i assume this is for multi-versioning.
> >
> > The intent is for the compiler to implement the equivalent of
> > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > efficiently as getauxv and scanning the auxv is too slow for inline
> > optimizations.
> >
> 
> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads 
> the variables private to glibc that already contain this information.  That 
> ought to be fast enough for the builtin, rather than consuming space in the TCB.
> 

Richard I do not understand how a 38 instruction function accessed via a
PLT call stub (minimum 4 additional instructions) is equivalent or "as
good as" a single in-line load instruction.

Even with best case path for getauxval HWCAP2 we are at 14 instructions
with exposure to 3 different branch miss predicts. And that is before
the application can execute its own __builtin_cpu_supports() test.

Lets look at a real customer example. The customer wants to use the P8
128-bit add/sub but also wants to be able to unit test code on existing
P7 machines. Which results in something like this:

static inline vui32_t
vec_addcuq (vui32_t a, vui32_t b)
{
        vui32_t t;

                if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))
                {
                
                        __asm__(
                            "vaddcuq %0,%1,%2;"
                            : "=v" (t)
                            : "v" (a),
                              "v" (b)
                            : );
                }
                else
                        vui32_t c, c2, co;
                        vui32_t z= {0,0,0,0};
                        __asm__(
                            "vaddcuw %3,%4,%5;\n"
                            "\tvadduwm %0,%4,%5;\n"
                            "\tvsldoi %1,%3,%6,4;\n"
                            "\tvaddcuw %2,%0,%1;\n"
                            "\tvadduwm %0,%0,%1;\n"
                            "\tvor %3,%3,%2;\n"
                            "\tvsldoi %1,%2,%6,4;\n"
                            "\tvaddcuw %2,%0,%1;\n"
                            "\tvadduwm %0,%0,%1;\n"
                            "\tvor %3,%3,%2;\n"
                            "\tvsldoi %1,%2,%6,4;\n"
                            "\tvadduwm %0,%0,%1;\n"
                            : "=&v" (t), /* 0 */
                              "=&v" (c), /* 1 */
                              "=&v" (c2), /* 2 */
                              "=&v" (co) /* 3 */
                            : "v" (a), /* 4 */
                              "v" (b), /* 5 */
                              "v" (z)  /* 6 */
                            : );
                        t = co;
                }
        return (t);
}

So it is clear to me that executing 14+ instruction to decide if I can
optimize to use new single instruction optimization is not a good deal.

One instruction (plus the __builtin_cpu_supports which should be and
immediate,  branch conditional) is a better deal. Inlining so the
compiler can do common sub-expression about larger blocks is an even
better deal.

I just do not understand why there is so much resistance to this simple
platform ABI specific request.
  
Ondrej Bilka June 29, 2015, 9:18 p.m. UTC | #49
On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
> Lets look at a real customer example. The customer wants to use the P8
> 128-bit add/sub but also wants to be able to unit test code on existing
> P7 machines. Which results in something like this:
> 
> static inline vui32_t
> vec_addcuq (vui32_t a, vui32_t b)
> {
>         vui32_t t;
> 
>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))
>                 {
>                 
>                         __asm__(
>                             "vaddcuq %0,%1,%2;"
>                             : "=v" (t)
>                             : "v" (a),
>                               "v" (b)
>                             : );
>                 }
>                 else
>                         vui32_t c, c2, co;
>                         vui32_t z= {0,0,0,0};
>                         __asm__(
>                             "vaddcuw %3,%4,%5;\n"
>                             "\tvadduwm %0,%4,%5;\n"
>                             "\tvsldoi %1,%3,%6,4;\n"
>                             "\tvaddcuw %2,%0,%1;\n"
>                             "\tvadduwm %0,%0,%1;\n"
>                             "\tvor %3,%3,%2;\n"
>                             "\tvsldoi %1,%2,%6,4;\n"
>                             "\tvaddcuw %2,%0,%1;\n"
>                             "\tvadduwm %0,%0,%1;\n"
>                             "\tvor %3,%3,%2;\n"
>                             "\tvsldoi %1,%2,%6,4;\n"
>                             "\tvadduwm %0,%0,%1;\n"
>                             : "=&v" (t), /* 0 */
>                               "=&v" (c), /* 1 */
>                               "=&v" (c2), /* 2 */
>                               "=&v" (co) /* 3 */
>                             : "v" (a), /* 4 */
>                               "v" (b), /* 5 */
>                               "v" (z)  /* 6 */
>                             : );
>                         t = co;
>                 }
>         return (t);
> }
> 
> So it is clear to me that executing 14+ instruction to decide if I can
> optimize to use new single instruction optimization is not a good deal.
>
No, this is prime example that average programmers shouldn't use hwcap
as that results in moronic code like this.

When you poorly reinvent wheel you will get terrible performance like
fallback here. Gcc already has 128 ints so tell average programmers to
use them instead and don't touch features that they don't understand.

As gcc compiles addition into pair of addc, adde instructions a
performance gain is minimal while code is harder to maintain. Due to
pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
on following example on power8.


int main()
{
  unsigned long i;
  __int128 u = 0;
//long u = 0;
  for (i = 0; i < 1000000000; i++)
    u += i * i;
  return u >> 35;
}

[neleai@gcc2-power8 ~]$ gcc uu.c -O3
[neleai@gcc2-power8 ~]$ time ./a.out 

real	0m0.957s
user	0m0.956s
sys	0m0.001s

[neleai@gcc2-power8 ~]$ vim uu.c 
[neleai@gcc2-power8 ~]$ gcc uu.c -O3
[neleai@gcc2-power8 ~]$ time ./a.out 

real	0m1.040s
user	0m1.039s
sys	0m0.001s


 
> One instruction (plus the __builtin_cpu_supports which should be and
> immediate,  branch conditional) is a better deal. Inlining so the
> compiler can do common sub-expression about larger blocks is an even
> better deal.
> 
That doesn't change fact that its mistake. A code above was bad as it
added check for single instruction that takes a cycle. When difference
between implementations is few cycles then each cycle matter (otherwise
you should just stick to generic one). Then a hwcap check itself causes
slowdown that matters and you should use ifunc to eliminate.

Or hope that its moved out of loop, but when its loop with 100
iterations a __builtin_cpu_supports time becomes imaterial.


> I just do not understand why there is so much resistance to this simple
> platform ABI specific request.
  
Adhemerval Zanella Netto June 29, 2015, 9:48 p.m. UTC | #50
On 29-06-2015 18:18, Ondřej Bílka wrote:
> On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
>> Lets look at a real customer example. The customer wants to use the P8
>> 128-bit add/sub but also wants to be able to unit test code on existing
>> P7 machines. Which results in something like this:
>>
>> static inline vui32_t
>> vec_addcuq (vui32_t a, vui32_t b)
>> {
>>         vui32_t t;
>>
>>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))
>>                 {
>>                 
>>                         __asm__(
>>                             "vaddcuq %0,%1,%2;"
>>                             : "=v" (t)
>>                             : "v" (a),
>>                               "v" (b)
>>                             : );
>>                 }
>>                 else
>>                         vui32_t c, c2, co;
>>                         vui32_t z= {0,0,0,0};
>>                         __asm__(
>>                             "vaddcuw %3,%4,%5;\n"
>>                             "\tvadduwm %0,%4,%5;\n"
>>                             "\tvsldoi %1,%3,%6,4;\n"
>>                             "\tvaddcuw %2,%0,%1;\n"
>>                             "\tvadduwm %0,%0,%1;\n"
>>                             "\tvor %3,%3,%2;\n"
>>                             "\tvsldoi %1,%2,%6,4;\n"
>>                             "\tvaddcuw %2,%0,%1;\n"
>>                             "\tvadduwm %0,%0,%1;\n"
>>                             "\tvor %3,%3,%2;\n"
>>                             "\tvsldoi %1,%2,%6,4;\n"
>>                             "\tvadduwm %0,%0,%1;\n"
>>                             : "=&v" (t), /* 0 */
>>                               "=&v" (c), /* 1 */
>>                               "=&v" (c2), /* 2 */
>>                               "=&v" (co) /* 3 */
>>                             : "v" (a), /* 4 */
>>                               "v" (b), /* 5 */
>>                               "v" (z)  /* 6 */
>>                             : );
>>                         t = co;
>>                 }
>>         return (t);
>> }
>>
>> So it is clear to me that executing 14+ instruction to decide if I can
>> optimize to use new single instruction optimization is not a good deal.
>>
> No, this is prime example that average programmers shouldn't use hwcap
> as that results in moronic code like this.
> 
> When you poorly reinvent wheel you will get terrible performance like
> fallback here. Gcc already has 128 ints so tell average programmers to
> use them instead and don't touch features that they don't understand.

Again your patronizing tone only shows your lack of knowledge in this
subject: the above code aims to use ISA 2.07 *vector* instructions to
multiply 128-bits integer in vector *registers*. It has nothing to do
with uint128_t support on GCC and only recently GCC added support to 
such builtins [1]. And although there is plan to add support to use 
vector instruction for uint128_t, right now they are done in GRP register
in powerpc.

Also, it is up to developers to select the best way to use the CPU
features.  Although I am not very found of providing the hwcap in TCB
(my suggestion was to use local __thread in libgcc instead), the idea
here is to provide *tools*.

[1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html

> 
> As gcc compiles addition into pair of addc, adde instructions a
> performance gain is minimal while code is harder to maintain. Due to
> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
> on following example on power8.
> 
> 
> int main()
> {
>   unsigned long i;
>   __int128 u = 0;
> //long u = 0;
>   for (i = 0; i < 1000000000; i++)
>     u += i * i;
>   return u >> 35;
> }
> 
> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> [neleai@gcc2-power8 ~]$ time ./a.out 
> 
> real	0m0.957s
> user	0m0.956s
> sys	0m0.001s
> 
> [neleai@gcc2-power8 ~]$ vim uu.c 
> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> [neleai@gcc2-power8 ~]$ time ./a.out 
> 
> real	0m1.040s
> user	0m1.039s
> sys	0m0.001s

This is due the code is not using any vector instruction, which is the aim of the
code snippet Steven has posted.  Also, it really depends in which mode the CPU is
set, on a POWER split-core mode, where the CPU dispatch groups are shared among
threads in an non-dynamic way the difference is bigger:

[[fedora@glibc-ppc64le ~]$ time ./test

real    0m1.730s
user    0m1.726s
sys     0m0.003s
[fedora@glibc-ppc64le ~]$ time ./test-long

real    0m1.593s
user    0m1.591s
sys     0m0.002s

> 
> 
>  
>> One instruction (plus the __builtin_cpu_supports which should be and
>> immediate,  branch conditional) is a better deal. Inlining so the
>> compiler can do common sub-expression about larger blocks is an even
>> better deal.
>>
> That doesn't change fact that its mistake. A code above was bad as it
> added check for single instruction that takes a cycle. When difference
> between implementations is few cycles then each cycle matter (otherwise
> you should just stick to generic one). Then a hwcap check itself causes
> slowdown that matters and you should use ifunc to eliminate.
> 
> Or hope that its moved out of loop, but when its loop with 100
> iterations a __builtin_cpu_supports time becomes imaterial.
> 
> 
>> I just do not understand why there is so much resistance to this simple
>> platform ABI specific request.
> 
>
  
Ondrej Bilka June 30, 2015, 3:14 a.m. UTC | #51
On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 29-06-2015 18:18, Ondřej Bílka wrote:
> > On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
> >> Lets look at a real customer example. The customer wants to use the P8
> >> 128-bit add/sub but also wants to be able to unit test code on existing
> >> P7 machines. Which results in something like this:
> >>
> >> static inline vui32_t
> >> vec_addcuq (vui32_t a, vui32_t b)
> >> {
> >>         vui32_t t;
> >>
> >>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))
> >>                 {
> >>                 
> >>                         __asm__(
> >>                             "vaddcuq %0,%1,%2;"
> >>                             : "=v" (t)
> >>                             : "v" (a),
> >>                               "v" (b)
> >>                             : );
> >>                 }
> >>                 else
> >>                         vui32_t c, c2, co;
> >>                         vui32_t z= {0,0,0,0};
> >>                         __asm__(
> >>                             "vaddcuw %3,%4,%5;\n"
> >>                             "\tvadduwm %0,%4,%5;\n"
> >>                             "\tvsldoi %1,%3,%6,4;\n"
> >>                             "\tvaddcuw %2,%0,%1;\n"
> >>                             "\tvadduwm %0,%0,%1;\n"
> >>                             "\tvor %3,%3,%2;\n"
> >>                             "\tvsldoi %1,%2,%6,4;\n"
> >>                             "\tvaddcuw %2,%0,%1;\n"
> >>                             "\tvadduwm %0,%0,%1;\n"
> >>                             "\tvor %3,%3,%2;\n"
> >>                             "\tvsldoi %1,%2,%6,4;\n"
> >>                             "\tvadduwm %0,%0,%1;\n"
> >>                             : "=&v" (t), /* 0 */
> >>                               "=&v" (c), /* 1 */
> >>                               "=&v" (c2), /* 2 */
> >>                               "=&v" (co) /* 3 */
> >>                             : "v" (a), /* 4 */
> >>                               "v" (b), /* 5 */
> >>                               "v" (z)  /* 6 */
> >>                             : );
> >>                         t = co;
> >>                 }
> >>         return (t);
> >> }
> >>
> >> So it is clear to me that executing 14+ instruction to decide if I can
> >> optimize to use new single instruction optimization is not a good deal.
> >>
> > No, this is prime example that average programmers shouldn't use hwcap
> > as that results in moronic code like this.
> > 
> > When you poorly reinvent wheel you will get terrible performance like
> > fallback here. Gcc already has 128 ints so tell average programmers to
> > use them instead and don't touch features that they don't understand.
> 
> Again your patronizing tone only shows your lack of knowledge in this
> subject: the above code aims to use ISA 2.07

Sorry, but could you explain how did you come to conclusion to use ISA
2.07? Only check done is

> >>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))

And vsx is part of ISA 2.06

> *vector* instructions to
> multiply 128-bits integer in vector *registers*.

Your sentence has three problems.
1. From start of mail.
> > On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
> >> Lets look at a real customer example. The customer wants to use the P8
> >> 128-bit add/sub
2. Function is named vec_addcuq so accoring to name it does...? I guess
division.

3. Power isa describes these instructions pretty clearly:

Vector Add and Write Carry-Out Unsigned
Word                                        VX-form
vaddcuw      VRT,VRA,VRB
    4        VRT      VRA      VRB          384
0         6        11       16       21                31
do i=0 to 127 by 32
   aop      EXTZ((VRA)i:i+31)
   bop      EXTZ((VRB)i:i+31)
   VRTi:i+31    Chop( ( aop +int bop ) >>ui 32,1)
For each vector element i from 0 to 3, do the following.
    Unsigned-integer word element i in VRA is added
    to unsigned-integer word element i in VRB. The
    carry out of the 32-bit sum is zero-extended to 32
    bits and placed into word element i of VRT.
Special Registers Altered:
     None

If you still believe that it somehow does multiplication just try this
and see that result is all zeroes.

   __vector uint32_t x={3,2,0,3},y={0,0,0,0};
   y = vec_addcuq(x,x);
   printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]);

Again your patronizing tone only shows your lack of knowledge of powerpc
assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/


I did mistake that I read to bit fast and seen only add instead of
instruction to get carry. Still thats with gpr two additions with carry,
then add zero with carry to set desired bit. 

> It has nothing to do
> with uint128_t support on GCC and only recently GCC added support to 
> such builtins [1]. And although there is plan to add support to use 
> vector instruction for uint128_t, right now they are done in GRP register
> in powerpc.
> 
Customer just wants to do 128 additions. If a fastest way
is with GPR registers then he should use gpr registers.

My claim was that this leads to slow code on power7. Fallback above
takes 14 cycles on power8 and 128bit addition is similarly slow.

Yes you could craft expressions that exploit vectors by doing ands/ors
with 128bit constants but if you mostly need to sum integers and use 128
bits to prevent overflows then gpr is correct choice due to transfer
cost.

> Also, it is up to developers to select the best way to use the CPU
> features.  Although I am not very found of providing the hwcap in TCB
> (my suggestion was to use local __thread in libgcc instead), the idea
> here is to provide *tools*.
>
If you want to provide tools then you should try to make best tool
possible instead of being satisfied with tool that poorly fits job and
is dangerous to use.

I am telling all time that there are better alternatives where this
doesn't matter.

One example would be write gcc pass that runs after early inlining to
find all functions containing __builtin_cpu_supports, cloning them to
replace it by constant and adding ifunc to automatically select variant.

You would also need to keep list of existing processor features to
remove nonexisting combinations. That easiest way to avoid combinatorial
explosion.



 
> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html
> 
> > 
> > As gcc compiles addition into pair of addc, adde instructions a
> > performance gain is minimal while code is harder to maintain. Due to
> > pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
> > on following example on power8.
> > 
> > 
> > int main()
> > {
> >   unsigned long i;
> >   __int128 u = 0;
> > //long u = 0;
> >   for (i = 0; i < 1000000000; i++)
> >     u += i * i;
> >   return u >> 35;
> > }
> > 
> > [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> > [neleai@gcc2-power8 ~]$ time ./a.out 
> > 
> > real	0m0.957s
> > user	0m0.956s
> > sys	0m0.001s
> > 
> > [neleai@gcc2-power8 ~]$ vim uu.c 
> > [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> > [neleai@gcc2-power8 ~]$ time ./a.out 
> > 
> > real	0m1.040s
> > user	0m1.039s
> > sys	0m0.001s
> 
> This is due the code is not using any vector instruction, which is the aim of the
> code snippet Steven has posted.

Wait do you want to have fast code or just show off your elite skills
with vector registers?

A vector 128bit addition is on power7 lot slower than 128bit addition in
gpr. This is valid use case when I produce 64bit integers and want to
compute their sum in 128bit variable. You could construct lot of use
cases where gpr wins, for example summing an array(possibly with applied
arithmetic expression).

Unless you show real world examples how could you prove that vector
registers are better choice?



>  Also, it really depends in which mode the CPU is
> set, on a POWER split-core mode, where the CPU dispatch groups are shared among
> threads in an non-dynamic way the difference is bigger:
> 
> [[fedora@glibc-ppc64le ~]$ time ./test
> 
> real    0m1.730s
> user    0m1.726s
> sys     0m0.003s
> [fedora@glibc-ppc64le ~]$ time ./test-long
> 
> real    0m1.593s
> user    0m1.591s
> sys     0m0.002s
> 
Difference? What difference? Only ratio matters to remove things like
different frequency of processors and that thread sharing slows you down
by constant. When I do math difference between these two ratios is 0.06%

1.593/1.730 = 0.9208092485549133
0.957/1.040 = 0.9201923076923076



> > 
> > 
> >  
> >> One instruction (plus the __builtin_cpu_supports which should be and
> >> immediate,  branch conditional) is a better deal. Inlining so the
> >> compiler can do common sub-expression about larger blocks is an even
> >> better deal.
> >>
> > That doesn't change fact that its mistake. A code above was bad as it
> > added check for single instruction that takes a cycle. When difference
> > between implementations is few cycles then each cycle matter (otherwise
> > you should just stick to generic one). Then a hwcap check itself causes
> > slowdown that matters and you should use ifunc to eliminate.
> > 
> > Or hope that its moved out of loop, but when its loop with 100
> > iterations a __builtin_cpu_supports time becomes imaterial.
> > 
> > 
> >> I just do not understand why there is so much resistance to this simple
> >> platform ABI specific request.
> > 
> >
  
Richard Henderson June 30, 2015, 6:49 a.m. UTC | #52
On 06/29/2015 07:37 PM, Steven Munroe wrote:
> On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
>> On 06/09/2015 04:06 PM, Steven Munroe wrote:
>>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
>>>>
>>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
>>>>> The proposed patch adds a new feature for powerpc. In order to get
>>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
>>>>> This enables users to write versioned code based on the HWCAP bits
>>>>> without going through the overhead of reading them from the auxiliary
>>>>> vector.
>>>
>>>> i assume this is for multi-versioning.
>>>
>>> The intent is for the compiler to implement the equivalent of
>>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
>>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
>>> efficiently as getauxv and scanning the auxv is too slow for inline
>>> optimizations.
>>>
>>
>> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
>> the variables private to glibc that already contain this information.  That
>> ought to be fast enough for the builtin, rather than consuming space in the TCB.
>>
>
> Richard I do not understand how a 38 instruction function accessed via a
> PLT call stub (minimum 4 additional instructions) is equivalent or "as
> good as" a single in-line load instruction.
>
> Even with best case path for getauxval HWCAP2 we are at 14 instructions
> with exposure to 3 different branch miss predicts. And that is before
> the application can execute its own __builtin_cpu_supports() test.
>
> Lets look at a real customer example. The customer wants to use the P8
> 128-bit add/sub but also wants to be able to unit test code on existing
> P7 machines. Which results in something like this:
>
> static inline vui32_t
> vec_addcuq (vui32_t a, vui32_t b)
> {
>          vui32_t t;
>
>                  if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))
>                  {
>
>                          __asm__(
>                              "vaddcuq %0,%1,%2;"
>                              : "=v" (t)
>                              : "v" (a),
>                                "v" (b)
>                              : );
...
>
> So it is clear to me that executing 14+ instruction to decide if I can
> optimize to use new single instruction optimization is not a good deal.

This is a horrible way to use this builtin. In the same way that using ifunc at 
this level would also be horrible.

Even supposing that this builtin uses a single load, you've at least doubled 
the overhead of using the insn. The user really should be aware of this and 
manually hoist this check much farther up the call chain.  At which point the 
difference between 2 cycles for a load and 40 cycles for a call is immaterial.

And if the user is really concerned about unit tests, surely ifdefs are more 
appropriate for this situation.  At the moment one can only test the P7 path on 
P7 and the P8 path on P8; better if one can also test the P7 path on P8.


r~
  
Adhemerval Zanella Netto June 30, 2015, 2:09 p.m. UTC | #53
On 30-06-2015 00:14, Ondřej Bílka wrote:
> On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 29-06-2015 18:18, Ondřej Bílka wrote:
>>> On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
>>>> Lets look at a real customer example. The customer wants to use the P8
>>>> 128-bit add/sub but also wants to be able to unit test code on existing
>>>> P7 machines. Which results in something like this:
>>>>
>>>> static inline vui32_t
>>>> vec_addcuq (vui32_t a, vui32_t b)
>>>> {
>>>>         vui32_t t;
>>>>
>>>>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))
>>>>                 {
>>>>                 
>>>>                         __asm__(
>>>>                             "vaddcuq %0,%1,%2;"
>>>>                             : "=v" (t)
>>>>                             : "v" (a),
>>>>                               "v" (b)
>>>>                             : );
>>>>                 }
>>>>                 else
>>>>                         vui32_t c, c2, co;
>>>>                         vui32_t z= {0,0,0,0};
>>>>                         __asm__(
>>>>                             "vaddcuw %3,%4,%5;\n"
>>>>                             "\tvadduwm %0,%4,%5;\n"
>>>>                             "\tvsldoi %1,%3,%6,4;\n"
>>>>                             "\tvaddcuw %2,%0,%1;\n"
>>>>                             "\tvadduwm %0,%0,%1;\n"
>>>>                             "\tvor %3,%3,%2;\n"
>>>>                             "\tvsldoi %1,%2,%6,4;\n"
>>>>                             "\tvaddcuw %2,%0,%1;\n"
>>>>                             "\tvadduwm %0,%0,%1;\n"
>>>>                             "\tvor %3,%3,%2;\n"
>>>>                             "\tvsldoi %1,%2,%6,4;\n"
>>>>                             "\tvadduwm %0,%0,%1;\n"
>>>>                             : "=&v" (t), /* 0 */
>>>>                               "=&v" (c), /* 1 */
>>>>                               "=&v" (c2), /* 2 */
>>>>                               "=&v" (co) /* 3 */
>>>>                             : "v" (a), /* 4 */
>>>>                               "v" (b), /* 5 */
>>>>                               "v" (z)  /* 6 */
>>>>                             : );
>>>>                         t = co;
>>>>                 }
>>>>         return (t);
>>>> }
>>>>
>>>> So it is clear to me that executing 14+ instruction to decide if I can
>>>> optimize to use new single instruction optimization is not a good deal.
>>>>
>>> No, this is prime example that average programmers shouldn't use hwcap
>>> as that results in moronic code like this.
>>>
>>> When you poorly reinvent wheel you will get terrible performance like
>>> fallback here. Gcc already has 128 ints so tell average programmers to
>>> use them instead and don't touch features that they don't understand.
>>
>> Again your patronizing tone only shows your lack of knowledge in this
>> subject: the above code aims to use ISA 2.07
> 
> Sorry, but could you explain how did you come to conclusion to use ISA
> 2.07? Only check done is

Because 'vaddcuq' is ISA 2.07 *only* and I think Steve has made a mistake
here, the test should be __builtin_cpu_supports("PPC_FEATURE2_ARCH_2_07").

> 
>>>>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))
> 
> And vsx is part of ISA 2.06
> 
>> *vector* instructions to
>> multiply 128-bits integer in vector *registers*.
> 
> Your sentence has three problems.
> 1. From start of mail.
>>> On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
>>>> Lets look at a real customer example. The customer wants to use the P8
>>>> 128-bit add/sub
> 2. Function is named vec_addcuq so accoring to name it does...? I guess
> division.
> 
> 3. Power isa describes these instructions pretty clearly:
> 
> Vector Add and Write Carry-Out Unsigned
> Word                                        VX-form
> vaddcuw      VRT,VRA,VRB
>     4        VRT      VRA      VRB          384
> 0         6        11       16       21                31
> do i=0 to 127 by 32
>    aop      EXTZ((VRA)i:i+31)
>    bop      EXTZ((VRB)i:i+31)
>    VRTi:i+31    Chop( ( aop +int bop ) >>ui 32,1)
> For each vector element i from 0 to 3, do the following.
>     Unsigned-integer word element i in VRA is added
>     to unsigned-integer word element i in VRB. The
>     carry out of the 32-bit sum is zero-extended to 32
>     bits and placed into word element i of VRT.
> Special Registers Altered:
>      None
> 
> If you still believe that it somehow does multiplication just try this
> and see that result is all zeroes.
> 
>    __vector uint32_t x={3,2,0,3},y={0,0,0,0};
>    y = vec_addcuq(x,x);
>    printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]);
> 
> Again your patronizing tone only shows your lack of knowledge of powerpc
> assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/

Seriously, you need to start admitting your lack of knowledge in PowerISA
(I am meant addition instead of multiplication, my mistake).  And repeating
myself to prove a point only makes you childish, I am not competing with
you.

> 
> 
> I did mistake that I read to bit fast and seen only add instead of
> instruction to get carry. Still thats with gpr two additions with carry,
> then add zero with carry to set desired bit. 
> 
>> It has nothing to do
>> with uint128_t support on GCC and only recently GCC added support to 
>> such builtins [1]. And although there is plan to add support to use 
>> vector instruction for uint128_t, right now they are done in GRP register
>> in powerpc.
>>
> Customer just wants to do 128 additions. If a fastest way
> is with GPR registers then he should use gpr registers.
> 
> My claim was that this leads to slow code on power7. Fallback above
> takes 14 cycles on power8 and 128bit addition is similarly slow.
> 
> Yes you could craft expressions that exploit vectors by doing ands/ors
> with 128bit constants but if you mostly need to sum integers and use 128
> bits to prevent overflows then gpr is correct choice due to transfer
> cost.

Again this is something, as Steve has pointed out, you only assume without
knowing the subject in depth: it is operating on *vector* registers and
thus it will be more costly to move to and back GRP than just do in
VSX registers.  And as Steven has pointed out, the idea is to *validate*
on POWER7.

> 
>> Also, it is up to developers to select the best way to use the CPU
>> features.  Although I am not very found of providing the hwcap in TCB
>> (my suggestion was to use local __thread in libgcc instead), the idea
>> here is to provide *tools*.
>>
> If you want to provide tools then you should try to make best tool
> possible instead of being satisfied with tool that poorly fits job and
> is dangerous to use.
> 
> I am telling all time that there are better alternatives where this
> doesn't matter.
> 
> One example would be write gcc pass that runs after early inlining to
> find all functions containing __builtin_cpu_supports, cloning them to
> replace it by constant and adding ifunc to automatically select variant.

Using internal PLT calls to such mechanism is really not the way to handle
performance for powerpc.  

> 
> You would also need to keep list of existing processor features to
> remove nonexisting combinations. That easiest way to avoid combinatorial
> explosion.
> 
> 
> 
>  
>> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html
>>
>>>
>>> As gcc compiles addition into pair of addc, adde instructions a
>>> performance gain is minimal while code is harder to maintain. Due to
>>> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
>>> on following example on power8.
>>>
>>>
>>> int main()
>>> {
>>>   unsigned long i;
>>>   __int128 u = 0;
>>> //long u = 0;
>>>   for (i = 0; i < 1000000000; i++)
>>>     u += i * i;
>>>   return u >> 35;
>>> }
>>>
>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
>>> [neleai@gcc2-power8 ~]$ time ./a.out 
>>>
>>> real	0m0.957s
>>> user	0m0.956s
>>> sys	0m0.001s
>>>
>>> [neleai@gcc2-power8 ~]$ vim uu.c 
>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
>>> [neleai@gcc2-power8 ~]$ time ./a.out 
>>>
>>> real	0m1.040s
>>> user	0m1.039s
>>> sys	0m0.001s
>>
>> This is due the code is not using any vector instruction, which is the aim of the
>> code snippet Steven has posted.
> 
> Wait do you want to have fast code or just show off your elite skills
> with vector registers?

What does it have to do with vectors? I just saying that in split-core mode
the CPU group dispatches are statically allocated for the eight threads
and thus pipeline gain are lower.  And indeed it was not the case for the
example (I rushed without doing the math, my mistake again).

> 
> A vector 128bit addition is on power7 lot slower than 128bit addition in
> gpr. This is valid use case when I produce 64bit integers and want to
> compute their sum in 128bit variable. You could construct lot of use
> cases where gpr wins, for example summing an array(possibly with applied
> arithmetic expression).
> 
> Unless you show real world examples how could you prove that vector
> registers are better choice?

How said they are better? As Steve has pointed out, *you* assume it, the
idea afaik is only to be able to *validate* the code on a POWER7 machine.

Anyway, I will conclude again because I am not in the mood to get back
at this subject (you can be the big boy and have the final line).
I tend to see the TCB is not the way to accomplish it, but not for
performance reasons.  My main issue is tie compiler code generation ABI
with runtime in a way it should be avoided (for instance implementing it
on libgcc).  And your performance analysis mostly do not hold true for
powerpc.

> 
> 
> 
>>  Also, it really depends in which mode the CPU is
>> set, on a POWER split-core mode, where the CPU dispatch groups are shared among
>> threads in an non-dynamic way the difference is bigger:
>>
>> [[fedora@glibc-ppc64le ~]$ time ./test
>>
>> real    0m1.730s
>> user    0m1.726s
>> sys     0m0.003s
>> [fedora@glibc-ppc64le ~]$ time ./test-long
>>
>> real    0m1.593s
>> user    0m1.591s
>> sys     0m0.002s
>>
> Difference? What difference? Only ratio matters to remove things like
> different frequency of processors and that thread sharing slows you down
> by constant. When I do math difference between these two ratios is 0.06%
> 
> 1.593/1.730 = 0.9208092485549133
> 0.957/1.040 = 0.9201923076923076
> 
> 
> 
>>>
>>>
>>>  
>>>> One instruction (plus the __builtin_cpu_supports which should be and
>>>> immediate,  branch conditional) is a better deal. Inlining so the
>>>> compiler can do common sub-expression about larger blocks is an even
>>>> better deal.
>>>>
>>> That doesn't change fact that its mistake. A code above was bad as it
>>> added check for single instruction that takes a cycle. When difference
>>> between implementations is few cycles then each cycle matter (otherwise
>>> you should just stick to generic one). Then a hwcap check itself causes
>>> slowdown that matters and you should use ifunc to eliminate.
>>>
>>> Or hope that its moved out of loop, but when its loop with 100
>>> iterations a __builtin_cpu_supports time becomes imaterial.
>>>
>>>
>>>> I just do not understand why there is so much resistance to this simple
>>>> platform ABI specific request.
>>>
>>>
  
Steven Munroe June 30, 2015, 3:07 p.m. UTC | #54
On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote:
> On 06/29/2015 07:37 PM, Steven Munroe wrote:
> > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
> >> On 06/09/2015 04:06 PM, Steven Munroe wrote:
> >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> >>>>
> >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> >>>>> The proposed patch adds a new feature for powerpc. In order to get
> >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> >>>>> This enables users to write versioned code based on the HWCAP bits
> >>>>> without going through the overhead of reading them from the auxiliary
> >>>>> vector.
> >>>
> >>>> i assume this is for multi-versioning.
> >>>
> >>> The intent is for the compiler to implement the equivalent of
> >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> >>> efficiently as getauxv and scanning the auxv is too slow for inline
> >>> optimizations.
> >>>
> >>
> >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
> >> the variables private to glibc that already contain this information.  That
> >> ought to be fast enough for the builtin, rather than consuming space in the TCB.
> >>
> >
> > Richard I do not understand how a 38 instruction function accessed via a
> > PLT call stub (minimum 4 additional instructions) is equivalent or "as
> > good as" a single in-line load instruction.
> >
> > Even with best case path for getauxval HWCAP2 we are at 14 instructions
> > with exposure to 3 different branch miss predicts. And that is before
> > the application can execute its own __builtin_cpu_supports() test.
> >
> > Lets look at a real customer example. The customer wants to use the P8
> > 128-bit add/sub but also wants to be able to unit test code on existing
> > P7 machines. Which results in something like this:
> >
> > static inline vui32_t
> > vec_addcuq (vui32_t a, vui32_t b)
> > {
> >          vui32_t t;
> >
> >                  if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))
> >                  {
> >
> >                          __asm__(
> >                              "vaddcuq %0,%1,%2;"
> >                              : "=v" (t)
> >                              : "v" (a),
> >                                "v" (b)
> >                              : );
> ...
> >
> > So it is clear to me that executing 14+ instruction to decide if I can
> > optimize to use new single instruction optimization is not a good deal.
> 
> This is a horrible way to use this builtin. In the same way that using ifunc at 
> this level would also be horrible.
> 
Yes it is just an example, there are many more that you might find less
objectionable. But this is not about you or me.

> Even supposing that this builtin uses a single load, you've at least doubled 
> the overhead of using the insn. The user really should be aware of this and 
> manually hoist this check much farther up the call chain.  At which point the 
> difference between 2 cycles for a load and 40 cycles for a call is immaterial.
>
> And if the user is really concerned about unit tests, surely ifdefs are more 
> appropriate for this situation.  At the moment one can only test the P7 path on 
> P7 and the P8 path on P8; better if one can also test the P7 path on P8.
> 

Yes I know there are better alternatives.

This is not intended for use within GLIBC or by knowledgeable folks like
yourself and the GLIBC community. 

This about application developers in other communities and users where I
would settle for them to just support my platform with any optimization
that is somewhat sane.

The __Builtin_cpu_supports exist and I see its use. I don't see much use
of the more "complicated" approaches that we use in GLIBC. So it seem
reasonable to enable __builtin_cpu_supports for POWER but define the
implementation to be optimal for the PowerPC platform.

Most the argument against seems to be based on assumed "moral hazard".
Where you think what they are doing is stupid and so you refuse to help
them with any mechanisms that might make what they are doing, and will
continue to do, a little less stupid.

I appreciate the concern, but think this is odd position for a community
that uses phrases like "Free as in Freedom" to describe what they do.

I think it is better to help all communities do things in a less stupid
(more functional and better performance) way. 

> 
> r~
>
  
Torvald Riegel June 30, 2015, 4:01 p.m. UTC | #55
On Tue, 2015-06-30 at 10:07 -0500, Steven Munroe wrote:
> On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote:
> > On 06/29/2015 07:37 PM, Steven Munroe wrote:
> > > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
> > >> On 06/09/2015 04:06 PM, Steven Munroe wrote:
> > >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > >>>>
> > >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > >>>>> The proposed patch adds a new feature for powerpc. In order to get
> > >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > >>>>> This enables users to write versioned code based on the HWCAP bits
> > >>>>> without going through the overhead of reading them from the auxiliary
> > >>>>> vector.
> > >>>
> > >>>> i assume this is for multi-versioning.
> > >>>
> > >>> The intent is for the compiler to implement the equivalent of
> > >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > >>> efficiently as getauxv and scanning the auxv is too slow for inline
> > >>> optimizations.
> > >>>
> > >>
> > >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
> > >> the variables private to glibc that already contain this information.  That
> > >> ought to be fast enough for the builtin, rather than consuming space in the TCB.
> > >>
> > >
> > > Richard I do not understand how a 38 instruction function accessed via a
> > > PLT call stub (minimum 4 additional instructions) is equivalent or "as
> > > good as" a single in-line load instruction.
> > >
> > > Even with best case path for getauxval HWCAP2 we are at 14 instructions
> > > with exposure to 3 different branch miss predicts. And that is before
> > > the application can execute its own __builtin_cpu_supports() test.
> > >
> > > Lets look at a real customer example. The customer wants to use the P8
> > > 128-bit add/sub but also wants to be able to unit test code on existing
> > > P7 machines. Which results in something like this:
> > >
> > > static inline vui32_t
> > > vec_addcuq (vui32_t a, vui32_t b)
> > > {
> > >          vui32_t t;
> > >
> > >                  if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”))
> > >                  {
> > >
> > >                          __asm__(
> > >                              "vaddcuq %0,%1,%2;"
> > >                              : "=v" (t)
> > >                              : "v" (a),
> > >                                "v" (b)
> > >                              : );
> > ...
> > >
> > > So it is clear to me that executing 14+ instruction to decide if I can
> > > optimize to use new single instruction optimization is not a good deal.
> > 
> > This is a horrible way to use this builtin. In the same way that using ifunc at 
> > this level would also be horrible.
> > 
> Yes it is just an example, there are many more that you might find less
> objectionable.

Could you give more examples that give a clearer picture why you think
that the concern that Richard raised isn't valid?  Especially this here
regarding 2 vs 40 cycles:

> > Even supposing that this builtin uses a single load, you've at least doubled 
> > the overhead of using the insn. The user really should be aware of this and 
> > manually hoist this check much farther up the call chain.  At which point the 
> > difference between 2 cycles for a load and 40 cycles for a call is immaterial.
> >
> > And if the user is really concerned about unit tests, surely ifdefs are more 
> > appropriate for this situation.  At the moment one can only test the P7 path on 
> > P7 and the P8 path on P8; better if one can also test the P7 path on P8.
> > 
> 
> Yes I know there are better alternatives.
> 
> This is not intended for use within GLIBC or by knowledgeable folks like
> yourself and the GLIBC community. 
> 
> This about application developers in other communities and users where I
> would settle for them to just support my platform with any optimization
> that is somewhat sane.
> 
> The __Builtin_cpu_supports exist and I see its use. I don't see much use
> of the more "complicated" approaches that we use in GLIBC. So it seem
> reasonable to enable __builtin_cpu_supports for POWER but define the
> implementation to be optimal for the PowerPC platform.
> 
> Most the argument against seems to be based on assumed "moral hazard".
> Where you think what they are doing is stupid and so you refuse to help
> them with any mechanisms that might make what they are doing, and will
> continue to do, a little less stupid.

I didn't understand Richard's concerns to be about that.  Rather, it
seemed to me he's concerned about supporting use cases that only mean
technical debt for us; if there is a much simpler way on the users' side
to do this right, we have to see whether we get a good balance between
technical debt and benefits for some users.

> I appreciate the concern, but think this is odd position for a community
> that uses phrases like "Free as in Freedom" to describe what they do.

I don't think we promise to do everything for everyone.  That does not
conflict with free software.
  
Richard Henderson June 30, 2015, 6:08 p.m. UTC | #56
On 06/30/2015 04:07 PM, Steven Munroe wrote:
> On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote:
> This is not intended for use within GLIBC or by knowledgeable folks like
> yourself and the GLIBC community.
>
> This about application developers in other communities and users where I
> would settle for them to just support my platform with any optimization
> that is somewhat sane.
>
> The __Builtin_cpu_supports exist and I see its use. I don't see much use
> of the more "complicated" approaches that we use in GLIBC. So it seem
> reasonable to enable __builtin_cpu_supports for POWER but define the
> implementation to be optimal for the PowerPC platform.
>
> Most the argument against seems to be based on assumed "moral hazard".
> Where you think what they are doing is stupid and so you refuse to help
> them with any mechanisms that might make what they are doing, and will
> continue to do, a little less stupid.

No, this is mostly an argument against adding a new dependency between glibc 
and gcc, at a specific glibc version, which cannot be checked via symbol 
versioning.

On the other hand, there is an alternative way to implement what you want that, 
while a factor of 20 slower is not "slow" when the interface is used at the 
"approprate" level.  And further, that the interface *is* handled by symbol 
versioning, and is also present in an older version of glibc and so the gcc 
feature is usable on more systems.

Surely that's a consideration worth a counter-argument?


r~
  
Steven Munroe June 30, 2015, 8:02 p.m. UTC | #57
On Tue, 2015-06-30 at 18:01 +0200, Torvald Riegel wrote:
> On Tue, 2015-06-30 at 10:07 -0500, Steven Munroe wrote:
> > On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote:
> > > On 06/29/2015 07:37 PM, Steven Munroe wrote:
> > > > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
> > > >> On 06/09/2015 04:06 PM, Steven Munroe wrote:
> > > >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > >>>>
> > > >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > >>>>> The proposed patch adds a new feature for powerpc. In order to get
> > > >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > >>>>> This enables users to write versioned code based on the HWCAP bits
> > > >>>>> without going through the overhead of reading them from the auxiliary
> > > >>>>> vector.
> > > >>>
> > > >>>> i assume this is for multi-versioning.
> > > >>>
> > > >>> The intent is for the compiler to implement the equivalent of
> > > >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > >>> efficiently as getauxv and scanning the auxv is too slow for inline
> > > >>> optimizations.
> > > >>>
> > > >>
> > > >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
> > > >> the variables private to glibc that already contain this information.  That
> > > >> ought to be fast enough for the builtin, rather than consuming space in the TCB.
> > > >>
> > > >
> > > > Richard I do not understand how a 38 instruction function accessed via a
> > > > PLT call stub (minimum 4 additional instructions) is equivalent or "as
> > > > good as" a single in-line load instruction.
> > > >
snip
> > > This is a horrible way to use this builtin. In the same way that using ifunc at 
> > > this level would also be horrible.
> > > 
> > Yes it is just an example, there are many more that you might find less
> > objectionable.
> 
> Could you give more examples that give a clearer picture why you think
> that the concern that Richard raised isn't valid?  Especially this here
> regarding 2 vs 40 cycles:
> 
I don't see where Richard raise a 2 vs 40 cycle comparison. I think it
was my comment that getauxval was to heavy (38+4 insturctions) for this
and similar cases.

The classic case was Decimal Floating Point when we introduced HW DFP in
POWER6 but needed to support older systems. In many cases we had to
choose between a single DFP instruction and calling software emulation.
The difference are 10-100 to 1 in performance.

DFP was heavily used in DB2, Oracle, SAP, but even though I provided CPU
tuned library (using the AT_PLATFORM dynamic library search capability)
implementation they refused to use it because it was Linux specific.
They did use a cruder version of __builtin_cpu_supports() in the
"portable" implementation they got from IBM research.

There are also enough differences between POWER7 and POWER8 Vector
capabilities (POWER8 added 120 new instructions) to cause the High
Performance Computing Folks fits. While they are more likely to use
#ifdef _ARCH_PWR8 (then most application developers) they still "want
to" provide a "single binary build" supporting multiple machines and
distros. And they don't seem too interested in exotic techniques like
IFUNC.


> > > Even supposing that this builtin uses a single load, you've at least doubled 
> > > the overhead of using the insn. The user really should be aware of this and 
> > > manually hoist this check much farther up the call chain.  At which point the 
> > > difference between 2 cycles for a load and 40 cycles for a call is immaterial.
> > >
> > > And if the user is really concerned about unit tests, surely ifdefs are more 
> > > appropriate for this situation.  At the moment one can only test the P7 path on 
> > > P7 and the P8 path on P8; better if one can also test the P7 path on P8.
> > > 
> > 
> > Yes I know there are better alternatives.

But 4-6 cycles beats 20-40 cycles every time.
> > 
> > This is not intended for use within GLIBC or by knowledgeable folks like
> > yourself and the GLIBC community. 
> > 
> > This about application developers in other communities and users where I
> > would settle for them to just support my platform with any optimization
> > that is somewhat sane.
> > 
> > The __Builtin_cpu_supports exist and I see its use. I don't see much use
> > of the more "complicated" approaches that we use in GLIBC. So it seem
> > reasonable to enable __builtin_cpu_supports for POWER but define the
> > implementation to be optimal for the PowerPC platform.
> > 
> > Most the argument against seems to be based on assumed "moral hazard".
> > Where you think what they are doing is stupid and so you refuse to help
> > them with any mechanisms that might make what they are doing, and will
> > continue to do, a little less stupid.
> 
> I didn't understand Richard's concerns to be about that.  Rather, it
> seemed to me he's concerned about supporting use cases that only mean
> technical debt for us; if there is a much simpler way on the users' side
> to do this right, we have to see whether we get a good balance between
> technical debt and benefits for some users.
> 
I do not understand this.

How is this any different then stack_guard and __private_ss (split
stack) which exist in X86_64 as well. These are non-version accesses to
the TCB from GCC. 

Both have added fields for TM support. And we have already added the EBB
and DSO/TAR support entries specific to PowerISA, while X86_64 contains
a number of TCB fields specific to SSE and AVX extensions.

Nothing new here folks just the Platform extending its own ABI to solve
platform specific problems within the toolchain.

Note I deliberately said ABI and Toolchain (the combined compiler (GCC,
LLVM, ...) linker (Binutils), Dynamic Linker and Posix Runtime (GLIBC))
not just GLIBC.

So the TCB, like the stack layout, register conventions, calling
conventions, etc is full of fixed (non-versioned) offsets; defined,
controlled, and documented by the ABI, known and used as necessarily by
toolchain components to implement the ABI. That is how this stuff works!

Yes TCB field offsets can never change and this was established long ago
in the original ABI and platform supplements.

This is why I argue that this proposal does not imply any "technical
debt" on GLIBC but just part of the ongoing evolution of the ABI as
implemented by the complete toolchain stack.

It is the responsibility of the Platform ABI owner to coordinate
implementation of the ABI to the various tool-chain components and
communities, including GLIBC.

It turns out GLIBC is the best place to get the hwcap fields initialized
for each thread. 


> > I appreciate the concern, but think this is odd position for a community
> > that uses phrases like "Free as in Freedom" to describe what they do.
> 
> I don't think we promise to do everything for everyone.  That does not
> conflict with free software.
>
  
Ondrej Bilka June 30, 2015, 9:15 p.m. UTC | #58
On Tue, Jun 30, 2015 at 11:09:20AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 30-06-2015 00:14, Ondřej Bílka wrote:
> > On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote:
 > 
> > If you still believe that it somehow does multiplication just try this
> > and see that result is all zeroes.
> > 
> >    __vector uint32_t x={3,2,0,3},y={0,0,0,0};
> >    y = vec_addcuq(x,x);
> >    printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]);
> > 
> > Again your patronizing tone only shows your lack of knowledge of powerpc
> > assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/
> 
> Seriously, you need to start admitting your lack of knowledge in PowerISA
> (I am meant addition instead of multiplication, my mistake).  And repeating
> myself to prove a point only makes you childish, I am not competing with
> you.
> 
It sound exactly as silly as your critique that was based on lie. Now
you are saying: Oops my mistake. But I was rigth. To see if one is rigth
or wrong is to present evidence. So whats yours?

> > 
> > 
> > I did mistake that I read to bit fast and seen only add instead of
> > instruction to get carry. Still thats with gpr two additions with carry,
> > then add zero with carry to set desired bit. 
> > 
> >> It has nothing to do
> >> with uint128_t support on GCC and only recently GCC added support to 
> >> such builtins [1]. And although there is plan to add support to use 
> >> vector instruction for uint128_t, right now they are done in GRP register
> >> in powerpc.
> >>
> > Customer just wants to do 128 additions. If a fastest way
> > is with GPR registers then he should use gpr registers.
> > 
> > My claim was that this leads to slow code on power7. Fallback above
> > takes 14 cycles on power8 and 128bit addition is similarly slow.
> > 
> > Yes you could craft expressions that exploit vectors by doing ands/ors
> > with 128bit constants but if you mostly need to sum integers and use 128
> > bits to prevent overflows then gpr is correct choice due to transfer
> > cost.
> 
> Again this is something, as Steve has pointed out, you only assume without
> knowing the subject in depth: it is operating on *vector* registers and
> thus it will be more costly to move to and back GRP than just do in
> VSX registers.  And as Steven has pointed out, the idea is to *validate*
> on POWER7.

If that is really case then using hwcap for that makes absolutely no sense.
Just surround these builtins by #ifdef TESTING and you will compile
power7 binary. When you are releasing production version you will
optimize that for power8. A difference from just using correct -mcpu
could dominate speedups that you try to get with these builtins. Slowing
down production application for validation support makes no sense.


Also you didn't answered my question, it works in both ways. 
From that example his uses vector register doesn't follow that 
application should use vector registers. If user does
something like in my example, the cost of gpr -> vector conversion will
harm performance and he should keep these in gpr. 






> > 
> >> Also, it is up to developers to select the best way to use the CPU
> >> features.  Although I am not very found of providing the hwcap in TCB
> >> (my suggestion was to use local __thread in libgcc instead), the idea
> >> here is to provide *tools*.
> >>
> > If you want to provide tools then you should try to make best tool
> > possible instead of being satisfied with tool that poorly fits job and
> > is dangerous to use.
> > 
> > I am telling all time that there are better alternatives where this
> > doesn't matter.
> > 
> > One example would be write gcc pass that runs after early inlining to
> > find all functions containing __builtin_cpu_supports, cloning them to
> > replace it by constant and adding ifunc to automatically select variant.
> 
> Using internal PLT calls to such mechanism is really not the way to handle
> performance for powerpc.  
> 
No you are wrong again. I wrote to introduce ifunc after inlining. You
do inlining to eliminate call overhead. So after inlining effect of
adding plt call is minimal, otherwise gcc should inline that to improve
performance in first place.

Also why are you so sure that its code in main binary and not code in
shared library?

> > 
> > You would also need to keep list of existing processor features to
> > remove nonexisting combinations. That easiest way to avoid combinatorial
> > explosion.
> > 
> > 
> > 
> >  
> >> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html
> >>
> >>>
> >>> As gcc compiles addition into pair of addc, adde instructions a
> >>> performance gain is minimal while code is harder to maintain. Due to
> >>> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
> >>> on following example on power8.
> >>>
> >>>
> >>> int main()
> >>> {
> >>>   unsigned long i;
> >>>   __int128 u = 0;
> >>> //long u = 0;
> >>>   for (i = 0; i < 1000000000; i++)
> >>>     u += i * i;
> >>>   return u >> 35;
> >>> }
> >>>
> >>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> >>> [neleai@gcc2-power8 ~]$ time ./a.out 
> >>>
> >>> real	0m0.957s
> >>> user	0m0.956s
> >>> sys	0m0.001s
> >>>
> >>> [neleai@gcc2-power8 ~]$ vim uu.c 
> >>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> >>> [neleai@gcc2-power8 ~]$ time ./a.out 
> >>>
> >>> real	0m1.040s
> >>> user	0m1.039s
> >>> sys	0m0.001s
> >>
> >> This is due the code is not using any vector instruction, which is the aim of the
> >> code snippet Steven has posted.
> > 
> > Wait do you want to have fast code or just show off your elite skills
> > with vector registers?
> 
> What does it have to do with vectors? I just saying that in split-core mode
> the CPU group dispatches are statically allocated for the eight threads
> and thus pipeline gain are lower.  And indeed it was not the case for the
> example (I rushed without doing the math, my mistake again).
>
And you are telling that in majority of time contested threads would be
problem? Do you have statistic how often that happens?

Then I would be more worried about vector implementation than gpr one.
It goes both ways. A slowdown in gpr code is relatively unlikely for
simple economic reasons: As addition, shifts... are frequent
intstruction one of best performance/silicon tradeoff is add more
execution units that do that until slowdown become unlikely. On other
hand for rarely used instructions that doesn't make sense so I wouldn't
be much surprised that when all threads would do 128bit vector addition it 
would get slow as they contest only one execution unit that could do
that. 



> > 
> > A vector 128bit addition is on power7 lot slower than 128bit addition in
> > gpr. This is valid use case when I produce 64bit integers and want to
> > compute their sum in 128bit variable. You could construct lot of use
> > cases where gpr wins, for example summing an array(possibly with applied
> > arithmetic expression).
> > 
> > Unless you show real world examples how could you prove that vector
> > registers are better choice?
> 
> How said they are better? As Steve has pointed out, *you* assume it, the
> idea afaik is only to be able to *validate* the code on a POWER7 machine.
> 
> Anyway, I will conclude again because I am not in the mood to get back
> at this subject (you can be the big boy and have the final line).
> I tend to see the TCB is not the way to accomplish it, but not for
> performance reasons.  My main issue is tie compiler code generation ABI
> with runtime in a way it should be avoided (for instance implementing it
> on libgcc).  And your performance analysis mostly do not hold true for
> powerpc.
> 
You could repeat it but could you prove it?

> > 
> > 
> > 
> >>  Also, it really depends in which mode the CPU is
> >> set, on a POWER split-core mode, where the CPU dispatch groups are shared among
> >> threads in an non-dynamic way the difference is bigger:
> >>
> >> [[fedora@glibc-ppc64le ~]$ time ./test
> >>
> >> real    0m1.730s
> >> user    0m1.726s
> >> sys     0m0.003s
> >> [fedora@glibc-ppc64le ~]$ time ./test-long
> >>
> >> real    0m1.593s
> >> user    0m1.591s
> >> sys     0m0.002s
> >>
> > Difference? What difference? Only ratio matters to remove things like
> > different frequency of processors and that thread sharing slows you down
> > by constant. When I do math difference between these two ratios is 0.06%
> > 
> > 1.593/1.730 = 0.9208092485549133
> > 0.957/1.040 = 0.9201923076923076
> > 
> > 
> > 
> >>>
> >>>
> >>>  
> >>>> One instruction (plus the __builtin_cpu_supports which should be and
> >>>> immediate,  branch conditional) is a better deal. Inlining so the
> >>>> compiler can do common sub-expression about larger blocks is an even
> >>>> better deal.
> >>>>
> >>> That doesn't change fact that its mistake. A code above was bad as it
> >>> added check for single instruction that takes a cycle. When difference
> >>> between implementations is few cycles then each cycle matter (otherwise
> >>> you should just stick to generic one). Then a hwcap check itself causes
> >>> slowdown that matters and you should use ifunc to eliminate.
> >>>
> >>> Or hope that its moved out of loop, but when its loop with 100
> >>> iterations a __builtin_cpu_supports time becomes imaterial.
> >>>
> >>>
> >>>> I just do not understand why there is so much resistance to this simple
> >>>> platform ABI specific request.
> >>>
> >>>
  
Adhemerval Zanella Netto June 30, 2015, 9:46 p.m. UTC | #59
On 30-06-2015 18:15, Ondřej Bílka wrote:
> On Tue, Jun 30, 2015 at 11:09:20AM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 30-06-2015 00:14, Ondřej Bílka wrote:
>>> On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote:
>  > 
>>> If you still believe that it somehow does multiplication just try this
>>> and see that result is all zeroes.
>>>
>>>    __vector uint32_t x={3,2,0,3},y={0,0,0,0};
>>>    y = vec_addcuq(x,x);
>>>    printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]);
>>>
>>> Again your patronizing tone only shows your lack of knowledge of powerpc
>>> assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/
>>
>> Seriously, you need to start admitting your lack of knowledge in PowerISA
>> (I am meant addition instead of multiplication, my mistake).  And repeating
>> myself to prove a point only makes you childish, I am not competing with
>> you.
>>
> It sound exactly as silly as your critique that was based on lie. Now
> you are saying: Oops my mistake. But I was rigth. To see if one is rigth
> or wrong is to present evidence. So whats yours?

I really do not want to go further on this path, so I will just dropped it.


> 
>>>
>>>
>>> I did mistake that I read to bit fast and seen only add instead of
>>> instruction to get carry. Still thats with gpr two additions with carry,
>>> then add zero with carry to set desired bit. 
>>>
>>>> It has nothing to do
>>>> with uint128_t support on GCC and only recently GCC added support to 
>>>> such builtins [1]. And although there is plan to add support to use 
>>>> vector instruction for uint128_t, right now they are done in GRP register
>>>> in powerpc.
>>>>
>>> Customer just wants to do 128 additions. If a fastest way
>>> is with GPR registers then he should use gpr registers.
>>>
>>> My claim was that this leads to slow code on power7. Fallback above
>>> takes 14 cycles on power8 and 128bit addition is similarly slow.
>>>
>>> Yes you could craft expressions that exploit vectors by doing ands/ors
>>> with 128bit constants but if you mostly need to sum integers and use 128
>>> bits to prevent overflows then gpr is correct choice due to transfer
>>> cost.
>>
>> Again this is something, as Steve has pointed out, you only assume without
>> knowing the subject in depth: it is operating on *vector* registers and
>> thus it will be more costly to move to and back GRP than just do in
>> VSX registers.  And as Steven has pointed out, the idea is to *validate*
>> on POWER7.
> 
> If that is really case then using hwcap for that makes absolutely no sense.
> Just surround these builtins by #ifdef TESTING and you will compile
> power7 binary. When you are releasing production version you will
> optimize that for power8. A difference from just using correct -mcpu
> could dominate speedups that you try to get with these builtins. Slowing
> down production application for validation support makes no sense.

That is a valid point, but as Steve has pointed out the idea is exactly
to avoid multiple builds.

> 
> 
> Also you didn't answered my question, it works in both ways. 
> From that example his uses vector register doesn't follow that 
> application should use vector registers. If user does
> something like in my example, the cost of gpr -> vector conversion will
> harm performance and he should keep these in gpr. 

And again you make assumptions that you do not know: what if the program
is made with vectors in mind and they want to process it as uint128_t if
it is the case?  You do know that neither the program constraints so
assuming that it would be better to use GPR may not hold true.

> 
> 
> 
> 
> 
> 
>>>
>>>> Also, it is up to developers to select the best way to use the CPU
>>>> features.  Although I am not very found of providing the hwcap in TCB
>>>> (my suggestion was to use local __thread in libgcc instead), the idea
>>>> here is to provide *tools*.
>>>>
>>> If you want to provide tools then you should try to make best tool
>>> possible instead of being satisfied with tool that poorly fits job and
>>> is dangerous to use.
>>>
>>> I am telling all time that there are better alternatives where this
>>> doesn't matter.
>>>
>>> One example would be write gcc pass that runs after early inlining to
>>> find all functions containing __builtin_cpu_supports, cloning them to
>>> replace it by constant and adding ifunc to automatically select variant.
>>
>> Using internal PLT calls to such mechanism is really not the way to handle
>> performance for powerpc.  
>>
> No you are wrong again. I wrote to introduce ifunc after inlining. You
> do inlining to eliminate call overhead. So after inlining effect of
> adding plt call is minimal, otherwise gcc should inline that to improve
> performance in first place.

It is the case if you have the function definition, which might not be
true.  But this is not the case since the code could be in a shared
library.

> 
> Also why are you so sure that its code in main binary and not code in
> shared library?
> 
>>>
>>> You would also need to keep list of existing processor features to
>>> remove nonexisting combinations. That easiest way to avoid combinatorial
>>> explosion.
>>>
>>>
>>>
>>>  
>>>> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html
>>>>
>>>>>
>>>>> As gcc compiles addition into pair of addc, adde instructions a
>>>>> performance gain is minimal while code is harder to maintain. Due to
>>>>> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
>>>>> on following example on power8.
>>>>>
>>>>>
>>>>> int main()
>>>>> {
>>>>>   unsigned long i;
>>>>>   __int128 u = 0;
>>>>> //long u = 0;
>>>>>   for (i = 0; i < 1000000000; i++)
>>>>>     u += i * i;
>>>>>   return u >> 35;
>>>>> }
>>>>>
>>>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
>>>>> [neleai@gcc2-power8 ~]$ time ./a.out 
>>>>>
>>>>> real	0m0.957s
>>>>> user	0m0.956s
>>>>> sys	0m0.001s
>>>>>
>>>>> [neleai@gcc2-power8 ~]$ vim uu.c 
>>>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
>>>>> [neleai@gcc2-power8 ~]$ time ./a.out 
>>>>>
>>>>> real	0m1.040s
>>>>> user	0m1.039s
>>>>> sys	0m0.001s
>>>>
>>>> This is due the code is not using any vector instruction, which is the aim of the
>>>> code snippet Steven has posted.
>>>
>>> Wait do you want to have fast code or just show off your elite skills
>>> with vector registers?
>>
>> What does it have to do with vectors? I just saying that in split-core mode
>> the CPU group dispatches are statically allocated for the eight threads
>> and thus pipeline gain are lower.  And indeed it was not the case for the
>> example (I rushed without doing the math, my mistake again).
>>
> And you are telling that in majority of time contested threads would be
> problem? Do you have statistic how often that happens?
> 
> Then I would be more worried about vector implementation than gpr one.
> It goes both ways. A slowdown in gpr code is relatively unlikely for
> simple economic reasons: As addition, shifts... are frequent
> intstruction one of best performance/silicon tradeoff is add more
> execution units that do that until slowdown become unlikely. On other
> hand for rarely used instructions that doesn't make sense so I wouldn't
> be much surprised that when all threads would do 128bit vector addition it 
> would get slow as they contest only one execution unit that could do
> that. 

Seriously, split-core is not really about contested threads, but rather
a way to set the core specially in KVM mode.  But we digress here, since
the idea is not analyse Steve code snippet if this is faster, better, etc;
but rather if hwcap using TCB access is better way to handle such compiler
builtin.

> 
> 
> 
>>>
>>> A vector 128bit addition is on power7 lot slower than 128bit addition in
>>> gpr. This is valid use case when I produce 64bit integers and want to
>>> compute their sum in 128bit variable. You could construct lot of use
>>> cases where gpr wins, for example summing an array(possibly with applied
>>> arithmetic expression).
>>>
>>> Unless you show real world examples how could you prove that vector
>>> registers are better choice?
>>
>> How said they are better? As Steve has pointed out, *you* assume it, the
>> idea afaik is only to be able to *validate* the code on a POWER7 machine.
>>
>> Anyway, I will conclude again because I am not in the mood to get back
>> at this subject (you can be the big boy and have the final line).
>> I tend to see the TCB is not the way to accomplish it, but not for
>> performance reasons.  My main issue is tie compiler code generation ABI
>> with runtime in a way it should be avoided (for instance implementing it
>> on libgcc).  And your performance analysis mostly do not hold true for
>> powerpc.
>>
> You could repeat it but could you prove it?

Again I do not want to go on this patch ...
  
Torvald Riegel June 30, 2015, 10:13 p.m. UTC | #60
On Tue, 2015-06-30 at 23:15 +0200, Ondřej Bílka wrote:
> On Tue, Jun 30, 2015 at 11:09:20AM -0300, Adhemerval Zanella wrote:
> > Seriously, you need to start admitting your lack of knowledge in PowerISA
> > (I am meant addition instead of multiplication, my mistake).  And repeating
> > myself to prove a point only makes you childish, I am not competing with
> > you.
> > 
> It sound exactly as silly as your critique that was based on lie. Now
> you are saying: Oops my mistake. But I was rigth. To see if one is rigth
> or wrong is to present evidence. So whats yours?

Please, let's all stick to the technical discussion here.  While we may
disagree on what we think is best for glibc, I think it would help if
we'd just assume that everyone tries to just do the best for glibc.
Thanks.
  
Ondrej Bilka July 1, 2015, 11:55 a.m. UTC | #61
On Tue, Jun 30, 2015 at 06:46:14PM -0300, Adhemerval Zanella wrote:

> >> Again this is something, as Steve has pointed out, you only assume without
> >> knowing the subject in depth: it is operating on *vector* registers and
> >> thus it will be more costly to move to and back GRP than just do in
> >> VSX registers.  And as Steven has pointed out, the idea is to *validate*
> >> on POWER7.
> > 
> > If that is really case then using hwcap for that makes absolutely no sense.
> > Just surround these builtins by #ifdef TESTING and you will compile
> > power7 binary. When you are releasing production version you will
> > optimize that for power8. A difference from just using correct -mcpu
> > could dominate speedups that you try to get with these builtins. Slowing
> > down production application for validation support makes no sense.
> 
> That is a valid point, but as Steve has pointed out the idea is exactly
> to avoid multiple builds.
>
And thats exactly problem that you just ignore solution. Seriously when
having single build is more important than -mcpu that will give you 1%
performance boost do you think that a 1% boost from hwcap selection
matters? I could come with easy suggestions like changing makefile to
create app_power7 and app_power8 in single build. And a app_power7 could
check if it supports power8 instruction and exec app_power8. I really
doubt why you insist on single build when a best practice is separate
testing and production.

Insisting that you need single binary would mean that you should stick
with power7 optimization and don't bother with hwcap instruction

 
> > 
> > 
> > Also you didn't answered my question, it works in both ways. 
> > From that example his uses vector register doesn't follow that 
> > application should use vector registers. If user does
> > something like in my example, the cost of gpr -> vector conversion will
> > harm performance and he should keep these in gpr. 
> 
> And again you make assumptions that you do not know: what if the program
> is made with vectors in mind and they want to process it as uint128_t if
> it is the case?  You do know that neither the program constraints so
> assuming that it would be better to use GPR may not hold true.
> 
I didn't make that assumption. 
I just said that your assumption that one must use vector
registers is wrong again. From my previous mail:


> Customer just wants to do 128 additions. If a fastest way
> is with GPR registers then he should use gpr registers.
>
> My claim was that this leads to slow code on power7. Fallback above
> takes 14 cycles on power8 and 128bit addition is similarly slow.
>
> Yes you could craft expressions that exploit vectors by doing ands/ors
> with 128bit constants but if you mostly need to sum integers and use 128
> bits to prevent overflows then gpr is correct choice due to transfer
> cost.

Yes it isn't known but its more likely that programmers just used that
as counter instead of vector magic. So we need to see use case in more
detail.


>> >>> I am telling all time that there are better alternatives where this
> >>> doesn't matter.
> >>>
> >>> One example would be write gcc pass that runs after early inlining to
> >>> find all functions containing __builtin_cpu_supports, cloning them to
> >>> replace it by constant and adding ifunc to automatically select variant.
> >>
> >> Using internal PLT calls to such mechanism is really not the way to handle
> >> performance for powerpc.  
> >>
> > No you are wrong again. I wrote to introduce ifunc after inlining. You
> > do inlining to eliminate call overhead. So after inlining effect of
> > adding plt call is minimal, otherwise gcc should inline that to improve
> > performance in first place.
> 
> It is the case if you have the function definition, which might not be
> true.  But this is not the case since the code could be in a shared
> library.
> 
Seriously? If its function from shared library then it should use ifunc
and not force every caller to keep hwcap selection in sync with library,
and you need plt indirection anyway.

For function definition again get low-hanging fruit and use --lto. It
is really preexisting problem as you will also gain performance by
fixing it in first place.

Also its bit off topic but you don't need internal plt for ifunc as its
implementation detail. You could do it with any ifunc if we decide that eager
resolution is ok.

If plt situation is as bad on power as you claim then you should write
plt elission. Idea is that loader would generate branch
instructions for all used functions instead plt stubs. For autogenerated ifunc gcc
could prepare page for each processor and runtime could do single mmap
acording to hwcap per process.

> > 
> > Also why are you so sure that its code in main binary and not code in
> > shared library?
> > 
Could you answer that as one should put reusable parts of program in
library?


> >>
> >> What does it have to do with vectors? I just saying that in split-core mode
> >> the CPU group dispatches are statically allocated for the eight threads
> >> and thus pipeline gain are lower.  And indeed it was not the case for the
> >> example (I rushed without doing the math, my mistake again).
> >>
> > And you are telling that in majority of time contested threads would be
> > problem? Do you have statistic how often that happens?
> > 
> > Then I would be more worried about vector implementation than gpr one.
> > It goes both ways. A slowdown in gpr code is relatively unlikely for
> > simple economic reasons: As addition, shifts... are frequent
> > intstruction one of best performance/silicon tradeoff is add more
> > execution units that do that until slowdown become unlikely. On other
> > hand for rarely used instructions that doesn't make sense so I wouldn't
> > be much surprised that when all threads would do 128bit vector addition it 
> > would get slow as they contest only one execution unit that could do
> > that. 
> 
> Seriously, split-core is not really about contested threads, but rather
> a way to set the core specially in KVM mode.

I just tried to understand why your example is relevant. I jumped bit
that a split core is equivalent to contested cpu. If you run other
cpu-intensive three threads then you will get similar cpu dispatches as
when you use split-core.

Also you didn't answer my question if split core is used often or its
just corner case. If its less than 1% then we shouldn't optimize that
corner case and you shouldn't post a irrelevant technical detail in
first place.

>  But we digress here, since
> the idea is not analyse Steve code snippet if this is faster, better, etc;
> but rather if hwcap using TCB access is better way to handle such compiler
> builtin.
> 
It is as main objection was if this helps at all. If you don't want to
show that this is better than current state we could conclude:


As this snipped was invalid no example that one needs to often
access hwcap was offered. Existing applications read hwcap once per run. 
So any proposal to optimize hwcap should be dropped as current code
gives reasonable performance.
  
Steven Munroe July 1, 2015, 7:12 p.m. UTC | #62
On Tue, 2015-06-09 at 12:38 -0400, Rich Felker wrote:
> On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote:
> > 
> > The proposed patch adds a new feature for powerpc. In order to get
> > faster access to the HWCAP/HWCAP2 bits, we now store them in the
> > TCB. This enables users to write versioned code based on the HWCAP
> > bits without going through the overhead of reading them from the
> > auxiliary vector.
> > 
> > A new API is published in ppc.h for get/set the bits in the
> > aforementioned memory area (mainly for gcc to use to create
> > builtins).
> 
> Do you have any justification (actual performance figures for a
> real-world usage case) for adding ABI constraints like this? This is
> not something that should be done lightly. My understanding is that
> hwcap bits are normally used in initializing functions pointers (or
> equivalent things like ifunc resolvers), not again and again at
> runtime, so I'm having a hard time seeing how this could help even if
> it does make the individual hwcap accesses measurably faster.
> 
> It would also be nice to see some justification for the magic number
> offsets. Will they be stable under changes to the TCB structure or
> will preserving them require tip-toeing around them?
> 

This discussion has metastasizes into so many side discussions, meta
discussion, personal opinions etc that i would like to start over at the
point where we where still discussing how to implement something
reasonable. 

First a level set on requirements and goals.

The intent is allow application developers to develop new application
for Linux on Power and simply the porting of existing Linux applications
to Power. And encourage then to apply the same level of platform
optimization to Power as they do for other Linux platforms.

While there are a near infinity of options (some of which some members
of this community think are stupid) and I have seen them all being used.
As general rule I find it counter productive to call the customer (All
Linux Application Developers are our customers) stupid their face, so I
try to explain the options and encourage them to use many of the
techniques that this community thinks are not stupid.

But as rule the application developer are busy, don't have much patience
for nonsense like IFUNC and AT_PLATFORM library search strategies. They
tend to use what they already know, apply minimal effort to solve the
immediate problem, and move on!

One of the "things they already know" is the __built_cpu_is()
__built_cpu_supports() GCC builtins for X86. To goal of this simple
proposal is enable that for powerpc powerpc64 and powerpc64le, based on
the existing AT_HWCAP/AT_HWCAP2 mechanisms.

Another observation is that many of these applications are deployed as
shared object libraries and frequently are not linked directly to the
main application but loaded via dl_open() are runtime. So clever
solutions that are only simple and/or fast from a main programs but
difficult and/or slow for dl_open() library are not an option.

They are very firm about a "single binary built" for all supported
distros and all supported hardware generations.

And finally these applications tend to be massive C++ programs composed
of smallish members functions and byzantine layers of templates. I have
not observed wide use of private/hidden and so these libraries tend to
expose every member function as a PLT entry, which resists most
in-lining opportunities.

Net this is a harder problem then it looks.

So lets write down some requirements:

0) Something the average application developer will understand and use.
1) In any user library, in including ones loaded via LD_PRELOAD and
dl_open().
2) Across multiple Distro versions and across Distros (using different
GLIBC versions).

And goals for the Power implementation:

1) As fast as possible accounting for the limits of the ABI, ISA and
Micro-architecture.
1a) Minimal path length to obtain the hwcap bit vector for test
1b) Limited exposure to micro-architecture hazards including
indirection.
2) Simple and reliable initialization of the cached values.
3) And without relying on .text relocation in libraries.

First lets dispose of the obvious. Extern static variables.

This is not horrible for larger grained examples but can be less than
optimal for fine grained C++ examples. As stated above the hwcap will
not be local to the user library. As PowerISA does not have PC-relative
addressing our ABI requires that R2 (AKA the TOC pointer) is set to
address the local (for this libraries) GOT/TOC/PLT before we access any
static variable and extern require an indirect load of the extern hwcap
address from the GOT/TOC.

In addition, since we are potentially changing R2 (AKA the TOC pointer)
we are now obligated to save and restore the R2.

Now the design of POWER assumes that as RISC architecture with lots of
registers and being designed for massive memory bandwidth and
out-of-order execution, the processor core does no optimized for
programs that store to then immediately reload from a memory location.
In a machine with 16-pipelines per core and capable of dispatching up to
8 instructions per cycle, "immediate" has an amazingly broad definition
(many 10s of instructions).

So the store and reload of the TOC pointer can hit the Load-hit-store
hazard (essentially the load got issued (out-of-order) before the store
it depended on was complete or at a stage where a bypass was available)
even across the execution of the called function. While the core detects
and corrects this state, it does so in a heavy handed way (instruction
rejects (11 cycles each) or instruction fetch flush (worse)). Lets just
say this is something to avoid if you can.

So introducing a static variable to C++ functions that would not
normally access static should be avoided. Many C++ member functions are
small enough execute completely within the available (volatile) register
and don't even need a stack-frame. So a __builtin_cpu_supports() design
based on none local extern static would be a unforced error in these
cases.

Of course the TCB based proposal avoids all of this because the TCB
pointer (R13) is constant across all functions in a thread (not
save/restored in the user application).

Now for the next obvious case. Which why not a normal TLS variable.

If you think about the requirements for a while it becomes clear. As the
HWCAP cache would have to be defined and initialized in either libgcc or
libc, accept will be none local from any user library. So all the local
TLC access optimization's are disallowed. Add the requirement to support
dl_open() libraries leaves the general dynamic TLS model as the ONLY
safe option.

This requires a up-call to __tls_get_adder plus accessing a couple of
TLS relocations from to GOT. And the __tls_get_addr, which is in
ld64.so.2, which requires a PLT call stub that saves and restores the
TOC-pointer. Remember the previous discussion about TOC save/restore and
exposure to the load-hit-store hazard?

Now there were a lot of suggestions to just force the HWCAP TLS
variables into initial exec or local exec TLS model with an attribute.
This would resolve to direct TLS offset in some special reserved TLS
space?

How does that work with a library loaded with dl_open()? How does that
work with a library linked with one toolchain / GLIBC on Distro X and
run on a system with a different toolchain and GLIBC on Distro Y? With
different versions of GLIBC? Will HWCAP get the same TLS offset? Do we
end up with .text relocations that we are also trying to avoid?

Again the TCB avoids all of this as it provides a fixed offset defined
by the ABI and does not require any up-calls or indirection. And also
will work in any library without induced hazards. This clearly works
across distros including previous version of GLIBC as the words where
previously reserved by the ABI. Application libraries that need to run
on older distros can add a __built_cpu_init() to their library init or
if threaded to their thread create function.
  
Carlos O'Donell July 3, 2015, 3:21 a.m. UTC | #63
On 07/01/2015 03:12 PM, Steven Munroe wrote:
> This discussion has metastasizes into so many side discussions, meta
> discussion, personal opinions etc that i would like to start over at the
> point where we where still discussing how to implement something
> reasonable. 

I want to make few higher level comments as an FSF steward for the glibc project.

* IBM has consistently provided hardware and developer resources to maintain 
  POWER for glibc. IBM is the POWER maintainer, and the ultimate responsibility
  for the machine rests with IBM.  Today that responsibility is with Steven Munroe
  (IBM) who is the POWER maintainer for glibc. The machine maintainership provides
  Steven with a veto for machine-specific features, ABIs and APIs, much like all
  the other machine port maintainers. Steven is expected to use this veto to
  further the goals of the glibc project, and serve the needs of POWER users, and
  balance the two.

* Consensus need not be agreement, it may be that we discuss the options, find
  no sustained opposition, and move forward with a solution. People can disagree
  bitterly and we can still have consensus. Developers can have strong and polarizing
  opinions about exactly how to use the limited resource of `thread-pointer + offset`
  accessible data, but at the end of the day consensus can be reached.

* Healthy and active discussion, like the discussion on this particular topic,
  are good for the community. Topics surrounding optimizations are rife with
  complex tradeoffs, and require discussion, summaries, and a developer to
  champion a position of consensus. I see nothing wrong with this kind of behaviour.
  The discussions should stay on point, be technical, provide feedback, and
  indicate clearly if the comment amounts to sustained opposition.

Cheers,
Carlos.
  
Carlos O'Donell July 3, 2015, 5:05 a.m. UTC | #64
On 06/08/2015 05:03 PM, Carlos Eduardo Seo wrote:
> The proposed patch adds a new feature for powerpc. In order to get
> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> This enables users to write versioned code based on the HWCAP bits
> without going through the overhead of reading them from the auxiliary
> vector.
> 
> A new API is published in ppc.h for get/set the bits in the
> aforementioned memory area (mainly for gcc to use to create
> builtins).
> 
> Testcases for the API functions were also created.
> 
> Tested on ppc32, ppc64 and ppc64le.
> 
> Okay to commit?

(1) Prevent running new applications against old glibc.

You add a new interface to glibc, but provide no way to prevent
new applications that compile with this support from crashing
or behaving badly when run on systems with an older glibc.

Richard Henderson had suggested to me that you could use a dummy
versioned symbol in the code to create a dependency against
GLIBC_2.22 and thus prevent those new binaries from running
on say GLIBC_2.21. You'd never use the versioned symbol for anything.

This would seem a much better way to prevent what will obviously
be a weird failure mode.

Have you considered this failure mode?

At the end of the day it's up to IBM to make the best use of the
tp+offset data stored in the TCB, but every byte you save is another
byte you can use later for something else.

Comments below.
 
> 2015-06-08  Carlos Eduardo Seo  <cseo@linux.vnet.ibm.com>
> 
> 	This patch adds a new feature for powerpc. In order to get faster
> 	access to the HWCAP/HWCAP2 bits, we now store them in the TCB, so
> 	we don't have to deal with the overhead of reading them via the
> 	auxiliary vector. A new API is published in ppc.h for get/set the
> 	bits.

Did you test this for 32-bit, 64-bit, and 64-bit LE?

Static and shared applications? To make sure you got both TLS init paths.

> 	* sysdeps/powerpc/nptl/tcb-offsets.sym: Added new offests
> 	for HWCAP and HWCAP2 in the TCB.
> 	* sysdeps/powerpc/nptl/tls.h: New functionality - stores
> 	the HWCAP and HWCAP2 in the TCB.
> 	(dtv): Added new fields for HWCAP and HWCAP2.
> 	(TLS_INIT_TP): Included calls to add the hwcap/hwcap2
> 	values in the TCB in TP initialization.
> 	(TLS_DEFINE_INIT_TP): Likewise.
> 	(THREAD_GET_HWCAP): New macro.
> 	(THREAD_SET_HWCAP): Likewise.
> 	(THREAD_GET_HWCAP2): Likewise.
> 	(THREAD_SET_HWCAP2): Likewise.
> 	* sysdeps/powerpc/sys/platform/ppc.h: Added new functions
> 	for get/set the HWCAP/HWCAP2 values in the TCB.
> 	(__ppc_get_hwcap): New function.
> 	(__ppc_get_hwcap2): Likewise.
> 	* sysdeps/powerpc/test-get_hwcap.c: Testcase for this
> 	functionality.
> 	* sysdeps/powerpc/test-set_hwcap.c: Testcase for this
> 	functionality.
> 	* sysdeps/powerpc/Makefile: Added testcases to the Makefile.
>

As Joseph pointed out you need to update the manual to describe this new interface.

Added the documentation step to the internals documentation for Platform Headers here:

https://sourceware.org/glibc/wiki/PlatformHeaders

> 
> Index: glibc-working/sysdeps/powerpc/nptl/tcb-offsets.sym
> ===================================================================
> --- glibc-working.orig/sysdeps/powerpc/nptl/tcb-offsets.sym
> +++ glibc-working/sysdeps/powerpc/nptl/tcb-offsets.sym
> @@ -20,6 +20,8 @@ TAR_SAVE			(offsetof (tcbhead_t, tar_sav
>  DSO_SLOT1			(offsetof (tcbhead_t, dso_slot1) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
>  DSO_SLOT2			(offsetof (tcbhead_t, dso_slot2) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
>  TM_CAPABLE			(offsetof (tcbhead_t, tm_capable) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
> +TCB_HWCAP			(offsetof (tcbhead_t, hwcap) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
> +TCB_HWCAP2			(offsetof (tcbhead_t, hwcap2) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
>  #ifndef __ASSUME_PRIVATE_FUTEX
>  PRIVATE_FUTEX_OFFSET		thread_offsetof (header.private_futex)
>  #endif
> Index: glibc-working/sysdeps/powerpc/nptl/tls.h
> ===================================================================
> --- glibc-working.orig/sysdeps/powerpc/nptl/tls.h
> +++ glibc-working/sysdeps/powerpc/nptl/tls.h
> @@ -63,6 +63,9 @@ typedef union dtv
>     are private.  */

Please update the comment for this structure to reflect the reality of those
fields which are public ABI and those which are not.

>  typedef struct
>  {
> +  /* Reservation for HWCAP data.  */
> +  unsigned int hwcap2;
> +  unsigned int hwcap;
>    /* Indicate if HTM capable (ISA 2.07).  */
>    int tm_capable;
>    /* Reservation for Dynamic System Optimizer ABI.  */
> @@ -134,7 +137,11 @@ register void *__thread_register __asm__
>  # define TLS_INIT_TP(tcbp) \
>    ({ 									      \
>      __thread_register = (void *) (tcbp) + TLS_TCB_OFFSET;		      \
> -    THREAD_SET_TM_CAPABLE (GLRO (dl_hwcap2) & PPC_FEATURE2_HAS_HTM ? 1 : 0);  \
> +    unsigned int hwcap = GLRO(dl_hwcap);				      \
> +    unsigned int hwcap2 = GLRO(dl_hwcap2);				      \
> +    THREAD_SET_TM_CAPABLE (hwcap2 & PPC_FEATURE2_HAS_HTM ? 1 : 0);	      \
> +    THREAD_SET_HWCAP (hwcap);						      \
> +    THREAD_SET_HWCAP2 (hwcap2);	

OK.
					      \
>      NULL;								      \
>    })
>  
> @@ -142,7 +149,11 @@ register void *__thread_register __asm__
>  # define TLS_DEFINE_INIT_TP(tp, pd) \
>      void *tp = (void *) (pd) + TLS_TCB_OFFSET + TLS_PRE_TCB_SIZE;	      \
>      (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].tm_capable) =	      \
> -      THREAD_GET_TM_CAPABLE ();
> +      THREAD_GET_TM_CAPABLE ();						      \
> +    (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].hwcap) =	      \
> +      THREAD_GET_HWCAP ();						      \
> +    (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].hwcap2) =	      \
> +      THREAD_GET_HWCAP2 ();

OK.

>  
>  /* Return the address of the dtv for the current thread.  */
>  # define THREAD_DTV() \
> @@ -203,6 +214,32 @@ register void *__thread_register __asm__
>  # define THREAD_SET_TM_CAPABLE(value) \
>      (THREAD_GET_TM_CAPABLE () = (value))
>  
> +/* hwcap & hwcap2 fields in TCB head.  */
> +# define THREAD_GET_HWCAP() \
> +    (((tcbhead_t *) ((char *) __thread_register				      \
> +		     - TLS_TCB_OFFSET))[-1].hwcap)
> +# define THREAD_SET_HWCAP(value) \
> +    if (value & PPC_FEATURE_ARCH_2_06)					      \
> +      value |= PPC_FEATURE_ARCH_2_05 |					      \
> +	       PPC_FEATURE_POWER5_PLUS |				      \
> +	       PPC_FEATURE_POWER5 |					      \
> +	       PPC_FEATURE_POWER4;					      \
> +    else if (value & PPC_FEATURE_ARCH_2_05)				      \
> +      value |= PPC_FEATURE_POWER5_PLUS |				      \
> +             PPC_FEATURE_POWER5 |					      \
> +             PPC_FEATURE_POWER4;					      \
> +    else if (value & PPC_FEATURE_POWER5_PLUS)				      \
> +      value |= PPC_FEATURE_POWER5 |					      \
> +             PPC_FEATURE_POWER4;					      \
> +    else if (value & PPC_FEATURE_POWER5)				      \
> +      value |= PPC_FEATURE_POWER4;					      \
> +    (THREAD_GET_HWCAP () = (value))
> +# define THREAD_GET_HWCAP2() \
> +    (((tcbhead_t *) ((char *) __thread_register				      \
> +                     - TLS_TCB_OFFSET))[-1].hwcap2)
> +# define THREAD_SET_HWCAP2(value) \
> +    (THREAD_GET_HWCAP2 () = (value))
> +

OK modulo Adhemerval's comments to try unify this.

>  /* l_tls_offset == 0 is perfectly valid on PPC, so we have to use some
>     different value to mean unset l_tls_offset.  */
>  # define NO_TLS_OFFSET		-1
> Index: glibc-working/sysdeps/powerpc/sys/platform/ppc.h
> ===================================================================
> --- glibc-working.orig/sysdeps/powerpc/sys/platform/ppc.h
> +++ glibc-working/sysdeps/powerpc/sys/platform/ppc.h
> @@ -23,6 +23,86 @@
>  #include <stdint.h>
>  #include <bits/ppc.h>
>  
> +
> +/* Get the hwcap/hwcap2 information from the TCB. Offsets taken
> +   from tcb-offsets.h.  */
> +static inline uint32_t

Should this still inline at -O0, do you want always_inline?

Support C90 and use __inline__?

> +__ppc_get_hwcap (void)
> +{
> +
> +  uint32_t __tcb_hwcap;
> +
> +#ifdef __powerpc64__
> +  register unsigned long __tp __asm__ ("r13");
> +  __asm__ volatile ("lwz %0,-28772(%1)\n"
> +		    : "=r" (__tcb_hwcap)
> +		    : "r" (__tp));

Adhemerval notes, and I note it too, that volatile is not needed.

> +#else
> +  register unsigned long __tp __asm__ ("r2");
> +  __asm__ volatile ("lwz %0,-28724(%1)\n"
> +		    : "=r" (__tcb_hwcap)
> +		    : "r" (__tp));
> +#endif
> +
> +  return __tcb_hwcap;
> +}
> +
> +static inline uint32_t
> +__ppc_get_hwcap2 (void)

Likewise.

> +{
> +
> +  uint32_t __tcb_hwcap2;
> +
> +#ifdef __powerpc64__
> +  register unsigned long __tp __asm__ ("r13");
> +  __asm__ volatile ("lwz %0,-28776(%1)\n"
> +		    : "=r" (__tcb_hwcap2)
> +		    : "r" (__tp));
> +#else
> +  register unsigned long __tp __asm__ ("r2");
> +  __asm__ volatile ("lwz %0,-28728(%1)\n"
> +		    : "=r" (__tcb_hwcap2)
> +		    : "r" (__tp));
> +#endif
> +
> +  return __tcb_hwcap2;
> +}
> +
> +/* Set the hwcap/hwcap2 bits into the designated area in the TCB. Offsets
> +   taken from tcb-offsets.h.  */
> +
> +static inline void
> +__ppc_set_hwcap (uint32_t __hwcap_mask)

Likewise.

> +{
> +#ifdef __powerpc64__
> +  register unsigned long __tp __asm__ ("r13");
> +  __asm__ volatile ("stw %1,-28772(%0)\n"
> +		    :
> +		    : "r" (__tp), "r" (__hwcap_mask));
> +#else
> +  register unsigned long __tp __asm__ ("r2");
> +  __asm__ volatile ("stw %1,-28724(%0)\n"
> +		    :
> +		    : "r" (__tp), "r" (__hwcap_mask));
> +#endif
> +}
> +
> +static inline void
> +__ppc_set_hwcap2 (uint32_t __hwcap2_mask)

Likewise.

> +{
> +#ifdef __powerpc64__
> +  register unsigned long __tp __asm__ ("r13");
> +  __asm__ volatile ("stw %1,-28776(%0)\n"
> +		    :
> +		    : "r" (__tp), "r" (__hwcap2_mask));
> +#else
> +  register unsigned long __tp __asm__ ("r2");
> +  __asm__ volatile ("stw %1,-28728(%0)\n"
> +		    :
> +		    : "r" (__tp), "r" (__hwcap2_mask));
> +#endif
> +}
> +
>  /* Read the Time Base Register.   */
>  static inline uint64_t
>  __ppc_get_timebase (void)
> Index: glibc-working/sysdeps/powerpc/Makefile
> ===================================================================
> --- glibc-working.orig/sysdeps/powerpc/Makefile
> +++ glibc-working/sysdeps/powerpc/Makefile
> @@ -28,7 +28,7 @@ endif
>  
>  ifeq ($(subdir),misc)
>  sysdep_headers += sys/platform/ppc.h
> -tests += test-gettimebase
> +tests += test-gettimebase test-get_hwcap test-set_hwcap

Please make this one test for simplicity.

>  endif
>  
>  ifneq (,$(filter %le,$(config-machine)))
> Index: glibc-working/sysdeps/powerpc/test-get_hwcap.c
> ===================================================================
> --- /dev/null
> +++ glibc-working/sysdeps/powerpc/test-get_hwcap.c
> @@ -0,0 +1,73 @@
> +/* Check __ppc_get_hwcap() functionality
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Tests if the hwcap and hwcap2 data is stored in the TCB.  */
> +
> +#include <inttypes.h>
> +#include <stdio.h>
> +#include <stdint.h>
> +
> +#include <sys/auxv.h>
> +#include <sys/platform/ppc.h>
> +
> +static int
> +do_test (void)
> +{
> +  uint32_t h1, h2, hwcap, hwcap2;
> +
> +  h1 = __ppc_get_hwcap ();
> +  h2 = __ppc_get_hwcap2 ();
> +  hwcap = getauxval(AT_HWCAP);
> +  hwcap2 = getauxval(AT_HWCAP2);
> +
> +  /* hwcap contains only the latest supported ISA, the code checks which is
> +     and fills the previous supported ones. This is necessary because the
> +     same is done in tls.h when setting the values to the TCB.   */
> +
> +  if (hwcap & PPC_FEATURE_ARCH_2_06)
> +    hwcap |= PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_POWER5_PLUS |
> +	     PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4;
> +  else if (hwcap & PPC_FEATURE_ARCH_2_05)
> +    hwcap |= PPC_FEATURE_POWER5_PLUS | PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4;
> +  else if (hwcap & PPC_FEATURE_POWER5_PLUS)
> +    hwcap |= PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4;
> +  else if (hwcap & PPC_FEATURE_POWER5)
> +    hwcap |= PPC_FEATURE_POWER4;
> +
> +  if ( h1 != hwcap )
> +    {
> +      printf("Fail: HWCAP is %x. Should be %x\n", h1, hwcap);
> +      return 1;
> +    }
> +
> +  if ( h2 != hwcap2 )
> +    {
> +      printf("Fail: HWCAP2 is %x. Should be %x\n", h2, hwcap2);
> +      return 1;
> +    }
> +
> +    printf("Pass: HWCAP and HWCAP2 are correctly set in the TCB.\n");
> +
> +    return 0;
> +
> +}
> +
> +#define TEST_FUNCTION do_test ()
> +#include "../test-skeleton.c"

OK.

> +
> +
> Index: glibc-working/sysdeps/powerpc/test-set_hwcap.c
> ===================================================================
> --- /dev/null
> +++ glibc-working/sysdeps/powerpc/test-set_hwcap.c
> @@ -0,0 +1,63 @@
> +/* Check __ppc_get_hwcap() functionality
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Tests if the hwcap and hwcap2 data can be stored in the TCB
> +   via the ppc.h API.  */
> +
> +#include <inttypes.h>
> +#include <stdio.h>
> +#include <stdint.h>
> +
> +#include <sys/auxv.h>
> +#include <sys/platform/ppc.h>
> +
> +static int
> +do_test (void)
> +{
> +  uint32_t h1, hwcap, hwcap2;
> +
> +  h1 = 0xDEADBEEF;
> +
> +  __ppc_set_hwcap(h1);
> +  hwcap = __ppc_get_hwcap();
> +
> +  if ( h1 != hwcap )
> +    {
> +      printf("Fail: HWCAP is %x. Should be %x\n", h1, hwcap);
> +      return 1;
> +    }
> +
> +  __ppc_set_hwcap2(h1);
> +  hwcap2 = __ppc_get_hwcap2();
> +
> +  if ( h1 != hwcap2 )
> +    {
> +      printf("Fail: HWCAP2 is %x. Should be %x\n", h1, hwcap2);
> +      return 1;
> +    }
> +
> +    printf("Pass: HWCAP and HWCAP2 are correctly set in the TCB.\n");
> +
> +    return 0;
> +
> +}
> +
> +#define TEST_FUNCTION do_test ()
> +#include "../test-skeleton.c"
> +
> +

Cheers,
Carlos.
  
Carlos O'Donell July 3, 2015, 5:08 a.m. UTC | #65
On 07/02/2015 11:21 PM, Carlos O'Donell wrote:
> On 07/01/2015 03:12 PM, Steven Munroe wrote:
>> This discussion has metastasizes into so many side discussions, meta
>> discussion, personal opinions etc that i would like to start over at the
>> point where we where still discussing how to implement something
>> reasonable. 
> 
> I want to make few higher level comments as an FSF steward for the glibc project.
> 
> * IBM has consistently provided hardware and developer resources to maintain 
>   POWER for glibc. IBM is the POWER maintainer, and the ultimate responsibility
>   for the machine rests with IBM.  Today that responsibility is with Steven Munroe
>   (IBM) who is the POWER maintainer for glibc. The machine maintainership provides
>   Steven with a veto for machine-specific features, ABIs and APIs, much like all
>   the other machine port maintainers. Steven is expected to use this veto to
>   further the goals of the glibc project, and serve the needs of POWER users, and
>   balance the two.
> 
> * Consensus need not be agreement, it may be that we discuss the options, find
>   no sustained opposition, and move forward with a solution. People can disagree
>   bitterly and we can still have consensus. Developers can have strong and polarizing
>   opinions about exactly how to use the limited resource of `thread-pointer + offset`
>   accessible data, but at the end of the day consensus can be reached.
> 
> * Healthy and active discussion, like the discussion on this particular topic,
>   are good for the community. Topics surrounding optimizations are rife with
>   complex tradeoffs, and require discussion, summaries, and a developer to
>   champion a position of consensus. I see nothing wrong with this kind of behaviour.
>   The discussions should stay on point, be technical, provide feedback, and
>   indicate clearly if the comment amounts to sustained opposition.

Fixed TO: Joseph Myers.

Cheers,
Carlos.
  
Ondrej Bilka July 3, 2015, 8:55 a.m. UTC | #66
On Fri, Jul 03, 2015 at 01:05:03AM -0400, Carlos O'Donell wrote:
> On 06/08/2015 05:03 PM, Carlos Eduardo Seo wrote:
> > The proposed patch adds a new feature for powerpc. In order to get
> > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > This enables users to write versioned code based on the HWCAP bits
> > without going through the overhead of reading them from the auxiliary
> > vector.
> > 
> > A new API is published in ppc.h for get/set the bits in the
> > aforementioned memory area (mainly for gcc to use to create
> > builtins).
> > 
> > Testcases for the API functions were also created.
> > 
> > Tested on ppc32, ppc64 and ppc64le.
> > 
> > Okay to commit?
> 
> (1) Prevent running new applications against old glibc.
> 
> You add a new interface to glibc, but provide no way to prevent
> new applications that compile with this support from crashing
> or behaving badly when run on systems with an older glibc.
> 
> Richard Henderson had suggested to me that you could use a dummy
> versioned symbol in the code to create a dependency against
> GLIBC_2.22 and thus prevent those new binaries from running
> on say GLIBC_2.21. You'd never use the versioned symbol for anything.
> 
> This would seem a much better way to prevent what will obviously
> be a weird failure mode.
> 
> Have you considered this failure mode?
> 
> At the end of the day it's up to IBM to make the best use of the
> tp+offset data stored in the TCB, but every byte you save is another
> byte you can use later for something else.
> 
Carlos a problem with this patch is that they ignored community
feedback. Early in this thread Florian come with better idea to use
GOT+offset that could be accessed as 
&hwcap_hack
and avoids per-thread runtime overhead.

Also I now have additional comment with api as if you want faster checks
wouldn't be faster to save each bit of hwcap into byte field so you
could avoid using mask at each check?
  
Carlos O'Donell July 3, 2015, 1:15 p.m. UTC | #67
On 07/03/2015 04:55 AM, Ondřej Bílka wrote:
>> At the end of the day it's up to IBM to make the best use of the
>> tp+offset data stored in the TCB, but every byte you save is another
>> byte you can use later for something else.
>>
> Carlos a problem with this patch is that they ignored community
> feedback. Early in this thread Florian come with better idea to use
> GOT+offset that could be accessed as 
> &hwcap_hack and avoids per-thread runtime overhead.

Steven and Carlos have not ignored the community feedback, they just
have a different set of priorities and requirements. There is little
to discuss if your priorities and requirements are different.

The use of tp+offset data is indeed a scarce resource that should be
used only when absolutely necessary or when the use case dictates.

It is my opinion as a developer, that Carlos' patch is flawed because
it uses a finite resource, namely tp+offset data, for what I perceive
to be a flawed design pattern that as a free software developer I don't
want to encourage. These are not entirely technical arguments though,
they are subjective and based on my desire to educate and mentor developers
who write such code. I don't present these arguments as sustained
opposition to the patch because they are not technical and Carlos
has a need to accelerate this use case today.

I have only a few substantive technical issues with the patch. Given
that the ABI allocates a large block of tp+offset data, I think it is
OK for IBM to use the data in this way. For example I think it is much
much more serious that such a built application will likely just crash
when run with an older glibc. This is a distribution maintenance issue
that I can't ignore and I'd like to see it solved by a dependency on a
versioned dummy symbol.

Lastly, the symbol address hack is an incomplete solution because Florian
has not provided an implementation. Depending on the implementation it
may require a new relocation, and that is potentially more costly to the
program startup than the present process for filling in HWCAP/HWCAP2.
Without a concrete implementation I can't comment on one or the other.
It is in my opinion overly harsh to force IBM to go implement this new
feature. They have space in the TCB per the ABI and may use it for their
needs. I think the community should investigate symbol address munging
as a method for storing data in addresses and make a generic API from it,
likewise I think the community should investigate standardizing tp+offset
data access behind a set of accessor macros and normalizing the usage
across the 5 or 6 architectures that use it.

> Also I now have additional comment with api as if you want faster checks
> wouldn't be faster to save each bit of hwcap into byte field so you
> could avoid using mask at each check?

That is an *excellent* suggestion, and exactly the type of technical
feedback that we should be giving IBM, and Carlos can confirm if they've
tried such "unpacking" of the bits into byte fields. Such unpacking is
common in other machine implementations.

Cheers,
Carlos.
  
Ondrej Bilka July 3, 2015, 5:11 p.m. UTC | #68
On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote:
> On 07/03/2015 04:55 AM, Ondřej Bílka wrote:
> >> At the end of the day it's up to IBM to make the best use of the
> >> tp+offset data stored in the TCB, but every byte you save is another
> >> byte you can use later for something else.
> >>
> > Carlos a problem with this patch is that they ignored community
> > feedback. Early in this thread Florian come with better idea to use
> > GOT+offset that could be accessed as 
> > &hwcap_hack and avoids per-thread runtime overhead.
> 
> Steven and Carlos have not ignored the community feedback, they just
> have a different set of priorities and requirements. There is little
> to discuss if your priorities and requirements are different.
> 
> The use of tp+offset data is indeed a scarce resource that should be
> used only when absolutely necessary or when the use case dictates.
> 
> It is my opinion as a developer, that Carlos' patch is flawed because
> it uses a finite resource, namely tp+offset data, for what I perceive
> to be a flawed design pattern that as a free software developer I don't
> want to encourage. These are not entirely technical arguments though,
> they are subjective and based on my desire to educate and mentor developers
> who write such code. I don't present these arguments as sustained
> opposition to the patch because they are not technical and Carlos
> has a need to accelerate this use case today.
> 
> I have only a few substantive technical issues with the patch. Given
> that the ABI allocates a large block of tp+offset data, I think it is
> OK for IBM to use the data in this way. For example I think it is much
> much more serious that such a built application will likely just crash
> when run with an older glibc. This is a distribution maintenance issue
> that I can't ignore and I'd like to see it solved by a dependency on a
> versioned dummy symbol.
> 
> Lastly, the symbol address hack is an incomplete solution because Florian
> has not provided an implementation. Depending on the implementation it
> may require a new relocation, and that is potentially more costly to the
> program startup than the present process for filling in HWCAP/HWCAP2.

Thats valid concern. My idea was checking if hwcap_hack relocation exist. 
I didn't realize that it scales with number of libraries.

One of reasons why I didn't like this proposal is that it harms linux
ecosystem as  it increases startup cost of a bit everything while its 
unlikely that cross-platform projects will use this.

But these could be done without much of our help. We need to keep these
writable to support this hack. I don't know exact assembly for powerpc,
it should be similar to how do it on x64:

int x;

int foo()
{
#ifdef SHARED
asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
#else
asm ("lea x(%rip), %rax; movb $32, (%rax)");
#endif
return &x;
}


> Without a concrete implementation I can't comment on one or the other.
> It is in my opinion overly harsh to force IBM to go implement this new
> feature. They have space in the TCB per the ABI and may use it for their
> needs. I think the community should investigate symbol address munging
> as a method for storing data in addresses and make a generic API from it,
> likewise I think the community should investigate standardizing tp+offset
> data access behind a set of accessor macros and normalizing the usage
> across the 5 or 6 architectures that use it.
>
I would like this as with access to that I could improve performance of
several inlines.

 
> > Also I now have additional comment with api as if you want faster checks
> > wouldn't be faster to save each bit of hwcap into byte field so you
> > could avoid using mask at each check?
> 
> That is an *excellent* suggestion, and exactly the type of technical
> feedback that we should be giving IBM, and Carlos can confirm if they've
> tried such "unpacking" of the bits into byte fields. Such unpacking is
> common in other machine implementations.
>
Also with unpacking doing that in userspace becomes more attractive so
we don't have to copy 64 bytes for each thread.
 
> Cheers,
> Carlos.
  
Carlos O'Donell July 3, 2015, 5:12 p.m. UTC | #69
On 07/01/2015 03:12 PM, Steven Munroe wrote:
> If you think about the requirements for a while it becomes clear. As the
> HWCAP cache would have to be defined and initialized in either libgcc or
> libc, accept will be none local from any user library. So all the local
> TLC access optimization's are disallowed. Add the requirement to support
> dl_open() libraries leaves the general dynamic TLS model as the ONLY
> safe option.

That's not true anymore? Alan Modra added pseudo-TLS descriptors to POWER
just recently[1], which means __tls_get_addr call is elided and the offset
returned immediately via a linker stub for use with tp+offset. However,
I agree that even Alan's great work here is still going to be several
more instructions than a raw tp+offset access. However, it would be
interesting to discuss with Alan if his changes are sufficiently good
that the out-of-order execution hides the latency of this additional
instructions and his methods are a sufficient win that you *can* use
TLS variables?

> Now there were a lot of suggestions to just force the HWCAP TLS
> variables into initial exec or local exec TLS model with an attribute.
> This would resolve to direct TLS offset in some special reserved TLS
> space?

It does. Since libc.so is always seen by the linker it can always allocate
static TLS space for that library when it computes the maximum size of
static TLS space.

> How does that work with a library loaded with dl_open()? How does that
> work with a library linked with one toolchain / GLIBC on Distro X and
> run on a system with a different toolchain and GLIBC on Distro Y? With
> different versions of GLIBC? Will HWCAP get the same TLS offset? Do we
> end up with .text relocations that we are also trying to avoid?

(1) Interaction with dlopen?

The two variables in question are always in libc.so.6, and therefore are
always loaded first by DT_NEEDED, and there is always static storage
reserved for that library.

There are 2 scenarios which are problematic.

(a) A static application accessing NSS / ICONV / IDN must dynamically
    load libc.so.6, and there must be enough reserve static TLS space
    for the allocated IE TLS variables or the dynamic loader will abort
    the load indicating that there is not enough space to load any more
    static TLS using DSOs. This is solved today by providing surplus
    static TLS storage space.

(b) Use of dlmopen to load multiple libc.so.6's. In this case you could
    load libc.so.6 into alternate namespaces and eventually run out of
    surplus static TLS. We have never seen this in common practice because
    there are very few users of dlmopen, and to be honest the interface
    is poorly documented and fraught with problems.

Therefore in the average scenario it will work to use static TLS, or
IE TLS variables in glibc in the average case. I consider the above
cases to be outside the normal realm of user applications.

e.g.
extern __thread int foo __attribute__((tls_model("initial-exec")));

(2) Distro to distro compatibility?

With my Red Hat on:

Let me start by saying you have absolutely no guarantee here at all
provided by any distribution. As the Fedora and RHEL glibc maintainer
your vendor is far outside the scope of support and such a scenario is
never possible. You can wish it, but it's not true unless you remain
very very low level and very very simple interfaces. That is to say
that you have no guarantee that a library linked by a vendor with one
toolchain in distro X will work in distro Y. If you need to do that
then build in a container, chroot or VM with distro Y tools. No vendor
I've ever talked to expects or even supports such a scenario.

With my hacker hat on:

Generally for simple features it just works as long as both distros
have the same version of glibc. However, we're talking only about
the glibc parts of the problem. Compatibility with other libraries
is another issue.

(3) Different versions of glibc?

Sure it works, as long as all the versions have the same feature and
are newer than the version in which you introduced the change. That's
what backwards compatibility is for.

(4) Will HWCAP get the same TLS offset? 

That's up to the static linker. You don't care anymore though, the gcc
builtin will reference the IE TLS variables like it would normally as
part of the shared implementation, and that variable is resolved to glibc
and normal library versioning hanppens. The program will now require that
glibc or newer and you'll get proper error messages about that.

(5) Do we end up with .text relocations that we are also trying to avoid?

You should not. The offset is known at link time and inserted by the
static linker.

> Again the TCB avoids all of this as it provides a fixed offset defined
> by the ABI and does not require any up-calls or indirection. And also
> will work in any library without induced hazards. This clearly works
> across distros including previous version of GLIBC as the words where
> previously reserved by the ABI. Application libraries that need to run
> on older distros can add a __built_cpu_init() to their library init or
> if threaded to their thread create function.

You get a crash since previous glibc's don't fill in the data?
And that crash gives you only some information to debug the problem,
namely that you ran code for a processors you didn't support.

I've suggested to Carlos that this is a problem with the use of the
TCB. If one uses the TCB, one should add a dummy symbol that is versioned
and tracks when you added the feature, and thus you can depend upon it,
but not call it, and that way you get the right versioning. The same
problem happened with stack canaries and it's still painfully annoying
at the distribution level.

It is true that you could use LD_PRELOAD to run __builtin_cpu_init()
on older systems, but you need to *know* that, and use that. What
provides this function? libgcc?

Do you want to use the IBM Advance Toolchain for POWER to be able to 
support this feature across all distributions at the same time by not
requiring any particular glibc version and by doing the initialization
out of band via __builtin_cpu_init() for older glibc? It will still result
in a weird crash of the application if the user doesn't know any better.

It is certainly a benefit to using the TCB, that this kind of use case
is supported. However, in doing so you adversely impact the distribution
maintainers for the benefit of?

Cheers,
Carlos.

[1] https://sourceware.org/ml/libc-alpha/2015-03/msg00580.html
  
Carlos O'Donell July 3, 2015, 5:31 p.m. UTC | #70
On 07/03/2015 01:11 PM, Ondřej Bílka wrote:
> On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote:
>> Lastly, the symbol address hack is an incomplete solution because Florian
>> has not provided an implementation. Depending on the implementation it
>> may require a new relocation, and that is potentially more costly to the
>> program startup than the present process for filling in HWCAP/HWCAP2.
> 
> Thats valid concern. My idea was checking if hwcap_hack relocation exist. 
> I didn't realize that it scales with number of libraries.

Exactly. Usually a GOT entry with a reloc, but this one is special since
it's computed by another function. Actually, can't the IFUNC infrastructure
doing this already? If you take the address of an STT_GNU_IFUNC symbol
you should get back the address of the resolved to function? Can the
resolver return `(void *)HWCAP`? It's an abuse of IFUNC to use the resolver
to return a custom function address that can't be executed but means
something dynamic?

> One of reasons why I didn't like this proposal is that it harms linux
> ecosystem as  it increases startup cost of a bit everything while its 
> unlikely that cross-platform projects will use this.

That could be fixed by removing the initialization from glibc and forcing
the developer to call __builtin_cpu_init() to do the initialization?
Then there is no dependency on glibc other than to provide scratch space
in TP+offset? Someone should ask IBM if this is feasible? Then instead
of having to say:

  "For old glibc you must have a constructor which calls __builtin_cpu_init()
   and old glibc varies depending on your distro like this..."

You just say:

  "Always call __builtin_cpu_init(). Period."

Note that while we continue to add things to TP+offset it becomes a target
for security attacks too. The TCB is read-write and now impacts program
control flow with these new bits, and it's easy to find ROP widgets that
store to TP+offset. You would need a program to use these bits and an
attack vector though.

Cheers,
Carlos.
  
Ondrej Bilka July 3, 2015, 6:44 p.m. UTC | #71
On Fri, Jul 03, 2015 at 01:31:04PM -0400, Carlos O'Donell wrote:
> On 07/03/2015 01:11 PM, Ondřej Bílka wrote:
> > On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote:
> >> Lastly, the symbol address hack is an incomplete solution because Florian
> >> has not provided an implementation. Depending on the implementation it
> >> may require a new relocation, and that is potentially more costly to the
> >> program startup than the present process for filling in HWCAP/HWCAP2.
> > 
> > Thats valid concern. My idea was checking if hwcap_hack relocation exist. 
> > I didn't realize that it scales with number of libraries.
> 
> Exactly. Usually a GOT entry with a reloc, but this one is special since
> it's computed by another function. Actually, can't the IFUNC infrastructure
> doing this already? If you take the address of an STT_GNU_IFUNC symbol
> you should get back the address of the resolved to function? Can the
> resolver return `(void *)HWCAP`? It's an abuse of IFUNC to use the resolver
> to return a custom function address that can't be executed but means
> something dynamic?
>
Yes we could, but it needs a LD_BIND_NOW=1, otherwise its lazily
resolved and you would need to initialize it for each dso.

There should be way to force early binding on per-symbol basis as gcc
could easily determine which functions will be called at least once.

Now lot of libc ifunc share that problem, memcpy is resolved for each
dso we load.

 
> > One of reasons why I didn't like this proposal is that it harms linux
> > ecosystem as  it increases startup cost of a bit everything while its 
> > unlikely that cross-platform projects will use this.
> 
> That could be fixed by removing the initialization from glibc and forcing
> the developer to call __builtin_cpu_init() to do the initialization?

It couldn't. We need to copy these for each thread, or user application
would need to interpose pthread_create.

> Then there is no dependency on glibc other than to provide scratch space
> in TP+offset? Someone should ask IBM if this is feasible? Then instead
> of having to say:
> 
>   "For old glibc you must have a constructor which calls __builtin_cpu_init()
>    and old glibc varies depending on your distro like this..."
>
This part isn't problem at all if only access is __builtin_cpu_support.
When compiling linker would check if __builtin_cpu_support was present
and automatically add constructor if it does.
 
> You just say:
> 
>   "Always call __builtin_cpu_init(). Period."
> 
> Note that while we continue to add things to TP+offset it becomes a target
> for security attacks too. The TCB is read-write and now impacts program
> control flow with these new bits, and it's easy to find ROP widgets that
> store to TP+offset. You would need a program to use these bits and an
> attack vector though.
> 
> Cheers,
> Carlos.
  
Ondrej Bilka July 3, 2015, 7:53 p.m. UTC | #72
On Wed, Jul 01, 2015 at 02:12:20PM -0500, Steven Munroe wrote:
> On Tue, 2015-06-09 at 12:38 -0400, Rich Felker wrote:
> > On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote:
> > > 
> > > The proposed patch adds a new feature for powerpc. In order to get
> > > faster access to the HWCAP/HWCAP2 bits, we now store them in the
> > > TCB. This enables users to write versioned code based on the HWCAP
> > > bits without going through the overhead of reading them from the
> > > auxiliary vector.
> > > 
> > > A new API is published in ppc.h for get/set the bits in the
> > > aforementioned memory area (mainly for gcc to use to create
> > > builtins).
> > 
> > Do you have any justification (actual performance figures for a
> > real-world usage case) for adding ABI constraints like this? This is
> > not something that should be done lightly. My understanding is that
> > hwcap bits are normally used in initializing functions pointers (or
> > equivalent things like ifunc resolvers), not again and again at
> > runtime, so I'm having a hard time seeing how this could help even if
> > it does make the individual hwcap accesses measurably faster.
> > 
> > It would also be nice to see some justification for the magic number
> > offsets. Will they be stable under changes to the TCB structure or
> > will preserving them require tip-toeing around them?
> > 
> 
> This discussion has metastasizes into so many side discussions, meta
> discussion, personal opinions etc that i would like to start over at the
> point where we where still discussing how to implement something
> reasonable. 
> 
> First a level set on requirements and goals.
> 
> The intent is allow application developers to develop new application
> for Linux on Power and simply the porting of existing Linux applications
> to Power. And encourage then to apply the same level of platform
> optimization to Power as they do for other Linux platforms.
>
From your proposal it didn't seem so. If this is the goal then you
should reach wider consensus to find cross-platform mechanism as
programmers would be discouraged by having to learn yet another custom
interface for powerpc.

 
> While there are a near infinity of options (some of which some members
> of this community think are stupid) and I have seen them all being used.
> As general rule I find it counter productive to call the customer (All
> Linux Application Developers are our customers) stupid their face, so I
> try to explain the options and encourage them to use many of the
> techniques that this community thinks are not stupid.
> 
That is good policy but that forces you to have higher standard and
strive to newer make a bad suggestion as customers could accept that
without question. So next time make clear that its customer wish and you
personally oppose that otherwise we would rigthfully tell you that you
shouldn't make that mistake.

> But as rule the application developer are busy, don't have much patience
> for nonsense like IFUNC and AT_PLATFORM library search strategies. They
> tend to use what they already know, apply minimal effort to solve the
> immediate problem, and move on!
> 
> One of the "things they already know" is the __built_cpu_is()
> __built_cpu_supports() GCC builtins for X86. To goal of this simple
> proposal is enable that for powerpc powerpc64 and powerpc64le, based on
> the existing AT_HWCAP/AT_HWCAP2 mechanisms.
>
While that is true you are too focused on __builtin_cpu_supports using
hwcap to see bigger picture. You need to distinguish between primitives
(hwcap, ifunc, AT_LIBRARY, fat libraries...) and interfaces. A
__builtin_cpu_supports is one interface and not neccesarry best one.

Finding a best interface is worthwide goal and I don't like to introduce 
worse interfaces as average programmers would use them and it would be
harder to change than if they did rigth one in first place.

How __builtin_cpu_supports is implemented is irrelevant. Gcc could
decide to make fat library that replaces every function by ifunc, use
ifunc for every function with __built_cpu_supports, we could add support
of fat libraries to linker... 

So why teach users ugly interface if they could use better and safer ones?
 
> Another observation is that many of these applications are deployed as
> shared object libraries and frequently are not linked directly to the
> main application but loaded via dl_open() are runtime. So clever
> solutions that are only simple and/or fast from a main programs but
> difficult and/or slow for dl_open() library are not an option.
> 
That removes performance argument why gcc shouldn't use ifunc. As these
use plt it wouldn't slow them down but checking hwcap bit inside
function would.

> They are very firm about a "single binary built" for all supported
> distros and all supported hardware generations.
> 
Needs of linux community are different that needs of your customers. You
have problem that platform-specific code increases size so for
distributions a best way would be split it into several files and they
could transfer package with binaries optimized only for current
cpu+generic ones. That requirement forces fat binaries and would
increase compile time a lot.

> And finally these applications tend to be massive C++ programs composed
> of smallish members functions and byzantine layers of templates. I have
> not observed wide use of private/hidden and so these libraries tend to
> expose every member function as a PLT entry, which resists most
> in-lining opportunities.
> 
And why customer couldn't use gcc -symbolic? This is strong argument
againist hwcap optimization. You could pair each hwcap use with plt
overhead of function and you will probably lose more cycles from
increased instruction cache usage than few cycles in member function
that does single operation.

And when you mentioned templates how often would you see uses like

template foo <bool could_do_x, bool could_do_y> ...

foo <__builtin_cpu_supports (x), __builtin_cpu_supports (y)> f;

> Net this is a harder problem then it looks.
> 
> So lets write down some requirements:
> 
> 0) Something the average application developer will understand and use.

Thats problem with __builtin_cpu_supports. Developers would use that but
not understand. Instead of being fixed on that a easier would be adding
a flag to gcc to handle that. Gcc could support builtins with multiple
implementation that when compiled would generate several variants
depending on cpu.

Or if __builtin_cpu_supports is used then gcc should treat it on higher
level and split that into two functions where one use feature but other
don't.


> 1) In any user library, in including ones loaded via LD_PRELOAD and
> dl_open().
> 2) Across multiple Distro versions and across Distros (using different
> GLIBC versions).
> 
> And goals for the Power implementation:
> 
> 1) As fast as possible accounting for the limits of the ABI, ISA and
> Micro-architecture.
> 1a) Minimal path length to obtain the hwcap bit vector for test
> 1b) Limited exposure to micro-architecture hazards including
> indirection.
> 2) Simple and reliable initialization of the cached values.
> 3) And without relying on .text relocation in libraries.
> 
> First lets dispose of the obvious. Extern static variables.
> 
> This is not horrible for larger grained examples but can be less than
> optimal for fine grained C++ examples. As stated above the hwcap will
> not be local to the user library. As PowerISA does not have PC-relative
> addressing our ABI requires that R2 (AKA the TOC pointer) is set to
> address the local (for this libraries) GOT/TOC/PLT before we access any
> static variable and extern require an indirect load of the extern hwcap
> address from the GOT/TOC.
> 
> In addition, since we are potentially changing R2 (AKA the TOC pointer)
> we are now obligated to save and restore the R2.
> 
> Now the design of POWER assumes that as RISC architecture with lots of
> registers and being designed for massive memory bandwidth and
> out-of-order execution, the processor core does no optimized for
> programs that store to then immediately reload from a memory location.
> In a machine with 16-pipelines per core and capable of dispatching up to
> 8 instructions per cycle, "immediate" has an amazingly broad definition
> (many 10s of instructions).
> 
> So the store and reload of the TOC pointer can hit the Load-hit-store
> hazard (essentially the load got issued (out-of-order) before the store
> it depended on was complete or at a stage where a bypass was available)
> even across the execution of the called function. While the core detects
> and corrects this state, it does so in a heavy handed way (instruction
> rejects (11 cycles each) or instruction fetch flush (worse)). Lets just
> say this is something to avoid if you can.
> 
> So introducing a static variable to C++ functions that would not
> normally access static should be avoided. Many C++ member functions are
> small enough execute completely within the available (volatile) register
> and don't even need a stack-frame. So a __builtin_cpu_supports() design
> based on none local extern static would be a unforced error in these
> cases.
> 
> Of course the TCB based proposal avoids all of this because the TCB
> pointer (R13) is constant across all functions in a thread (not
> save/restored in the user application).
> 

Which isn't obvious at all. Main mistake is assuming that variable needs
to be static. There is no reason why gcc shouldn't generate code
equivalent to including hwcap.h and adding equivalent of hwcap.c when linking:

hwcap.h:
int __hwcap __attribute__ ((visibility ("hidden"))) ;

hwcap.c:

#include <hwcap.h>

// gcc needs to make this first constructor.
extern int __global_hwcap;
void __attribute__ ((constructor)) 
set_hwcap () 
{
  __hwcap = __global_hwcap;
}

Also this is friendlier when we use optimization to use a byte to store
each hwcap bit.
  
Steven Munroe July 6, 2015, 1:16 a.m. UTC | #73
On Fri, 2015-07-03 at 13:12 -0400, Carlos O'Donell wrote:
> On 07/01/2015 03:12 PM, Steven Munroe wrote:
> > If you think about the requirements for a while it becomes clear. As the
> > HWCAP cache would have to be defined and initialized in either libgcc or
> > libc, accept will be none local from any user library. So all the local
> > TLC access optimization's are disallowed. Add the requirement to support
> > dl_open() libraries leaves the general dynamic TLS model as the ONLY
> > safe option.
> 
> That's not true anymore? Alan Modra added pseudo-TLS descriptors to POWER
> just recently[1], which means __tls_get_addr call is elided and the offset
> returned immediately via a linker stub for use with tp+offset. However,
> I agree that even Alan's great work here is still going to be several
> more instructions than a raw tp+offset access. However, it would be
> interesting to discuss with Alan if his changes are sufficiently good
> that the out-of-order execution hides the latency of this additional
> instructions and his methods are a sufficient win that you *can* use
> TLS variables?
> 
I did discuss this with Alan and he agree that with the given
requirements the the standard TLS mechanism is always slower them my
original TCB proposal.

Why would you think I had not talked to Alan?

> > Now there were a lot of suggestions to just force the HWCAP TLS
> > variables into initial exec or local exec TLS model with an attribute.
> > This would resolve to direct TLS offset in some special reserved TLS
> > space?
> 
> It does. Since libc.so is always seen by the linker it can always allocate
> static TLS space for that library when it computes the maximum size of
> static TLS space.
> 
> > How does that work with a library loaded with dl_open()? How does that
> > work with a library linked with one toolchain / GLIBC on Distro X and
> > run on a system with a different toolchain and GLIBC on Distro Y? With
> > different versions of GLIBC? Will HWCAP get the same TLS offset? Do we
> > end up with .text relocations that we are also trying to avoid?
> 
> (1) Interaction with dlopen?
> 
> The two variables in question are always in libc.so.6, and therefore are
> always loaded first by DT_NEEDED, and there is always static storage
> reserved for that library.
> 
> There are 2 scenarios which are problematic.
> 
> (a) A static application accessing NSS / ICONV / IDN must dynamically
>     load libc.so.6, and there must be enough reserve static TLS space
>     for the allocated IE TLS variables or the dynamic loader will abort
>     the load indicating that there is not enough space to load any more
>     static TLS using DSOs. This is solved today by providing surplus
>     static TLS storage space.
> 
> (b) Use of dlmopen to load multiple libc.so.6's. In this case you could
>     load libc.so.6 into alternate namespaces and eventually run out of
>     surplus static TLS. We have never seen this in common practice because
>     there are very few users of dlmopen, and to be honest the interface
>     is poorly documented and fraught with problems.
> 
> Therefore in the average scenario it will work to use static TLS, or
> IE TLS variables in glibc in the average case. I consider the above
> cases to be outside the normal realm of user applications.
> 
> e.g.
> extern __thread int foo __attribute__((tls_model("initial-exec")));
> 
> (2) Distro to distro compatibility?
> 
> With my Red Hat on:
> 
> Let me start by saying you have absolutely no guarantee here at all
> provided by any distribution. As the Fedora and RHEL glibc maintainer
> your vendor is far outside the scope of support and such a scenario is
> never possible. You can wish it, but it's not true unless you remain
> very very low level and very very simple interfaces. That is to say
> that you have no guarantee that a library linked by a vendor with one
> toolchain in distro X will work in distro Y. If you need to do that
> then build in a container, chroot or VM with distro Y tools. No vendor
> I've ever talked to expects or even supports such a scenario.
> 
> With my hacker hat on:
> 
> Generally for simple features it just works as long as both distros
> have the same version of glibc. However, we're talking only about
> the glibc parts of the problem. Compatibility with other libraries
> is another issue.
> 
No! the version of GLIBC does not matter as long as the GLIBC supports
TLS (GLIBC-2.5?)

> (3) Different versions of glibc?
> 
> Sure it works, as long as all the versions have the same feature and
> are newer than the version in which you introduced the change. That's
> what backwards compatibility is for.
> 
> (4) Will HWCAP get the same TLS offset? 
> 
> That's up to the static linker. You don't care anymore though, the gcc
> builtin will reference the IE TLS variables like it would normally as
> part of the shared implementation, and that variable is resolved to glibc
> and normal library versioning hanppens. The program will now require that
> glibc or newer and you'll get proper error messages about that.
> 
> (5) Do we end up with .text relocations that we are also trying to avoid?
> 
> You should not. The offset is known at link time and inserted by the
> static linker.
> 
To avoid the text relocation I believe there is an extra GOT load of the
offset. If this is not true then Alan owes me an update to the ABI
document to explain how this would work. As the current Draft ELF2 ABI
update does not say this is supported.

> > Again the TCB avoids all of this as it provides a fixed offset defined
> > by the ABI and does not require any up-calls or indirection. And also
> > will work in any library without induced hazards. This clearly works
> > across distros including previous version of GLIBC as the words where
> > previously reserved by the ABI. Application libraries that need to run
> > on older distros can add a __built_cpu_init() to their library init or
> > if threaded to their thread create function.
> 
> You get a crash since previous glibc's don't fill in the data?
> And that crash gives you only some information to debug the problem,
> namely that you ran code for a processors you didn't support.
> 
There is NO crash. There never was a crash. There is no additional
security exposure. The only TCB fields that might be a security exposure
where already there, in every other platform.

The worst there can be is is fallback the to base implementation (the
bit is 0 when is should be 1).

As explained the dword is already there and initialized to 0 when the
page is allocate. So the load will work NOW for any GLIBC since TLS was
implemented.

As implemented by Alan and I.


> I've suggested to Carlos that this is a problem with the use of the
> TCB. If one uses the TCB, one should add a dummy symbol that is versioned
> and tracks when you added the feature, and thus you can depend upon it,
> but not call it, and that way you get the right versioning. The same
> problem happened with stack canaries and it's still painfully annoying
> at the distribution level.

This is completely unnecessary. The load associated with
__builtin_cpu_supports() will work with any GLIBC what support TLS and
the worst that will happen is it will load zeros.

You have not convince me that this is necessary.

You are trying to force to me to use a any number of techniques that
either don't actually work (on my ISA and ABI) or add unnecessary
overhead (exposure to pipeline hazards) for no added benefit.

The problems that are claimed either don't actually exist or are greatly
exaggerated.

I have explained all this in great deal. I really don't understand what
this is so hard to accept.

> It is true that you could use LD_PRELOAD to run __builtin_cpu_init()
> on older systems, but you need to *know* that, and use that. What
> provides this function? libgcc?
> 
We will provide a little init routine applications can use. This is not
hard.

> Do you want to use the IBM Advance Toolchain for POWER to be able to 
> support this feature across all distributions at the same time by not
> requiring any particular glibc version and by doing the initialization
> out of band via __builtin_cpu_init() for older glibc? It will still result
> in a weird crash of the application if the user doesn't know any better.
> 
The Advance Toolchain provides it own newer GLIBC. This feature can be
delivered in any of the current AT version within weeks after it goes
upstream.

The customer requirement for the single binary only requires that GLIBC
on the target system or from he AT is as new or newer then the GLIBC it
was linked to in the build.

So not a problem.

> It is certainly a benefit to using the TCB, that this kind of use case
> is supported. However, in doing so you adversely impact the distribution
> maintainers for the benefit of?
> 
I can not think of any adverse impacts on any of the other platform
maintainers, on any the distros.

This is all platform specific code. And a tiny amount at that.

Eventually distro's will pick this up in the normal way. The normal
distro processes used for interim release updates applies.

> Cheers,
> Carlos.
> 
> [1] https://sourceware.org/ml/libc-alpha/2015-03/msg00580.html
>
  
Rich Felker July 6, 2015, 2:13 a.m. UTC | #74
On Sun, Jul 05, 2015 at 08:16:44PM -0500, Steven Munroe wrote:
> > I've suggested to Carlos that this is a problem with the use of the
> > TCB. If one uses the TCB, one should add a dummy symbol that is versioned
> > and tracks when you added the feature, and thus you can depend upon it,
> > but not call it, and that way you get the right versioning. The same
> > problem happened with stack canaries and it's still painfully annoying
> > at the distribution level.
> 
> This is completely unnecessary. The load associated with
> __builtin_cpu_supports() will work with any GLIBC what support TLS and
> the worst that will happen is it will load zeros.

That's bad enough -- there are applications of hwcap where you NEED
the correct value, not some (possibly empty) subset of the bits. For
example if you need to know which registers to save/restore in an aync
context-switching setup (rolling your own makecontext/swapcontext) or
if you're implementing a function which has a special calling
convention with a contract not to clobber any but a small fixed set of
registers, but it might callback to arbitrary code in a rare case (ala
__tls_get_addr or tls descriptor functions).

However I don't even see how you can be confident that you'll read
zeros. Is the TCB field before this new field you're adding
always-zero?

Rich
  
Steven Munroe July 6, 2015, 1:26 p.m. UTC | #75
On Sun, 2015-07-05 at 22:13 -0400, Rich Felker wrote:
> On Sun, Jul 05, 2015 at 08:16:44PM -0500, Steven Munroe wrote:
> > > I've suggested to Carlos that this is a problem with the use of the
> > > TCB. If one uses the TCB, one should add a dummy symbol that is versioned
> > > and tracks when you added the feature, and thus you can depend upon it,
> > > but not call it, and that way you get the right versioning. The same
> > > problem happened with stack canaries and it's still painfully annoying
> > > at the distribution level.
> > 
> > This is completely unnecessary. The load associated with
> > __builtin_cpu_supports() will work with any GLIBC what support TLS and
> > the worst that will happen is it will load zeros.
> 
> That's bad enough -- there are applications of hwcap where you NEED
> the correct value, not some (possibly empty) subset of the bits. For
> example if you need to know which registers to save/restore in an aync
> context-switching setup (rolling your own makecontext/swapcontext) or
> if you're implementing a function which has a special calling
> convention with a contract not to clobber any but a small fixed set of
> registers, but it might callback to arbitrary code in a rare case (ala
> __tls_get_addr or tls descriptor functions).
> 
No! any application that uses HWCAP and or __builtin_cpu_supports, has
to program for when the feature is not available. The feature bit is
either true or false.

The dword we are talking about is already allocated and has been since
the initial implementation of TLS. For the PowerPC ABIs we allocated a
full 4K for the TCB and use negative displacement calculations that work
well with our ISA. 

None of the existing TCB field offsets change. So this add is completely
upward compatible with all current GLIBC version.

None on the issues you suggest exist in the this proposal

> However I don't even see how you can be confident that you'll read
> zeros. Is the TCB field before this new field you're adding
> always-zero?
> 
The TCB is allocated as part of the thread stack which is mmap, which
kernel initializes to all zero.

> Rich
>
  
Rich Felker July 6, 2015, 3:52 p.m. UTC | #76
On Mon, Jul 06, 2015 at 08:26:46AM -0500, Steven Munroe wrote:
> On Sun, 2015-07-05 at 22:13 -0400, Rich Felker wrote:
> > On Sun, Jul 05, 2015 at 08:16:44PM -0500, Steven Munroe wrote:
> > > > I've suggested to Carlos that this is a problem with the use of the
> > > > TCB. If one uses the TCB, one should add a dummy symbol that is versioned
> > > > and tracks when you added the feature, and thus you can depend upon it,
> > > > but not call it, and that way you get the right versioning. The same
> > > > problem happened with stack canaries and it's still painfully annoying
> > > > at the distribution level.
> > > 
> > > This is completely unnecessary. The load associated with
> > > __builtin_cpu_supports() will work with any GLIBC what support TLS and
> > > the worst that will happen is it will load zeros.
> > 
> > That's bad enough -- there are applications of hwcap where you NEED
> > the correct value, not some (possibly empty) subset of the bits. For
> > example if you need to know which registers to save/restore in an aync
> > context-switching setup (rolling your own makecontext/swapcontext) or
> > if you're implementing a function which has a special calling
> > convention with a contract not to clobber any but a small fixed set of
> > registers, but it might callback to arbitrary code in a rare case (ala
> > __tls_get_addr or tls descriptor functions).
> > 
> No! any application that uses HWCAP and or __builtin_cpu_supports, has
> to program for when the feature is not available. The feature bit is
> either true or false.

I don't think you understood what I was saying. False negatives for
__builtin_cpu_supports are not safe because they may wrongly indicate
absence of a register you need to save on behalf of unknown
third-party code. I already gave two examples of situations where this
can arise.

> The dword we are talking about is already allocated and has been since
> the initial implementation of TLS. For the PowerPC ABIs we allocated a
> full 4K for the TCB and use negative displacement calculations that work
> well with our ISA. 

I don't see this in glibc. struct pthread seems to be immediately
below tcbhead_t, and the latter is not 4k. I'm looking at:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD

Rich
  
Steven Munroe July 6, 2015, 9:26 p.m. UTC | #77
On Mon, 2015-07-06 at 11:52 -0400, Rich Felker wrote:
> On Mon, Jul 06, 2015 at 08:26:46AM -0500, Steven Munroe wrote:
> > On Sun, 2015-07-05 at 22:13 -0400, Rich Felker wrote:
> > > On Sun, Jul 05, 2015 at 08:16:44PM -0500, Steven Munroe wrote:
> > > > > I've suggested to Carlos that this is a problem with the use of the
> > > > > TCB. If one uses the TCB, one should add a dummy symbol that is versioned
> > > > > and tracks when you added the feature, and thus you can depend upon it,
> > > > > but not call it, and that way you get the right versioning. The same
> > > > > problem happened with stack canaries and it's still painfully annoying
> > > > > at the distribution level.
> > > > 
> > > > This is completely unnecessary. The load associated with
> > > > __builtin_cpu_supports() will work with any GLIBC what support TLS and
> > > > the worst that will happen is it will load zeros.
> > > 
> > > That's bad enough -- there are applications of hwcap where you NEED
> > > the correct value, not some (possibly empty) subset of the bits. For
> > > example if you need to know which registers to save/restore in an aync
> > > context-switching setup (rolling your own makecontext/swapcontext) or
> > > if you're implementing a function which has a special calling
> > > convention with a contract not to clobber any but a small fixed set of
> > > registers, but it might callback to arbitrary code in a rare case (ala
> > > __tls_get_addr or tls descriptor functions).
> > > 
> > No! any application that uses HWCAP and or __builtin_cpu_supports, has
> > to program for when the feature is not available. The feature bit is
> > either true or false.
> 
> I don't think you understood what I was saying. False negatives for
> __builtin_cpu_supports are not safe because they may wrongly indicate
> absence of a register you need to save on behalf of unknown
> third-party code. I already gave two examples of situations where this
> can arise.
> 
> > The dword we are talking about is already allocated and has been since
> > the initial implementation of TLS. For the PowerPC ABIs we allocated a
> > full 4K for the TCB and use negative displacement calculations that work
> > well with our ISA. 
> 
> I don't see this in glibc. struct pthread seems to be immediately
> below tcbhead_t, and the latter is not 4k. I'm looking at:
> 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD
> 

The key is the following statement from tls.h:

/* The following assumes that TP (R2 or R13) points to the end of the
   TCB + 0x7000 (per the ABI).  This implies that TCB address is
   TP - 0x7000.  As we define TLS_DTV_AT_TP we can
   assume that the pthread struct is allocated immediately ahead of the
   TCB.  This implies that the pthread_descr address is
   TP - (TLS_PRE_TCB_SIZE + 0x7000).  */

So struct pthread is allocated immediately ahead of the TCB and grows
down (to lower addresses) and the TCB alway ends on the byte before R13
- 0x7000 and grow up (to higher addresses). This is why we always add
new fields to the front of the TCB struct.

This allow the TCB and struct pthread to grow redundantly from either
side of R13-0x7000 and allows the TCB field offsets to remain stable
across releases of the ABI and versions of GLIBC.

The various macros in tls.h handle the details.
  
Rich Felker July 6, 2015, 9:56 p.m. UTC | #78
On Mon, Jul 06, 2015 at 04:26:21PM -0500, Steven Munroe wrote:
> > > The dword we are talking about is already allocated and has been since
> > > the initial implementation of TLS. For the PowerPC ABIs we allocated a
> > > full 4K for the TCB and use negative displacement calculations that work
> > > well with our ISA. 
> > 
> > I don't see this in glibc. struct pthread seems to be immediately
> > below tcbhead_t, and the latter is not 4k. I'm looking at:
> > 
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD
> > 
> 
> The key is the following statement from tls.h:
> 
> /* The following assumes that TP (R2 or R13) points to the end of the
>    TCB + 0x7000 (per the ABI).  This implies that TCB address is
>    TP - 0x7000.  As we define TLS_DTV_AT_TP we can
>    assume that the pthread struct is allocated immediately ahead of the
>    TCB.  This implies that the pthread_descr address is
>    TP - (TLS_PRE_TCB_SIZE + 0x7000).  */
> 
> So struct pthread is allocated immediately ahead of the TCB and grows
> down (to lower addresses) and the TCB alway ends on the byte before R13
> - 0x7000 and grow up (to higher addresses). This is why we always add
> new fields to the front of the TCB struct.
> 
> This allow the TCB and struct pthread to grow redundantly from either
> side of R13-0x7000 and allows the TCB field offsets to remain stable
> across releases of the ABI and versions of GLIBC.
> 
> The various macros in tls.h handle the details.

The layout as I understand it is not compatible with what you
described; there is certainly no way it can allow growth in both
directions, since one direction grows into the local-exec TLS, which
begins at or just above TP-0x7000.

Here is the layout of TLS, from lowest address to highest address:

1. struct pthread  \ These lines 2 and 3 together make up
2. tcbhead_t       / the TLS_PRE_TCB_SIZE in tls.h.
3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h)
4. Local-exec TLS

TP-0x7000 points to the end of 2, or the beginning/end of 3, or the
beginning of 4 (take your pick since they're all the same).

Fields of tcbhead_t can be accessed as ABI since they have a fixed
offset from TP-0x7000, as long as you only add new fields to the
beginning; doing so "pushes struct pthread down", which is harmless.
However, if you access a newly-added field from code assuming it
exists, but you're running with an old glibc version where it did no
exist, you will actually end up accessing the end of struct pthread.

Rich
  
Steven Munroe July 6, 2015, 10:25 p.m. UTC | #79
On Mon, 2015-07-06 at 17:56 -0400, Rich Felker wrote:
> On Mon, Jul 06, 2015 at 04:26:21PM -0500, Steven Munroe wrote:
> > > > The dword we are talking about is already allocated and has been since
> > > > the initial implementation of TLS. For the PowerPC ABIs we allocated a
> > > > full 4K for the TCB and use negative displacement calculations that work
> > > > well with our ISA. 
> > > 
> > > I don't see this in glibc. struct pthread seems to be immediately
> > > below tcbhead_t, and the latter is not 4k. I'm looking at:
> > > 
> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD
> > > 
> > 
> > The key is the following statement from tls.h:
> > 
> > /* The following assumes that TP (R2 or R13) points to the end of the
> >    TCB + 0x7000 (per the ABI).  This implies that TCB address is
> >    TP - 0x7000.  As we define TLS_DTV_AT_TP we can
> >    assume that the pthread struct is allocated immediately ahead of the
> >    TCB.  This implies that the pthread_descr address is
> >    TP - (TLS_PRE_TCB_SIZE + 0x7000).  */
> > 
> > So struct pthread is allocated immediately ahead of the TCB and grows
> > down (to lower addresses) and the TCB alway ends on the byte before R13
> > - 0x7000 and grow up (to higher addresses). This is why we always add
> > new fields to the front of the TCB struct.
> > 
> > This allow the TCB and struct pthread to grow redundantly from either
> > side of R13-0x7000 and allows the TCB field offsets to remain stable
> > across releases of the ABI and versions of GLIBC.
> > 
> > The various macros in tls.h handle the details.
> 
> The layout as I understand it is not compatible with what you
> described; there is certainly no way it can allow growth in both
> directions, since one direction grows into the local-exec TLS, which
> begins at or just above TP-0x7000.
> 
> Here is the layout of TLS, from lowest address to highest address:
> 
> 1. struct pthread  \ These lines 2 and 3 together make up
> 2. tcbhead_t       / the TLS_PRE_TCB_SIZE in tls.h.
> 3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h)
> 4. Local-exec TLS
> 
> TP-0x7000 points to the end of 2, or the beginning/end of 3, or the
> beginning of 4 (take your pick since they're all the same).
> 
> Fields of tcbhead_t can be accessed as ABI since they have a fixed
> offset from TP-0x7000, as long as you only add new fields to the
> beginning; doing so "pushes struct pthread down", which is harmless.
> However, if you access a newly-added field from code assuming it
> exists, but you're running with an old glibc version where it did no
> exist, you will actually end up accessing the end of struct pthread.
> 
No, look again at how the macros are defined. 

As the size tcbhead_t changes the end of the struct tcbhead_t does not
move and as such the previous TCB fields and the struct pthread do not
move.

Alan, tag your it, please explain this to Rick, after your first cup.

its been a long day...
  
Rich Felker July 7, 2015, 1:58 a.m. UTC | #80
On Mon, Jul 06, 2015 at 05:25:27PM -0500, Steven Munroe wrote:
> On Mon, 2015-07-06 at 17:56 -0400, Rich Felker wrote:
> > On Mon, Jul 06, 2015 at 04:26:21PM -0500, Steven Munroe wrote:
> > > > > The dword we are talking about is already allocated and has been since
> > > > > the initial implementation of TLS. For the PowerPC ABIs we allocated a
> > > > > full 4K for the TCB and use negative displacement calculations that work
> > > > > well with our ISA. 
> > > > 
> > > > I don't see this in glibc. struct pthread seems to be immediately
> > > > below tcbhead_t, and the latter is not 4k. I'm looking at:
> > > > 
> > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD
> > > > 
> > > 
> > > The key is the following statement from tls.h:
> > > 
> > > /* The following assumes that TP (R2 or R13) points to the end of the
> > >    TCB + 0x7000 (per the ABI).  This implies that TCB address is
> > >    TP - 0x7000.  As we define TLS_DTV_AT_TP we can
> > >    assume that the pthread struct is allocated immediately ahead of the
> > >    TCB.  This implies that the pthread_descr address is
> > >    TP - (TLS_PRE_TCB_SIZE + 0x7000).  */
> > > 
> > > So struct pthread is allocated immediately ahead of the TCB and grows
> > > down (to lower addresses) and the TCB alway ends on the byte before R13
> > > - 0x7000 and grow up (to higher addresses). This is why we always add
> > > new fields to the front of the TCB struct.
> > > 
> > > This allow the TCB and struct pthread to grow redundantly from either
> > > side of R13-0x7000 and allows the TCB field offsets to remain stable
> > > across releases of the ABI and versions of GLIBC.
> > > 
> > > The various macros in tls.h handle the details.
> > 
> > The layout as I understand it is not compatible with what you
> > described; there is certainly no way it can allow growth in both
> > directions, since one direction grows into the local-exec TLS, which
> > begins at or just above TP-0x7000.
> > 
> > Here is the layout of TLS, from lowest address to highest address:
> > 
> > 1. struct pthread  \ These lines 2 and 3 together make up
> > 2. tcbhead_t       / the TLS_PRE_TCB_SIZE in tls.h.
> > 3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h)
> > 4. Local-exec TLS
> > 
> > TP-0x7000 points to the end of 2, or the beginning/end of 3, or the
> > beginning of 4 (take your pick since they're all the same).
> > 
> > Fields of tcbhead_t can be accessed as ABI since they have a fixed
> > offset from TP-0x7000, as long as you only add new fields to the
> > beginning; doing so "pushes struct pthread down", which is harmless.
> > However, if you access a newly-added field from code assuming it
> > exists, but you're running with an old glibc version where it did no
> > exist, you will actually end up accessing the end of struct pthread.
> > 
> No, look again at how the macros are defined. 
> 
> As the size tcbhead_t changes the end of the struct tcbhead_t does not
> move and as such the previous TCB fields and the struct pthread do not
> move.
> 
> Alan, tag your it, please explain this to Rick, after your first cup.
> 
> its been a long day...

I'll wait for Alan to respond since I feel like our conversation is
getting nowhere and the concerns I'm trying to address (which I
believe were raised originally by Carlos, not me) are not getting
across to you clearly. Regardless of whose fault that is, maybe having
a third party look at this can help resolve it.

Rich
  
Alan Modra July 7, 2015, 2:36 a.m. UTC | #81
On Mon, Jul 06, 2015 at 05:25:27PM -0500, Steven Munroe wrote:
> On Mon, 2015-07-06 at 17:56 -0400, Rich Felker wrote:
> > The layout as I understand it is not compatible with what you
> > described; there is certainly no way it can allow growth in both
> > directions, since one direction grows into the local-exec TLS, which
> > begins at or just above TP-0x7000.
> > 
> > Here is the layout of TLS, from lowest address to highest address:
> > 
> > 1. struct pthread  \ These lines 2 and 3 together make up
> > 2. tcbhead_t       / the TLS_PRE_TCB_SIZE in tls.h.
> > 3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h)
> > 4. Local-exec TLS
> > 
> > TP-0x7000 points to the end of 2, or the beginning/end of 3, or the
> > beginning of 4 (take your pick since they're all the same).
> > 
> > Fields of tcbhead_t can be accessed as ABI since they have a fixed
> > offset from TP-0x7000, as long as you only add new fields to the
> > beginning; doing so "pushes struct pthread down", which is harmless.

Correct.  If you look into the fine details, the size allocated for
tcbhead_t is rounded up, so there might be some padding between struct
pthread and tcbhead_t.

> No, look again at how the macros are defined. 
>
> As the size tcbhead_t changes the end of the struct tcbhead_t does not
> move and as such the previous TCB fields and the struct pthread do not
> move.
> 
> Alan, tag your it, please explain this to Rick, after your first cup.

I think Rich is 100% correct in the part of his email that I quote
above, modulo omitting the detail on padding.

> > However, if you access a newly-added field from code assuming it
> > exists, but you're running with an old glibc version where it did no
> > exist, you will actually end up accessing the end of struct pthread.

And this concern is true too.  A newly minted program with accesses to
hwcap in tcbhead_t, ie. reads from a uint64_t at tp-0x7068, if run
with an older glibc will instead access struct pthread.  You'll
probably get a wrong hwcap value.  ;)  Fixable by ensuring any newly
built executable using hwcap in tcb has a reference to a versioned
symbol only available with newer glibc.  All quite standard with new
glibc features..  So, no real problem here.
  
Steven Munroe July 7, 2015, 3:01 a.m. UTC | #82
On Tue, 2015-07-07 at 12:06 +0930, Alan Modra wrote:
> On Mon, Jul 06, 2015 at 05:25:27PM -0500, Steven Munroe wrote:
> > On Mon, 2015-07-06 at 17:56 -0400, Rich Felker wrote:
> > > The layout as I understand it is not compatible with what you
> > > described; there is certainly no way it can allow growth in both
> > > directions, since one direction grows into the local-exec TLS, which
> > > begins at or just above TP-0x7000.
> > > 
> > > Here is the layout of TLS, from lowest address to highest address:
> > > 
> > > 1. struct pthread  \ These lines 2 and 3 together make up
> > > 2. tcbhead_t       / the TLS_PRE_TCB_SIZE in tls.h.
> > > 3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h)
> > > 4. Local-exec TLS
> > > 
> > > TP-0x7000 points to the end of 2, or the beginning/end of 3, or the
> > > beginning of 4 (take your pick since they're all the same).
> > > 
> > > Fields of tcbhead_t can be accessed as ABI since they have a fixed
> > > offset from TP-0x7000, as long as you only add new fields to the
> > > beginning; doing so "pushes struct pthread down", which is harmless.
> 
> Correct.  If you look into the fine details, the size allocated for
> tcbhead_t is rounded up, so there might be some padding between struct
> pthread and tcbhead_t.
> 
> > No, look again at how the macros are defined. 
> >
> > As the size tcbhead_t changes the end of the struct tcbhead_t does not
> > move and as such the previous TCB fields and the struct pthread do not
> > move.
> > 
> > Alan, tag your it, please explain this to Rick, after your first cup.
> 
> I think Rich is 100% correct in the part of his email that I quote
> above, modulo omitting the detail on padding.
> 
> > > However, if you access a newly-added field from code assuming it
> > > exists, but you're running with an old glibc version where it did no
> > > exist, you will actually end up accessing the end of struct pthread.
> 
> And this concern is true too.  A newly minted program with accesses to
> hwcap in tcbhead_t, ie. reads from a uint64_t at tp-0x7068, if run
> with an older glibc will instead access struct pthread.  You'll
> probably get a wrong hwcap value.  ;)  Fixable by ensuring any newly
> built executable using hwcap in tcb has a reference to a versioned
> symbol only available with newer glibc.  All quite standard with new
> glibc features..  So, no real problem here.
> 
Sorry Rich, Thanks Alan

I really did remember 0x7000 is physical offset and I fixated at on
that.

The final allocation rounds to quadword and the R13 is set to TCB
+0x7000. The offset is logical but no real.

And I am literally exhausted by all this. Which did not help.

We can add the symbol reference to detect old GLIBC and but I believe
that existing GLIBC versioning would catch this anyway.
  
Alan Modra July 7, 2015, 4:02 a.m. UTC | #83
On Fri, Jul 03, 2015 at 09:53:13PM +0200, Ondřej Bílka wrote:
> hwcap.h:
> int __hwcap __attribute__ ((visibility ("hidden"))) ;
> 
> hwcap.c:
> 
> #include <hwcap.h>
> 
> // gcc needs to make this first constructor.
> extern int __global_hwcap;
> void __attribute__ ((constructor)) 
> set_hwcap () 
> {
>   __hwcap = __global_hwcap;
> }

We considered using this scheme.  In fact, I put forward the idea.
However, it was discarded as second best, due to the need to set up
GOT/TOC addressing on a variable access.  Nothing beats Steve's single
instruction "ld r,-0x7068(r13)" to read hwcap.
  
Steven Munroe July 7, 2015, 3:35 p.m. UTC | #84
On Fri, 2015-07-03 at 19:11 +0200, Ondřej Bílka wrote:
> On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote:
> > On 07/03/2015 04:55 AM, Ondřej Bílka wrote:
> > >> At the end of the day it's up to IBM to make the best use of the
> > >> tp+offset data stored in the TCB, but every byte you save is another
> > >> byte you can use later for something else.
> > >>
> > > Carlos a problem with this patch is that they ignored community
> > > feedback. Early in this thread Florian come with better idea to use
> > > GOT+offset that could be accessed as 
> > > &hwcap_hack and avoids per-thread runtime overhead.
> > 
> > Steven and Carlos have not ignored the community feedback, they just
> > have a different set of priorities and requirements. There is little
> > to discuss if your priorities and requirements are different.
> > 
> > The use of tp+offset data is indeed a scarce resource that should be
> > used only when absolutely necessary or when the use case dictates.
> > 
> > It is my opinion as a developer, that Carlos' patch is flawed because
> > it uses a finite resource, namely tp+offset data, for what I perceive
> > to be a flawed design pattern that as a free software developer I don't
> > want to encourage. These are not entirely technical arguments though,
> > they are subjective and based on my desire to educate and mentor developers
> > who write such code. I don't present these arguments as sustained
> > opposition to the patch because they are not technical and Carlos
> > has a need to accelerate this use case today.
> > 
Value judgments about what is precious can vary.

On Power CPU, cycles and hazard avoidance is more precious then a double
word or two. On a machine with 64KB pages, 128-byte cache-lines, and
supported memory configs up to 32TB, this is a good trade-off.

I am not trying to impose this on any one else.

> > I have only a few substantive technical issues with the patch. Given
> > that the ABI allocates a large block of tp+offset data, I think it is
> > OK for IBM to use the data in this way. For example I think it is much
> > much more serious that such a built application will likely just crash
> > when run with an older glibc. This is a distribution maintenance issue
> > that I can't ignore and I'd like to see it solved by a dependency on a
> > versioned dummy symbol.
> > 

We agree to add the symbol check and fail the app it is loading an old
GLIBC.

> > Lastly, the symbol address hack is an incomplete solution because Florian
> > has not provided an implementation. Depending on the implementation it
> > may require a new relocation, and that is potentially more costly to the
> > program startup than the present process for filling in HWCAP/HWCAP2.
> 
> Thats valid concern. My idea was checking if hwcap_hack relocation exist. 
> I didn't realize that it scales with number of libraries.
> 
> One of reasons why I didn't like this proposal is that it harms linux
> ecosystem as  it increases startup cost of a bit everything while its 
> unlikely that cross-platform projects will use this.
> 
> But these could be done without much of our help. We need to keep these
> writable to support this hack. I don't know exact assembly for powerpc,
> it should be similar to how do it on x64:
> 
> int x;
> 
> int foo()
> {
> #ifdef SHARED
> asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
> #else
> asm ("lea x(%rip), %rax; movb $32, (%rax)");
> #endif
> return &x;
> }
> 

Not so simple on PowerISA as we don't have PC-relative addressing.

1) The global entry requires 2 instruction to establish the TOC/GOT
2) Medium model requires two instructions (fused) to load a pointer from
the GOT.
3) Finally we can load the cached hwcap.

None of this is required for the TP+offset.

Telling me how x86 does things is not much help.
> 
> > Without a concrete implementation I can't comment on one or the other.
> > It is in my opinion overly harsh to force IBM to go implement this new
> > feature. They have space in the TCB per the ABI and may use it for their
> > needs. I think the community should investigate symbol address munging
> > as a method for storing data in addresses and make a generic API from it,
> > likewise I think the community should investigate standardizing tp+offset
> > data access behind a set of accessor macros and normalizing the usage
> > across the 5 or 6 architectures that use it.
> >
> I would like this as with access to that I could improve performance of
> several inlines.
> 
> 
> > > Also I now have additional comment with api as if you want faster checks
> > > wouldn't be faster to save each bit of hwcap into byte field so you
> > > could avoid using mask at each check?
> > 
> > That is an *excellent* suggestion, and exactly the type of technical
> > feedback that we should be giving IBM, and Carlos can confirm if they've
> > tried such "unpacking" of the bits into byte fields. Such unpacking is
> > common in other machine implementations.
> >
This does not help on Power, Any (byte, halfword, word, doubleword,
quadword) aligned load is the same performance. Splitting our bits to
bytes just slow things down. Consider:

if (__builtin_cpu_supports(ARCH_2_07) &&   
    __builtin_cpu_supports(VEC_CRYPTO))

This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
byte Boolean. 

Again value judgements about that is fast or slow can vary by platform.
  
Steven Munroe July 7, 2015, 3:47 p.m. UTC | #85
On Wed, 2015-07-01 at 13:55 +0200, Ondřej Bílka wrote:
> On Tue, Jun 30, 2015 at 06:46:14PM -0300, Adhemerval Zanella wrote:
> 
> > >> Again this is something, as Steve has pointed out, you only assume without
> > >> knowing the subject in depth: it is operating on *vector* registers and
> > >> thus it will be more costly to move to and back GRP than just do in
> > >> VSX registers.  And as Steven has pointed out, the idea is to *validate*
> > >> on POWER7.
> > > 
> > > If that is really case then using hwcap for that makes absolutely no sense.
> > > Just surround these builtins by #ifdef TESTING and you will compile
> > > power7 binary. When you are releasing production version you will
> > > optimize that for power8. A difference from just using correct -mcpu
> > > could dominate speedups that you try to get with these builtins. Slowing
> > > down production application for validation support makes no sense.
> > 
> > That is a valid point, but as Steve has pointed out the idea is exactly
> > to avoid multiple builds.
> >
> And thats exactly problem that you just ignore solution. Seriously when
> having single build is more important than -mcpu that will give you 1%
> performance boost do you think that a 1% boost from hwcap selection
> matters? I could come with easy suggestions like changing makefile to
> create app_power7 and app_power8 in single build. And a app_power7 could
> check if it supports power8 instruction and exec app_power8. I really
> doubt why you insist on single build when a best practice is separate
> testing and production.
> 
> Insisting that you need single binary would mean that you should stick
> with power7 optimization and don't bother with hwcap instruction
> 
> 
> > > 
> > > 
> > > Also you didn't answered my question, it works in both ways. 
> > > From that example his uses vector register doesn't follow that 
> > > application should use vector registers. If user does
> > > something like in my example, the cost of gpr -> vector conversion will
> > > harm performance and he should keep these in gpr. 
> > 
> > And again you make assumptions that you do not know: what if the program
> > is made with vectors in mind and they want to process it as uint128_t if
> > it is the case?  You do know that neither the program constraints so
> > assuming that it would be better to use GPR may not hold true.
> > 
> I didn't make that assumption. 
> I just said that your assumption that one must use vector
> registers is wrong again. From my previous mail:
> 
> 
> > Customer just wants to do 128 additions. If a fastest way
> > is with GPR registers then he should use gpr registers.
> >
> > My claim was that this leads to slow code on power7. Fallback above
> > takes 14 cycles on power8 and 128bit addition is similarly slow.
> >
> > Yes you could craft expressions that exploit vectors by doing ands/ors
> > with 128bit constants but if you mostly need to sum integers and use 128
> > bits to prevent overflows then gpr is correct choice due to transfer
> > cost.
> 
> Yes it isn't known but its more likely that programmers just used that
> as counter instead of vector magic. So we need to see use case in more
> detail.
> 
> 
> >> >>> I am telling all time that there are better alternatives where this
> > >>> doesn't matter.
> > >>>
> > >>> One example would be write gcc pass that runs after early inlining to
> > >>> find all functions containing __builtin_cpu_supports, cloning them to
> > >>> replace it by constant and adding ifunc to automatically select variant.
> > >>
> > >> Using internal PLT calls to such mechanism is really not the way to handle
> > >> performance for powerpc.  
> > >>
> > > No you are wrong again. I wrote to introduce ifunc after inlining. You
> > > do inlining to eliminate call overhead. So after inlining effect of
> > > adding plt call is minimal, otherwise gcc should inline that to improve
> > > performance in first place.
> > 
> > It is the case if you have the function definition, which might not be
> > true.  But this is not the case since the code could be in a shared
> > library.
> > 
> Seriously? If its function from shared library then it should use ifunc
> and not force every caller to keep hwcap selection in sync with library,
> and you need plt indirection anyway.
> 
if you believe so strongly that ifunc it the best solution then I
suggest you look at the 1000s of packages in a Linux distro and see how
many of them use IFUNC or any of the other suggested techniques.

My survey shows very few.

So your issue is not with me but with the world at large. 

If you want this to be a serious option then you need to convince all of
them.
  
Mike Frysinger July 8, 2015, 6 a.m. UTC | #86
On 08 Jun 2015 18:03, Carlos Eduardo Seo wrote:
> +/* Get the hwcap/hwcap2 information from the TCB. Offsets taken
> +   from tcb-offsets.h.  */
> +static inline uint32_t
> +__ppc_get_hwcap (void)
> +{
> +
> +  uint32_t __tcb_hwcap;
> +
> +#ifdef __powerpc64__
> +  register unsigned long __tp __asm__ ("r13");
> +  __asm__ volatile ("lwz %0,-28772(%1)\n"
> +		    : "=r" (__tcb_hwcap)
> +		    : "r" (__tp));
> +#else
> +  register unsigned long __tp __asm__ ("r2");
> +  __asm__ volatile ("lwz %0,-28724(%1)\n"
> +		    : "=r" (__tcb_hwcap)
> +		    : "r" (__tp));
> +#endif
> +
> +  return __tcb_hwcap;
> +}

i'm confused ... why can't the offsets you've already calculated via the 
offsets header be used instead of duplicating them all over this file ?
-mike
  
Carlos O'Donell July 8, 2015, 7:51 a.m. UTC | #87
On 07/05/2015 09:16 PM, Steven Munroe wrote:
> On Fri, 2015-07-03 at 13:12 -0400, Carlos O'Donell wrote:
>> On 07/01/2015 03:12 PM, Steven Munroe wrote:
>>> If you think about the requirements for a while it becomes clear. As the
>>> HWCAP cache would have to be defined and initialized in either libgcc or
>>> libc, accept will be none local from any user library. So all the local
>>> TLC access optimization's are disallowed. Add the requirement to support
>>> dl_open() libraries leaves the general dynamic TLS model as the ONLY
>>> safe option.
>>
>> That's not true anymore? Alan Modra added pseudo-TLS descriptors to POWER
>> just recently[1], which means __tls_get_addr call is elided and the offset
>> returned immediately via a linker stub for use with tp+offset. However,
>> I agree that even Alan's great work here is still going to be several
>> more instructions than a raw tp+offset access. However, it would be
>> interesting to discuss with Alan if his changes are sufficiently good
>> that the out-of-order execution hides the latency of this additional
>> instructions and his methods are a sufficient win that you *can* use
>> TLS variables?
>>
> I did discuss this with Alan and he agree that with the given
> requirements the the standard TLS mechanism is always slower them my
> original TCB proposal.

Sounds good, thank you for clarifying that.

> Why would you think I had not talked to Alan?

As a reviewer I can't assume anything you don't tell me.

Let me use a Mark Mitchell anecdote: You walk into class on the first
day of class. The teacher says "What's your job?" You say "To learn!"
The teacher says "No. It's to make it easy for the grader to give you
an A."

You make it easy for the reviewer to accept your patch when the
submission answers all of the questions the reviewer would ask.

>>> Now there were a lot of suggestions to just force the HWCAP TLS
>>> variables into initial exec or local exec TLS model with an attribute.
>>> This would resolve to direct TLS offset in some special reserved TLS
>>> space?
>>
>> It does. Since libc.so is always seen by the linker it can always allocate
>> static TLS space for that library when it computes the maximum size of
>> static TLS space.
>>
>>> How does that work with a library loaded with dl_open()? How does that
>>> work with a library linked with one toolchain / GLIBC on Distro X and
>>> run on a system with a different toolchain and GLIBC on Distro Y? With
>>> different versions of GLIBC? Will HWCAP get the same TLS offset? Do we
>>> end up with .text relocations that we are also trying to avoid?
>>
>> (1) Interaction with dlopen?
>>
>> The two variables in question are always in libc.so.6, and therefore are
>> always loaded first by DT_NEEDED, and there is always static storage
>> reserved for that library.
>>
>> There are 2 scenarios which are problematic.
>>
>> (a) A static application accessing NSS / ICONV / IDN must dynamically
>>     load libc.so.6, and there must be enough reserve static TLS space
>>     for the allocated IE TLS variables or the dynamic loader will abort
>>     the load indicating that there is not enough space to load any more
>>     static TLS using DSOs. This is solved today by providing surplus
>>     static TLS storage space.
>>
>> (b) Use of dlmopen to load multiple libc.so.6's. In this case you could
>>     load libc.so.6 into alternate namespaces and eventually run out of
>>     surplus static TLS. We have never seen this in common practice because
>>     there are very few users of dlmopen, and to be honest the interface
>>     is poorly documented and fraught with problems.
>>
>> Therefore in the average scenario it will work to use static TLS, or
>> IE TLS variables in glibc in the average case. I consider the above
>> cases to be outside the normal realm of user applications.
>>
>> e.g.
>> extern __thread int foo __attribute__((tls_model("initial-exec")));
>>
>> (2) Distro to distro compatibility?
>>
>> With my Red Hat on:
>>
>> Let me start by saying you have absolutely no guarantee here at all
>> provided by any distribution. As the Fedora and RHEL glibc maintainer
>> your vendor is far outside the scope of support and such a scenario is
>> never possible. You can wish it, but it's not true unless you remain
>> very very low level and very very simple interfaces. That is to say
>> that you have no guarantee that a library linked by a vendor with one
>> toolchain in distro X will work in distro Y. If you need to do that
>> then build in a container, chroot or VM with distro Y tools. No vendor
>> I've ever talked to expects or even supports such a scenario.
>>
>> With my hacker hat on:
>>
>> Generally for simple features it just works as long as both distros
>> have the same version of glibc. However, we're talking only about
>> the glibc parts of the problem. Compatibility with other libraries
>> is another issue.
>>
> No! the version of GLIBC does not matter as long as the GLIBC supports
> TLS (GLIBC-2.5?)

You are correct, the runtime glibc version does not strictly matter,
but I think it *might* matter if you use an old glibc (see discussion
about crashes).

>> (3) Different versions of glibc?
>>
>> Sure it works, as long as all the versions have the same feature and
>> are newer than the version in which you introduced the change. That's
>> what backwards compatibility is for.
>>
>> (4) Will HWCAP get the same TLS offset? 
>>
>> That's up to the static linker. You don't care anymore though, the gcc
>> builtin will reference the IE TLS variables like it would normally as
>> part of the shared implementation, and that variable is resolved to glibc
>> and normal library versioning hanppens. The program will now require that
>> glibc or newer and you'll get proper error messages about that.
>>
>> (5) Do we end up with .text relocations that we are also trying to avoid?
>>
>> You should not. The offset is known at link time and inserted by the
>> static linker.
>>
> To avoid the text relocation I believe there is an extra GOT load of the
> offset. If this is not true then Alan owes me an update to the ABI
> document to explain how this would work. As the current Draft ELF2 ABI
> update does not say this is supported.

Sorry, you are correct, for ppc64 there is a R_PPC64_TPREL64 on the GOT
and an indirect load. So this doesn't work for you either because of the
indirect performance penalty.

>>> Again the TCB avoids all of this as it provides a fixed offset defined
>>> by the ABI and does not require any up-calls or indirection. And also
>>> will work in any library without induced hazards. This clearly works
>>> across distros including previous version of GLIBC as the words where
>>> previously reserved by the ABI. Application libraries that need to run
>>> on older distros can add a __built_cpu_init() to their library init or
>>> if threaded to their thread create function.
>>
>> You get a crash since previous glibc's don't fill in the data?
>> And that crash gives you only some information to debug the problem,
>> namely that you ran code for a processors you didn't support.
>>
> There is NO crash. There never was a crash. There is no additional
> security exposure. The only TCB fields that might be a security exposure
> where already there, in every other platform.

Sorry, I don't follow you here, could you expand what you mean by
"already there?" Do you mean to say that "The ABI has always specified
this space as reserved?"

> The worst there can be is is fallback the to base implementation (the
> bit is 0 when is should be 1).

The threading support uses a stack cache that reuses allocated stacks
from other threads, and depending on the requirements of the thread to
have guards or other parameters that consume stack space I don't know
that you can guarantee the reserved space stays at zero for the lifetime
of the program without initializing it every time the thread is started.
A reused stack for a newly started thread might therefore have non-zero
data in the reserved spot and cause the code for an invalid CPU to be
selected. This can't be fixed without the per-thread initialization
code in glibc?

Someone should look at this case minimally, or alternatively version
the interface and only use this support with newer glibc's that carry
out the initialization.

> As explained the dword is already there and initialized to 0 when the
> page is allocate. So the load will work NOW for any GLIBC since TLS was
> implemented.
> 
> As implemented by Alan and I.
 
I don't think this is true per my comments above regarding stack reuse.
 
>> It is true that you could use LD_PRELOAD to run __builtin_cpu_init()
>> on older systems, but you need to *know* that, and use that. What
>> provides this function? libgcc?
>>
> We will provide a little init routine applications can use. This is not
> hard.

I assume they have to use it in every thread before they can call any
of the builtins?

>> It is certainly a benefit to using the TCB, that this kind of use case
>> is supported. However, in doing so you adversely impact the distribution
>> maintainers for the benefit of?
>>
> I can not think of any adverse impacts on any of the other platform
> maintainers, on any the distros.

As described above I think you can get crashes because of stack cache
reuse leaving some of these reserved words potentially non-zero.
I also think a cancelled thread (which might be in an undefined state
and have written into the TCB) can have it's stack reused also.

Cheers,
Carlos.
  
Carlos O'Donell July 8, 2015, 8 a.m. UTC | #88
On 07/07/2015 12:02 AM, Alan Modra wrote:
> On Fri, Jul 03, 2015 at 09:53:13PM +0200, Ondřej Bílka wrote:
>> hwcap.h:
>> int __hwcap __attribute__ ((visibility ("hidden"))) ;
>>
>> hwcap.c:
>>
>> #include <hwcap.h>
>>
>> // gcc needs to make this first constructor.
>> extern int __global_hwcap;
>> void __attribute__ ((constructor)) 
>> set_hwcap () 
>> {
>>   __hwcap = __global_hwcap;
>> }
> 
> We considered using this scheme.  In fact, I put forward the idea.
> However, it was discarded as second best, due to the need to set up
> GOT/TOC addressing on a variable access.  Nothing beats Steve's single
> instruction "ld r,-0x7068(r13)" to read hwcap.

Agreed, and you should be allowed to use it given that you have ABI
space allocated for it. It still makes me sad to see the kind
of code it enables though.

My other comments to Steven still stand though, with stack reuse
via the internal  cache I think you have no guarantees the words you
want will be zero. Therefore you need to version this interface for
it to be generally safe and use it only if you know the hwcap in the
TCB was initialized by glibc. I do not think it is a good trade to
have "hard to debug crashes" along with "support for all versions of
glibc with TLS." I would rather see "never crashes" and "works with
glibc 2.22 and newer."

Cheers,
Carlos.
  
Carlos O'Donell July 8, 2015, 8:03 a.m. UTC | #89
On 07/07/2015 11:35 AM, Steven Munroe wrote:
> We agree to add the symbol check and fail the app it is loading an old
> GLIBC.

In which case I think the next step is a v2 patch with the symbol check.

That would be good with me and acceptable to checkin IMO.

You have reserved ABI space to use it as you see fit.

Cheers,
Carlos.
  
Carlos O'Donell July 8, 2015, 8:10 a.m. UTC | #90
On 07/06/2015 11:01 PM, Steven Munroe wrote:
> We can add the symbol reference to detect old GLIBC and but I believe
> that existing GLIBC versioning would catch this anyway.
 
There is no implicit guarantee. It happens some times that you reference
another symbol that is new enough that it works and your library is
then subsequently dependent on the newer glibc, but there is no guarantee.
To add a guarantee you have to weave into your macros a reference to
a new dummy symbol with the right version.

c.
  
Carlos O'Donell July 8, 2015, 8:15 a.m. UTC | #91
On 07/06/2015 09:58 PM, Rich Felker wrote:
> I'll wait for Alan to respond since I feel like our conversation is
> getting nowhere and the concerns I'm trying to address (which I
> believe were raised originally by Carlos, not me) are not getting
> across to you clearly. Regardless of whose fault that is, maybe having
> a third party look at this can help resolve it.

Correct, I raised it originally when it came to light the requirement
was to support old versions of glibc.

My initial worry was around reused stacks and TCB getting garbage
from those stacks. I had not yet considered that the reserved
ABI space was not reserved in the layout macros for TLS.

Seeing Alan's response clarifies that though, the space is reserved
in the ABI document only, but in glibc we allow struct pthread to
move up into that reserved space to save on allocated pages.

c.
  
Steven Munroe July 8, 2015, 2:21 p.m. UTC | #92
On Wed, 2015-07-08 at 04:00 -0400, Carlos O'Donell wrote:
> On 07/07/2015 12:02 AM, Alan Modra wrote:
> > On Fri, Jul 03, 2015 at 09:53:13PM +0200, Ondřej Bílka wrote:
> >> hwcap.h:
> >> int __hwcap __attribute__ ((visibility ("hidden"))) ;
> >>
> >> hwcap.c:
> >>
> >> #include <hwcap.h>
> >>
> >> // gcc needs to make this first constructor.
> >> extern int __global_hwcap;
> >> void __attribute__ ((constructor)) 
> >> set_hwcap () 
> >> {
> >>   __hwcap = __global_hwcap;
> >> }
> > 
> > We considered using this scheme.  In fact, I put forward the idea.
> > However, it was discarded as second best, due to the need to set up
> > GOT/TOC addressing on a variable access.  Nothing beats Steve's single
> > instruction "ld r,-0x7068(r13)" to read hwcap.
> 
> Agreed, and you should be allowed to use it given that you have ABI
> space allocated for it. It still makes me sad to see the kind
> of code it enables though.
> 
> My other comments to Steven still stand though, with stack reuse
> via the internal  cache I think you have no guarantees the words you
> want will be zero. Therefore you need to version this interface for
> it to be generally safe and use it only if you know the hwcap in the
> TCB was initialized by glibc. I do not think it is a good trade to
> have "hard to debug crashes" along with "support for all versions of
> glibc with TLS." I would rather see "never crashes" and "works with
> glibc 2.22 and newer."
> 
Agreed, I have asked Carlos Seo to update and resubmit the patch for
review.
  
Mike Frysinger July 8, 2015, 5:42 p.m. UTC | #93
On 08 Jul 2015 10:55, Carlos Eduardo Seo wrote:
> tcb-offsets.h is generated from tcb-offsets.sym during the glibc build and isn’t installed. That’s why the offsets are duplicated in ppc.h, which is a public header.

then perhaps tcb-offsets.h or something like it should be installed alongside 
the ppc.h header ?
-mike
  
Steve Munroe July 8, 2015, 6:33 p.m. UTC | #94
Mike Frysinger <vapier@gentoo.org> wrote on 07/08/2015 12:42:25 PM:

> From: Mike Frysinger <vapier@gentoo.org>

> To: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>

> Cc: GLIBC Devel <libc-alpha@sourceware.org>, Steve

Munroe/Rochester/IBM@IBMUS
> Date: 07/08/2015 12:42 PM

> Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB

>

> On 08 Jul 2015 10:55, Carlos Eduardo Seo wrote:

> > tcb-offsets.h is generated from tcb-offsets.sym during the glibc

> build and isn’t installed. That’s why the offsets are duplicated in

> ppc.h, which is a public header.

>

> then perhaps tcb-offsets.h or something like it should be installed

alongside
> the ppc.h header ?


I fear that what you propose would just ignite another endless debate about
the wisdom of exposing the TCB and struct pthreads to users.

The current ./sysdeps/powerpc/nptl/tcb-offsets.sym includes offsets for
header.multiple_threads, header.private_futex, and pointer_guard which I
suspect the community feels (and I agree) are private to GLIBC
implementation.

So for now I would like to just provide nice #defines for the two fields
evolved and then once the community considers and agrees to a general
policy we can work on a more general solution.

I would like to catch the 2.22 train before it leaves.

Ok?


Steven J. Munroe
Linux on Power Toolchain Architect
IBM Corporation, Linux Technology Center
  
Carlos O'Donell July 8, 2015, 7:11 p.m. UTC | #95
On 07/08/2015 01:47 PM, Carlos Eduardo Seo wrote:
> Hm, not sure if this is the best approach. This particular header was
> intended to be internal to glibc.
> 
> Maybe O’Donell or Adhemerval may want to chime in on this?

The *-offsets.h headers are special and for internal use only
and are auto-generated from the *.sym files.

Nothing prevents one from deploying any header you want as
part of the internal implementation details, however in this
case I think the expedient thing to do is leave ppc.h with
duplicate definitions of these constants for now. They can't
change anyway because they are ABI.

Cheers,
Carlos
  
Ondrej Bilka July 9, 2015, 9:25 a.m. UTC | #96
On Tue, Jul 07, 2015 at 01:32:17PM +0930, Alan Modra wrote:
> On Fri, Jul 03, 2015 at 09:53:13PM +0200, Ondřej Bílka wrote:
> > hwcap.h:
> > int __hwcap __attribute__ ((visibility ("hidden"))) ;
> > 
> > hwcap.c:
> > 
> > #include <hwcap.h>
> > 
> > // gcc needs to make this first constructor.
> > extern int __global_hwcap;
> > void __attribute__ ((constructor)) 
> > set_hwcap () 
> > {
> >   __hwcap = __global_hwcap;
> > }
> 
> We considered using this scheme.  In fact, I put forward the idea.
> However, it was discarded as second best, due to the need to set up
> GOT/TOC addressing on a variable access.  Nothing beats Steve's single
> instruction "ld r,-0x7068(r13)" to read hwcap.
> 
So you have bigger problem that you need TOC addressing for static
variable access.
That could be avoided with better ABI. If you allocate text segment
before TOC then you could use single instructions to read each static
variable.
  
Ondrej Bilka July 9, 2015, 10:34 a.m. UTC | #97
On Tue, Jul 07, 2015 at 10:47:36AM -0500, Steven Munroe wrote:
> On Wed, 2015-07-01 at 13:55 +0200, Ondřej Bílka wrote:
> > On Tue, Jun 30, 2015 at 06:46:14PM -0300, Adhemerval Zanella wrote:
 > >> >>> I am telling all time that there are better alternatives where this
> > > >>> doesn't matter.
> > > >>>
> > > >>> One example would be write gcc pass that runs after early inlining to
> > > >>> find all functions containing __builtin_cpu_supports, cloning them to
> > > >>> replace it by constant and adding ifunc to automatically select variant.
> > > >>
> > > >> Using internal PLT calls to such mechanism is really not the way to handle
> > > >> performance for powerpc.  
> > > >>
> > > > No you are wrong again. I wrote to introduce ifunc after inlining. You
> > > > do inlining to eliminate call overhead. So after inlining effect of
> > > > adding plt call is minimal, otherwise gcc should inline that to improve
> > > > performance in first place.
> > > 
> > > It is the case if you have the function definition, which might not be
> > > true.  But this is not the case since the code could be in a shared
> > > library.
> > > 
> > Seriously? If its function from shared library then it should use ifunc
> > and not force every caller to keep hwcap selection in sync with library,
> > and you need plt indirection anyway.
> > 
> if you believe so strongly that ifunc it the best solution then I
> suggest you look at the 1000s of packages in a Linux distro and see how
> many of them use IFUNC or any of the other suggested techniques.
>
> My survey shows very few.

Thats trivial take gentoo, where you could compile with -mcpu 
But I am glad that you did survey.

You could finally answer a questions that I asked in first place.
1) Are among these packages some that use hwcap?
2) Do some use hwcap more than once in early initialization?
3) Did you do profiling to show that a hwcap optimization has some
performance impact?

You still didn't answer an objection that this harms packages that don't
use hwcap and we asked for examples to show that this proposal will help
in some cases. So far you didn't provided any justified example.
 
> 
> So your issue is not with me but with the world at large. 
> 
> If you want this to be a serious option then you need to convince all of
> them.
Could you stop making strawman arguments? I never said that but from
start of mail:

> > > >>> One example would be write gcc pass that runs after early inlining to
> > > >>> find all functions containing __builtin_cpu_supports, cloning them to
> > > >>> replace it by constant and adding ifunc to automatically select variant.

Here you need to only convince gcc developers to use that. I also said
that your idea of application developers using that is a mistake and
they shouldn't touch that. Instead a distribution managers would package
these by adding appropriate gcc flags.
  
Ondrej Bilka July 9, 2015, 7:02 p.m. UTC | #98
On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
 > But these could be done without much of our help. We need to keep these
> > writable to support this hack. I don't know exact assembly for powerpc,
> > it should be similar to how do it on x64:
> > 
> > int x;
> > 
> > int foo()
> > {
> > #ifdef SHARED
> > asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
> > #else
> > asm ("lea x(%rip), %rax; movb $32, (%rax)");
> > #endif
> > return &x;
> > }
> > 
> 
> Not so simple on PowerISA as we don't have PC-relative addressing.
> 
> 1) The global entry requires 2 instruction to establish the TOC/GOT
> 2) Medium model requires two instructions (fused) to load a pointer from
> the GOT.
> 3) Finally we can load the cached hwcap.
> 
> None of this is required for the TP+offset.
>
And why you didn't wrote that when it was first suggested? When you don't answer 
it looks like you don't want to answer because that suggestion is better.

Here problem isn't lack of relative addressing but that you don't start
with GOT in register. 

You certainly could do similar hack as you do with tcb and place hwcap
bits just after that so you could do just one load.

That you require so many instructions on powerpc is gcc bug, rather than
rule. You don't need that many instructions when you place frequent
symbols in -32768..32767 range. For example here you could save one
addition.

int x, y;
int foo()
{
  return x + y;
}

original

00000000000007d0 <foo>:
 7d0:	02 00 4c 3c 	addis   r2,r12,2
 7d4:	30 78 42 38 	addi    r2,r2,30768
 7d8:	00 00 00 60 	nop
 7dc:	30 80 42 e9 	ld      r10,-32720(r2)
 7e0:	00 00 00 60 	nop
 7e4:	38 80 22 e9 	ld      r9,-32712(r2)
 7e8:	00 00 6a 80 	lwz     r3,0(r10)
 7ec:	00 00 29 81 	lwz     r9,0(r9)
 7f0:	14 4a 63 7c 	add     r3,r3,r9
 7f4:	b4 07 63 7c 	extsw   r3,r3
 7f8:	20 00 80 4e 	blr

new

 	addis   r2,r12,2
	ld      r10,-1952(r2)
	ld      r9,-1944(r2)
	lwz     r3,0(r10)
	lwz     r9,0(r9)
	add     r3,r3,r9
	extsw   r3,r3
	blr

 
> Telling me how x86 does things is not much help.

That why we need to know how that would work on powerpc.

> > 
> > > Without a concrete implementation I can't comment on one or the other.
> > > It is in my opinion overly harsh to force IBM to go implement this new
> > > feature. They have space in the TCB per the ABI and may use it for their
> > > needs. I think the community should investigate symbol address munging
> > > as a method for storing data in addresses and make a generic API from it,
> > > likewise I think the community should investigate standardizing tp+offset
> > > data access behind a set of accessor macros and normalizing the usage
> > > across the 5 or 6 architectures that use it.
> > >
> > I would like this as with access to that I could improve performance of
> > several inlines.
> > 
> > 
> > > > Also I now have additional comment with api as if you want faster checks
> > > > wouldn't be faster to save each bit of hwcap into byte field so you
> > > > could avoid using mask at each check?
> > > 
> > > That is an *excellent* suggestion, and exactly the type of technical
> > > feedback that we should be giving IBM, and Carlos can confirm if they've
> > > tried such "unpacking" of the bits into byte fields. Such unpacking is
> > > common in other machine implementations.
> > >
> This does not help on Power, Any (byte, halfword, word, doubleword,
> quadword) aligned load is the same performance. Splitting our bits to
> bytes just slow things down. Consider:
> 
> if (__builtin_cpu_supports(ARCH_2_07) &&   
>     __builtin_cpu_supports(VEC_CRYPTO))
> 
> This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
> byte Boolean. 
> 
> Again value judgements about that is fast or slow can vary by platform.

Instruction count means nothing if you don't have good intuition about
powerpc platform. If you consider these your three instructions are lot
slower than byte Booleans. 

Use following benchmark. You need separate compilation as to simulate
many calls of function that uses hwcap that are not optimized away by
gcc. I used computation before hwcap selection as without that there
wouldn't be much difference as with OoO execution it would mostly
measure latency of loads. It would still be slower but its 1.90s vs 1.92s

Adding third check makes no difference, and case of one is obviously
faster.

Also how are you sure that checking more flags happens often to justify
any potential savings with more checks if there were any savings?

Benchmark is following:

[neleai@gcc2-power8 ~]$ echo c.c:;cat c.c; echo x.c:;cat x.c;echo y.c:;
cat y.c; gcc -O3 x.c -c; gcc -O3 x.o c.c -o x; gcc -O3 y.c -c; gcc -O3
c.c y.o -o y; time ./x ; time ./y; time ./x; time ./y

c.c:
volatile int v, w;
volatile int u;
int main()
{
  u= -1;
  v = 1; w = 1;
  long i;
  unsigned long sum = 0;
  for (i=0;i<500000000;i++)
    sum += foo(sum, 42);
  return sum;

}
x.c:
extern int v,w;
int __attribute__((noinline))foo(int x, int y){
 x= 3 * x - 32 + y;
 y = 4 * x + 5;
 if (v & w)
   return 3 * x;
 return 5 * y;
}

y.c:
extern int u;
int __attribute__((noinline))foo(int x, int y){
 x= 3 * x - 32 + y;
 y = 4 * x + 5;
 if (((u&((1<<17)|(1<<21)))==((1<<17)|(1<<21))))
   return 3 * x;
 return 5 * y;
}


real	0m2.390s
user	0m2.389s
sys	0m0.001s

real	0m2.531s
user	0m2.529s
sys	0m0.001s

real	0m2.390s
user	0m2.389s
sys	0m0.001s

real	0m2.532s
user	0m2.530s
sys	0m0.001s
  
Adhemerval Zanella Netto July 9, 2015, 7:31 p.m. UTC | #99
On 09-07-2015 16:02, Ondřej Bílka wrote:
> On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
>  > But these could be done without much of our help. We need to keep these
>>> writable to support this hack. I don't know exact assembly for powerpc,
>>> it should be similar to how do it on x64:
>>>
>>> int x;
>>>
>>> int foo()
>>> {
>>> #ifdef SHARED
>>> asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
>>> #else
>>> asm ("lea x(%rip), %rax; movb $32, (%rax)");
>>> #endif
>>> return &x;
>>> }
>>>
>>
>> Not so simple on PowerISA as we don't have PC-relative addressing.
>>
>> 1) The global entry requires 2 instruction to establish the TOC/GOT
>> 2) Medium model requires two instructions (fused) to load a pointer from
>> the GOT.
>> 3) Finally we can load the cached hwcap.
>>
>> None of this is required for the TP+offset.
>>
> And why you didn't wrote that when it was first suggested? When you don't answer 
> it looks like you don't want to answer because that suggestion is better.
> 
> Here problem isn't lack of relative addressing but that you don't start
> with GOT in register. 
> 
> You certainly could do similar hack as you do with tcb and place hwcap
> bits just after that so you could do just one load.
> 
> That you require so many instructions on powerpc is gcc bug, rather than
> rule. You don't need that many instructions when you place frequent
> symbols in -32768..32767 range. For example here you could save one
> addition.
> 
> int x, y;
> int foo()
> {
>   return x + y;
> }
> 
> original
> 
> 00000000000007d0 <foo>:
>  7d0:	02 00 4c 3c 	addis   r2,r12,2
>  7d4:	30 78 42 38 	addi    r2,r2,30768
>  7d8:	00 00 00 60 	nop
>  7dc:	30 80 42 e9 	ld      r10,-32720(r2)
>  7e0:	00 00 00 60 	nop
>  7e4:	38 80 22 e9 	ld      r9,-32712(r2)
>  7e8:	00 00 6a 80 	lwz     r3,0(r10)
>  7ec:	00 00 29 81 	lwz     r9,0(r9)
>  7f0:	14 4a 63 7c 	add     r3,r3,r9
>  7f4:	b4 07 63 7c 	extsw   r3,r3
>  7f8:	20 00 80 4e 	blr
> 
> new
> 
>  	addis   r2,r12,2
> 	ld      r10,-1952(r2)
> 	ld      r9,-1944(r2)
> 	lwz     r3,0(r10)
> 	lwz     r9,0(r9)
> 	add     r3,r3,r9
> 	extsw   r3,r3
> 	blr

No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two
entrypoints for every function, global and local, with former being used when
you need to materialize the TOC while latter you can use the same TOC. And
compiler has no information regarding this, it has to be decided by the linker.

For the example you posted, the assembly is:

foo:
0:	addis 2,12,.TOC.-0b@ha
	addi 2,2,.TOC.-0b@l
	.localentry	foo,.-foo
	addis 10,2,.LC0@toc@ha		# gpr load fusion, type long
	ld 10,.LC0@toc@l(10)
	addis 9,2,.LC1@toc@ha		# gpr load fusion, type long
	ld 9,.LC1@toc@l(9)
	lwz 3,0(10)
	lwz 9,0(9)
	add 3,3,9
	extsw 3,3
	blr

Even if you place the symbol in the -32768..32767 range you still need
to take in consideration the symbol can be called either by '0:' or
by the '.localentry' and for both cases you need the proper TOC.  And
for POWER8 the addis+ld should be fused, resulting in latency similar
to one load instruction.


> 
>  
>> Telling me how x86 does things is not much help.
> 
> That why we need to know how that would work on powerpc.
> 
>>>
>>>> Without a concrete implementation I can't comment on one or the other.
>>>> It is in my opinion overly harsh to force IBM to go implement this new
>>>> feature. They have space in the TCB per the ABI and may use it for their
>>>> needs. I think the community should investigate symbol address munging
>>>> as a method for storing data in addresses and make a generic API from it,
>>>> likewise I think the community should investigate standardizing tp+offset
>>>> data access behind a set of accessor macros and normalizing the usage
>>>> across the 5 or 6 architectures that use it.
>>>>
>>> I would like this as with access to that I could improve performance of
>>> several inlines.
>>>
>>>
>>>>> Also I now have additional comment with api as if you want faster checks
>>>>> wouldn't be faster to save each bit of hwcap into byte field so you
>>>>> could avoid using mask at each check?
>>>>
>>>> That is an *excellent* suggestion, and exactly the type of technical
>>>> feedback that we should be giving IBM, and Carlos can confirm if they've
>>>> tried such "unpacking" of the bits into byte fields. Such unpacking is
>>>> common in other machine implementations.
>>>>
>> This does not help on Power, Any (byte, halfword, word, doubleword,
>> quadword) aligned load is the same performance. Splitting our bits to
>> bytes just slow things down. Consider:
>>
>> if (__builtin_cpu_supports(ARCH_2_07) &&   
>>     __builtin_cpu_supports(VEC_CRYPTO))
>>
>> This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
>> byte Boolean. 
>>
>> Again value judgements about that is fast or slow can vary by platform.
> 
> Instruction count means nothing if you don't have good intuition about
> powerpc platform. If you consider these your three instructions are lot
> slower than byte Booleans. 
> 
> Use following benchmark. You need separate compilation as to simulate
> many calls of function that uses hwcap that are not optimized away by
> gcc. I used computation before hwcap selection as without that there
> wouldn't be much difference as with OoO execution it would mostly
> measure latency of loads. It would still be slower but its 1.90s vs 1.92s
> 
> Adding third check makes no difference, and case of one is obviously
> faster.
> 
> Also how are you sure that checking more flags happens often to justify
> any potential savings with more checks if there were any savings?
> 
> Benchmark is following:
> 
> [neleai@gcc2-power8 ~]$ echo c.c:;cat c.c; echo x.c:;cat x.c;echo y.c:;
> cat y.c; gcc -O3 x.c -c; gcc -O3 x.o c.c -o x; gcc -O3 y.c -c; gcc -O3
> c.c y.o -o y; time ./x ; time ./y; time ./x; time ./y
> 
> c.c:
> volatile int v, w;
> volatile int u;
> int main()
> {
>   u= -1;
>   v = 1; w = 1;
>   long i;
>   unsigned long sum = 0;
>   for (i=0;i<500000000;i++)
>     sum += foo(sum, 42);
>   return sum;
> 
> }
> x.c:
> extern int v,w;
> int __attribute__((noinline))foo(int x, int y){
>  x= 3 * x - 32 + y;
>  y = 4 * x + 5;
>  if (v & w)
>    return 3 * x;
>  return 5 * y;
> }
> 
> y.c:
> extern int u;
> int __attribute__((noinline))foo(int x, int y){
>  x= 3 * x - 32 + y;
>  y = 4 * x + 5;
>  if (((u&((1<<17)|(1<<21)))==((1<<17)|(1<<21))))
>    return 3 * x;
>  return 5 * y;
> }
> 
> 
> real	0m2.390s
> user	0m2.389s
> sys	0m0.001s
> 
> real	0m2.531s
> user	0m2.529s
> sys	0m0.001s
> 
> real	0m2.390s
> user	0m2.389s
> sys	0m0.001s
> 
> real	0m2.532s
> user	0m2.530s
> sys	0m0.001s
>
  
Ondrej Bilka July 9, 2015, 9:51 p.m. UTC | #100
On Thu, Jul 09, 2015 at 04:31:17PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 09-07-2015 16:02, Ondřej Bílka wrote:
> > On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
> >> Not so simple on PowerISA as we don't have PC-relative addressing.
> >>
> >> 1) The global entry requires 2 instruction to establish the TOC/GOT
> >> 2) Medium model requires two instructions (fused) to load a pointer from
> >> the GOT.
> >> 3) Finally we can load the cached hwcap.
> >>
> >> None of this is required for the TP+offset.
> >>
> > And why you didn't wrote that when it was first suggested? When you don't answer 
> > it looks like you don't want to answer because that suggestion is better.
> > 
> > Here problem isn't lack of relative addressing but that you don't start
> > with GOT in register. 
> > 
> > You certainly could do similar hack as you do with tcb and place hwcap
> > bits just after that so you could do just one load.
> > 
> > That you require so many instructions on powerpc is gcc bug, rather than
> > rule. You don't need that many instructions when you place frequent
> > symbols in -32768..32767 range. For example here you could save one
> > addition.
> > 
> > int x, y;
> > int foo()
> > {
> >   return x + y;
> > }
> > 
> > original
> > 
> > 00000000000007d0 <foo>:
> >  7d0:	02 00 4c 3c 	addis   r2,r12,2
> >  7d4:	30 78 42 38 	addi    r2,r2,30768
> >  7d8:	00 00 00 60 	nop
> >  7dc:	30 80 42 e9 	ld      r10,-32720(r2)
> >  7e0:	00 00 00 60 	nop
> >  7e4:	38 80 22 e9 	ld      r9,-32712(r2)
> >  7e8:	00 00 6a 80 	lwz     r3,0(r10)
> >  7ec:	00 00 29 81 	lwz     r9,0(r9)
> >  7f0:	14 4a 63 7c 	add     r3,r3,r9
> >  7f4:	b4 07 63 7c 	extsw   r3,r3
> >  7f8:	20 00 80 4e 	blr
> > 
> > new
> > 
> >  	addis   r2,r12,2
> > 	ld      r10,-1952(r2)
> > 	ld      r9,-1944(r2)
> > 	lwz     r3,0(r10)
> > 	lwz     r9,0(r9)
> > 	add     r3,r3,r9
> > 	extsw   r3,r3
> > 	blr
> 
> No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two
> entrypoints for every function, global and local, with former being used when
> you need to materialize the TOC while latter you can use the same TOC. And
> compiler has no information regarding this, it has to be decided by the linker.
>
Of course I can, reusing TOC is not mandatory. That would just decrease
performance a bit for local.

You need majority of calls be from different dso to use global.
Otherwise if you use local entrypoint there is no reason to use tcb as
hidden variable does same job (and you could use local entrypoint in
plt of same dso.). A example that I previously mentioned is
compiled by

 gcc hw.c h.o -O3 -fPIC  -mcmodel=medium -shared

extern int __hwcap __attribute__ ((visibility ("hidden"))) ;
int foo(int x, int y)
{
  if (__hwcap)
    return x;
  else
    return y;
}

into

0000000000000750 <foo>:
 750:	02 00 4c 3c 	addis   r2,r12,2
 754:	b0 78 42 38 	addi    r2,r2,30896
 758:	00 00 00 60 	nop
 75c:	54 80 22 81 	lwz     r9,-32684(r2)
 760:	00 00 89 2f 	cmpwi   cr7,r9,0
 764:	20 00 9e 4c 	bnelr   cr7
 768:	78 23 83 7c 	mr      r3,r4
 76c:	20 00 80 4e 	blr

which with local entry uses only one load as tcb proposal.
  
Kalle Olavi Niemitalo July 9, 2015, 10:12 p.m. UTC | #101
Steven Munroe <munroesj@linux.vnet.ibm.comcom> writes:

> if (__builtin_cpu_supports(ARCH_2_07) &&   
>     __builtin_cpu_supports(VEC_CRYPTO))
>
> This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
> byte Boolean. 

I would understand 3 instructions for "||" (test the zero flag) but
how do you do it for "&&"?  I have hardly any powerpc experience
though, so perhaps there is some trick I don't realize.

If not, and if "&&" is more common than "||" in HWCAP tests, then
would it be worthwhile to invert the HWCAP bits in TCB?  I guess
it wouldn't, because such a format would increase the risk that
the program crashes if the bits were not properly initialized
before they were read.
  
Adhemerval Zanella Netto July 9, 2015, 10:17 p.m. UTC | #102
On 09-07-2015 18:51, Ondřej Bílka wrote:
> On Thu, Jul 09, 2015 at 04:31:17PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 09-07-2015 16:02, Ondřej Bílka wrote:
>>> On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
>>>> Not so simple on PowerISA as we don't have PC-relative addressing.
>>>>
>>>> 1) The global entry requires 2 instruction to establish the TOC/GOT
>>>> 2) Medium model requires two instructions (fused) to load a pointer from
>>>> the GOT.
>>>> 3) Finally we can load the cached hwcap.
>>>>
>>>> None of this is required for the TP+offset.
>>>>
>>> And why you didn't wrote that when it was first suggested? When you don't answer 
>>> it looks like you don't want to answer because that suggestion is better.
>>>
>>> Here problem isn't lack of relative addressing but that you don't start
>>> with GOT in register. 
>>>
>>> You certainly could do similar hack as you do with tcb and place hwcap
>>> bits just after that so you could do just one load.
>>>
>>> That you require so many instructions on powerpc is gcc bug, rather than
>>> rule. You don't need that many instructions when you place frequent
>>> symbols in -32768..32767 range. For example here you could save one
>>> addition.
>>>
>>> int x, y;
>>> int foo()
>>> {
>>>   return x + y;
>>> }
>>>
>>> original
>>>
>>> 00000000000007d0 <foo>:
>>>  7d0:	02 00 4c 3c 	addis   r2,r12,2
>>>  7d4:	30 78 42 38 	addi    r2,r2,30768
>>>  7d8:	00 00 00 60 	nop
>>>  7dc:	30 80 42 e9 	ld      r10,-32720(r2)
>>>  7e0:	00 00 00 60 	nop
>>>  7e4:	38 80 22 e9 	ld      r9,-32712(r2)
>>>  7e8:	00 00 6a 80 	lwz     r3,0(r10)
>>>  7ec:	00 00 29 81 	lwz     r9,0(r9)
>>>  7f0:	14 4a 63 7c 	add     r3,r3,r9
>>>  7f4:	b4 07 63 7c 	extsw   r3,r3
>>>  7f8:	20 00 80 4e 	blr
>>>
>>> new
>>>
>>>  	addis   r2,r12,2
>>> 	ld      r10,-1952(r2)
>>> 	ld      r9,-1944(r2)
>>> 	lwz     r3,0(r10)
>>> 	lwz     r9,0(r9)
>>> 	add     r3,r3,r9
>>> 	extsw   r3,r3
>>> 	blr
>>
>> No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two
>> entrypoints for every function, global and local, with former being used when
>> you need to materialize the TOC while latter you can use the same TOC. And
>> compiler has no information regarding this, it has to be decided by the linker.
>>
> Of course I can, reusing TOC is not mandatory. That would just decrease
> performance a bit for local.

Reusing TOC is exactly the optimization linker will do to avoid call the
global entrypoint.  And the problem is 1. it still requires to materialize
the TOC on global entrypoints, where you will need to save/restore it
in PLT stubs and 2. you will need a hwcap copy per TOC/DSO.  I think 
Steven proposal is exactly to avoid these. In fact this was one option
I advocate to him before he remind the issues.

> 
> You need majority of calls be from different dso to use global.
> Otherwise if you use local entrypoint there is no reason to use tcb as
> hidden variable does same job (and you could use local entrypoint in
> plt of same dso.). A example that I previously mentioned is
> compiled by
> 
>  gcc hw.c h.o -O3 -fPIC  -mcmodel=medium -shared
> 
> extern int __hwcap __attribute__ ((visibility ("hidden"))) ;
> int foo(int x, int y)
> {
>   if (__hwcap)
>     return x;
>   else
>     return y;
> }
> 
> into
> 
> 0000000000000750 <foo>:
>  750:	02 00 4c 3c 	addis   r2,r12,2
>  754:	b0 78 42 38 	addi    r2,r2,30896
>  758:	00 00 00 60 	nop
>  75c:	54 80 22 81 	lwz     r9,-32684(r2)
>  760:	00 00 89 2f 	cmpwi   cr7,r9,0
>  764:	20 00 9e 4c 	bnelr   cr7
>  768:	78 23 83 7c 	mr      r3,r4
>  76c:	20 00 80 4e 	blr
> 
> which with local entry uses only one load as tcb proposal.
>
  
Ondrej Bilka July 9, 2015, 10:24 p.m. UTC | #103
On Fri, Jul 10, 2015 at 01:12:46AM +0300, Kalle Olavi Niemitalo wrote:
> Steven Munroe <munroesj@linux.vnet.ibm.comcom> writes:
> 
> > if (__builtin_cpu_supports(ARCH_2_07) &&   
> >     __builtin_cpu_supports(VEC_CRYPTO))
> >
> > This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
> > byte Boolean. 
> 
> I would understand 3 instructions for "||" (test the zero flag) but
> how do you do it for "&&"?  I have hardly any powerpc experience
> though, so perhaps there is some trick I don't realize.
> 
> If not, and if "&&" is more common than "||" in HWCAP tests, then
> would it be worthwhile to invert the HWCAP bits in TCB?  I guess
> it wouldn't, because such a format would increase the risk that
> the program crashes if the bits were not properly initialized
> before they were read.

A trick here is just like doing macro expansion. You need to realize
that arguments are masks so you test feature F by (get_hwcap & F) == F

Then this expands into

if (((get_hwcap & ARCH_2_07) == ARCH_2_07) 
      && ((get_hwcap & VEC_CRYPTO) == VEC_CRYPTO))

Then you realize that its true if and only if all bits from ARCH_2_07
and VEC_CRYPTO masks are true. You could write that as 

if ((get_hwcap & (ARCH_2_07 | VEC_CRYPTO)) == (ARCH_2_07 | VEC_CRYPTO))
  
Ondrej Bilka July 9, 2015, 11:27 p.m. UTC | #104
On Thu, Jul 09, 2015 at 07:17:01PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 09-07-2015 18:51, Ondřej Bílka wrote:
> > On Thu, Jul 09, 2015 at 04:31:17PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 09-07-2015 16:02, Ondřej Bílka wrote:
> >>> On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
> >>>> Not so simple on PowerISA as we don't have PC-relative addressing.
> >>>>
> >>>> 1) The global entry requires 2 instruction to establish the TOC/GOT
> >>>> 2) Medium model requires two instructions (fused) to load a pointer from
> >>>> the GOT.
> >>>> 3) Finally we can load the cached hwcap.
> >>>>
> >>>> None of this is required for the TP+offset.
> >>>>
> >>> And why you didn't wrote that when it was first suggested? When you don't answer 
> >>> it looks like you don't want to answer because that suggestion is better.
> >>>
> >>> Here problem isn't lack of relative addressing but that you don't start
> >>> with GOT in register. 
> >>>
> >>> You certainly could do similar hack as you do with tcb and place hwcap
> >>> bits just after that so you could do just one load.
> >>>
> >>> That you require so many instructions on powerpc is gcc bug, rather than
> >>> rule. You don't need that many instructions when you place frequent
> >>> symbols in -32768..32767 range. For example here you could save one
> >>> addition.
> >>>
> >>> int x, y;
> >>> int foo()
> >>> {
> >>>   return x + y;
> >>> }
> >>>
> >>> original
> >>>
> >>> 00000000000007d0 <foo>:
> >>>  7d0:	02 00 4c 3c 	addis   r2,r12,2
> >>>  7d4:	30 78 42 38 	addi    r2,r2,30768
> >>>  7d8:	00 00 00 60 	nop
> >>>  7dc:	30 80 42 e9 	ld      r10,-32720(r2)
> >>>  7e0:	00 00 00 60 	nop
> >>>  7e4:	38 80 22 e9 	ld      r9,-32712(r2)
> >>>  7e8:	00 00 6a 80 	lwz     r3,0(r10)
> >>>  7ec:	00 00 29 81 	lwz     r9,0(r9)
> >>>  7f0:	14 4a 63 7c 	add     r3,r3,r9
> >>>  7f4:	b4 07 63 7c 	extsw   r3,r3
> >>>  7f8:	20 00 80 4e 	blr
> >>>
> >>> new
> >>>
> >>>  	addis   r2,r12,2
> >>> 	ld      r10,-1952(r2)
> >>> 	ld      r9,-1944(r2)
> >>> 	lwz     r3,0(r10)
> >>> 	lwz     r9,0(r9)
> >>> 	add     r3,r3,r9
> >>> 	extsw   r3,r3
> >>> 	blr
> >>
> >> No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two
> >> entrypoints for every function, global and local, with former being used when
> >> you need to materialize the TOC while latter you can use the same TOC. And
> >> compiler has no information regarding this, it has to be decided by the linker.
> >>
> > Of course I can, reusing TOC is not mandatory. That would just decrease
> > performance a bit for local.
> 
> Reusing TOC is exactly the optimization linker will do to avoid call the
> global entrypoint.  And the problem is 1. it still requires to materialize
> the TOC on global entrypoints, where you will need to save/restore it
> in PLT stubs and 2. you will need a hwcap copy per TOC/DSO.  I think 
> Steven proposal is exactly to avoid these. In fact this was one option
> I advocate to him before he remind the issues.
>
As 1 that isn't problem as when you use PLT stubs then you already have
bigger hazards from entry so you don't have to worry about getting hwcap. 
As for interDSO stubs you could use local entry this happens only when you 
repeatedly call function from different dso. Moreover you must use only
local variables there, otherwise you would need to materialize TOC
anyway and it would be free for hwcap. Also it doesn't looks good as you 
should use ifunc generated by gcc anyway to directly jump after check 
and save few cycles.

2. is one of my main critique. What argument Steven used for convincing
you?

Problem is that while his proposal scales with number of thread which is
greater than 1 this scales with number of dso that use hwcap. Which on
average could be 0.05 or similar as most packages won't use it at all.
So I ask once again where is your evidence to show it will be frequently
used? Particularily to pay cost of binaries where its never used and as
they could create many threads a cost will increase?
  
Segher Boessenkool July 10, 2015, 2:43 a.m. UTC | #105
On Fri, Jul 10, 2015 at 01:12:46AM +0300, Kalle Olavi Niemitalo wrote:
> Steven Munroe <munroesj@linux.vnet.ibm.comcom> writes:
> 
> > if (__builtin_cpu_supports(ARCH_2_07) &&   
> >     __builtin_cpu_supports(VEC_CRYPTO))
> >
> > This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
> > byte Boolean. 
> 
> I would understand 3 instructions for "||" (test the zero flag) but
> how do you do it for "&&"?  I have hardly any powerpc experience
> though, so perhaps there is some trick I don't realize.

There is no such trick, you're not missing anything.  And there is no
need to write error-prone manually expanded things, GCC can handle it
just fine (the simpler cases, anyway ;-) )


Segher
  

Patch

2015-06-08  Carlos Eduardo Seo  <cseo@linux.vnet.ibm.com>

	This patch adds a new feature for powerpc. In order to get faster
	access to the HWCAP/HWCAP2 bits, we now store them in the TCB, so
	we don't have to deal with the overhead of reading them via the
	auxiliary vector. A new API is published in ppc.h for get/set the
	bits.

	* sysdeps/powerpc/nptl/tcb-offsets.sym: Added new offests
	for HWCAP and HWCAP2 in the TCB.
	* sysdeps/powerpc/nptl/tls.h: New functionality - stores
	the HWCAP and HWCAP2 in the TCB.
	(dtv): Added new fields for HWCAP and HWCAP2.
	(TLS_INIT_TP): Included calls to add the hwcap/hwcap2
	values in the TCB in TP initialization.
	(TLS_DEFINE_INIT_TP): Likewise.
	(THREAD_GET_HWCAP): New macro.
	(THREAD_SET_HWCAP): Likewise.
	(THREAD_GET_HWCAP2): Likewise.
	(THREAD_SET_HWCAP2): Likewise.
	* sysdeps/powerpc/sys/platform/ppc.h: Added new functions
	for get/set the HWCAP/HWCAP2 values in the TCB.
	(__ppc_get_hwcap): New function.
	(__ppc_get_hwcap2): Likewise.
	* sysdeps/powerpc/test-get_hwcap.c: Testcase for this
	functionality.
	* sysdeps/powerpc/test-set_hwcap.c: Testcase for this
	functionality.
	* sysdeps/powerpc/Makefile: Added testcases to the Makefile.
	

Index: glibc-working/sysdeps/powerpc/nptl/tcb-offsets.sym
===================================================================
--- glibc-working.orig/sysdeps/powerpc/nptl/tcb-offsets.sym
+++ glibc-working/sysdeps/powerpc/nptl/tcb-offsets.sym
@@ -20,6 +20,8 @@  TAR_SAVE			(offsetof (tcbhead_t, tar_sav
 DSO_SLOT1			(offsetof (tcbhead_t, dso_slot1) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
 DSO_SLOT2			(offsetof (tcbhead_t, dso_slot2) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
 TM_CAPABLE			(offsetof (tcbhead_t, tm_capable) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
+TCB_HWCAP			(offsetof (tcbhead_t, hwcap) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
+TCB_HWCAP2			(offsetof (tcbhead_t, hwcap2) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
 #ifndef __ASSUME_PRIVATE_FUTEX
 PRIVATE_FUTEX_OFFSET		thread_offsetof (header.private_futex)
 #endif
Index: glibc-working/sysdeps/powerpc/nptl/tls.h
===================================================================
--- glibc-working.orig/sysdeps/powerpc/nptl/tls.h
+++ glibc-working/sysdeps/powerpc/nptl/tls.h
@@ -63,6 +63,9 @@  typedef union dtv
    are private.  */
 typedef struct
 {
+  /* Reservation for HWCAP data.  */
+  unsigned int hwcap2;
+  unsigned int hwcap;
   /* Indicate if HTM capable (ISA 2.07).  */
   int tm_capable;
   /* Reservation for Dynamic System Optimizer ABI.  */
@@ -134,7 +137,11 @@  register void *__thread_register __asm__
 # define TLS_INIT_TP(tcbp) \
   ({ 									      \
     __thread_register = (void *) (tcbp) + TLS_TCB_OFFSET;		      \
-    THREAD_SET_TM_CAPABLE (GLRO (dl_hwcap2) & PPC_FEATURE2_HAS_HTM ? 1 : 0);  \
+    unsigned int hwcap = GLRO(dl_hwcap);				      \
+    unsigned int hwcap2 = GLRO(dl_hwcap2);				      \
+    THREAD_SET_TM_CAPABLE (hwcap2 & PPC_FEATURE2_HAS_HTM ? 1 : 0);	      \
+    THREAD_SET_HWCAP (hwcap);						      \
+    THREAD_SET_HWCAP2 (hwcap2);						      \
     NULL;								      \
   })
 
@@ -142,7 +149,11 @@  register void *__thread_register __asm__
 # define TLS_DEFINE_INIT_TP(tp, pd) \
     void *tp = (void *) (pd) + TLS_TCB_OFFSET + TLS_PRE_TCB_SIZE;	      \
     (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].tm_capable) =	      \
-      THREAD_GET_TM_CAPABLE ();
+      THREAD_GET_TM_CAPABLE ();						      \
+    (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].hwcap) =	      \
+      THREAD_GET_HWCAP ();						      \
+    (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].hwcap2) =	      \
+      THREAD_GET_HWCAP2 ();
 
 /* Return the address of the dtv for the current thread.  */
 # define THREAD_DTV() \
@@ -203,6 +214,32 @@  register void *__thread_register __asm__
 # define THREAD_SET_TM_CAPABLE(value) \
     (THREAD_GET_TM_CAPABLE () = (value))
 
+/* hwcap & hwcap2 fields in TCB head.  */
+# define THREAD_GET_HWCAP() \
+    (((tcbhead_t *) ((char *) __thread_register				      \
+		     - TLS_TCB_OFFSET))[-1].hwcap)
+# define THREAD_SET_HWCAP(value) \
+    if (value & PPC_FEATURE_ARCH_2_06)					      \
+      value |= PPC_FEATURE_ARCH_2_05 |					      \
+	       PPC_FEATURE_POWER5_PLUS |				      \
+	       PPC_FEATURE_POWER5 |					      \
+	       PPC_FEATURE_POWER4;					      \
+    else if (value & PPC_FEATURE_ARCH_2_05)				      \
+      value |= PPC_FEATURE_POWER5_PLUS |				      \
+             PPC_FEATURE_POWER5 |					      \
+             PPC_FEATURE_POWER4;					      \
+    else if (value & PPC_FEATURE_POWER5_PLUS)				      \
+      value |= PPC_FEATURE_POWER5 |					      \
+             PPC_FEATURE_POWER4;					      \
+    else if (value & PPC_FEATURE_POWER5)				      \
+      value |= PPC_FEATURE_POWER4;					      \
+    (THREAD_GET_HWCAP () = (value))
+# define THREAD_GET_HWCAP2() \
+    (((tcbhead_t *) ((char *) __thread_register				      \
+                     - TLS_TCB_OFFSET))[-1].hwcap2)
+# define THREAD_SET_HWCAP2(value) \
+    (THREAD_GET_HWCAP2 () = (value))
+
 /* l_tls_offset == 0 is perfectly valid on PPC, so we have to use some
    different value to mean unset l_tls_offset.  */
 # define NO_TLS_OFFSET		-1
Index: glibc-working/sysdeps/powerpc/sys/platform/ppc.h
===================================================================
--- glibc-working.orig/sysdeps/powerpc/sys/platform/ppc.h
+++ glibc-working/sysdeps/powerpc/sys/platform/ppc.h
@@ -23,6 +23,86 @@ 
 #include <stdint.h>
 #include <bits/ppc.h>
 
+
+/* Get the hwcap/hwcap2 information from the TCB. Offsets taken
+   from tcb-offsets.h.  */
+static inline uint32_t
+__ppc_get_hwcap (void)
+{
+
+  uint32_t __tcb_hwcap;
+
+#ifdef __powerpc64__
+  register unsigned long __tp __asm__ ("r13");
+  __asm__ volatile ("lwz %0,-28772(%1)\n"
+		    : "=r" (__tcb_hwcap)
+		    : "r" (__tp));
+#else
+  register unsigned long __tp __asm__ ("r2");
+  __asm__ volatile ("lwz %0,-28724(%1)\n"
+		    : "=r" (__tcb_hwcap)
+		    : "r" (__tp));
+#endif
+
+  return __tcb_hwcap;
+}
+
+static inline uint32_t
+__ppc_get_hwcap2 (void)
+{
+
+  uint32_t __tcb_hwcap2;
+
+#ifdef __powerpc64__
+  register unsigned long __tp __asm__ ("r13");
+  __asm__ volatile ("lwz %0,-28776(%1)\n"
+		    : "=r" (__tcb_hwcap2)
+		    : "r" (__tp));
+#else
+  register unsigned long __tp __asm__ ("r2");
+  __asm__ volatile ("lwz %0,-28728(%1)\n"
+		    : "=r" (__tcb_hwcap2)
+		    : "r" (__tp));
+#endif
+
+  return __tcb_hwcap2;
+}
+
+/* Set the hwcap/hwcap2 bits into the designated area in the TCB. Offsets
+   taken from tcb-offsets.h.  */
+
+static inline void
+__ppc_set_hwcap (uint32_t __hwcap_mask)
+{
+#ifdef __powerpc64__
+  register unsigned long __tp __asm__ ("r13");
+  __asm__ volatile ("stw %1,-28772(%0)\n"
+		    :
+		    : "r" (__tp), "r" (__hwcap_mask));
+#else
+  register unsigned long __tp __asm__ ("r2");
+  __asm__ volatile ("stw %1,-28724(%0)\n"
+		    :
+		    : "r" (__tp), "r" (__hwcap_mask));
+#endif
+}
+
+static inline void
+__ppc_set_hwcap2 (uint32_t __hwcap2_mask)
+{
+#ifdef __powerpc64__
+  register unsigned long __tp __asm__ ("r13");
+  __asm__ volatile ("stw %1,-28776(%0)\n"
+		    :
+		    : "r" (__tp), "r" (__hwcap2_mask));
+#else
+  register unsigned long __tp __asm__ ("r2");
+  __asm__ volatile ("stw %1,-28728(%0)\n"
+		    :
+		    : "r" (__tp), "r" (__hwcap2_mask));
+#endif
+}
+
 /* Read the Time Base Register.   */
 static inline uint64_t
 __ppc_get_timebase (void)
Index: glibc-working/sysdeps/powerpc/Makefile
===================================================================
--- glibc-working.orig/sysdeps/powerpc/Makefile
+++ glibc-working/sysdeps/powerpc/Makefile
@@ -28,7 +28,7 @@  endif
 
 ifeq ($(subdir),misc)
 sysdep_headers += sys/platform/ppc.h
-tests += test-gettimebase
+tests += test-gettimebase test-get_hwcap test-set_hwcap
 endif
 
 ifneq (,$(filter %le,$(config-machine)))
Index: glibc-working/sysdeps/powerpc/test-get_hwcap.c
===================================================================
--- /dev/null
+++ glibc-working/sysdeps/powerpc/test-get_hwcap.c
@@ -0,0 +1,73 @@ 
+/* Check __ppc_get_hwcap() functionality
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Tests if the hwcap and hwcap2 data is stored in the TCB.  */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <stdint.h>
+
+#include <sys/auxv.h>
+#include <sys/platform/ppc.h>
+
+static int
+do_test (void)
+{
+  uint32_t h1, h2, hwcap, hwcap2;
+
+  h1 = __ppc_get_hwcap ();
+  h2 = __ppc_get_hwcap2 ();
+  hwcap = getauxval(AT_HWCAP);
+  hwcap2 = getauxval(AT_HWCAP2);
+
+  /* hwcap contains only the latest supported ISA, the code checks which is
+     and fills the previous supported ones. This is necessary because the
+     same is done in tls.h when setting the values to the TCB.   */
+
+  if (hwcap & PPC_FEATURE_ARCH_2_06)
+    hwcap |= PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_POWER5_PLUS |
+	     PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4;
+  else if (hwcap & PPC_FEATURE_ARCH_2_05)
+    hwcap |= PPC_FEATURE_POWER5_PLUS | PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4;
+  else if (hwcap & PPC_FEATURE_POWER5_PLUS)
+    hwcap |= PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4;
+  else if (hwcap & PPC_FEATURE_POWER5)
+    hwcap |= PPC_FEATURE_POWER4;
+
+  if ( h1 != hwcap )
+    {
+      printf("Fail: HWCAP is %x. Should be %x\n", h1, hwcap);
+      return 1;
+    }
+
+  if ( h2 != hwcap2 )
+    {
+      printf("Fail: HWCAP2 is %x. Should be %x\n", h2, hwcap2);
+      return 1;
+    }
+
+    printf("Pass: HWCAP and HWCAP2 are correctly set in the TCB.\n");
+
+    return 0;
+
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+
Index: glibc-working/sysdeps/powerpc/test-set_hwcap.c
===================================================================
--- /dev/null
+++ glibc-working/sysdeps/powerpc/test-set_hwcap.c
@@ -0,0 +1,63 @@ 
+/* Check __ppc_get_hwcap() functionality
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Tests if the hwcap and hwcap2 data can be stored in the TCB
+   via the ppc.h API.  */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <stdint.h>
+
+#include <sys/auxv.h>
+#include <sys/platform/ppc.h>
+
+static int
+do_test (void)
+{
+  uint32_t h1, hwcap, hwcap2;
+
+  h1 = 0xDEADBEEF;
+
+  __ppc_set_hwcap(h1);
+  hwcap = __ppc_get_hwcap();
+
+  if ( h1 != hwcap )
+    {
+      printf("Fail: HWCAP is %x. Should be %x\n", h1, hwcap);
+      return 1;
+    }
+
+  __ppc_set_hwcap2(h1);
+  hwcap2 = __ppc_get_hwcap2();
+
+  if ( h1 != hwcap2 )
+    {
+      printf("Fail: HWCAP2 is %x. Should be %x\n", h1, hwcap2);
+      return 1;
+    }
+
+    printf("Pass: HWCAP and HWCAP2 are correctly set in the TCB.\n");
+
+    return 0;
+
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+