[RFC,BZ,#17943] Use long for int_fast8_t

Message ID 20150208110426.GA28729@domone
State Superseded
Headers

Commit Message

Ondrej Bilka Feb. 8, 2015, 11:04 a.m. UTC
  Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?

It is definitely slower with division, following code is 25% slower on
haswell with char than when you use long.

There is question what about other architectures and atomic operations,
are byte ones better than int?

int main ()
{
  int i;
  char x = 32;
  for (i=0; i<1000000000; i++)
    x = 11 * x + 5 + x / 3;
  return x;
}




	* sysdeps/generic/stdint.h: Use long int for int_fast8_t
  

Comments

Maciej W. Rozycki Feb. 9, 2015, 1:36 p.m. UTC | #1
On Sun, 8 Feb 2015, Ondřej Bílka wrote:

> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
> 
> It is definitely slower with division, following code is 25% slower on
> haswell with char than when you use long.

 It may boil down to the choice of instructions produced made by the 
compiler.  I can hardly imagine 8-bit division to be slower than 64-bit 
one on a processor that implements subword integer arithmetic.

> There is question what about other architectures and atomic operations,
> are byte ones better than int?
> 
> int main ()
> {
>   int i;
>   char x = 32;
>   for (i=0; i<1000000000; i++)
>     x = 11 * x + 5 + x / 3;
>   return x;
> }

 On Intel Atom for example division latencies are as follows[1]:

		latency	throughput
IDIV r/m8	   33	    32
IDIV r/m16	   42	    41
IDIV r/m32	   57	    56
IDIV r/m64	  197	   196

I'd expect the ratio of elapsed times for the corresponding data widths 
and a manual division algorithm used with processors that have no hardware 
divider to be similar.  There is no latency difference between individual 
data widths AFAICT for multiplication or general ALU operations.

 For processors that do have a hardware divider implementing word 
calculation only I'd expect either a constant latency or again, a decrease 
in operation time depending on the actual width of significant data 
contained in operands.

 For example the M14Kc MIPS32 processor has an RTL configuration option to 
include either an area-efficient or a high-performance MDU 
(multiply/divide unit).  The area-efficient MDU has a latency of 33 clocks 
for unsigned division (signed division adds up to 2 clocks for sign 
reversal).  The high-performance MDU reduces the latency as follows[2]:

"Divide operations are implemented with a simple 1-bit-per-clock iterative 
algorithm.  An early-in detection checks the sign extension of the 
dividend (rs) operand.  If rs is 8 bits wide, 23 iterations are skipped. 
For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs, 
7 iterations are skipped.  Any attempt to issue a subsequent MDU 
instruction while a divide is still active causes an IU pipeline stall 
until the divide operation has completed."

As it happens automatically there is no benefit from using a narrower data 
type, and the lack of subword arithmetic operations means that using such 
a type will require a truncation operation from time to time for 
multiplication or general ALU operations.

> --- a/sysdeps/generic/stdint.h
> +++ b/sysdeps/generic/stdint.h
> @@ -87,12 +87,13 @@ typedef unsigned long long int	uint_least64_t;
>  /* Fast types.  */
>  
>  /* Signed.  */
> -typedef signed char		int_fast8_t;
>  #if __WORDSIZE == 64
> +typedef long int		int_fast8_t;
>  typedef long int		int_fast16_t;
>  typedef long int		int_fast32_t;
>  typedef long int		int_fast64_t;
>  #else
> +typedef int			int_fast8_t;
>  typedef int			int_fast16_t;
>  typedef int			int_fast32_t;
>  __extension__

 So I find the choice of types above to be already questionable for a 
generic header.  By default I'd expect fast data types to have the same 
width as their fixed-width counterparts for the large benefit they provide 
with most architectures that do implement subword arithmetic weighed 
against the small loss they will likely incur with architectures that only 
implement word arithmetic.  Then individual ports could override the 
defaults as they see fit.

 At this point the discussion is however I believe moot though -- there 
will have been relocatable objects out there with data embedded using 
these types already so the ABI has been set and I don't see a way of 
changing it without breaking binary compatibility.

 References:

[1] "Intel 64 and IA-32 Architectures Optimization Reference Manual", 
    Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2 
    "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21

[2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS 
    Technologies, Inc., Document Number: MD00672, November 2, 2009, 
    Subsection "High-Performance MDU", p. 6

  Maciej
  
H.J. Lu Feb. 9, 2015, 1:41 p.m. UTC | #2
On Mon, Feb 9, 2015 at 5:36 AM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Sun, 8 Feb 2015, Ondřej Bílka wrote:
>
>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
>>
>> It is definitely slower with division, following code is 25% slower on
>> haswell with char than when you use long.
>
>  It may boil down to the choice of instructions produced made by the
> compiler.  I can hardly imagine 8-bit division to be slower than 64-bit
> one on a processor that implements subword integer arithmetic.
>
>> There is question what about other architectures and atomic operations,
>> are byte ones better than int?
>>
>> int main ()
>> {
>>   int i;
>>   char x = 32;
>>   for (i=0; i<1000000000; i++)
>>     x = 11 * x + 5 + x / 3;
>>   return x;
>> }
>
>  On Intel Atom for example division latencies are as follows[1]:
>
>                 latency throughput
> IDIV r/m8          33       32
> IDIV r/m16         42       41
> IDIV r/m32         57       56
> IDIV r/m64        197      196
>
> I'd expect the ratio of elapsed times for the corresponding data widths
> and a manual division algorithm used with processors that have no hardware
> divider to be similar.  There is no latency difference between individual
> data widths AFAICT for multiplication or general ALU operations.
>
>  For processors that do have a hardware divider implementing word
> calculation only I'd expect either a constant latency or again, a decrease
> in operation time depending on the actual width of significant data
> contained in operands.
>
>  For example the M14Kc MIPS32 processor has an RTL configuration option to
> include either an area-efficient or a high-performance MDU
> (multiply/divide unit).  The area-efficient MDU has a latency of 33 clocks
> for unsigned division (signed division adds up to 2 clocks for sign
> reversal).  The high-performance MDU reduces the latency as follows[2]:
>
> "Divide operations are implemented with a simple 1-bit-per-clock iterative
> algorithm.  An early-in detection checks the sign extension of the
> dividend (rs) operand.  If rs is 8 bits wide, 23 iterations are skipped.
> For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs,
> 7 iterations are skipped.  Any attempt to issue a subsequent MDU
> instruction while a divide is still active causes an IU pipeline stall
> until the divide operation has completed."
>
> As it happens automatically there is no benefit from using a narrower data
> type, and the lack of subword arithmetic operations means that using such
> a type will require a truncation operation from time to time for
> multiplication or general ALU operations.
>
>> --- a/sysdeps/generic/stdint.h
>> +++ b/sysdeps/generic/stdint.h
>> @@ -87,12 +87,13 @@ typedef unsigned long long int    uint_least64_t;
>>  /* Fast types.  */
>>
>>  /* Signed.  */
>> -typedef signed char          int_fast8_t;
>>  #if __WORDSIZE == 64
>> +typedef long int             int_fast8_t;
>>  typedef long int             int_fast16_t;
>>  typedef long int             int_fast32_t;
>>  typedef long int             int_fast64_t;
>>  #else
>> +typedef int                  int_fast8_t;
>>  typedef int                  int_fast16_t;
>>  typedef int                  int_fast32_t;
>>  __extension__
>
>  So I find the choice of types above to be already questionable for a
> generic header.  By default I'd expect fast data types to have the same
> width as their fixed-width counterparts for the large benefit they provide
> with most architectures that do implement subword arithmetic weighed
> against the small loss they will likely incur with architectures that only
> implement word arithmetic.  Then individual ports could override the
> defaults as they see fit.
>
>  At this point the discussion is however I believe moot though -- there
> will have been relocatable objects out there with data embedded using
> these types already so the ABI has been set and I don't see a way of
> changing it without breaking binary compatibility.
>
>  References:
>
> [1] "Intel 64 and IA-32 Architectures Optimization Reference Manual",
>     Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2
>     "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21

This manual is very old and Intel Atom Microarchitecture has been
replaced by Silvermont Microarchitecture.

> [2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS
>     Technologies, Inc., Document Number: MD00672, November 2, 2009,
>     Subsection "High-Performance MDU", p. 6
>
>   Maciej
  
Maciej W. Rozycki Feb. 9, 2015, 2:21 p.m. UTC | #3
On Mon, 9 Feb 2015, H.J. Lu wrote:

> > [1] "Intel 64 and IA-32 Architectures Optimization Reference Manual",
> >     Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2
> >     "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21
> 
> This manual is very old and Intel Atom Microarchitecture has been
> replaced by Silvermont Microarchitecture.

 Maybe, I picked whatever I had handy.  That doesn't change the fact this 
silicon is out there.  There's older yet x86 silicon in the field where 
the divider has a similar characteristic.

 As I say for a generic header you need to weigh the balance -- except 
that the ABI has been set, so there's little if anything we can do, apart 
from new architecture additions that can get their overrides right from 
the beginning.

  Maciej
  
Joseph Myers Feb. 9, 2015, 4:37 p.m. UTC | #4
On Sun, 8 Feb 2015, Ondřej Bílka wrote:

> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
> 
> It is definitely slower with division, following code is 25% slower on
> haswell with char than when you use long.

Well, there are trade-offs with space in that if you use more cache for a 
larger type then that could make things slower; it's not automatically 
the case that a type that's faster for individual operations makes 
programs faster if it's used.

In any case, we need to be very wary of changing the size or alignment or 
C++ name mangling of any exported type in a public header, even if it 
doesn't affect existing executables / shared libraries because the type 
isn't involved in the ABIs of existing glibc functions.

In the present case, GCC knows about these types for each target, both so 
as to ensure <stdint.h> is available for freestanding compilations, and to 
provide the Fortran C bindings.  So if you did use different types for new 
glibc configurations, the GCC ports for those targets would also need to 
know.

This patch doesn't change inttypes.h.  If it passed testing, I suppose 
that means we are missing a testcase that all the inttypes.h format macros 
do work with the corresponding types (such a test would mainly be that the 
testcase builds OK with -Wformat -Werror rather than for the runtime 
execution).
  
Rich Felker Feb. 9, 2015, 6:13 p.m. UTC | #5
On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote:
> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
> 
> It is definitely slower with division, following code is 25% slower on
> haswell with char than when you use long.

This claim is nonsense. It's a compiler bug. If the 8-bit divide
instruction is slow, then the compiler should use 32-bit or 64-bit
divide instructions to divide 8-bit types. (Note: there's actually no
such thing as a division of 8-byte types; formally, they're promoted
to int, so it's the compiler being stupid if it generates a slow 8-bit
divide instruction for operands that are formally int!) There's no
reason to use a different type for the _storage_.

The only legigimate reason to change the int_fast types is when the
_storage_ in certain sizes is slow to access.

Rich
  
Richard Earnshaw Feb. 9, 2015, 6:28 p.m. UTC | #6
On 09/02/15 13:36, Maciej W. Rozycki wrote:
> On Sun, 8 Feb 2015, Ondřej Bílka wrote:
> 
>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
>>
>> It is definitely slower with division, following code is 25% slower on
>> haswell with char than when you use long.
> 
>  It may boil down to the choice of instructions produced made by the 
> compiler.  I can hardly imagine 8-bit division to be slower than 64-bit 
> one on a processor that implements subword integer arithmetic.
> 
>> There is question what about other architectures and atomic operations,
>> are byte ones better than int?
>>
>> int main ()
>> {
>>   int i;
>>   char x = 32;
>>   for (i=0; i<1000000000; i++)
>>     x = 11 * x + 5 + x / 3;
>>   return x;
>> }
> 
>  On Intel Atom for example division latencies are as follows[1]:
> 
> 		latency	throughput
> IDIV r/m8	   33	    32
> IDIV r/m16	   42	    41
> IDIV r/m32	   57	    56
> IDIV r/m64	  197	   196
> 
> I'd expect the ratio of elapsed times for the corresponding data widths 
> and a manual division algorithm used with processors that have no hardware 
> divider to be similar.  There is no latency difference between individual 
> data widths AFAICT for multiplication or general ALU operations.
> 
>  For processors that do have a hardware divider implementing word 
> calculation only I'd expect either a constant latency or again, a decrease 
> in operation time depending on the actual width of significant data 
> contained in operands.
> 
>  For example the M14Kc MIPS32 processor has an RTL configuration option to 
> include either an area-efficient or a high-performance MDU 
> (multiply/divide unit).  The area-efficient MDU has a latency of 33 clocks 
> for unsigned division (signed division adds up to 2 clocks for sign 
> reversal).  The high-performance MDU reduces the latency as follows[2]:
> 
> "Divide operations are implemented with a simple 1-bit-per-clock iterative 
> algorithm.  An early-in detection checks the sign extension of the 
> dividend (rs) operand.  If rs is 8 bits wide, 23 iterations are skipped. 
> For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs, 
> 7 iterations are skipped.  Any attempt to issue a subsequent MDU 
> instruction while a divide is still active causes an IU pipeline stall 
> until the divide operation has completed."
> 
> As it happens automatically there is no benefit from using a narrower data 
> type, and the lack of subword arithmetic operations means that using such 
> a type will require a truncation operation from time to time for 
> multiplication or general ALU operations.
> 
>> --- a/sysdeps/generic/stdint.h
>> +++ b/sysdeps/generic/stdint.h
>> @@ -87,12 +87,13 @@ typedef unsigned long long int	uint_least64_t;
>>  /* Fast types.  */
>>  
>>  /* Signed.  */
>> -typedef signed char		int_fast8_t;
>>  #if __WORDSIZE == 64
>> +typedef long int		int_fast8_t;
>>  typedef long int		int_fast16_t;
>>  typedef long int		int_fast32_t;
>>  typedef long int		int_fast64_t;
>>  #else
>> +typedef int			int_fast8_t;
>>  typedef int			int_fast16_t;
>>  typedef int			int_fast32_t;
>>  __extension__

On AArch64 there's nothing to be gained in terms of performance from
using a 64-bit type over a 32-bit type when both can hold the required
range of values.  In fact, it's likely to make things slower, since
multiply and divide operations will most likely take longer.

So on AArch64 int_fast8_t, int_fast16_t and int_fast32_t should all map
to int, not long.

R.

> 
>  So I find the choice of types above to be already questionable for a 
> generic header.  By default I'd expect fast data types to have the same 
> width as their fixed-width counterparts for the large benefit they provide 
> with most architectures that do implement subword arithmetic weighed 
> against the small loss they will likely incur with architectures that only 
> implement word arithmetic.  Then individual ports could override the 
> defaults as they see fit.
> 
>  At this point the discussion is however I believe moot though -- there 
> will have been relocatable objects out there with data embedded using 
> these types already so the ABI has been set and I don't see a way of 
> changing it without breaking binary compatibility.
> 
>  References:
> 
> [1] "Intel 64 and IA-32 Architectures Optimization Reference Manual", 
>     Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2 
>     "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21
> 
> [2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS 
>     Technologies, Inc., Document Number: MD00672, November 2, 2009, 
>     Subsection "High-Performance MDU", p. 6
> 
>   Maciej
>
  
Ondrej Bilka Feb. 11, 2015, 1:34 p.m. UTC | #7
On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote:
> On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote:
> > Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
> > 
> > It is definitely slower with division, following code is 25% slower on
> > haswell with char than when you use long.
> 
> This claim is nonsense. It's a compiler bug. If the 8-bit divide
> instruction is slow, then the compiler should use 32-bit or 64-bit
> divide instructions to divide 8-bit types. (Note: there's actually no
> such thing as a division of 8-byte types; formally, they're promoted
> to int, so it's the compiler being stupid if it generates a slow 8-bit
> divide instruction for operands that are formally int!) There's no
> reason to use a different type for the _storage_.
> 
That is also nonsense, you cannot get same speed as 32bit instruction
without having 8bit instruction with same performance.

Compiler must add extra truncation instructions to get correct result
which slows it down, otherwise it gets wrong result for cases like (128+128)%3
  
Rich Felker Feb. 11, 2015, 1:59 p.m. UTC | #8
On Wed, Feb 11, 2015 at 02:34:09PM +0100, Ondřej Bílka wrote:
> On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote:
> > On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote:
> > > Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
> > > 
> > > It is definitely slower with division, following code is 25% slower on
> > > haswell with char than when you use long.
> > 
> > This claim is nonsense. It's a compiler bug. If the 8-bit divide
> > instruction is slow, then the compiler should use 32-bit or 64-bit
> > divide instructions to divide 8-bit types. (Note: there's actually no
> > such thing as a division of 8-byte types; formally, they're promoted
> > to int, so it's the compiler being stupid if it generates a slow 8-bit
> > divide instruction for operands that are formally int!) There's no
> > reason to use a different type for the _storage_.
> > 
> That is also nonsense, you cannot get same speed as 32bit instruction
> without having 8bit instruction with same performance.
> 
> Compiler must add extra truncation instructions to get correct result
> which slows it down, otherwise it gets wrong result for cases like (128+128)%3

The additional need to truncate can arise when using 32-bit types on a
64-bit arch, but normally you defer this truncation for until storage
(in which case it's usually automatic), promotion, or division (in
which case it's hopefully dominated by the time to divide). If your
concern is the need to truncate before division/remainder operations,
I think that cost is dwarfed by the added cache/memory pressure of
wasting half your space.

Rich
  
Richard Earnshaw Feb. 11, 2015, 6:26 p.m. UTC | #9
On 11/02/15 13:34, Ondřej Bílka wrote:
> On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote:
>> On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote:
>>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
>>>
>>> It is definitely slower with division, following code is 25% slower on
>>> haswell with char than when you use long.
>>
>> This claim is nonsense. It's a compiler bug. If the 8-bit divide
>> instruction is slow, then the compiler should use 32-bit or 64-bit
>> divide instructions to divide 8-bit types. (Note: there's actually no
>> such thing as a division of 8-byte types; formally, they're promoted
>> to int, so it's the compiler being stupid if it generates a slow 8-bit
>> divide instruction for operands that are formally int!) There's no
>> reason to use a different type for the _storage_.
>>
> That is also nonsense, you cannot get same speed as 32bit instruction
> without having 8bit instruction with same performance.
> 
> Compiler must add extra truncation instructions to get correct result
> which slows it down, otherwise it gets wrong result for cases like (128+128)%3
> 

Only if the intermediate result (128+128) is assigned directly to a
variable with less precision than int.  Otherwise the whole expression
is calculated with int precision.

R.
  
Richard Earnshaw Feb. 11, 2015, 6:28 p.m. UTC | #10
On 11/02/15 18:26, Richard Earnshaw wrote:
> On 11/02/15 13:34, Ondřej Bílka wrote:
>> On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote:
>>> On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote:
>>>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
>>>>
>>>> It is definitely slower with division, following code is 25% slower on
>>>> haswell with char than when you use long.
>>>
>>> This claim is nonsense. It's a compiler bug. If the 8-bit divide
>>> instruction is slow, then the compiler should use 32-bit or 64-bit
>>> divide instructions to divide 8-bit types. (Note: there's actually no
>>> such thing as a division of 8-byte types; formally, they're promoted
>>> to int, so it's the compiler being stupid if it generates a slow 8-bit
>>> divide instruction for operands that are formally int!) There's no
>>> reason to use a different type for the _storage_.
>>>
>> That is also nonsense, you cannot get same speed as 32bit instruction
>> without having 8bit instruction with same performance.
>>
>> Compiler must add extra truncation instructions to get correct result
>> which slows it down, otherwise it gets wrong result for cases like (128+128)%3
>>
> 
> Only if the intermediate result (128+128) is assigned directly to a
> variable with less precision than int. 

Or, of course, if an explicit cast is applied to the intermediate result.


> Otherwise the whole expression
> is calculated with int precision.
> 
> R.
>
  
Rich Felker Feb. 11, 2015, 6:57 p.m. UTC | #11
On Wed, Feb 11, 2015 at 06:26:01PM +0000, Richard Earnshaw wrote:
> On 11/02/15 13:34, Ondřej Bílka wrote:
> > On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote:
> >> On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote:
> >>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
> >>>
> >>> It is definitely slower with division, following code is 25% slower on
> >>> haswell with char than when you use long.
> >>
> >> This claim is nonsense. It's a compiler bug. If the 8-bit divide
> >> instruction is slow, then the compiler should use 32-bit or 64-bit
> >> divide instructions to divide 8-bit types. (Note: there's actually no
> >> such thing as a division of 8-byte types; formally, they're promoted
> >> to int, so it's the compiler being stupid if it generates a slow 8-bit
> >> divide instruction for operands that are formally int!) There's no
> >> reason to use a different type for the _storage_.
> >>
> > That is also nonsense, you cannot get same speed as 32bit instruction
> > without having 8bit instruction with same performance.
> > 
> > Compiler must add extra truncation instructions to get correct result
> > which slows it down, otherwise it gets wrong result for cases like (128+128)%3
> > 
> 
> Only if the intermediate result (128+128) is assigned directly to a
> variable with less precision than int.  Otherwise the whole expression
> is calculated with int precision.

I read it as an example with small numbers that's not directly
applicable with the C type ranges. If we're actually talking about the
value 128 in C then of course 128+128 is not subject to truncation.

Rich
  

Patch

diff --git a/sysdeps/generic/stdint.h b/sysdeps/generic/stdint.h
index 5c9f0ff..c6dda3b 100644
--- a/sysdeps/generic/stdint.h
+++ b/sysdeps/generic/stdint.h
@@ -87,12 +87,13 @@  typedef unsigned long long int	uint_least64_t;
 /* Fast types.  */
 
 /* Signed.  */
-typedef signed char		int_fast8_t;
 #if __WORDSIZE == 64
+typedef long int		int_fast8_t;
 typedef long int		int_fast16_t;
 typedef long int		int_fast32_t;
 typedef long int		int_fast64_t;
 #else
+typedef int			int_fast8_t;
 typedef int			int_fast16_t;
 typedef int			int_fast32_t;
 __extension__