Submitter | Adhemerval Zanella Netto |
---|---|

Date | Dec. 5, 2017, 1:47 p.m. |

Message ID | <a505c43a-1a44-34fc-f36b-243e328b34af@linaro.org> |

Download | mbox | patch |

Permalink | /patch/24736/ |

State | New |

Headers | show |

## Comments

On Tue, 5 Dec 2017, Adhemerval Zanella wrote: > +#else > + res = __builtin_floor (d); > +#endif > return res; > } > --- > > As least for GCC7 compiler will expand to an inline version even for generic > tune option. However the code that actually calls __floor is not showing on Which, as per discussions on gcc-patches in August, may not be a good idea with current GCC, because it means that your code built for generic SSE2 is now slower on an SSE4.1 processor than it would have been if it had called the __floor IFUNC. (That issue could be avoided by adding an SSE4.1 IFUNC of sinf, of course. I think we should aim to use floor etc. directly anyway - in cases where we can't avoid floor and just convert directly to integer - and address any such pessimizations in GCC rather than working around them in glibc.)

On Tue, Dec 5, 2017 at 5:47 AM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > And with a simple modification to avoid int to fp conversion: > > --- > diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c > index 40d3d19..a2fd3cf 100644 > --- a/sysdeps/ieee754/flt-32/s_sinf.c > +++ b/sysdeps/ieee754/flt-32/s_sinf.c > @@ -75,7 +75,7 @@ static const double invpio4_table[] = { > 0x1.0e4107cp-169 > }; > > -static const int ones[] = { +1, -1 }; > +static const double ones[] = { 1.0, -1.0 }; > > /* Compute the sine value using Chebyshev polynomials where > THETA is the range reduced absolute value of the input > @@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n, > const double theta2 = theta * theta; > /* We are operating on |x|, so we need to add back the original > signbit for sinf. */ > - int sign; > + double sign; > /* Determine positive or negative primary interval. */ > sign = ones[((n >> 2) & 1) ^ signbit]; > /* Are we in the primary interval of sin or cos? */ > --- > > I get: > > "sinf": { > "": { > "duration": 4.0015e+10, > "iterations": 1.4535e+09, > "max": 640.456, > "min": 11.437, > "mean": 27.5301 > } > > Which is roughly 3% on mean and 11.5% on min. I think we can improve it > even more by avoiding the int to fp conversion to get the sign right > and try operate with sign as double argument. I tried it on Skylake with the current master. Before: "sinf": { "": { "duration": 3.4044e+10, "iterations": 1.9942e+09, "max": 141.106, "min": 7.704, "mean": 17.0715 } } After: "sinf": { "": { "duration": 3.40665e+10, "iterations": 2.03199e+09, "max": 95.994, "min": 7.704, "mean": 16.765 } } Generic is faster than asm now: "sinf": { "": { "duration": 3.40417e+10, "iterations": 1.87792e+09, "max": 138.868, "min": 8.546, "mean": 18.1273 } } Can you submit your patch? Thanks.

On 05/12/2017 11:57, Joseph Myers wrote: > On Tue, 5 Dec 2017, Adhemerval Zanella wrote: > >> +#else >> + res = __builtin_floor (d); >> +#endif >> return res; >> } >> --- >> >> As least for GCC7 compiler will expand to an inline version even for generic >> tune option. However the code that actually calls __floor is not showing on > > Which, as per discussions on gcc-patches in August, may not be a good idea > with current GCC, because it means that your code built for generic SSE2 > is now slower on an SSE4.1 processor than it would have been if it had > called the __floor IFUNC. (That issue could be avoided by adding an > SSE4.1 IFUNC of sinf, of course. I think we should aim to use floor etc. > directly anyway - in cases where we can't avoid floor and just convert > directly to integer - and address any such pessimizations in GCC rather > than working around them in glibc.) > Which indeed seems to be the case here as well (although the difference is quite small, but still noticeable). With latest H.J. Lu changes to remove floor it seems the small change to use double for sign operation also does not yield much change.

On 05/12/2017 14:56, H.J. Lu wrote: > On Tue, Dec 5, 2017 at 5:47 AM, Adhemerval Zanella > <adhemerval.zanella@linaro.org> wrote: > >> And with a simple modification to avoid int to fp conversion: >> >> --- >> diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c >> index 40d3d19..a2fd3cf 100644 >> --- a/sysdeps/ieee754/flt-32/s_sinf.c >> +++ b/sysdeps/ieee754/flt-32/s_sinf.c >> @@ -75,7 +75,7 @@ static const double invpio4_table[] = { >> 0x1.0e4107cp-169 >> }; >> >> -static const int ones[] = { +1, -1 }; >> +static const double ones[] = { 1.0, -1.0 }; >> >> /* Compute the sine value using Chebyshev polynomials where >> THETA is the range reduced absolute value of the input >> @@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n, >> const double theta2 = theta * theta; >> /* We are operating on |x|, so we need to add back the original >> signbit for sinf. */ >> - int sign; >> + double sign; >> /* Determine positive or negative primary interval. */ >> sign = ones[((n >> 2) & 1) ^ signbit]; >> /* Are we in the primary interval of sin or cos? */ >> --- >> >> I get: >> >> "sinf": { >> "": { >> "duration": 4.0015e+10, >> "iterations": 1.4535e+09, >> "max": 640.456, >> "min": 11.437, >> "mean": 27.5301 >> } >> >> Which is roughly 3% on mean and 11.5% on min. I think we can improve it >> even more by avoiding the int to fp conversion to get the sign right >> and try operate with sign as double argument. > > I tried it on Skylake with the current master. Before: > > "sinf": { > "": { > "duration": 3.4044e+10, > "iterations": 1.9942e+09, > "max": 141.106, > "min": 7.704, > "mean": 17.0715 > } > } > > After: > > "sinf": { > "": { > "duration": 3.40665e+10, > "iterations": 2.03199e+09, > "max": 95.994, > "min": 7.704, > "mean": 16.765 > } > } > > Generic is faster than asm now: > > "sinf": { > "": { > "duration": 3.40417e+10, > "iterations": 1.87792e+09, > "max": 138.868, > "min": 8.546, > "mean": 18.1273 > } > } > > Can you submit your patch? I will do it, thanks for checking this out.

## Patch

diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c index 40d3d19..a2fd3cf 100644 --- a/sysdeps/ieee754/flt-32/s_sinf.c +++ b/sysdeps/ieee754/flt-32/s_sinf.c @@ -75,7 +75,7 @@ static const double invpio4_table[] = { 0x1.0e4107cp-169 }; -static const int ones[] = { +1, -1 }; +static const double ones[] = { 1.0, -1.0 }; /* Compute the sine value using Chebyshev polynomials where THETA is the range reduced absolute value of the input @@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n, const double theta2 = theta * theta; /* We are operating on |x|, so we need to add back the original signbit for sinf. */ - int sign; + double sign; /* Determine positive or negative primary interval. */ sign = ones[((n >> 2) & 1) ^ signbit]; /* Are we in the primary interval of sin or cos? */