Patchwork x86-64: Add sinf with FMA

login
register
mail settings
Submitter Adhemerval Zanella Netto
Date Dec. 5, 2017, 1:47 p.m.
Message ID <a505c43a-1a44-34fc-f36b-243e328b34af@linaro.org>
Download mbox | patch
Permalink /patch/24736/
State New
Headers show

Comments

Adhemerval Zanella Netto - Dec. 5, 2017, 1:47 p.m.
On 04/12/2017 20:42, H.J. Lu wrote:
> On Mon, Dec 4, 2017 at 12:59 PM, Joseph Myers <joseph@codesourcery.com> wrote:
>> On Mon, 4 Dec 2017, H.J. Lu wrote:
>>
>>> Is
>>>
>>> (*__errno_location ()) = xxx
>>
>> If anything I'd expect a direct TLS initial-exec access to errno to be
>> faster.
> 
> I will update x86-64 s_sinf.S to use errno.
> 
>>> faster?  On x86-64, there should be no call to __floor.
>>
>> The x86_64 __floor inline in math_private.h is only when compiling glibc
>> for SSE4.1 or later.
>>
>> The case of inlining floor / __floor and related functions for x86_64
>> without SSE4.1 is tricky.  Supposing we had appropriate asm redirects to
>> allow libm to call floor / ceil / trunc etc. directly so the compiler
>> could inline them but __* are still called if not inlined, the default
>> SSE2 inlines would come into play.  But those inlines are slower on SSE4.1
>> hardware than an out-of-line call to the floor / ceil / trunc IFUNC, so if
>> you're building a generic SSE2 glibc that may well be used on SSE4.1
>> hardware, you may wish either to avoid those inlines or, if there is a
>> significant performance difference in benchmarks, have an SSE4.1 IFUNC of
>> the calling function using floor (or __floor, with the present inline).
>>
>> The expf etc. set of optimized float functions have several different
>> choices of how conversions to integer are handled, which may be configured
>> by an architecture.  That may make sense in other cases as well.
> 
> x86-64 s_sinf.S avoids floor () with cvttsd2si.  I don't think it can be used
> with generic sinf.
> 

I think we can the compiler builtins for math_private.h as a fallback for non
SSE 4.1 as:

---
@@ -109,11 +116,15 @@ extern __always_inline double
 __floor (double d)
 {
   double res;
+#ifdef __SSE4_1__
 # if defined __AVX__ || defined SSE2AVX
   asm ("vroundsd $1, %1, %0, %0" : "=x" (res) : "xm" (d));
 # else
   asm ("roundsd $1, %1, %0" : "=x" (res) : "xm" (d));
 # endif
+#else
+  res = __builtin_floor (d);
+#endif
   return res;
 }
---

As least for GCC7 compiler will expand to an inline version even for generic
tune option.  However the code that actually calls __floor is not showing on
profiling, but rather int to fp conversion.

Generic implementation shows on my system (gcc version 7.1.1, i7-4790K):

  "sinf": {
   "": {
    "duration": 4.00221e+10,
    "iterations": 1.4128e+09,
    "max": 587.555,
    "min": 12.747,
    "mean": 28.3281
   }

And with a simple modification to avoid int to fp conversion:

---
---

I get:

  "sinf": {
   "": {
    "duration": 4.0015e+10,
    "iterations": 1.4535e+09,
    "max": 640.456,
    "min": 11.437,
    "mean": 27.5301
   }

Which is roughly 3% on mean and 11.5% on min. I think we can improve it
even more by avoiding the int to fp conversion to get the sign right
and try operate with sign as double argument.
Joseph Myers - Dec. 5, 2017, 1:57 p.m.
On Tue, 5 Dec 2017, Adhemerval Zanella wrote:

> +#else
> +  res = __builtin_floor (d);
> +#endif
>    return res;
>  }
> ---
> 
> As least for GCC7 compiler will expand to an inline version even for generic
> tune option.  However the code that actually calls __floor is not showing on

Which, as per discussions on gcc-patches in August, may not be a good idea 
with current GCC, because it means that your code built for generic SSE2 
is now slower on an SSE4.1 processor than it would have been if it had 
called the __floor IFUNC.  (That issue could be avoided by adding an 
SSE4.1 IFUNC of sinf, of course.  I think we should aim to use floor etc. 
directly anyway - in cases where we can't avoid floor and just convert 
directly to integer - and address any such pessimizations in GCC rather 
than working around them in glibc.)
H.J. Lu - Dec. 5, 2017, 4:56 p.m.
On Tue, Dec 5, 2017 at 5:47 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:

> And with a simple modification to avoid int to fp conversion:
>
> ---
> diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
> index 40d3d19..a2fd3cf 100644
> --- a/sysdeps/ieee754/flt-32/s_sinf.c
> +++ b/sysdeps/ieee754/flt-32/s_sinf.c
> @@ -75,7 +75,7 @@ static const double invpio4_table[] = {
>    0x1.0e4107cp-169
>  };
>
> -static const int ones[] = { +1, -1 };
> +static const double ones[] = { 1.0, -1.0 };
>
>  /* Compute the sine value using Chebyshev polynomials where
>     THETA is the range reduced absolute value of the input
> @@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n,
>    const double theta2 = theta * theta;
>    /* We are operating on |x|, so we need to add back the original
>       signbit for sinf.  */
> -  int sign;
> +  double sign;
>    /* Determine positive or negative primary interval.  */
>    sign = ones[((n >> 2) & 1) ^ signbit];
>    /* Are we in the primary interval of sin or cos?  */
> ---
>
> I get:
>
>   "sinf": {
>    "": {
>     "duration": 4.0015e+10,
>     "iterations": 1.4535e+09,
>     "max": 640.456,
>     "min": 11.437,
>     "mean": 27.5301
>    }
>
> Which is roughly 3% on mean and 11.5% on min. I think we can improve it
> even more by avoiding the int to fp conversion to get the sign right
> and try operate with sign as double argument.

I tried it on Skylake with the current master.  Before:

  "sinf": {
   "": {
    "duration": 3.4044e+10,
    "iterations": 1.9942e+09,
    "max": 141.106,
    "min": 7.704,
    "mean": 17.0715
   }
  }

After:

  "sinf": {
   "": {
    "duration": 3.40665e+10,
    "iterations": 2.03199e+09,
    "max": 95.994,
    "min": 7.704,
    "mean": 16.765
   }
  }

Generic is faster than asm now:

  "sinf": {
   "": {
    "duration": 3.40417e+10,
    "iterations": 1.87792e+09,
    "max": 138.868,
    "min": 8.546,
    "mean": 18.1273
   }
  }

Can you submit your patch?

Thanks.
Adhemerval Zanella Netto - Dec. 5, 2017, 5:06 p.m.
On 05/12/2017 11:57, Joseph Myers wrote:
> On Tue, 5 Dec 2017, Adhemerval Zanella wrote:
> 
>> +#else
>> +  res = __builtin_floor (d);
>> +#endif
>>    return res;
>>  }
>> ---
>>
>> As least for GCC7 compiler will expand to an inline version even for generic
>> tune option.  However the code that actually calls __floor is not showing on
> 
> Which, as per discussions on gcc-patches in August, may not be a good idea 
> with current GCC, because it means that your code built for generic SSE2 
> is now slower on an SSE4.1 processor than it would have been if it had 
> called the __floor IFUNC.  (That issue could be avoided by adding an 
> SSE4.1 IFUNC of sinf, of course.  I think we should aim to use floor etc. 
> directly anyway - in cases where we can't avoid floor and just convert 
> directly to integer - and address any such pessimizations in GCC rather 
> than working around them in glibc.)
> 

Which indeed seems to be the case here as well (although the difference is
quite small, but still noticeable).  With latest H.J. Lu changes to remove
floor it seems the small change to use double for sign operation also does
not yield much change.
Adhemerval Zanella Netto - Dec. 5, 2017, 5:09 p.m.
On 05/12/2017 14:56, H.J. Lu wrote:
> On Tue, Dec 5, 2017 at 5:47 AM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
> 
>> And with a simple modification to avoid int to fp conversion:
>>
>> ---
>> diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
>> index 40d3d19..a2fd3cf 100644
>> --- a/sysdeps/ieee754/flt-32/s_sinf.c
>> +++ b/sysdeps/ieee754/flt-32/s_sinf.c
>> @@ -75,7 +75,7 @@ static const double invpio4_table[] = {
>>    0x1.0e4107cp-169
>>  };
>>
>> -static const int ones[] = { +1, -1 };
>> +static const double ones[] = { 1.0, -1.0 };
>>
>>  /* Compute the sine value using Chebyshev polynomials where
>>     THETA is the range reduced absolute value of the input
>> @@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n,
>>    const double theta2 = theta * theta;
>>    /* We are operating on |x|, so we need to add back the original
>>       signbit for sinf.  */
>> -  int sign;
>> +  double sign;
>>    /* Determine positive or negative primary interval.  */
>>    sign = ones[((n >> 2) & 1) ^ signbit];
>>    /* Are we in the primary interval of sin or cos?  */
>> ---
>>
>> I get:
>>
>>   "sinf": {
>>    "": {
>>     "duration": 4.0015e+10,
>>     "iterations": 1.4535e+09,
>>     "max": 640.456,
>>     "min": 11.437,
>>     "mean": 27.5301
>>    }
>>
>> Which is roughly 3% on mean and 11.5% on min. I think we can improve it
>> even more by avoiding the int to fp conversion to get the sign right
>> and try operate with sign as double argument.
> 
> I tried it on Skylake with the current master.  Before:
> 
>   "sinf": {
>    "": {
>     "duration": 3.4044e+10,
>     "iterations": 1.9942e+09,
>     "max": 141.106,
>     "min": 7.704,
>     "mean": 17.0715
>    }
>   }
> 
> After:
> 
>   "sinf": {
>    "": {
>     "duration": 3.40665e+10,
>     "iterations": 2.03199e+09,
>     "max": 95.994,
>     "min": 7.704,
>     "mean": 16.765
>    }
>   }
> 
> Generic is faster than asm now:
> 
>   "sinf": {
>    "": {
>     "duration": 3.40417e+10,
>     "iterations": 1.87792e+09,
>     "max": 138.868,
>     "min": 8.546,
>     "mean": 18.1273
>    }
>   }
> 
> Can you submit your patch?

I will do it, thanks for checking this out.

Patch

diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
index 40d3d19..a2fd3cf 100644
--- a/sysdeps/ieee754/flt-32/s_sinf.c
+++ b/sysdeps/ieee754/flt-32/s_sinf.c
@@ -75,7 +75,7 @@  static const double invpio4_table[] = {
   0x1.0e4107cp-169
 };

-static const int ones[] = { +1, -1 };
+static const double ones[] = { 1.0, -1.0 };

 /* Compute the sine value using Chebyshev polynomials where
    THETA is the range reduced absolute value of the input
@@ -92,7 +92,7 @@  reduced (const double theta, const unsigned long int n,
   const double theta2 = theta * theta;
   /* We are operating on |x|, so we need to add back the original
      signbit for sinf.  */
-  int sign;
+  double sign;
   /* Determine positive or negative primary interval.  */
   sign = ones[((n >> 2) & 1) ^ signbit];
   /* Are we in the primary interval of sin or cos?  */