x86-64: Add sinf with FMA

Message ID 20171204180905.GA31592@gmail.com
State New, archived
Headers

Commit Message

Lu, Hongjiu Dec. 4, 2017, 6:09 p.m. UTC
  On Skylake, bench-sinf reports performance improvement:

            Before        After         Improvement
max        153.996       100.094           54%
min        8.546         6.852             25%
mean       18.1223       14.4616           25%

Any comments?

H.J.
---
	* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
	Add s_sinf-sse2 and s_sinf-fma.
	(CFLAGS-s_sinf-fma.c): New.
	* sysdeps/x86_64/fpu/s_sinf.S (sinf): Add alias only if __sinf
	is undefined.
	* sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
	* sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
---
 sysdeps/x86_64/fpu/multiarch/Makefile      |  5 ++++-
 sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c  |  3 +++
 sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S |  3 +++
 sysdeps/x86_64/fpu/multiarch/s_sinf.c      | 28 ++++++++++++++++++++++++++++
 sysdeps/x86_64/fpu/s_sinf.S                |  2 ++
 5 files changed, 40 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c
 create mode 100644 sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S
 create mode 100644 sysdeps/x86_64/fpu/multiarch/s_sinf.c
  

Comments

Adhemerval Zanella Dec. 4, 2017, 6:38 p.m. UTC | #1
On 04/12/2017 16:09, H.J. Lu wrote:
> On Skylake, bench-sinf reports performance improvement:
> 
>             Before        After         Improvement
> max        153.996       100.094           54%
> min        8.546         6.852             25%
> mean       18.1223       14.4616           25%
> 
> Any comments?
> 
> H.J.
> ---
> 	* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
> 	Add s_sinf-sse2 and s_sinf-fma.
> 	(CFLAGS-s_sinf-fma.c): New.
> 	* sysdeps/x86_64/fpu/s_sinf.S (sinf): Add alias only if __sinf
> 	is undefined.
> 	* sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
> 	* sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S: Likewise.
> 	* sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
> ---
>  sysdeps/x86_64/fpu/multiarch/Makefile      |  5 ++++-
>  sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c  |  3 +++
>  sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S |  3 +++

With new s_sinf.c generic implementation, does x86_64 still require an 
assembly one?
  
H.J. Lu Dec. 4, 2017, 6:51 p.m. UTC | #2
On Mon, Dec 4, 2017 at 10:38 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 04/12/2017 16:09, H.J. Lu wrote:
>> On Skylake, bench-sinf reports performance improvement:
>>
>>             Before        After         Improvement
>> max        153.996       100.094           54%
>> min        8.546         6.852             25%
>> mean       18.1223       14.4616           25%
>>
>> Any comments?
>>
>> H.J.
>> ---
>>       * sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
>>       Add s_sinf-sse2 and s_sinf-fma.
>>       (CFLAGS-s_sinf-fma.c): New.
>>       * sysdeps/x86_64/fpu/s_sinf.S (sinf): Add alias only if __sinf
>>       is undefined.
>>       * sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
>>       * sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S: Likewise.
>>       * sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
>> ---
>>  sysdeps/x86_64/fpu/multiarch/Makefile      |  5 ++++-
>>  sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c  |  3 +++
>>  sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S |  3 +++
>
> With new s_sinf.c generic implementation, does x86_64 still require an
> assembly one?

They are very close.  But assembly version is a little bit faster.

Assembly:

  "sinf": {
   "": {
    "duration": 3.40466e+10,
    "iterations": 1.88083e+09,
    "max": 137.398,
    "min": 8.546,
    "mean": 18.1019
   }
  }

Generic:

  "sinf": {
   "": {
    "duration": 3.40946e+10,
    "iterations": 1.54362e+09,
    "max": 205.012,
    "min": 7.704,
    "mean": 22.0875
   }
  }

I think we should keep assembly version.
  
Joseph Myers Dec. 4, 2017, 6:56 p.m. UTC | #3
On Mon, 4 Dec 2017, H.J. Lu wrote:

> > With new s_sinf.c generic implementation, does x86_64 still require an
> > assembly one?
> 
> They are very close.  But assembly version is a little bit faster.

There may well be scope for further performance tuning of the generic 
version.
  
H.J. Lu Dec. 4, 2017, 7:41 p.m. UTC | #4
On Mon, Dec 4, 2017 at 10:56 AM, Joseph Myers <joseph@codesourcery.com> wrote:
> On Mon, 4 Dec 2017, H.J. Lu wrote:
>
>> > With new s_sinf.c generic implementation, does x86_64 still require an
>> > assembly one?
>>
>> They are very close.  But assembly version is a little bit faster.
>
> There may well be scope for further performance tuning of the generic
> version.


There are a couple differences.  Among them:

Generic:

                 U errno
                 U __floor

Assembly:

                 U __errno_location

Is

(*__errno_location ()) = xxx

faster?  On x86-64, there should be no call to __floor.
  
Joseph Myers Dec. 4, 2017, 8:59 p.m. UTC | #5
On Mon, 4 Dec 2017, H.J. Lu wrote:

> Is
> 
> (*__errno_location ()) = xxx

If anything I'd expect a direct TLS initial-exec access to errno to be 
faster.

> faster?  On x86-64, there should be no call to __floor.

The x86_64 __floor inline in math_private.h is only when compiling glibc 
for SSE4.1 or later.

The case of inlining floor / __floor and related functions for x86_64 
without SSE4.1 is tricky.  Supposing we had appropriate asm redirects to 
allow libm to call floor / ceil / trunc etc. directly so the compiler 
could inline them but __* are still called if not inlined, the default 
SSE2 inlines would come into play.  But those inlines are slower on SSE4.1 
hardware than an out-of-line call to the floor / ceil / trunc IFUNC, so if 
you're building a generic SSE2 glibc that may well be used on SSE4.1 
hardware, you may wish either to avoid those inlines or, if there is a 
significant performance difference in benchmarks, have an SSE4.1 IFUNC of 
the calling function using floor (or __floor, with the present inline).

The expf etc. set of optimized float functions have several different 
choices of how conversions to integer are handled, which may be configured 
by an architecture.  That may make sense in other cases as well.
  
H.J. Lu Dec. 4, 2017, 10:42 p.m. UTC | #6
On Mon, Dec 4, 2017 at 12:59 PM, Joseph Myers <joseph@codesourcery.com> wrote:
> On Mon, 4 Dec 2017, H.J. Lu wrote:
>
>> Is
>>
>> (*__errno_location ()) = xxx
>
> If anything I'd expect a direct TLS initial-exec access to errno to be
> faster.

I will update x86-64 s_sinf.S to use errno.

>> faster?  On x86-64, there should be no call to __floor.
>
> The x86_64 __floor inline in math_private.h is only when compiling glibc
> for SSE4.1 or later.
>
> The case of inlining floor / __floor and related functions for x86_64
> without SSE4.1 is tricky.  Supposing we had appropriate asm redirects to
> allow libm to call floor / ceil / trunc etc. directly so the compiler
> could inline them but __* are still called if not inlined, the default
> SSE2 inlines would come into play.  But those inlines are slower on SSE4.1
> hardware than an out-of-line call to the floor / ceil / trunc IFUNC, so if
> you're building a generic SSE2 glibc that may well be used on SSE4.1
> hardware, you may wish either to avoid those inlines or, if there is a
> significant performance difference in benchmarks, have an SSE4.1 IFUNC of
> the calling function using floor (or __floor, with the present inline).
>
> The expf etc. set of optimized float functions have several different
> choices of how conversions to integer are handled, which may be configured
> by an architecture.  That may make sense in other cases as well.

x86-64 s_sinf.S avoids floor () with cvttsd2si.  I don't think it can be used
with generic sinf.
  
Florian Weimer Dec. 5, 2017, 1:44 p.m. UTC | #7
On 12/04/2017 09:59 PM, Joseph Myers wrote:
> On Mon, 4 Dec 2017, H.J. Lu wrote:
> 
>> Is
>>
>> (*__errno_location ()) = xxx
> If anything I'd expect a direct TLS initial-exec access to errno to be
> faster.

On x86-64, direct TLS access usually results in more machine code than 
calling __errno_location:

   Direct access:    7 bytes for the load of the TLS offset
                     7 bytes for storing a constant to the TLS variable
   __errno_location: 5 bytes for calling __errno_location
                     6 bytes for storing the constant

Since errno is typically accessed on error paths only, direct TLS access 
is not a win because it bloats the code.

Even if stack alignment is required, it is typically possible to d this 
with just a push/pop pair, adding two more bytes to the __errno_location 
approach, so it is still one byte shorter than direct access.

Thanks,
Florian
  
Nix Dec. 8, 2017, 4:02 p.m. UTC | #8
On 4 Dec 2017, H. J. Lu uttered the following:

> On Skylake, bench-sinf reports performance improvement:
>
>             Before        After         Improvement
> max        153.996       100.094           54%
> min        8.546         6.852             25%
> mean       18.1223       14.4616           25%
>
> Any comments?

Do we have any benchmark runs on older processors? They're not remotely
obsolete: Intel is still selling Broadwell server parts, and the vast
majority of SSE2-capable parts out there at present are not as new as
Skylake.

Are we penalizing them? (I'd guess not, but it would be nice to know.)
  
Arjan van de Ven Dec. 8, 2017, 4:04 p.m. UTC | #9
On 12/8/2017 8:02 AM, Nick Alcock wrote:
> On 4 Dec 2017, H. J. Lu uttered the following:
> 
>> On Skylake, bench-sinf reports performance improvement:
>>
>>              Before        After         Improvement
>> max        153.996       100.094           54%
>> min        8.546         6.852             25%
>> mean       18.1223       14.4616           25%
>>
>> Any comments?
> 
> Do we have any benchmark runs on older processors? They're not remotely
> obsolete: Intel is still selling Broadwell server parts, and the vast
> majority of SSE2-capable parts out there at present are not as new as
> Skylake.
> 
> Are we penalizing them? (I'd guess not, but it would be nice to know.)

this is just adding another IFUNC for FMA capable CPUs.. the older CPUs
will keep using the code that they are running today....
so no they're not being penalized.
  
Arjan van de Ven Dec. 8, 2017, 4:07 p.m. UTC | #10
On 12/8/2017 8:02 AM, Nick Alcock wrote:
> On 4 Dec 2017, H. J. Lu uttered the following:
> 
>> On Skylake, bench-sinf reports performance improvement:
>>
>>              Before        After         Improvement
>> max        153.996       100.094           54%
>> min        8.546         6.852             25%
>> mean       18.1223       14.4616           25%
>>
>> Any comments?
> 
> Do we have any benchmark runs on older processors? They're not remotely
> obsolete: Intel is still selling Broadwell server parts, and the vast
> majority of SSE2-capable parts out there at present are not as new as
> Skylake.

(oh and Haswell & Broadwell also support FMA already)
  
H.J. Lu Dec. 8, 2017, 4:11 p.m. UTC | #11
On Fri, Dec 8, 2017 at 8:07 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 12/8/2017 8:02 AM, Nick Alcock wrote:
>>
>> On 4 Dec 2017, H. J. Lu uttered the following:
>>
>>> On Skylake, bench-sinf reports performance improvement:
>>>
>>>              Before        After         Improvement
>>> max        153.996       100.094           54%
>>> min        8.546         6.852             25%
>>> mean       18.1223       14.4616           25%
>>>
>>> Any comments?
>>
>>
>> Do we have any benchmark runs on older processors? They're not remotely
>> obsolete: Intel is still selling Broadwell server parts, and the vast
>> majority of SSE2-capable parts out there at present are not as new as
>> Skylake.
>
>
> (oh and Haswell & Broadwell also support FMA already)

On Haswell, without FMA:

  "sinf": {
   "": {
    "duration": 3.4905e+10,
    "iterations": 1.91281e+09,
    "max": 450.098,
    "min": 8.091,
    "mean": 18.2481
   }
  }

With FMA:

  "sinf": {
   "": {
    "duration": 3.49046e+10,
    "iterations": 2.44188e+09,
    "max": 173.855,
    "min": 7.253,
    "mean": 14.2942
   }
  }

FMA is faster on HSW.  I expect Broadwell should also be faster with FMA.
  
Nix Dec. 8, 2017, 4:16 p.m. UTC | #12
On 8 Dec 2017, H. J. Lu spake thusly:

> On Fri, Dec 8, 2017 at 8:07 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>> On 12/8/2017 8:02 AM, Nick Alcock wrote:
>>>
>>> On 4 Dec 2017, H. J. Lu uttered the following:
>>>
>>>> On Skylake, bench-sinf reports performance improvement:
>>>>
>>>>              Before        After         Improvement
>>>> max        153.996       100.094           54%
>>>> min        8.546         6.852             25%
>>>> mean       18.1223       14.4616           25%
>>>>
>>>> Any comments?
>>>
>>>
>>> Do we have any benchmark runs on older processors? They're not remotely
>>> obsolete: Intel is still selling Broadwell server parts, and the vast
>>> majority of SSE2-capable parts out there at present are not as new as
>>> Skylake.
>>
>> (oh and Haswell & Broadwell also support FMA already)

Yeah, I guess I was wondering if the speedup was actually due to FMA or
some other microarchitectural variation which might not be present on
older FMA-capable processors.

But my worries were for naught:

>     "max": 450.098,
>     "min": 8.091,
>     "mean": 18.2481

>     "max": 173.855,
>     "min": 7.253,
>     "mean": 14.2942

That's a nice speedup! :)
  
Arjan van de Ven Dec. 8, 2017, 4:32 p.m. UTC | #13
On 12/8/2017 8:16 AM, Nix wrote:
> Yeah, I guess I was wondering if the speedup was actually due to FMA or
> some other microarchitectural variation which might not be present on
> older FMA-capable processors.

I'm not aware of any system where a FMA would be slower than discrete mul + add...
Even for HSW/BDW they're faster. Now on SKL the performance increase of FMA is *more*
(FMA is 4 cycles, but so are individual add and mul .. so it's really 2x gain)
but that's more upside, not a downside for others.
  

Patch

diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile
index c78624b47d..cab84bff3a 100644
--- a/sysdeps/x86_64/fpu/multiarch/Makefile
+++ b/sysdeps/x86_64/fpu/multiarch/Makefile
@@ -37,14 +37,17 @@  CFLAGS-slowpow-fma.c = -mfma -mavx2
 CFLAGS-s_sin-fma.c = -mfma -mavx2
 CFLAGS-s_tan-fma.c = -mfma -mavx2
 
+libm-sysdep_routines += s_sinf-sse2
+
 libm-sysdep_routines += e_exp2f-fma e_expf-fma e_log2f-fma e_logf-fma \
-			e_powf-fma
+			e_powf-fma s_sinf-fma
 
 CFLAGS-e_exp2f-fma.c = -mfma -mavx2
 CFLAGS-e_expf-fma.c = -mfma -mavx2
 CFLAGS-e_log2f-fma.c = -mfma -mavx2
 CFLAGS-e_logf-fma.c = -mfma -mavx2
 CFLAGS-e_powf-fma.c = -mfma -mavx2
+CFLAGS-s_sinf-fma.c = -mfma -mavx2
 
 libm-sysdep_routines += e_exp-fma4 e_log-fma4 e_pow-fma4 s_atan-fma4 \
 			e_asin-fma4 e_atan2-fma4 s_sin-fma4 s_tan-fma4 \
diff --git a/sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c b/sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c
new file mode 100644
index 0000000000..5f6e17fc4d
--- /dev/null
+++ b/sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c
@@ -0,0 +1,3 @@ 
+#define SINF __sinf_fma
+
+#include <sysdeps/ieee754/flt-32/s_sinf.c>
diff --git a/sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S b/sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S
new file mode 100644
index 0000000000..523456c66c
--- /dev/null
+++ b/sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S
@@ -0,0 +1,3 @@ 
+#define __sinf __sinf_sse2
+
+#include <sysdeps/x86_64/fpu/s_sinf.S>
diff --git a/sysdeps/x86_64/fpu/multiarch/s_sinf.c b/sysdeps/x86_64/fpu/multiarch/s_sinf.c
new file mode 100644
index 0000000000..831bc6f131
--- /dev/null
+++ b/sysdeps/x86_64/fpu/multiarch/s_sinf.c
@@ -0,0 +1,28 @@ 
+/* Multiple versions of sinf.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <libm-alias-float.h>
+
+extern float __redirect_sinf (float);
+
+#define SYMBOL_NAME sinf
+#include "ifunc-fma.h"
+
+libc_ifunc_redirected (__redirect_sinf, __sinf, IFUNC_SELECTOR ());
+
+libm_alias_float (__sin, sin)
diff --git a/sysdeps/x86_64/fpu/s_sinf.S b/sysdeps/x86_64/fpu/s_sinf.S
index c505d60091..db7dce0c8a 100644
--- a/sysdeps/x86_64/fpu/s_sinf.S
+++ b/sysdeps/x86_64/fpu/s_sinf.S
@@ -556,4 +556,6 @@  L(SP_ABS_MASK): /* Mask for getting SP absolute value */
 	.type L(SP_ABS_MASK), @object
 	ASM_SIZE_DIRECTIVE(L(SP_ABS_MASK))
 
+#ifndef __sinf
 libm_alias_float (__sin, sin)
+#endif