x86-64: Add sinf with FMA
Commit Message
On Skylake, bench-sinf reports performance improvement:
Before After Improvement
max 153.996 100.094 54%
min 8.546 6.852 25%
mean 18.1223 14.4616 25%
Any comments?
H.J.
---
* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
Add s_sinf-sse2 and s_sinf-fma.
(CFLAGS-s_sinf-fma.c): New.
* sysdeps/x86_64/fpu/s_sinf.S (sinf): Add alias only if __sinf
is undefined.
* sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
* sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
---
sysdeps/x86_64/fpu/multiarch/Makefile | 5 ++++-
sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c | 3 +++
sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S | 3 +++
sysdeps/x86_64/fpu/multiarch/s_sinf.c | 28 ++++++++++++++++++++++++++++
sysdeps/x86_64/fpu/s_sinf.S | 2 ++
5 files changed, 40 insertions(+), 1 deletion(-)
create mode 100644 sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c
create mode 100644 sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S
create mode 100644 sysdeps/x86_64/fpu/multiarch/s_sinf.c
Comments
On 04/12/2017 16:09, H.J. Lu wrote:
> On Skylake, bench-sinf reports performance improvement:
>
> Before After Improvement
> max 153.996 100.094 54%
> min 8.546 6.852 25%
> mean 18.1223 14.4616 25%
>
> Any comments?
>
> H.J.
> ---
> * sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
> Add s_sinf-sse2 and s_sinf-fma.
> (CFLAGS-s_sinf-fma.c): New.
> * sysdeps/x86_64/fpu/s_sinf.S (sinf): Add alias only if __sinf
> is undefined.
> * sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
> * sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S: Likewise.
> * sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
> ---
> sysdeps/x86_64/fpu/multiarch/Makefile | 5 ++++-
> sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c | 3 +++
> sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S | 3 +++
With new s_sinf.c generic implementation, does x86_64 still require an
assembly one?
On Mon, Dec 4, 2017 at 10:38 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 04/12/2017 16:09, H.J. Lu wrote:
>> On Skylake, bench-sinf reports performance improvement:
>>
>> Before After Improvement
>> max 153.996 100.094 54%
>> min 8.546 6.852 25%
>> mean 18.1223 14.4616 25%
>>
>> Any comments?
>>
>> H.J.
>> ---
>> * sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
>> Add s_sinf-sse2 and s_sinf-fma.
>> (CFLAGS-s_sinf-fma.c): New.
>> * sysdeps/x86_64/fpu/s_sinf.S (sinf): Add alias only if __sinf
>> is undefined.
>> * sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
>> * sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S: Likewise.
>> * sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
>> ---
>> sysdeps/x86_64/fpu/multiarch/Makefile | 5 ++++-
>> sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c | 3 +++
>> sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S | 3 +++
>
> With new s_sinf.c generic implementation, does x86_64 still require an
> assembly one?
They are very close. But assembly version is a little bit faster.
Assembly:
"sinf": {
"": {
"duration": 3.40466e+10,
"iterations": 1.88083e+09,
"max": 137.398,
"min": 8.546,
"mean": 18.1019
}
}
Generic:
"sinf": {
"": {
"duration": 3.40946e+10,
"iterations": 1.54362e+09,
"max": 205.012,
"min": 7.704,
"mean": 22.0875
}
}
I think we should keep assembly version.
On Mon, 4 Dec 2017, H.J. Lu wrote:
> > With new s_sinf.c generic implementation, does x86_64 still require an
> > assembly one?
>
> They are very close. But assembly version is a little bit faster.
There may well be scope for further performance tuning of the generic
version.
On Mon, Dec 4, 2017 at 10:56 AM, Joseph Myers <joseph@codesourcery.com> wrote:
> On Mon, 4 Dec 2017, H.J. Lu wrote:
>
>> > With new s_sinf.c generic implementation, does x86_64 still require an
>> > assembly one?
>>
>> They are very close. But assembly version is a little bit faster.
>
> There may well be scope for further performance tuning of the generic
> version.
There are a couple differences. Among them:
Generic:
U errno
U __floor
Assembly:
U __errno_location
Is
(*__errno_location ()) = xxx
faster? On x86-64, there should be no call to __floor.
On Mon, 4 Dec 2017, H.J. Lu wrote:
> Is
>
> (*__errno_location ()) = xxx
If anything I'd expect a direct TLS initial-exec access to errno to be
faster.
> faster? On x86-64, there should be no call to __floor.
The x86_64 __floor inline in math_private.h is only when compiling glibc
for SSE4.1 or later.
The case of inlining floor / __floor and related functions for x86_64
without SSE4.1 is tricky. Supposing we had appropriate asm redirects to
allow libm to call floor / ceil / trunc etc. directly so the compiler
could inline them but __* are still called if not inlined, the default
SSE2 inlines would come into play. But those inlines are slower on SSE4.1
hardware than an out-of-line call to the floor / ceil / trunc IFUNC, so if
you're building a generic SSE2 glibc that may well be used on SSE4.1
hardware, you may wish either to avoid those inlines or, if there is a
significant performance difference in benchmarks, have an SSE4.1 IFUNC of
the calling function using floor (or __floor, with the present inline).
The expf etc. set of optimized float functions have several different
choices of how conversions to integer are handled, which may be configured
by an architecture. That may make sense in other cases as well.
On Mon, Dec 4, 2017 at 12:59 PM, Joseph Myers <joseph@codesourcery.com> wrote:
> On Mon, 4 Dec 2017, H.J. Lu wrote:
>
>> Is
>>
>> (*__errno_location ()) = xxx
>
> If anything I'd expect a direct TLS initial-exec access to errno to be
> faster.
I will update x86-64 s_sinf.S to use errno.
>> faster? On x86-64, there should be no call to __floor.
>
> The x86_64 __floor inline in math_private.h is only when compiling glibc
> for SSE4.1 or later.
>
> The case of inlining floor / __floor and related functions for x86_64
> without SSE4.1 is tricky. Supposing we had appropriate asm redirects to
> allow libm to call floor / ceil / trunc etc. directly so the compiler
> could inline them but __* are still called if not inlined, the default
> SSE2 inlines would come into play. But those inlines are slower on SSE4.1
> hardware than an out-of-line call to the floor / ceil / trunc IFUNC, so if
> you're building a generic SSE2 glibc that may well be used on SSE4.1
> hardware, you may wish either to avoid those inlines or, if there is a
> significant performance difference in benchmarks, have an SSE4.1 IFUNC of
> the calling function using floor (or __floor, with the present inline).
>
> The expf etc. set of optimized float functions have several different
> choices of how conversions to integer are handled, which may be configured
> by an architecture. That may make sense in other cases as well.
x86-64 s_sinf.S avoids floor () with cvttsd2si. I don't think it can be used
with generic sinf.
On 12/04/2017 09:59 PM, Joseph Myers wrote:
> On Mon, 4 Dec 2017, H.J. Lu wrote:
>
>> Is
>>
>> (*__errno_location ()) = xxx
> If anything I'd expect a direct TLS initial-exec access to errno to be
> faster.
On x86-64, direct TLS access usually results in more machine code than
calling __errno_location:
Direct access: 7 bytes for the load of the TLS offset
7 bytes for storing a constant to the TLS variable
__errno_location: 5 bytes for calling __errno_location
6 bytes for storing the constant
Since errno is typically accessed on error paths only, direct TLS access
is not a win because it bloats the code.
Even if stack alignment is required, it is typically possible to d this
with just a push/pop pair, adding two more bytes to the __errno_location
approach, so it is still one byte shorter than direct access.
Thanks,
Florian
On 4 Dec 2017, H. J. Lu uttered the following:
> On Skylake, bench-sinf reports performance improvement:
>
> Before After Improvement
> max 153.996 100.094 54%
> min 8.546 6.852 25%
> mean 18.1223 14.4616 25%
>
> Any comments?
Do we have any benchmark runs on older processors? They're not remotely
obsolete: Intel is still selling Broadwell server parts, and the vast
majority of SSE2-capable parts out there at present are not as new as
Skylake.
Are we penalizing them? (I'd guess not, but it would be nice to know.)
On 12/8/2017 8:02 AM, Nick Alcock wrote:
> On 4 Dec 2017, H. J. Lu uttered the following:
>
>> On Skylake, bench-sinf reports performance improvement:
>>
>> Before After Improvement
>> max 153.996 100.094 54%
>> min 8.546 6.852 25%
>> mean 18.1223 14.4616 25%
>>
>> Any comments?
>
> Do we have any benchmark runs on older processors? They're not remotely
> obsolete: Intel is still selling Broadwell server parts, and the vast
> majority of SSE2-capable parts out there at present are not as new as
> Skylake.
>
> Are we penalizing them? (I'd guess not, but it would be nice to know.)
this is just adding another IFUNC for FMA capable CPUs.. the older CPUs
will keep using the code that they are running today....
so no they're not being penalized.
On 12/8/2017 8:02 AM, Nick Alcock wrote:
> On 4 Dec 2017, H. J. Lu uttered the following:
>
>> On Skylake, bench-sinf reports performance improvement:
>>
>> Before After Improvement
>> max 153.996 100.094 54%
>> min 8.546 6.852 25%
>> mean 18.1223 14.4616 25%
>>
>> Any comments?
>
> Do we have any benchmark runs on older processors? They're not remotely
> obsolete: Intel is still selling Broadwell server parts, and the vast
> majority of SSE2-capable parts out there at present are not as new as
> Skylake.
(oh and Haswell & Broadwell also support FMA already)
On Fri, Dec 8, 2017 at 8:07 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 12/8/2017 8:02 AM, Nick Alcock wrote:
>>
>> On 4 Dec 2017, H. J. Lu uttered the following:
>>
>>> On Skylake, bench-sinf reports performance improvement:
>>>
>>> Before After Improvement
>>> max 153.996 100.094 54%
>>> min 8.546 6.852 25%
>>> mean 18.1223 14.4616 25%
>>>
>>> Any comments?
>>
>>
>> Do we have any benchmark runs on older processors? They're not remotely
>> obsolete: Intel is still selling Broadwell server parts, and the vast
>> majority of SSE2-capable parts out there at present are not as new as
>> Skylake.
>
>
> (oh and Haswell & Broadwell also support FMA already)
On Haswell, without FMA:
"sinf": {
"": {
"duration": 3.4905e+10,
"iterations": 1.91281e+09,
"max": 450.098,
"min": 8.091,
"mean": 18.2481
}
}
With FMA:
"sinf": {
"": {
"duration": 3.49046e+10,
"iterations": 2.44188e+09,
"max": 173.855,
"min": 7.253,
"mean": 14.2942
}
}
FMA is faster on HSW. I expect Broadwell should also be faster with FMA.
On 8 Dec 2017, H. J. Lu spake thusly:
> On Fri, Dec 8, 2017 at 8:07 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>> On 12/8/2017 8:02 AM, Nick Alcock wrote:
>>>
>>> On 4 Dec 2017, H. J. Lu uttered the following:
>>>
>>>> On Skylake, bench-sinf reports performance improvement:
>>>>
>>>> Before After Improvement
>>>> max 153.996 100.094 54%
>>>> min 8.546 6.852 25%
>>>> mean 18.1223 14.4616 25%
>>>>
>>>> Any comments?
>>>
>>>
>>> Do we have any benchmark runs on older processors? They're not remotely
>>> obsolete: Intel is still selling Broadwell server parts, and the vast
>>> majority of SSE2-capable parts out there at present are not as new as
>>> Skylake.
>>
>> (oh and Haswell & Broadwell also support FMA already)
Yeah, I guess I was wondering if the speedup was actually due to FMA or
some other microarchitectural variation which might not be present on
older FMA-capable processors.
But my worries were for naught:
> "max": 450.098,
> "min": 8.091,
> "mean": 18.2481
> "max": 173.855,
> "min": 7.253,
> "mean": 14.2942
That's a nice speedup! :)
On 12/8/2017 8:16 AM, Nix wrote:
> Yeah, I guess I was wondering if the speedup was actually due to FMA or
> some other microarchitectural variation which might not be present on
> older FMA-capable processors.
I'm not aware of any system where a FMA would be slower than discrete mul + add...
Even for HSW/BDW they're faster. Now on SKL the performance increase of FMA is *more*
(FMA is 4 cycles, but so are individual add and mul .. so it's really 2x gain)
but that's more upside, not a downside for others.
@@ -37,14 +37,17 @@ CFLAGS-slowpow-fma.c = -mfma -mavx2
CFLAGS-s_sin-fma.c = -mfma -mavx2
CFLAGS-s_tan-fma.c = -mfma -mavx2
+libm-sysdep_routines += s_sinf-sse2
+
libm-sysdep_routines += e_exp2f-fma e_expf-fma e_log2f-fma e_logf-fma \
- e_powf-fma
+ e_powf-fma s_sinf-fma
CFLAGS-e_exp2f-fma.c = -mfma -mavx2
CFLAGS-e_expf-fma.c = -mfma -mavx2
CFLAGS-e_log2f-fma.c = -mfma -mavx2
CFLAGS-e_logf-fma.c = -mfma -mavx2
CFLAGS-e_powf-fma.c = -mfma -mavx2
+CFLAGS-s_sinf-fma.c = -mfma -mavx2
libm-sysdep_routines += e_exp-fma4 e_log-fma4 e_pow-fma4 s_atan-fma4 \
e_asin-fma4 e_atan2-fma4 s_sin-fma4 s_tan-fma4 \
new file mode 100644
@@ -0,0 +1,3 @@
+#define SINF __sinf_fma
+
+#include <sysdeps/ieee754/flt-32/s_sinf.c>
new file mode 100644
@@ -0,0 +1,3 @@
+#define __sinf __sinf_sse2
+
+#include <sysdeps/x86_64/fpu/s_sinf.S>
new file mode 100644
@@ -0,0 +1,28 @@
+/* Multiple versions of sinf.
+ Copyright (C) 2017 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <libm-alias-float.h>
+
+extern float __redirect_sinf (float);
+
+#define SYMBOL_NAME sinf
+#include "ifunc-fma.h"
+
+libc_ifunc_redirected (__redirect_sinf, __sinf, IFUNC_SELECTOR ());
+
+libm_alias_float (__sin, sin)
@@ -556,4 +556,6 @@ L(SP_ABS_MASK): /* Mask for getting SP absolute value */
.type L(SP_ABS_MASK), @object
ASM_SIZE_DIRECTIVE(L(SP_ABS_MASK))
+#ifndef __sinf
libm_alias_float (__sin, sin)
+#endif