From patchwork Thu Apr 30 15:03:53 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Senkevich X-Patchwork-Id: 6509 Received: (qmail 94239 invoked by alias); 30 Apr 2015 15:04:40 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 94228 invoked by uid 89); 30 Apr 2015 15:04:38 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL, BAYES_50, FREEMAIL_FROM, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-lb0-f182.google.com X-Received: by 10.112.180.201 with SMTP id dq9mr4358361lbc.78.1430406263957; Thu, 30 Apr 2015 08:04:23 -0700 (PDT) MIME-Version: 1.0 From: Andrew Senkevich Date: Thu, 30 Apr 2015 18:03:53 +0300 Message-ID: Subject: [PATCH 1/14] [x86_64] Vector math functions (vector cos) To: libc-alpha Hi, this patch starts series of patches with vectorized implementations of several math functions for x86_64. Here is implementation of cos containing SSE, AVX, AVX2 and AVX512 versions according to Vector ABI which was discussed in https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4. This patch also adds vector math library build and ABI check enabled by default for x86_64. 2015-04-30 Andrew Senkevich * sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S: New file. * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file. * sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S: New file. * sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S: New file. * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file. * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file. * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file. * sysdeps/x86_64/fpu/Versions: New file. * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC. * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos. * sysdeps/x86_64/configure: Regenerated. * sysdeps/x86_64/configure.ac: Options for libmvec build. * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file. * sysdeps/x86_64/fpu/Makefile: New file. * sysdeps/x86_64/fpu/multiarch/Makefile: Added build of SSE version which is IFUNC. Is this patch ok? --- WBR, Andrew diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index e8e5577..85a6a95 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -60,7 +60,7 @@ __MATHCALL (atan,, (_Mdouble_ __x)); __MATHCALL (atan2,, (_Mdouble_ __y, _Mdouble_ __x)); /* Cosine of X. */ -__MATHCALL (cos,, (_Mdouble_ __x)); +__MATHCALL_VEC (cos,, (_Mdouble_ __x)); /* Sine of X. */ __MATHCALL (sin,, (_Mdouble_ __x)); /* Tangent of X. */ diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist new file mode 100644 index 0000000..be6eaed --- /dev/null +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -0,0 +1,6 @@ +GLIBC_2.22 + GLIBC_2.22 A + _ZGVbN2v_cos F + _ZGVcN4v_cos F + _ZGVdN4v_cos F + _ZGVeN8v_cos F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h new file mode 100644 index 0000000..6540268 --- /dev/null +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -0,0 +1,34 @@ +/* Platform-specific SIMD declarations of math functions. + Copyright (C) 2014 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef _MATH_H +# error "Never include directly;\ + include instead." +#endif + +/* Get default empty definitions for simd declarations. */ +#include + +#if defined __x86_64__ && defined __FAST_MATH__ +# if defined _OPENMP && _OPENMP >= 201307 +/* OpenMP case. */ +# define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch") +# undef __DECL_SIMD_cos +# define __DECL_SIMD_cos __DECL_SIMD_x86_64 +# endif +#endif diff --git a/sysdeps/x86_64/configure b/sysdeps/x86_64/configure index 7d4dadd..1493523 100644 --- a/sysdeps/x86_64/configure +++ b/sysdeps/x86_64/configure @@ -275,6 +275,10 @@ fi config_vars="$config_vars config-cflags-avx2 = $libc_cv_cc_avx2" +if test x"$build_mathvec" = xnotset; then + build_mathvec=yes +fi + $as_echo "#define PI_STATIC_AND_HIDDEN 1" >>confdefs.h # work around problem with autoconf and empty lines at the end of files diff --git a/sysdeps/x86_64/configure.ac b/sysdeps/x86_64/configure.ac index c9f9a51..1c2b35f 100644 --- a/sysdeps/x86_64/configure.ac +++ b/sysdeps/x86_64/configure.ac @@ -99,6 +99,10 @@ if test $libc_cv_cc_avx2 = yes; then fi LIBC_CONFIG_VAR([config-cflags-avx2], [$libc_cv_cc_avx2]) +if test x"$build_mathvec" = xnotset; then + build_mathvec=yes +fi + dnl It is always possible to access static and hidden symbols in an dnl position independent way. AC_DEFINE(PI_STATIC_AND_HIDDEN) diff --git a/sysdeps/x86_64/fpu/Makefile b/sysdeps/x86_64/fpu/Makefile new file mode 100644 index 0000000..c24188f --- /dev/null +++ b/sysdeps/x86_64/fpu/Makefile @@ -0,0 +1,5 @@ +ifeq ($(subdir),mathvec) +libmvec-support += svml_d_cos2_core_sse svml_d_cos4_core_avx \ + svml_d_cos4_core_avx2 svml_d_cos8_core_avx512 \ + svml_d_cos_data init-arch +endif diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions new file mode 100644 index 0000000..b38ed07 --- /dev/null +++ b/sysdeps/x86_64/fpu/Versions @@ -0,0 +1,8 @@ +libmvec { + GLIBC_2.22 { + _ZGVbN2v_cos; + _ZGVcN4v_cos; + _ZGVdN4v_cos; + _ZGVeN8v_cos; + } +} diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile index 12b0526..062d8ce 100644 --- a/sysdeps/x86_64/fpu/multiarch/Makefile +++ b/sysdeps/x86_64/fpu/multiarch/Makefile @@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX endif endif + +ifeq ($(subdir),mathvec) +libmvec-support += svml_d_cos2_core +endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S new file mode 100644 index 0000000..29c6afd --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S @@ -0,0 +1,34 @@ +/* Multiple versions of vectorized cos. + Copyright (C) 2014 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include + + .text +ENTRY(_ZGVbN2v_cos) + .type _ZGVbN2v_cos, @gnu_indirect_function + cmpl $0, KIND_OFFSET+__cpu_features(%rip) + jne 1f + call __init_cpu_features +1: leaq _ZGVbN2v_cos_sse4(%rip), %rax + testl $bit_SSE4_1, __cpu_features+CPUID_OFFSET+index_SSE4_1(%rip) + jz 2f + ret +2: leaq _ZGVbN2v_cos_sse2(%rip), %rax + ret +END(_ZGVbN2v_cos) diff --git a/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S new file mode 100644 index 0000000..16da39b --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S @@ -0,0 +1,225 @@ +/* Function cos vectorized with SSE2 and SSE4. + Copyright (C) 2014 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include +#include "svml_d_cos_data.h" + + .text +ENTRY(_ZGVbN2v_cos_sse2) +/* SSE2 version as wrapper to scalar. */ + subq $40, %rsp + movaps %xmm0, (%rsp) + call cos@PLT + movsd %xmm0, 16(%rsp) + movsd 8(%rsp), %xmm0 + call cos@PLT + movsd 16(%rsp), %xmm1 + movsd %xmm0, 24(%rsp) + unpcklpd %xmm0, %xmm1 + movaps %xmm1, %xmm0 + addq $40, %rsp + ret +END(_ZGVbN2v_cos_sse2) + +ENTRY(_ZGVbN2v_cos_sse4) +/* ALGORITHM DESCRIPTION: + * + * ( low accuracy ( < 4ulp ) or enhanced performance ( half of correct + * mantissa ) implementation ) + * + * Argument representation: + * arg + Pi/2 = (N*Pi + R) + * + * Result calculation: + * cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R) + * sin(R) is approximated by corresponding polynomial + */ + pushq %rbp + movq %rsp, %rbp + andq $-64, %rsp + subq $320, %rsp + movaps %xmm0, %xmm3 + movq __svml_dcos_data@GOTPCREL(%rip), %rax + movups __dHalfPI(%rax), %xmm2 + +/* ARGUMENT RANGE REDUCTION: + * Add Pi/2 to argument: X' = X+Pi/2 + */ + addpd %xmm3, %xmm2 + movups __dInvPI(%rax), %xmm5 + movups __dAbsMask(%rax), %xmm4 + +/* Get absolute argument value: X' = |X'| */ + andps %xmm2, %xmm4 + +/* Y = X'*InvPi + RS : right shifter add */ + mulpd %xmm5, %xmm2 + +/* Check for large arguments path */ + cmpnlepd __dRangeVal(%rax), %xmm4 + movups __dRShifter(%rax), %xmm6 + addpd %xmm6, %xmm2 + movmskpd %xmm4, %ecx + +/* N = Y - RS : right shifter sub */ + movaps %xmm2, %xmm1 + +/* SignRes = Y<<63 : shift LSB to MSB place for result sign */ + psllq $63, %xmm2 + subpd %xmm6, %xmm1 + +/* N = N - 0.5 */ + subpd __dOneHalf(%rax), %xmm1 + movups __dPI1(%rax), %xmm7 + +/* R = X - N*Pi1 */ + mulpd %xmm1, %xmm7 + movups __dPI2(%rax), %xmm4 + +/* R = R - N*Pi2 */ + mulpd %xmm1, %xmm4 + subpd %xmm7, %xmm0 + movups __dPI3(%rax), %xmm5 + +/* R = R - N*Pi3 */ + mulpd %xmm1, %xmm5 + subpd %xmm4, %xmm0 + +/* R = R - N*Pi4 */ + movups __dPI4(%rax), %xmm6 + mulpd %xmm6, %xmm1 + subpd %xmm5, %xmm0 + subpd %xmm1, %xmm0 + +/* POLYNOMIAL APPROXIMATION: + * R2 = R*R + */ + movaps %xmm0, %xmm4 + mulpd %xmm0, %xmm4 + movups __dC7(%rax), %xmm1 + mulpd %xmm4, %xmm1 + addpd __dC6(%rax), %xmm1 + mulpd %xmm4, %xmm1 + addpd __dC5(%rax), %xmm1 + mulpd %xmm4, %xmm1 + addpd __dC4(%rax), %xmm1 + +/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */ + mulpd %xmm4, %xmm1 + addpd __dC3(%rax), %xmm1 + +/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */ + mulpd %xmm4, %xmm1 + addpd __dC2(%rax), %xmm1 + mulpd %xmm4, %xmm1 + addpd __dC1(%rax), %xmm1 + mulpd %xmm1, %xmm4 + mulpd %xmm0, %xmm4 + addpd %xmm4, %xmm0 + +/* RECONSTRUCTION: + * Final sign setting: Res = Poly^SignRes + */ + xorps %xmm2, %xmm0 + testl %ecx, %ecx + jne .LBL_1_3 + +.LBL_1_2: + movq %rbp, %rsp + popq %rbp + ret + +.LBL_1_3: + movups %xmm3, 192(%rsp) + movups %xmm0, 256(%rsp) + je .LBL_1_2 + + xorb %dl, %dl + xorl %eax, %eax + movups %xmm8, 112(%rsp) + movups %xmm9, 96(%rsp) + movups %xmm10, 80(%rsp) + movups %xmm11, 64(%rsp) + movups %xmm12, 48(%rsp) + movups %xmm13, 32(%rsp) + movups %xmm14, 16(%rsp) + movups %xmm15, (%rsp) + movq %rsi, 136(%rsp) + movq %rdi, 128(%rsp) + movq %r12, 168(%rsp) + movb %dl, %r12b + movq %r13, 160(%rsp) + movl %ecx, %r13d + movq %r14, 152(%rsp) + movl %eax, %r14d + movq %r15, 144(%rsp) + +.LBL_1_6: + btl %r14d, %r13d + jc .LBL_1_12 + +.LBL_1_7: + lea 1(%r14), %esi + btl %esi, %r13d + jc .LBL_1_10 + +.LBL_1_8: + incb %r12b + addl $2, %r14d + cmpb $16, %r12b + jb .LBL_1_6 + + movups 112(%rsp), %xmm8 + movups 96(%rsp), %xmm9 + movups 80(%rsp), %xmm10 + movups 64(%rsp), %xmm11 + movups 48(%rsp), %xmm12 + movups 32(%rsp), %xmm13 + movups 16(%rsp), %xmm14 + movups (%rsp), %xmm15 + movq 136(%rsp), %rsi + movq 128(%rsp), %rdi + movq 168(%rsp), %r12 + movq 160(%rsp), %r13 + movq 152(%rsp), %r14 + movq 144(%rsp), %r15 + movups 256(%rsp), %xmm0 + jmp .LBL_1_2 + +.LBL_1_10: + movzbl %r12b, %r15d + shlq $4, %r15 + movsd 200(%rsp,%r15), %xmm0 + + call cos@PLT + + movsd %xmm0, 264(%rsp,%r15) + jmp .LBL_1_8 + +.LBL_1_12: + movzbl %r12b, %r15d + shlq $4, %r15 + movsd 192(%rsp,%r15), %xmm0 + + call cos@PLT + + movsd %xmm0, 256(%rsp,%r15) + jmp .LBL_1_7 + +END(_ZGVbN2v_cos_sse4) diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S new file mode 100644 index 0000000..5c24f42 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S @@ -0,0 +1,39 @@ +/* Function cos vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2014 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include + + .text +ENTRY(_ZGVcN4v_cos) + pushq %rbp + movq %rsp, %rbp + andq $-32, %rsp + subq $32, %rsp + vextractf128 $1, %ymm0, (%rsp) + vzeroupper + call _ZGVbN2v_cos@PLT + vmovapd %xmm0, 16(%rsp) + vmovaps (%rsp), %xmm0 + call _ZGVbN2v_cos@PLT + vmovapd %xmm0, %xmm1 + vmovapd 16(%rsp), %xmm0 + vinsertf128 $1, %xmm1, %ymm0, %ymm0 + movq %rbp, %rsp + popq %rbp + ret +END(_ZGVcN4v_cos) diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S new file mode 100644 index 0000000..28f661c --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S @@ -0,0 +1,193 @@ +/* Function cos vectorized with AVX2. + Copyright (C) 2014 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_cos_data.h" + + .text +ENTRY(_ZGVdN4v_cos) + +/* ALGORITHM DESCRIPTION: + * + * ( low accuracy ( < 4ulp ) or enhanced performance + * ( half of correct mantissa ) implementation ) + * + * Argument representation: + * arg + Pi/2 = (N*Pi + R) + * + * Result calculation: + * cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R) + * sin(R) is approximated by corresponding polynomial + */ + pushq %rbp + movq %rsp, %rbp + andq $-64, %rsp + subq $448, %rsp + movq __svml_dcos_data@GOTPCREL(%rip), %rax + vmovapd %ymm0, %ymm1 + vmovupd __dInvPI(%rax), %ymm4 + vmovupd __dRShifter(%rax), %ymm5 + +/* + * ARGUMENT RANGE REDUCTION: + * Add Pi/2 to argument: X' = X+Pi/2 + */ + vaddpd __dHalfPI(%rax), %ymm1, %ymm7 + +/* Get absolute argument value: X' = |X'| */ + vandpd __dAbsMask(%rax), %ymm7, %ymm2 + +/* Y = X'*InvPi + RS : right shifter add */ + vfmadd213pd %ymm5, %ymm4, %ymm7 + vmovupd __dC7(%rax), %ymm4 + +/* Check for large arguments path */ + vcmpnle_uqpd __dRangeVal(%rax), %ymm2, %ymm3 + +/* N = Y - RS : right shifter sub */ + vsubpd %ymm5, %ymm7, %ymm6 + vmovupd __dPI1_FMA(%rax), %ymm2 + +/* SignRes = Y<<63 : shift LSB to MSB place for result sign */ + vpsllq $63, %ymm7, %ymm7 + +/* N = N - 0.5 */ + vsubpd __dOneHalf(%rax), %ymm6, %ymm0 + vmovmskpd %ymm3, %ecx + +/* R = X - N*Pi1 */ + vmovapd %ymm1, %ymm3 + vfnmadd231pd %ymm0, %ymm2, %ymm3 + +/* R = R - N*Pi2 */ + vfnmadd231pd __dPI2_FMA(%rax), %ymm0, %ymm3 + +/* R = R - N*Pi3 */ + vfnmadd132pd __dPI3_FMA(%rax), %ymm3, %ymm0 + +/* + * POLYNOMIAL APPROXIMATION: + * R2 = R*R + */ + vmulpd %ymm0, %ymm0, %ymm5 + vfmadd213pd __dC6(%rax), %ymm5, %ymm4 + vfmadd213pd __dC5(%rax), %ymm5, %ymm4 + vfmadd213pd __dC4(%rax), %ymm5, %ymm4 + +/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */ + vfmadd213pd __dC3(%rax), %ymm5, %ymm4 + +/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */ + vfmadd213pd __dC2(%rax), %ymm5, %ymm4 + vfmadd213pd __dC1(%rax), %ymm5, %ymm4 + vmulpd %ymm5, %ymm4, %ymm6 + vfmadd213pd %ymm0, %ymm0, %ymm6 + +/* + * RECONSTRUCTION: + * Final sign setting: Res = Poly^SignRes + */ + vxorpd %ymm7, %ymm6, %ymm0 + testl %ecx, %ecx + jne .LBL_1_3 + +.LBL_1_2: + movq %rbp, %rsp + popq %rbp + ret + +.LBL_1_3: + vmovupd %ymm1, 320(%rsp) + vmovupd %ymm0, 384(%rsp) + je .LBL_1_2 + + xorb %dl, %dl + xorl %eax, %eax + vmovups %ymm8, 224(%rsp) + vmovups %ymm9, 192(%rsp) + vmovups %ymm10, 160(%rsp) + vmovups %ymm11, 128(%rsp) + vmovups %ymm12, 96(%rsp) + vmovups %ymm13, 64(%rsp) + vmovups %ymm14, 32(%rsp) + vmovups %ymm15, (%rsp) + movq %rsi, 264(%rsp) + movq %rdi, 256(%rsp) + movq %r12, 296(%rsp) + movb %dl, %r12b + movq %r13, 288(%rsp) + movl %ecx, %r13d + movq %r14, 280(%rsp) + movl %eax, %r14d + movq %r15, 272(%rsp) + +.LBL_1_6: + btl %r14d, %r13d + jc .LBL_1_12 + +.LBL_1_7: + lea 1(%r14), %esi + btl %esi, %r13d + jc .LBL_1_10 + +.LBL_1_8: + incb %r12b + addl $2, %r14d + cmpb $16, %r12b + jb .LBL_1_6 + + vmovups 224(%rsp), %ymm8 + vmovups 192(%rsp), %ymm9 + vmovups 160(%rsp), %ymm10 + vmovups 128(%rsp), %ymm11 + vmovups 96(%rsp), %ymm12 + vmovups 64(%rsp), %ymm13 + vmovups 32(%rsp), %ymm14 + vmovups (%rsp), %ymm15 + vmovupd 384(%rsp), %ymm0 + movq 264(%rsp), %rsi + movq 256(%rsp), %rdi + movq 296(%rsp), %r12 + movq 288(%rsp), %r13 + movq 280(%rsp), %r14 + movq 272(%rsp), %r15 + jmp .LBL_1_2 + +.LBL_1_10: + movzbl %r12b, %r15d + shlq $4, %r15 + vmovsd 328(%rsp,%r15), %xmm0 + vzeroupper + + call cos@PLT + + vmovsd %xmm0, 392(%rsp,%r15) + jmp .LBL_1_8 + +.LBL_1_12: + movzbl %r12b, %r15d + shlq $4, %r15 + vmovsd 320(%rsp,%r15), %xmm0 + vzeroupper + + call cos@PLT + + vmovsd %xmm0, 384(%rsp,%r15) + jmp .LBL_1_7 + +END(_ZGVdN4v_cos) diff --git a/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S new file mode 100644 index 0000000..cb65750 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S @@ -0,0 +1,247 @@ +/* Function cos vectorized with AVX-512, KNL version. + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_cos_data.h" + + .text +ENTRY(_ZGVeN8v_cos) +#ifndef HAVE_AVX512_ASM_SUPPORT +/* Implemented as wrapper to AVX2 version. */ + pushq %rbp + movq %rsp, %rbp + andq $-64, %rsp + subq $64, %rsp +/* Below is encoding for vmovaps %zmm0, (%rsp). */ + .byte 0x62 + .byte 0xf1 + .byte 0x7c + .byte 0x48 + .byte 0x29 + .byte 0x04 + .byte 0x24 +/* Below is encoding for vmovapd (%rsp), %ymm0. */ + .byte 0xc5 + .byte 0xfd + .byte 0x28 + .byte 0x04 + .byte 0x24 + call _ZGVdN4v_cos@PLT +/* Below is encoding for vmovapd 32(%rsp), %ymm0. */ + .byte 0xc5 + .byte 0xfd + .byte 0x28 + .byte 0x44 + .byte 0x24 + .byte 0x20 + call _ZGVdN4v_cos@PLT + movq %rbp, %rsp + popq %rbp + ret +#else +/* + * ALGORITHM DESCRIPTION: + * + * ( low accuracy ( < 4ulp ) or + * enhanced performance ( half of correct mantissa ) implementation ) + * + * Argument representation: + * arg + Pi/2 = (N*Pi + R) + * + * Result calculation: + * cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R) + * sin(R) is approximated by corresponding polynomial + */ + pushq %rbp + movq %rsp, %rbp + andq $-64, %rsp + subq $1280, %rsp + movq __svml_dcos_data@GOTPCREL(%rip), %rax + +/* R = X - N*Pi1 */ + vmovaps %zmm0, %zmm7 + +/* Check for large arguments path */ + movq $-1, %rcx + +/* + * ARGUMENT RANGE REDUCTION: + * Add Pi/2 to argument: X' = X+Pi/2 + */ + vaddpd __dHalfPI(%rax), %zmm0, %zmm5 + vmovups __dInvPI(%rax), %zmm3 + +/* Get absolute argument value: X' = |X'| */ + vpandq __dAbsMask(%rax), %zmm5, %zmm1 + +/* Y = X'*InvPi + RS : right shifter add */ + vfmadd213pd __dRShifter(%rax), %zmm3, %zmm5 + vmovups __dPI1_FMA(%rax), %zmm6 + +/* N = Y - RS : right shifter sub */ + vsubpd __dRShifter(%rax), %zmm5, %zmm4 + +/* SignRes = Y<<63 : shift LSB to MSB place for result sign */ + vpsllq $63, %zmm5, %zmm12 + vmovups __dC7(%rax), %zmm8 + +/* N = N - 0.5 */ + vsubpd __dOneHalf(%rax), %zmm4, %zmm10 + vcmppd $22, __dRangeVal(%rax), %zmm1, %k1 + vpbroadcastq %rcx, %zmm2{%k1}{z} + vfnmadd231pd %zmm10, %zmm6, %zmm7 + vptestmq %zmm2, %zmm2, %k0 + +/* R = R - N*Pi2 */ + vfnmadd231pd __dPI2_FMA(%rax), %zmm10, %zmm7 + kmovw %k0, %ecx + movzbl %cl, %ecx + +/* R = R - N*Pi3 */ + vfnmadd132pd __dPI3_FMA(%rax), %zmm7, %zmm10 + +/* + * POLYNOMIAL APPROXIMATION: + * R2 = R*R + */ + vmulpd %zmm10, %zmm10, %zmm9 + vfmadd213pd __dC6(%rax), %zmm9, %zmm8 + vfmadd213pd __dC5(%rax), %zmm9, %zmm8 + vfmadd213pd __dC4(%rax), %zmm9, %zmm8 + +/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */ + vfmadd213pd __dC3(%rax), %zmm9, %zmm8 + +/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */ + vfmadd213pd __dC2(%rax), %zmm9, %zmm8 + vfmadd213pd __dC1(%rax), %zmm9, %zmm8 + vmulpd %zmm9, %zmm8, %zmm11 + vfmadd213pd %zmm10, %zmm10, %zmm11 + +/* + * RECONSTRUCTION: + * Final sign setting: Res = Poly^SignRes + */ + vpxorq %zmm12, %zmm11, %zmm1 + testl %ecx, %ecx + jne .LBL_1_3 + +.LBL_1_2: + vmovaps %zmm1, %zmm0 + movq %rbp, %rsp + popq %rbp + ret + +.LBL_1_3: + vmovups %zmm0, 1152(%rsp) + vmovups %zmm1, 1216(%rsp) + je .LBL_1_2 + + xorb %dl, %dl + kmovw %k4, 1048(%rsp) + xorl %eax, %eax + kmovw %k5, 1040(%rsp) + kmovw %k6, 1032(%rsp) + kmovw %k7, 1024(%rsp) + vmovups %zmm16, 960(%rsp) + vmovups %zmm17, 896(%rsp) + vmovups %zmm18, 832(%rsp) + vmovups %zmm19, 768(%rsp) + vmovups %zmm20, 704(%rsp) + vmovups %zmm21, 640(%rsp) + vmovups %zmm22, 576(%rsp) + vmovups %zmm23, 512(%rsp) + vmovups %zmm24, 448(%rsp) + vmovups %zmm25, 384(%rsp) + vmovups %zmm26, 320(%rsp) + vmovups %zmm27, 256(%rsp) + vmovups %zmm28, 192(%rsp) + vmovups %zmm29, 128(%rsp) + vmovups %zmm30, 64(%rsp) + vmovups %zmm31, (%rsp) + movq %rsi, 1064(%rsp) + movq %rdi, 1056(%rsp) + movq %r12, 1096(%rsp) + movb %dl, %r12b + movq %r13, 1088(%rsp) + movl %ecx, %r13d + movq %r14, 1080(%rsp) + movl %eax, %r14d + movq %r15, 1072(%rsp) + +.LBL_1_6: + btl %r14d, %r13d + jc .LBL_1_12 + +.LBL_1_7: + lea 1(%r14), %esi + btl %esi, %r13d + jc .LBL_1_10 + +.LBL_1_8: + addb $1, %r12b + addl $2, %r14d + cmpb $16, %r12b + jb .LBL_1_6 + + kmovw 1048(%rsp), %k4 + movq 1064(%rsp), %rsi + kmovw 1040(%rsp), %k5 + movq 1056(%rsp), %rdi + kmovw 1032(%rsp), %k6 + movq 1096(%rsp), %r12 + movq 1088(%rsp), %r13 + kmovw 1024(%rsp), %k7 + vmovups 960(%rsp), %zmm16 + vmovups 896(%rsp), %zmm17 + vmovups 832(%rsp), %zmm18 + vmovups 768(%rsp), %zmm19 + vmovups 704(%rsp), %zmm20 + vmovups 640(%rsp), %zmm21 + vmovups 576(%rsp), %zmm22 + vmovups 512(%rsp), %zmm23 + vmovups 448(%rsp), %zmm24 + vmovups 384(%rsp), %zmm25 + vmovups 320(%rsp), %zmm26 + vmovups 256(%rsp), %zmm27 + vmovups 192(%rsp), %zmm28 + vmovups 128(%rsp), %zmm29 + vmovups 64(%rsp), %zmm30 + vmovups (%rsp), %zmm31 + movq 1080(%rsp), %r14 + movq 1072(%rsp), %r15 + vmovups 1216(%rsp), %zmm1 + jmp .LBL_1_2 + +.LBL_1_10: + movzbl %r12b, %r15d + shlq $4, %r15 + vmovsd 1160(%rsp,%r15), %xmm0 + call cos@PLT + vmovsd %xmm0, 1224(%rsp,%r15) + jmp .LBL_1_8 + +.LBL_1_12: + movzbl %r12b, %r15d + shlq $4, %r15 + vmovsd 1152(%rsp,%r15), %xmm0 + call cos@PLT + vmovsd %xmm0, 1216(%rsp,%r15) + jmp .LBL_1_7 +#endif +END(_ZGVeN8v_cos) diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.S b/sysdeps/x86_64/fpu/svml_d_cos_data.S new file mode 100644 index 0000000..6f36ecd --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cos_data.S @@ -0,0 +1,114 @@ +/* Data for vectorized cos. + Copyright (C) 2014 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include "svml_d_cos_data.h" + +.macro double_vector offset value +.if .-__svml_dcos_data != \offset +.err +.endif +.rept 8 +.quad \value +.endr +.endm + + .section .rodata, "a" + .align 64 + +/* Data table for vector implementations of function cos. + * The table may contain polynomial, reduction, lookup + * coefficients and other constants obtained through different + * methods of research and experimental work. + */ + .globl __svml_dcos_data +__svml_dcos_data: + +/* General purpose constants: + * absolute value mask + */ +double_vector __dAbsMask 0x7fffffffffffffff + +/* working range threshold */ +double_vector __dRangeVal 0x4160000000000000 + +/* PI/2 */ +double_vector __dHalfPI 0x3ff921fb54442d18 + +/* 1/PI */ +double_vector __dInvPI 0x3fd45f306dc9c883 + +/* right-shifter constant */ +double_vector __dRShifter 0x4338000000000000 + +/* 0.5 */ +double_vector __dOneHalf 0x3fe0000000000000 + +/* Range reduction PI-based constants: + * PI high part + */ +double_vector __dPI1 0x400921fb40000000 + +/* PI mid part 1 */ +double_vector __dPI2 0x3e84442d00000000 + +/* PI mid part 2 */ +double_vector __dPI3 0x3d08469880000000 + +/* PI low part */ +double_vector __dPI4 0x3b88cc51701b839a + +/* Range reduction PI-based constants if FMA available: + * PI high part (FMA available) + */ +double_vector __dPI1_FMA 0x400921fb54442d18 + +/* PI mid part (FMA available) */ +double_vector __dPI2_FMA 0x3ca1a62633145c06 + +/* PI low part (FMA available) */ +double_vector __dPI3_FMA 0x395c1cd129024e09 + +/* Polynomial coefficients (relative error 2^(-52.115)): */ +double_vector __dC1 0xbfc55555555554a7 +double_vector __dC2 0x3f8111111110a4a8 +double_vector __dC3 0xbf2a01a019a5b86d +double_vector __dC4 0x3ec71de38030fea0 +double_vector __dC5 0xbe5ae63546002231 +double_vector __dC6 0x3de60e6857a2f220 +double_vector __dC7 0xbd69f0d60811aac8 + +/* + * Additional constants: + * absolute value mask + */ +double_vector __dAbsMask_la 0x7fffffffffffffff + +/* 1/PI */ +double_vector __dInvPI_la 0x3fd45f306dc9c883 + +/* right-shifer for low accuracy version */ +double_vector __dRShifter_la 0x4330000000000000 + +/* right-shifer-1.0 for low accuracy version */ +double_vector __dRShifterm5_la 0x432fffffffffffff + +/* right-shifer with low mask for low accuracy version */ +double_vector __dRXmax_la 0x43300000007ffffe + + .type __svml_dcos_data,@object + .size __svml_dcos_data,.-__svml_dcos_data diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.h b/sysdeps/x86_64/fpu/svml_d_cos_data.h new file mode 100644 index 0000000..21fa806 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cos_data.h @@ -0,0 +1,48 @@ +/* Offsets for data table for vectorized cos. + Copyright (C) 2014 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef D_COS_DATA_H +#define D_COS_DATA_H + +#define __dAbsMask 0 +#define __dRangeVal 64 +#define __dHalfPI 128 +#define __dInvPI 192 +#define __dRShifter 256 +#define __dOneHalf 320 +#define __dPI1 384 +#define __dPI2 448 +#define __dPI3 512 +#define __dPI4 576 +#define __dPI1_FMA 640 +#define __dPI2_FMA 704 +#define __dPI3_FMA 768 +#define __dC1 832 +#define __dC2 896 +#define __dC3 960 +#define __dC4 1024 +#define __dC5 1088 +#define __dC6 1152 +#define __dC7 1216 +#define __dAbsMask_la 1280 +#define __dInvPI_la 1344 +#define __dRShifter_la 1408 +#define __dRShifterm5_la 1472 +#define __dRXmax_la 1536 + +#endif