[1/14,x86_64] Vector math functions (vector cos)

Message ID CAMXFM3t+v7zYEmvWQxV9aRmJB6Ag2hQgQ56cZJTDy=siRBszDw@mail.gmail.com
State Superseded
Headers

Commit Message

Andrew Senkevich April 30, 2015, 3:03 p.m. UTC
  Hi,

this patch starts series of patches with vectorized implementations of
several math functions for x86_64.

Here is implementation of cos containing SSE, AVX, AVX2 and AVX512
versions according to Vector ABI which was discussed in
https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4.

This patch also adds vector math library build and ABI check enabled
by default for x86_64.

2015-04-30  Andrew Senkevich  <andrew.senkevich@intel.com>

        * sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
        * sysdeps/x86_64/fpu/Versions: New file.
        * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
        * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
        * sysdeps/x86_64/configure: Regenerated.
        * sysdeps/x86_64/configure.ac: Options for libmvec build.
        * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.
        * sysdeps/x86_64/fpu/Makefile: New file.
        * sysdeps/x86_64/fpu/multiarch/Makefile: Added build of SSE version
        which is IFUNC.


Is this patch ok?


--
WBR,
Andrew
  

Comments

Joseph Myers April 30, 2015, 4:06 p.m. UTC | #1
On Thu, 30 Apr 2015, Andrew Senkevich wrote:

> +   Copyright (C) 2014 Free Software Foundation, Inc.

2014-2015 in all new files.

> +ENTRY(_ZGVbN2v_cos_sse4)
> +/* ALGORITHM DESCRIPTION:
> + *
> + *      ( low accuracy ( < 4ulp ) or enhanced performance ( half of correct
> + *      mantissa ) implementation )
> + *
> + *      Argument representation:
> + *      arg + Pi/2 = (N*Pi + R)
> + *
> + *      Result calculation:
> + *      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
> + *      sin(R) is approximated by corresponding polynomial
> + */
> +        pushq     %rbp
> +        movq      %rsp, %rbp
> +        andq      $-64, %rsp
> +        subq      $320, %rsp

Stack adjustments should have CFI that accurately describes how to unwind 
the stack on any instruction boundary (try stepping by instruction in GDB 
and making sure you can get an accurate backtrace after every step).  
Probably something like

pushq %rbp
cfi_adjust_cfa_offset (8)
cfi_rel_offset (%rbp, 0)
movq %rsp, %rbp
cfi_def_cfa_register (%rbp)

(having defined %rbp to be the CFA, after that %rsp adjustments don't need 
further CFI)

...

movq %rbp, %rsp
cfi_def_cfa_register (%rsp)
popq %rbp
cfi_adjust_cfa_offset (-8)
cfi_restore (%rbp)

(but since you jump past code like that, you need to save and restore CFI 
state appropriately for that).

If you save, clobber and restore any call-preserved registers, CFI is also 
needed for that.  Again, you need to take care about jumps to ensure the 
CFI state is correct at each label that may be a jump target.

> +ENTRY(_ZGVcN4v_cos)
> +        pushq          %rbp
> +        movq           %rsp, %rbp
> +        andq           $-32, %rsp
> +        subq           $32, %rsp
> +        vextractf128   $1, %ymm0, (%rsp)
> +        vzeroupper
> +        call           _ZGVbN2v_cos@PLT
> +        vmovapd                %xmm0, 16(%rsp)
> +        vmovaps                (%rsp), %xmm0
> +        call           _ZGVbN2v_cos@PLT

Does this pass testing (specifically, the localplt test in the elf/ 
directory)?  I'd expect you to need to use HIDDEN_JUMPTARGET when calling 
any symbol exported from libmvec (together with libm_hidden_def / 
libm_hidden_weak on the definition of the function called, if not already 
present).

If it passes despite there being unnecessary local PLT references in 
libmvec, that points to something missing from the earlier patches: 
libmvec should be included in the libraries for which the localplt test 
runs.

> +/* Data table for vector implementations of function cos.
> + * The table may contain polynomial, reduction, lookup
> + * coefficients and other constants obtained through different
> + * methods of research and experimental work.

Comments in glibc don't use an initial '*' on each line after the first.
  
Andrew Senkevich May 15, 2015, 3:34 p.m. UTC | #2
Hi Joseph,

here is fixed patch with some additions (wrapper implementations goes
to separate header and AVX512 version now has 2 implementations).


--
WBR,
Andrew
  
Joseph Myers May 15, 2015, 4:32 p.m. UTC | #3
On Fri, 15 May 2015, Andrew Senkevich wrote:

> diff --git a/include/libc-symbols.h b/include/libc-symbols.h
> index ca3fe00..8992e9e 100644
> --- a/include/libc-symbols.h
> +++ b/include/libc-symbols.h
> @@ -526,7 +526,7 @@ for linking")
>  # define rtld_hidden_data_ver(local, name)
>  #endif
>  
> -#if IS_IN (libm)
> +#if IS_IN (libm) || IS_IN(libmvec)
>  # define libm_hidden_proto(name, attrs...) hidden_proto (name, ##attrs)
>  # define libm_hidden_tls_proto(name, attrs...) hidden_tls_proto (name, ##attrs)
>  # define libm_hidden_def(name) hidden_def (name)

I think it would be better to have a completely separate set of 
libmvec_hidden_* definitions (sorry for misleading you in my previous 
comments by referring to libm_hidden_*).  Also remember the space before 
'(' in the IS_IN call, and in other macro calls such as ENTRY, END and 
libmvec_hidden_def.

> +        jne       .LBL_1_3
> +
> +.LBL_1_2:
> +        movq      %rbp, %rsp
> +        cfi_def_cfa_register (%rsp)
> +        popq      %rbp
> +        cfi_adjust_cfa_offset (-8)
> +        cfi_restore (%rbp)
> +        ret
> +
> +.LBL_1_3:
> +        movups    %xmm3, 192(%rsp)

This won't yield correct CFI at .LBL_1_3 and afterwards.  At .LBL_1_3, the 
correct CFI is the same as that at .LBL_1_2 - *not* the same as that at 
the ret instruction.  So you should do cfi_remember_state before the CFI 
changes here, and cfi_restore_state after ret, for the CFI to be correct 
for the rest of the function.

> +        movups    %xmm0, 256(%rsp)
> +        je        .LBL_1_2
> +
> +        xorb      %dl, %dl
> +        xorl      %eax, %eax
> +        movups    %xmm8, 112(%rsp)
> +        movups    %xmm9, 96(%rsp)
> +        movups    %xmm10, 80(%rsp)
> +        movups    %xmm11, 64(%rsp)
> +        movups    %xmm12, 48(%rsp)
> +        movups    %xmm13, 32(%rsp)
> +        movups    %xmm14, 16(%rsp)
> +        movups    %xmm15, (%rsp)
> +        movq      %rsi, 136(%rsp)
> +        movq      %rdi, 128(%rsp)
> +        movq      %r12, 168(%rsp)
> +        movb      %dl, %r12b
> +        movq      %r13, 160(%rsp)
> +        movl      %ecx, %r13d
> +        movq      %r14, 152(%rsp)
> +        movl      %eax, %r14d
> +        movq      %r15, 144(%rsp)

The ABI says that r12-r15 are callee-saved registers.  That means you need 
CFI describing when they are saved and restored (and again need 
cfi_remember_state and cfi_restore_state to deal with how some code that's 
physically after the restoration is logically before it, and so needs the 
CFI describing how those registers are saved on the stack).

Similar issues apply with the CFI in other function implementations here.
  
Szabolcs Nagy May 15, 2015, 4:46 p.m. UTC | #4
On 15/05/15 16:34, Andrew Senkevich wrote:
> diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h
> new file mode 100644
> index 0000000..27294ce
> --- /dev/null
> +++ b/sysdeps/x86/fpu/bits/math-vector.h
> @@ -0,0 +1,34 @@
> +/* Platform-specific SIMD declarations of math functions.
> +   Copyright (C) 2014-2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _MATH_H
> +# error "Never include <bits/math-vector.h> directly;\
> + include <math.h> instead."
> +#endif
> +
> +/* Get default empty definitions for simd declarations.  */
> +#include <bits/libm-simd-decl-stubs.h>
> +
> +#if defined __x86_64__ && defined __FAST_MATH__
> +# if defined _OPENMP && _OPENMP >= 201307
> +/* OpenMP case.  */
> +#  define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch")
> +#  undef __DECL_SIMD_cos
> +#  define __DECL_SIMD_cos __DECL_SIMD_x86_64
> +# endif
> +#endif

why does math.h have #ifdef _OPENMP ? (and __FAST_MATH__)

can this be implemented in an openmp runtime instead?
  
Joseph Myers May 15, 2015, 4:54 p.m. UTC | #5
On Fri, 15 May 2015, Szabolcs Nagy wrote:

> > +#if defined __x86_64__ && defined __FAST_MATH__
> > +# if defined _OPENMP && _OPENMP >= 201307
> > +/* OpenMP case.  */
> > +#  define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch")
> > +#  undef __DECL_SIMD_cos
> > +#  define __DECL_SIMD_cos __DECL_SIMD_x86_64
> > +# endif
> > +#endif
> 
> why does math.h have #ifdef _OPENMP ? (and __FAST_MATH__)

Because this pragma to declare that the vector functions are available 
(for the compiler to vectorize calls to scalar functions) is an OpenMP 
pragma, only available when OpenMP is enabled.  They might also be made 
available in other cases in future when appropriate pragmas (e.g. 
CilkPlus) are available.  __FAST_MATH__ is because these functions may not 
meet the normal glibc rules regarding accuracy of results and exceptions.

See the discussions starting last September of the approach for adding 
vector libm functions.
  
Andrew Senkevich May 18, 2015, 4:37 p.m. UTC | #6
2015-05-15 19:32 GMT+03:00 Joseph Myers <joseph@codesourcery.com>:
> On Fri, 15 May 2015, Andrew Senkevich wrote:
>
>> +        movups    %xmm0, 256(%rsp)
>> +        je        .LBL_1_2
>> +
>> +        xorb      %dl, %dl
>> +        xorl      %eax, %eax
>> +        movups    %xmm8, 112(%rsp)
>> +        movups    %xmm9, 96(%rsp)
>> +        movups    %xmm10, 80(%rsp)
>> +        movups    %xmm11, 64(%rsp)
>> +        movups    %xmm12, 48(%rsp)
>> +        movups    %xmm13, 32(%rsp)
>> +        movups    %xmm14, 16(%rsp)
>> +        movups    %xmm15, (%rsp)
>> +        movq      %rsi, 136(%rsp)
>> +        movq      %rdi, 128(%rsp)
>> +        movq      %r12, 168(%rsp)
>> +        movb      %dl, %r12b
>> +        movq      %r13, 160(%rsp)
>> +        movl      %ecx, %r13d
>> +        movq      %r14, 152(%rsp)
>> +        movl      %eax, %r14d
>> +        movq      %r15, 144(%rsp)
>
> The ABI says that r12-r15 are callee-saved registers.  That means you need
> CFI describing when they are saved and restored (and again need
> cfi_remember_state and cfi_restore_state to deal with how some code that's
> physically after the restoration is logically before it, and so needs the
> CFI describing how those registers are saved on the stack).

How to calculate offset for .cfi_offset directive after f.e. movq
%r12, 168(%rsp)?

If not to have %rbp as CFA but stay %rsp, how to specify the proper
CFI after         andq      $-64, %rsp?


--
WBR,
Andrew
  
Joseph Myers May 18, 2015, 5:27 p.m. UTC | #7
On Mon, 18 May 2015, Andrew Senkevich wrote:

> How to calculate offset for .cfi_offset directive after f.e. movq
> %r12, 168(%rsp)?

I think you'll need to use .cfi_escape in order to generate 
DW_CFA_expression describing the register as saved relative to %rsp 
instead of relative to the CFA.  Something like:

DW_CFA_expression
uleb128 register number (as in the x86_64 ABI)
uleb128 length of what follows
DW_OP_drop
DW_OP_breg7 (%rsp is register 7)
sleb128 offset from %rsp

appropriately encoded by hand and entered with .cfi_escape (with 
appropriate comments saying what operations are being encoded).  You could 
probably write an assembler macro to do the encoding (though if the 
offsets in question don't all have the same length when encoded as 
sleb128, that complicates the macro a bit).
  
Andrew Senkevich May 19, 2015, 7:38 p.m. UTC | #8
> appropriate comments saying what operations are being encoded).  You could
> probably write an assembler macro to do the encoding (though if the
> offsets in question don't all have the same length when encoded as
> sleb128, that complicates the macro a bit).

Where to put that macro, in sysdeps/x86_64/sysdep.h?


--
WBR,
Andrew
  
Joseph Myers May 20, 2015, 3:30 p.m. UTC | #9
On Tue, 19 May 2015, Andrew Senkevich wrote:

> > appropriate comments saying what operations are being encoded).  You could
> > probably write an assembler macro to do the encoding (though if the
> > offsets in question don't all have the same length when encoded as
> > sleb128, that complicates the macro a bit).
> 
> Where to put that macro, in sysdeps/x86_64/sysdep.h?

That might be plausible, depending on what the macro looks like (whether 
it's something generic for any x86_64 code saving a register on the stack 
when CFA is something else, or more specific to this libmvec code).
  

Patch

diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h
index e8e5577..85a6a95 100644
--- a/math/bits/mathcalls.h
+++ b/math/bits/mathcalls.h
@@ -60,7 +60,7 @@  __MATHCALL (atan,, (_Mdouble_ __x));
 __MATHCALL (atan2,, (_Mdouble_ __y, _Mdouble_ __x));

 /* Cosine of X.  */
-__MATHCALL (cos,, (_Mdouble_ __x));
+__MATHCALL_VEC (cos,, (_Mdouble_ __x));
 /* Sine of X.  */
 __MATHCALL (sin,, (_Mdouble_ __x));
 /* Tangent of X.  */
diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
new file mode 100644
index 0000000..be6eaed
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
@@ -0,0 +1,6 @@ 
+GLIBC_2.22
+ GLIBC_2.22 A
+ _ZGVbN2v_cos F
+ _ZGVcN4v_cos F
+ _ZGVdN4v_cos F
+ _ZGVeN8v_cos F
diff --git a/sysdeps/x86/fpu/bits/math-vector.h
b/sysdeps/x86/fpu/bits/math-vector.h
new file mode 100644
index 0000000..6540268
--- /dev/null
+++ b/sysdeps/x86/fpu/bits/math-vector.h
@@ -0,0 +1,34 @@ 
+/* Platform-specific SIMD declarations of math functions.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _MATH_H
+# error "Never include <bits/math-vector.h> directly;\
+ include <math.h> instead."
+#endif
+
+/* Get default empty definitions for simd declarations.  */
+#include <bits/libm-simd-decl-stubs.h>
+
+#if defined __x86_64__ && defined __FAST_MATH__
+# if defined _OPENMP && _OPENMP >= 201307
+/* OpenMP case.  */
+#  define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch")
+#  undef __DECL_SIMD_cos
+#  define __DECL_SIMD_cos __DECL_SIMD_x86_64
+# endif
+#endif
diff --git a/sysdeps/x86_64/configure b/sysdeps/x86_64/configure
index 7d4dadd..1493523 100644
--- a/sysdeps/x86_64/configure
+++ b/sysdeps/x86_64/configure
@@ -275,6 +275,10 @@  fi
 config_vars="$config_vars
 config-cflags-avx2 = $libc_cv_cc_avx2"

+if test x"$build_mathvec" = xnotset; then
+  build_mathvec=yes
+fi
+
 $as_echo "#define PI_STATIC_AND_HIDDEN 1" >>confdefs.h

 # work around problem with autoconf and empty lines at the end of files
diff --git a/sysdeps/x86_64/configure.ac b/sysdeps/x86_64/configure.ac
index c9f9a51..1c2b35f 100644
--- a/sysdeps/x86_64/configure.ac
+++ b/sysdeps/x86_64/configure.ac
@@ -99,6 +99,10 @@  if test $libc_cv_cc_avx2 = yes; then
 fi
 LIBC_CONFIG_VAR([config-cflags-avx2], [$libc_cv_cc_avx2])

+if test x"$build_mathvec" = xnotset; then
+  build_mathvec=yes
+fi
+
 dnl It is always possible to access static and hidden symbols in an
 dnl position independent way.
 AC_DEFINE(PI_STATIC_AND_HIDDEN)
diff --git a/sysdeps/x86_64/fpu/Makefile b/sysdeps/x86_64/fpu/Makefile
new file mode 100644
index 0000000..c24188f
--- /dev/null
+++ b/sysdeps/x86_64/fpu/Makefile
@@ -0,0 +1,5 @@ 
+ifeq ($(subdir),mathvec)
+libmvec-support += svml_d_cos2_core_sse svml_d_cos4_core_avx \
+                  svml_d_cos4_core_avx2 svml_d_cos8_core_avx512 \
+                  svml_d_cos_data init-arch
+endif
diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions
new file mode 100644
index 0000000..b38ed07
--- /dev/null
+++ b/sysdeps/x86_64/fpu/Versions
@@ -0,0 +1,8 @@ 
+libmvec {
+  GLIBC_2.22 {
+    _ZGVbN2v_cos;
+    _ZGVcN4v_cos;
+    _ZGVdN4v_cos;
+    _ZGVeN8v_cos;
+  }
+}
diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile
b/sysdeps/x86_64/fpu/multiarch/Makefile
index 12b0526..062d8ce 100644
--- a/sysdeps/x86_64/fpu/multiarch/Makefile
+++ b/sysdeps/x86_64/fpu/multiarch/Makefile
@@ -51,3 +51,7 @@  CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
 CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
 endif
 endif
+
+ifeq ($(subdir),mathvec)
+libmvec-support += svml_d_cos2_core
+endif
diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
new file mode 100644
index 0000000..29c6afd
--- /dev/null
+++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
@@ -0,0 +1,34 @@ 
+/* Multiple versions of vectorized cos.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <init-arch.h>
+
+       .text
+ENTRY(_ZGVbN2v_cos)
+        .type   _ZGVbN2v_cos, @gnu_indirect_function
+        cmpl    $0, KIND_OFFSET+__cpu_features(%rip)
+        jne     1f
+        call    __init_cpu_features
+1:      leaq    _ZGVbN2v_cos_sse4(%rip), %rax
+        testl   $bit_SSE4_1, __cpu_features+CPUID_OFFSET+index_SSE4_1(%rip)
+        jz      2f
+        ret
+2:      leaq    _ZGVbN2v_cos_sse2(%rip), %rax
+        ret
+END(_ZGVbN2v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
new file mode 100644
index 0000000..16da39b
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
@@ -0,0 +1,225 @@ 
+/* Function cos vectorized with SSE2 and SSE4.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <init-arch.h>
+#include "svml_d_cos_data.h"
+
+       .text
+ENTRY(_ZGVbN2v_cos_sse2)
+/* SSE2 version as wrapper to scalar.  */
+        subq      $40, %rsp
+        movaps    %xmm0, (%rsp)
+        call      cos@PLT
+        movsd     %xmm0, 16(%rsp)
+        movsd     8(%rsp), %xmm0
+        call      cos@PLT
+        movsd     16(%rsp), %xmm1
+        movsd     %xmm0, 24(%rsp)
+        unpcklpd  %xmm0, %xmm1
+        movaps    %xmm1, %xmm0
+        addq      $40, %rsp
+        ret
+END(_ZGVbN2v_cos_sse2)
+
+ENTRY(_ZGVbN2v_cos_sse4)
+/* ALGORITHM DESCRIPTION:
+ *
+ *      ( low accuracy ( < 4ulp ) or enhanced performance ( half of correct
+ *      mantissa ) implementation )
+ *
+ *      Argument representation:
+ *      arg + Pi/2 = (N*Pi + R)
+ *
+ *      Result calculation:
+ *      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+ *      sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        movq      %rsp, %rbp
+        andq      $-64, %rsp
+        subq      $320, %rsp
+        movaps    %xmm0, %xmm3
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+        movups    __dHalfPI(%rax), %xmm2
+
+/* ARGUMENT RANGE REDUCTION:
+ * Add Pi/2 to argument: X' = X+Pi/2
+ */
+        addpd     %xmm3, %xmm2
+        movups    __dInvPI(%rax), %xmm5
+        movups    __dAbsMask(%rax), %xmm4
+
+/* Get absolute argument value: X' = |X'| */
+        andps     %xmm2, %xmm4
+
+/* Y = X'*InvPi + RS : right shifter add */
+        mulpd     %xmm5, %xmm2
+
+/* Check for large arguments path */
+        cmpnlepd  __dRangeVal(%rax), %xmm4
+        movups    __dRShifter(%rax), %xmm6
+        addpd     %xmm6, %xmm2
+        movmskpd  %xmm4, %ecx
+
+/* N = Y - RS : right shifter sub */
+        movaps    %xmm2, %xmm1
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        psllq     $63, %xmm2
+        subpd     %xmm6, %xmm1
+
+/* N = N - 0.5 */
+        subpd     __dOneHalf(%rax), %xmm1
+        movups    __dPI1(%rax), %xmm7
+
+/* R = X - N*Pi1 */
+        mulpd     %xmm1, %xmm7
+        movups    __dPI2(%rax), %xmm4
+
+/* R = R - N*Pi2 */
+        mulpd     %xmm1, %xmm4
+        subpd     %xmm7, %xmm0
+        movups    __dPI3(%rax), %xmm5
+
+/* R = R - N*Pi3 */
+        mulpd     %xmm1, %xmm5
+        subpd     %xmm4, %xmm0
+
+/* R = R - N*Pi4 */
+        movups     __dPI4(%rax), %xmm6
+        mulpd     %xmm6, %xmm1
+        subpd     %xmm5, %xmm0
+        subpd     %xmm1, %xmm0
+
+/* POLYNOMIAL APPROXIMATION:
+ * R2 = R*R
+ */
+        movaps    %xmm0, %xmm4
+        mulpd     %xmm0, %xmm4
+        movups    __dC7(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC6(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC5(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC4(%rax), %xmm1
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        mulpd     %xmm4, %xmm1
+        addpd     __dC3(%rax), %xmm1
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        mulpd     %xmm4, %xmm1
+        addpd     __dC2(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC1(%rax), %xmm1
+        mulpd     %xmm1, %xmm4
+        mulpd     %xmm0, %xmm4
+        addpd     %xmm4, %xmm0
+
+/* RECONSTRUCTION:
+ * Final sign setting: Res = Poly^SignRes
+ */
+        xorps     %xmm2, %xmm0
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        movq      %rbp, %rsp
+        popq      %rbp
+        ret
+
+.LBL_1_3:
+        movups    %xmm3, 192(%rsp)
+        movups    %xmm0, 256(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        movups    %xmm8, 112(%rsp)
+        movups    %xmm9, 96(%rsp)
+        movups    %xmm10, 80(%rsp)
+        movups    %xmm11, 64(%rsp)
+        movups    %xmm12, 48(%rsp)
+        movups    %xmm13, 32(%rsp)
+        movups    %xmm14, 16(%rsp)
+        movups    %xmm15, (%rsp)
+        movq      %rsi, 136(%rsp)
+        movq      %rdi, 128(%rsp)
+        movq      %r12, 168(%rsp)
+        movb      %dl, %r12b
+        movq      %r13, 160(%rsp)
+        movl      %ecx, %r13d
+        movq      %r14, 152(%rsp)
+        movl      %eax, %r14d
+        movq      %r15, 144(%rsp)
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        movups    112(%rsp), %xmm8
+        movups    96(%rsp), %xmm9
+        movups    80(%rsp), %xmm10
+        movups    64(%rsp), %xmm11
+        movups    48(%rsp), %xmm12
+        movups    32(%rsp), %xmm13
+        movups    16(%rsp), %xmm14
+        movups    (%rsp), %xmm15
+        movq      136(%rsp), %rsi
+        movq      128(%rsp), %rdi
+        movq      168(%rsp), %r12
+        movq      160(%rsp), %r13
+        movq      152(%rsp), %r14
+        movq      144(%rsp), %r15
+        movups    256(%rsp), %xmm0
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        movsd     200(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        movsd     %xmm0, 264(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        movsd     192(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        movsd     %xmm0, 256(%rsp,%r15)
+        jmp       .LBL_1_7
+
+END(_ZGVbN2v_cos_sse4)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
new file mode 100644
index 0000000..5c24f42
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
@@ -0,0 +1,39 @@ 
+/* Function cos vectorized in AVX ISA as wrapper to SSE4 ISA version.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+       .text
+ENTRY(_ZGVcN4v_cos)
+        pushq          %rbp
+        movq           %rsp, %rbp
+        andq           $-32, %rsp
+        subq           $32, %rsp
+        vextractf128   $1, %ymm0, (%rsp)
+        vzeroupper
+        call           _ZGVbN2v_cos@PLT
+        vmovapd                %xmm0, 16(%rsp)
+        vmovaps                (%rsp), %xmm0
+        call           _ZGVbN2v_cos@PLT
+        vmovapd                %xmm0, %xmm1
+        vmovapd                16(%rsp), %xmm0
+        vinsertf128    $1, %xmm1, %ymm0, %ymm0
+        movq           %rbp, %rsp
+        popq           %rbp
+        ret
+END(_ZGVcN4v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
new file mode 100644
index 0000000..28f661c
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
@@ -0,0 +1,193 @@ 
+/* Function cos vectorized with AVX2.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+
+       .text
+ENTRY(_ZGVdN4v_cos)
+
+/* ALGORITHM DESCRIPTION:
+ *
+ *    ( low accuracy ( < 4ulp ) or enhanced performance
+ *      ( half of correct mantissa ) implementation )
+ *
+ *    Argument representation:
+ *    arg + Pi/2 = (N*Pi + R)
+ *
+ *    Result calculation:
+ *    cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+ *    sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        movq      %rsp, %rbp
+        andq      $-64, %rsp
+        subq      $448, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+        vmovapd   %ymm0, %ymm1
+        vmovupd __dInvPI(%rax), %ymm4
+        vmovupd __dRShifter(%rax), %ymm5
+
+/*
+ * ARGUMENT RANGE REDUCTION:
+ * Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %ymm1, %ymm7
+
+/* Get absolute argument value: X' = |X'| */
+        vandpd __dAbsMask(%rax), %ymm7, %ymm2
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd %ymm5, %ymm4, %ymm7
+        vmovupd __dC7(%rax), %ymm4
+
+/* Check for large arguments path */
+        vcmpnle_uqpd __dRangeVal(%rax), %ymm2, %ymm3
+
+/* N = Y - RS : right shifter sub */
+        vsubpd    %ymm5, %ymm7, %ymm6
+        vmovupd __dPI1_FMA(%rax), %ymm2
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %ymm7, %ymm7
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %ymm6, %ymm0
+        vmovmskpd %ymm3, %ecx
+
+/* R = X - N*Pi1 */
+        vmovapd   %ymm1, %ymm3
+        vfnmadd231pd %ymm0, %ymm2, %ymm3
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %ymm0, %ymm3
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %ymm3, %ymm0
+
+/*
+ * POLYNOMIAL APPROXIMATION:
+ * R2 = R*R
+ */
+        vmulpd    %ymm0, %ymm0, %ymm5
+        vfmadd213pd __dC6(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC5(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC4(%rax), %ymm5, %ymm4
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %ymm5, %ymm4
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC1(%rax), %ymm5, %ymm4
+        vmulpd    %ymm5, %ymm4, %ymm6
+        vfmadd213pd %ymm0, %ymm0, %ymm6
+
+/*
+ * RECONSTRUCTION:
+ * Final sign setting: Res = Poly^SignRes
+ */
+        vxorpd    %ymm7, %ymm6, %ymm0
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        movq      %rbp, %rsp
+        popq      %rbp
+        ret
+
+.LBL_1_3:
+        vmovupd   %ymm1, 320(%rsp)
+        vmovupd   %ymm0, 384(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        vmovups   %ymm8, 224(%rsp)
+        vmovups   %ymm9, 192(%rsp)
+        vmovups   %ymm10, 160(%rsp)
+        vmovups   %ymm11, 128(%rsp)
+        vmovups   %ymm12, 96(%rsp)
+        vmovups   %ymm13, 64(%rsp)
+        vmovups   %ymm14, 32(%rsp)
+        vmovups   %ymm15, (%rsp)
+        movq      %rsi, 264(%rsp)
+        movq      %rdi, 256(%rsp)
+        movq      %r12, 296(%rsp)
+        movb      %dl, %r12b
+        movq      %r13, 288(%rsp)
+        movl      %ecx, %r13d
+        movq      %r14, 280(%rsp)
+        movl      %eax, %r14d
+        movq      %r15, 272(%rsp)
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        vmovups   224(%rsp), %ymm8
+        vmovups   192(%rsp), %ymm9
+        vmovups   160(%rsp), %ymm10
+        vmovups   128(%rsp), %ymm11
+        vmovups   96(%rsp), %ymm12
+        vmovups   64(%rsp), %ymm13
+        vmovups   32(%rsp), %ymm14
+        vmovups   (%rsp), %ymm15
+        vmovupd   384(%rsp), %ymm0
+        movq      264(%rsp), %rsi
+        movq      256(%rsp), %rdi
+        movq      296(%rsp), %r12
+        movq      288(%rsp), %r13
+        movq      280(%rsp), %r14
+        movq      272(%rsp), %r15
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    328(%rsp,%r15), %xmm0
+        vzeroupper
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 392(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    320(%rsp,%r15), %xmm0
+        vzeroupper
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 384(%rsp,%r15)
+        jmp       .LBL_1_7
+
+END(_ZGVdN4v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
new file mode 100644
index 0000000..cb65750
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
@@ -0,0 +1,247 @@ 
+/* Function cos vectorized with AVX-512, KNL version.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+
+       .text
+ENTRY(_ZGVeN8v_cos)
+#ifndef HAVE_AVX512_ASM_SUPPORT
+/* Implemented as wrapper to AVX2 version.  */
+        pushq  %rbp
+        movq   %rsp, %rbp
+        andq   $-64, %rsp
+        subq   $64, %rsp
+/* Below is encoding for vmovaps %zmm0, (%rsp).  */
+        .byte  0x62
+        .byte  0xf1
+        .byte  0x7c
+        .byte  0x48
+        .byte  0x29
+        .byte  0x04
+        .byte  0x24
+/* Below is encoding for vmovapd (%rsp), %ymm0.  */
+        .byte  0xc5
+        .byte  0xfd
+        .byte  0x28
+        .byte  0x04
+        .byte  0x24
+        call   _ZGVdN4v_cos@PLT
+/* Below is encoding for vmovapd 32(%rsp), %ymm0.  */
+        .byte  0xc5
+        .byte  0xfd
+        .byte  0x28
+        .byte  0x44
+        .byte  0x24
+        .byte  0x20
+        call   _ZGVdN4v_cos@PLT
+        movq   %rbp, %rsp
+        popq   %rbp
+        ret
+#else
+/*
+ * ALGORITHM DESCRIPTION:
+ *
+ *      ( low accuracy ( < 4ulp ) or
+ *        enhanced performance ( half of correct mantissa ) implementation )
+ *
+ *      Argument representation:
+ *      arg + Pi/2 = (N*Pi + R)
+ *
+ *      Result calculation:
+ *      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+ *      sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        movq      %rsp, %rbp
+        andq      $-64, %rsp
+        subq      $1280, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+
+/* R = X - N*Pi1 */
+        vmovaps   %zmm0, %zmm7
+
+/* Check for large arguments path */
+        movq      $-1, %rcx
+
+/*
+ * ARGUMENT RANGE REDUCTION:
+ * Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %zmm0, %zmm5
+        vmovups __dInvPI(%rax), %zmm3
+
+/* Get absolute argument value: X' = |X'| */
+        vpandq __dAbsMask(%rax), %zmm5, %zmm1
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd __dRShifter(%rax), %zmm3, %zmm5
+        vmovups __dPI1_FMA(%rax), %zmm6
+
+/* N = Y - RS : right shifter sub */
+        vsubpd __dRShifter(%rax), %zmm5, %zmm4
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %zmm5, %zmm12
+        vmovups __dC7(%rax), %zmm8
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %zmm4, %zmm10
+        vcmppd    $22, __dRangeVal(%rax), %zmm1, %k1
+        vpbroadcastq %rcx, %zmm2{%k1}{z}
+        vfnmadd231pd %zmm10, %zmm6, %zmm7
+        vptestmq  %zmm2, %zmm2, %k0
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %zmm10, %zmm7
+        kmovw     %k0, %ecx
+        movzbl    %cl, %ecx
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %zmm7, %zmm10
+
+/*
+ * POLYNOMIAL APPROXIMATION:
+ * R2 = R*R
+ */
+        vmulpd    %zmm10, %zmm10, %zmm9
+        vfmadd213pd __dC6(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC5(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC4(%rax), %zmm9, %zmm8
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %zmm9, %zmm8
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC1(%rax), %zmm9, %zmm8
+        vmulpd    %zmm9, %zmm8, %zmm11
+        vfmadd213pd %zmm10, %zmm10, %zmm11
+
+/*
+ * RECONSTRUCTION:
+ * Final sign setting: Res = Poly^SignRes
+ */
+        vpxorq    %zmm12, %zmm11, %zmm1
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        vmovaps   %zmm1, %zmm0
+        movq      %rbp, %rsp
+        popq      %rbp
+        ret
+
+.LBL_1_3:
+        vmovups   %zmm0, 1152(%rsp)
+        vmovups   %zmm1, 1216(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        kmovw     %k4, 1048(%rsp)
+        xorl      %eax, %eax
+        kmovw     %k5, 1040(%rsp)
+        kmovw     %k6, 1032(%rsp)
+        kmovw     %k7, 1024(%rsp)
+        vmovups   %zmm16, 960(%rsp)
+        vmovups   %zmm17, 896(%rsp)
+        vmovups   %zmm18, 832(%rsp)
+        vmovups   %zmm19, 768(%rsp)
+        vmovups   %zmm20, 704(%rsp)
+        vmovups   %zmm21, 640(%rsp)
+        vmovups   %zmm22, 576(%rsp)
+        vmovups   %zmm23, 512(%rsp)
+        vmovups   %zmm24, 448(%rsp)
+        vmovups   %zmm25, 384(%rsp)
+        vmovups   %zmm26, 320(%rsp)
+        vmovups   %zmm27, 256(%rsp)
+        vmovups   %zmm28, 192(%rsp)
+        vmovups   %zmm29, 128(%rsp)
+        vmovups   %zmm30, 64(%rsp)
+        vmovups   %zmm31, (%rsp)
+        movq      %rsi, 1064(%rsp)
+        movq      %rdi, 1056(%rsp)
+        movq      %r12, 1096(%rsp)
+        movb      %dl, %r12b
+        movq      %r13, 1088(%rsp)
+        movl      %ecx, %r13d
+        movq      %r14, 1080(%rsp)
+        movl      %eax, %r14d
+        movq      %r15, 1072(%rsp)
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        addb      $1, %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        kmovw     1048(%rsp), %k4
+        movq      1064(%rsp), %rsi
+        kmovw     1040(%rsp), %k5
+        movq      1056(%rsp), %rdi
+        kmovw     1032(%rsp), %k6
+        movq      1096(%rsp), %r12
+        movq      1088(%rsp), %r13
+        kmovw     1024(%rsp), %k7
+        vmovups   960(%rsp), %zmm16
+        vmovups   896(%rsp), %zmm17
+        vmovups   832(%rsp), %zmm18
+        vmovups   768(%rsp), %zmm19
+        vmovups   704(%rsp), %zmm20
+        vmovups   640(%rsp), %zmm21
+        vmovups   576(%rsp), %zmm22
+        vmovups   512(%rsp), %zmm23
+        vmovups   448(%rsp), %zmm24
+        vmovups   384(%rsp), %zmm25
+        vmovups   320(%rsp), %zmm26
+        vmovups   256(%rsp), %zmm27
+        vmovups   192(%rsp), %zmm28
+        vmovups   128(%rsp), %zmm29
+        vmovups   64(%rsp), %zmm30
+        vmovups   (%rsp), %zmm31
+        movq      1080(%rsp), %r14
+        movq      1072(%rsp), %r15
+        vmovups   1216(%rsp), %zmm1
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1160(%rsp,%r15), %xmm0
+        call      cos@PLT
+        vmovsd    %xmm0, 1224(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1152(%rsp,%r15), %xmm0
+        call      cos@PLT
+        vmovsd    %xmm0, 1216(%rsp,%r15)
+        jmp       .LBL_1_7
+#endif
+END(_ZGVeN8v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.S
b/sysdeps/x86_64/fpu/svml_d_cos_data.S
new file mode 100644
index 0000000..6f36ecd
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos_data.S
@@ -0,0 +1,114 @@ 
+/* Data for vectorized cos.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "svml_d_cos_data.h"
+
+.macro double_vector offset value
+.if .-__svml_dcos_data != \offset
+.err
+.endif
+.rept 8
+.quad \value
+.endr
+.endm
+
+       .section .rodata, "a"
+       .align 64
+
+/* Data table for vector implementations of function cos.
+ * The table may contain polynomial, reduction, lookup
+ * coefficients and other constants obtained through different
+ * methods of research and experimental work.
+ */
+       .globl __svml_dcos_data
+__svml_dcos_data:
+
+/* General purpose constants:
+ * absolute value mask
+ */
+double_vector __dAbsMask 0x7fffffffffffffff
+
+/* working range threshold */
+double_vector __dRangeVal 0x4160000000000000
+
+/* PI/2 */
+double_vector __dHalfPI 0x3ff921fb54442d18
+
+/* 1/PI */
+double_vector __dInvPI 0x3fd45f306dc9c883
+
+/* right-shifter constant */
+double_vector __dRShifter 0x4338000000000000
+
+/* 0.5 */
+double_vector __dOneHalf 0x3fe0000000000000
+
+/* Range reduction PI-based constants:
+ * PI high part
+ */
+double_vector __dPI1 0x400921fb40000000
+
+/* PI mid  part 1 */
+double_vector __dPI2 0x3e84442d00000000
+
+/* PI mid  part 2 */
+double_vector __dPI3 0x3d08469880000000
+
+/* PI low  part */
+double_vector __dPI4 0x3b88cc51701b839a
+
+/* Range reduction PI-based constants if FMA available:
+ * PI high part (FMA available)
+ */
+double_vector __dPI1_FMA 0x400921fb54442d18
+
+/* PI mid part  (FMA available) */
+double_vector __dPI2_FMA 0x3ca1a62633145c06
+
+/* PI low part  (FMA available) */
+double_vector __dPI3_FMA 0x395c1cd129024e09
+
+/* Polynomial coefficients (relative error 2^(-52.115)): */
+double_vector __dC1 0xbfc55555555554a7
+double_vector __dC2 0x3f8111111110a4a8
+double_vector __dC3 0xbf2a01a019a5b86d
+double_vector __dC4 0x3ec71de38030fea0
+double_vector __dC5 0xbe5ae63546002231
+double_vector __dC6 0x3de60e6857a2f220
+double_vector __dC7 0xbd69f0d60811aac8
+
+/*
+ * Additional constants:
+ * absolute value mask
+ */
+double_vector __dAbsMask_la 0x7fffffffffffffff
+
+/* 1/PI */
+double_vector __dInvPI_la 0x3fd45f306dc9c883
+
+/* right-shifer for low accuracy version */
+double_vector __dRShifter_la 0x4330000000000000
+
+/* right-shifer-1.0 for low accuracy version */
+double_vector __dRShifterm5_la 0x432fffffffffffff
+
+/* right-shifer with low mask for low accuracy version */
+double_vector __dRXmax_la 0x43300000007ffffe
+
+       .type   __svml_dcos_data,@object
+       .size   __svml_dcos_data,.-__svml_dcos_data
diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.h
b/sysdeps/x86_64/fpu/svml_d_cos_data.h
new file mode 100644
index 0000000..21fa806
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos_data.h
@@ -0,0 +1,48 @@ 
+/* Offsets for data table for vectorized cos.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef D_COS_DATA_H
+#define D_COS_DATA_H
+
+#define __dAbsMask              0
+#define __dRangeVal             64
+#define __dHalfPI               128
+#define __dInvPI                192
+#define __dRShifter             256
+#define __dOneHalf              320
+#define __dPI1                  384
+#define __dPI2                  448
+#define __dPI3                  512
+#define __dPI4                  576
+#define __dPI1_FMA              640
+#define __dPI2_FMA              704
+#define __dPI3_FMA              768
+#define __dC1                   832
+#define __dC2                   896
+#define __dC3                   960
+#define __dC4                   1024
+#define __dC5                   1088
+#define __dC6                   1152
+#define __dC7                   1216
+#define __dAbsMask_la           1280
+#define __dInvPI_la             1344
+#define __dRShifter_la          1408
+#define __dRShifterm5_la        1472
+#define __dRXmax_la             1536
+
+#endif