[1/14,x86_64] Vector math functions (vector cos)

Message ID CAMXFM3u57e==ySd8TF7+qFwmrswcT1jqzcN-n_JDPF45+77Z0g@mail.gmail.com
State Committed
Headers

Commit Message

Andrew Senkevich May 20, 2015, 6:15 p.m. UTC
  Here is fixed patch and ChangeLog.

2015-05-20  Andrew Senkevich  <andrew.senkevich@intel.com>

        * sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
        * sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: New file.
        * sysdeps/x86_64/fpu/Makefile: New file.
        * sysdeps/x86_64/fpu/Versions: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-support): Added build
        of SSE and AVX512 versions which are IFUNC.
        * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
        * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
        * sysdeps/x86_64/configure.ac: Options for libmvec build.
        * sysdeps/x86_64/configure: Regenerated.
        * sysdeps/x86_64/sysdep.h (cfi_offset_rel_rsp): New macro.
        * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.
        * elf/Makefile (localplt-built-dso): libmvec added to localplt testing.
        * include/libc-symbols.h: Added libmvec_hidden_* macro series.


Is it ok?


--
WBR,
Andrew
  

Comments

Andrew Senkevich May 21, 2015, 4:24 p.m. UTC | #1
Was found typos, attached with fix.


--
WBR,
Andrew
  
Joseph Myers May 22, 2015, 3:31 p.m. UTC | #2
On Thu, 21 May 2015, Andrew Senkevich wrote:

> diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile
> index 12b0526..5ccf97b 100644
> --- a/sysdeps/x86_64/fpu/multiarch/Makefile
> +++ b/sysdeps/x86_64/fpu/multiarch/Makefile
> @@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
>  CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
>  endif
>  endif
> +
> +ifeq ($(subdir),mathvec)
> +libmvec-support += svml_d_cos2_core svml_d_cos8_core
> +endif

As far as I can tell, if you use --disable-multi-arch, then these files 
won't be built into libmvec, and nothing else will provide the 
_ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for 
examples of how such things are handled - typically, the file outside the 
multiarch directory implements things for an always-supported architecture 
variant (in this case, that would be a variant guaranteed to be supported 
if the given entry point gets called), then, in the multiarch directory, 
there are the implementations for other variants, and a file with the same 
name as that outside the multiarch directory, that (a) provides the IFUNC 
resolver and (b) defines some macros before #including the file in the 
directory above, so that the basic version of the function gets defined 
under a different name.

The elf/Makefile and include/libc-symbols.h changes are OK on their own - 
I think it's best for them to go in now rather than together with the 
first function implementations.
  
Andrew Senkevich May 22, 2015, 4:45 p.m. UTC | #3
2015-05-22 18:31 GMT+03:00 Joseph Myers <joseph@codesourcery.com>:
> The elf/Makefile and include/libc-symbols.h changes are OK on their own -
> I think it's best for them to go in now rather than together with the
> first function implementations.

Is it needed to post extracted part as separate patch?


--
WBR,
Andrew
  
Joseph Myers May 22, 2015, 5:07 p.m. UTC | #4
On Fri, 22 May 2015, Andrew Senkevich wrote:

> 2015-05-22 18:31 GMT+03:00 Joseph Myers <joseph@codesourcery.com>:
> > The elf/Makefile and include/libc-symbols.h changes are OK on their own -
> > I think it's best for them to go in now rather than together with the
> > first function implementations.
> 
> Is it needed to post extracted part as separate patch?

I don't think so.
  
Andrew Senkevich May 25, 2015, 6:26 p.m. UTC | #5
2015-05-22 18:31 GMT+03:00 Joseph Myers <joseph@codesourcery.com>:
> On Thu, 21 May 2015, Andrew Senkevich wrote:
>
>> diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile
>> index 12b0526..5ccf97b 100644
>> --- a/sysdeps/x86_64/fpu/multiarch/Makefile
>> +++ b/sysdeps/x86_64/fpu/multiarch/Makefile
>> @@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
>>  CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
>>  endif
>>  endif
>> +
>> +ifeq ($(subdir),mathvec)
>> +libmvec-support += svml_d_cos2_core svml_d_cos8_core
>> +endif
>
> As far as I can tell, if you use --disable-multi-arch, then these files
> won't be built into libmvec, and nothing else will provide the
> _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
> examples of how such things are handled - typically, the file outside the
> multiarch directory implements things for an always-supported architecture
> variant (in this case, that would be a variant guaranteed to be supported
> if the given entry point gets called), then, in the multiarch directory,
> there are the implementations for other variants, and a file with the same
> name as that outside the multiarch directory, that (a) provides the IFUNC
> resolver and (b) defines some macros before #including the file in the
> directory above, so that the basic version of the function gets defined
> under a different name.

Hi, Joseph, updated patch is attached. Is it ok?

2015-05-25  Andrew Senkevich  <andrew.senkevich@intel.com>

        * sysdeps/x86_64/fpu/Makefile: New file.
        * sysdeps/x86_64/fpu/Versions: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
        * sysdeps/x86_64/fpu/svml_d_cos2_core.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos8_core.S: New file.
        * sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core_sse4.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core_avx2.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: New file.
        * sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-sysdep_routines): Added
        build of SSE, AVX2 and AVX512 versions which are IFUNC.
        * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
        * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
        * sysdeps/x86_64/configure.ac: Options for libmvec build.
        * sysdeps/x86_64/configure: Regenerated.
        * sysdeps/x86_64/sysdep.h (cfi_offset_rel_rsp): New macro.
        * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.


--
WBR,
Andrew
  
Andrew Senkevich June 3, 2015, 11:55 a.m. UTC | #6
2015-05-25 21:26 GMT+03:00 Andrew Senkevich <andrew.n.senkevich@gmail.com>:
> 2015-05-22 18:31 GMT+03:00 Joseph Myers <joseph@codesourcery.com>:
>> On Thu, 21 May 2015, Andrew Senkevich wrote:
>>
>>> diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile
>>> index 12b0526..5ccf97b 100644
>>> --- a/sysdeps/x86_64/fpu/multiarch/Makefile
>>> +++ b/sysdeps/x86_64/fpu/multiarch/Makefile
>>> @@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
>>>  CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
>>>  endif
>>>  endif
>>> +
>>> +ifeq ($(subdir),mathvec)
>>> +libmvec-support += svml_d_cos2_core svml_d_cos8_core
>>> +endif
>>
>> As far as I can tell, if you use --disable-multi-arch, then these files
>> won't be built into libmvec, and nothing else will provide the
>> _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
>> examples of how such things are handled - typically, the file outside the
>> multiarch directory implements things for an always-supported architecture
>> variant (in this case, that would be a variant guaranteed to be supported
>> if the given entry point gets called), then, in the multiarch directory,
>> there are the implementations for other variants, and a file with the same
>> name as that outside the multiarch directory, that (a) provides the IFUNC
>> resolver and (b) defines some macros before #including the file in the
>> directory above, so that the basic version of the function gets defined
>> under a different name.
>
> Hi, Joseph, updated patch is attached. Is it ok?
>
> 2015-05-25  Andrew Senkevich  <andrew.senkevich@intel.com>
>
>         * sysdeps/x86_64/fpu/Makefile: New file.
>         * sysdeps/x86_64/fpu/Versions: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos2_core.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos4_core.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos8_core.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core_sse4.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core_avx2.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-sysdep_routines): Added
>         build of SSE, AVX2 and AVX512 versions which are IFUNC.
>         * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
>         * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
>         * sysdeps/x86_64/configure.ac: Options for libmvec build.
>         * sysdeps/x86_64/configure: Regenerated.
>         * sysdeps/x86_64/sysdep.h (cfi_offset_rel_rsp): New macro.
>         * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.

Ping.


--
WBR,
Andrew
  
Joseph Myers June 4, 2015, 4:50 p.m. UTC | #7
On Mon, 25 May 2015, Andrew Senkevich wrote:

> > As far as I can tell, if you use --disable-multi-arch, then these files
> > won't be built into libmvec, and nothing else will provide the
> > _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
> > examples of how such things are handled - typically, the file outside the
> > multiarch directory implements things for an always-supported architecture
> > variant (in this case, that would be a variant guaranteed to be supported
> > if the given entry point gets called), then, in the multiarch directory,
> > there are the implementations for other variants, and a file with the same
> > name as that outside the multiarch directory, that (a) provides the IFUNC
> > resolver and (b) defines some macros before #including the file in the
> > directory above, so that the basic version of the function gets defined
> > under a different name.
> 
> Hi, Joseph, updated patch is attached. Is it ok?

OK provided you've tested this both with and without --disable-multi-arch.
  
Andrew Senkevich June 4, 2015, 5:22 p.m. UTC | #8
2015-06-04 19:50 GMT+03:00 Joseph Myers <joseph@codesourcery.com>:
> On Mon, 25 May 2015, Andrew Senkevich wrote:
>
>> > As far as I can tell, if you use --disable-multi-arch, then these files
>> > won't be built into libmvec, and nothing else will provide the
>> > _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
>> > examples of how such things are handled - typically, the file outside the
>> > multiarch directory implements things for an always-supported architecture
>> > variant (in this case, that would be a variant guaranteed to be supported
>> > if the given entry point gets called), then, in the multiarch directory,
>> > there are the implementations for other variants, and a file with the same
>> > name as that outside the multiarch directory, that (a) provides the IFUNC
>> > resolver and (b) defines some macros before #including the file in the
>> > directory above, so that the basic version of the function gets defined
>> > under a different name.
>>
>> Hi, Joseph, updated patch is attached. Is it ok?
>
> OK provided you've tested this both with and without --disable-multi-arch.

Thank you for review, of course both cases were tested.


--
WBR,
Andrew
  
Joseph Myers June 5, 2015, 4:44 p.m. UTC | #9
Note that the addition of the first libmvec functions should be 
accompanied by a NEWS entry describing this new feature in 2.22.  (That 
NEWS entry can then be updated for each new function added - of course 
anything added after 2.22 is released gets a separate NEWS entry for 
2.23.)
  
Adhemerval Zanella Netto June 9, 2015, 6:11 p.m. UTC | #10
I am seeing this issue with master build using binutils 2.24:

../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Assembler messages:
../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:281: Error: operand type mismatch for `vandpd'
../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:327: Error: operand type mismatch for `vxorpd'


On 25-05-2015 15:26, Andrew Senkevich wrote:
> 2015-05-22 18:31 GMT+03:00 Joseph Myers <joseph@codesourcery.com>:
>> On Thu, 21 May 2015, Andrew Senkevich wrote:
>>
>>> diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile
>>> index 12b0526..5ccf97b 100644
>>> --- a/sysdeps/x86_64/fpu/multiarch/Makefile
>>> +++ b/sysdeps/x86_64/fpu/multiarch/Makefile
>>> @@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
>>>  CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
>>>  endif
>>>  endif
>>> +
>>> +ifeq ($(subdir),mathvec)
>>> +libmvec-support += svml_d_cos2_core svml_d_cos8_core
>>> +endif
>>
>> As far as I can tell, if you use --disable-multi-arch, then these files
>> won't be built into libmvec, and nothing else will provide the
>> _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
>> examples of how such things are handled - typically, the file outside the
>> multiarch directory implements things for an always-supported architecture
>> variant (in this case, that would be a variant guaranteed to be supported
>> if the given entry point gets called), then, in the multiarch directory,
>> there are the implementations for other variants, and a file with the same
>> name as that outside the multiarch directory, that (a) provides the IFUNC
>> resolver and (b) defines some macros before #including the file in the
>> directory above, so that the basic version of the function gets defined
>> under a different name.
> 
> Hi, Joseph, updated patch is attached. Is it ok?
> 
> 2015-05-25  Andrew Senkevich  <andrew.senkevich@intel.com>
> 
>         * sysdeps/x86_64/fpu/Makefile: New file.
>         * sysdeps/x86_64/fpu/Versions: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos2_core.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos4_core.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos8_core.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core_sse4.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core_avx2.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-sysdep_routines): Added
>         build of SSE, AVX2 and AVX512 versions which are IFUNC.
>         * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
>         * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
>         * sysdeps/x86_64/configure.ac: Options for libmvec build.
>         * sysdeps/x86_64/configure: Regenerated.
>         * sysdeps/x86_64/sysdep.h (cfi_offset_rel_rsp): New macro.
>         * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.
> 
> 
> --
> WBR,
> Andrew
>
  
Martin Sebor June 9, 2015, 6:40 p.m. UTC | #11
This patch breaks x86_64 builds with Binutils 2.24 (Fedora 21):

../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Assembler 
messages:
../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:281: Error: 
operand type mismatch for `vandpd'
../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:327: Error: 
operand type mismatch for `vxorpd'
/build/glibc-trunk/sysd-rules:1549: recipe for target 
'/build/glibc-trunk/mathvec/svml_d_cos8_core_avx512.o' failed

(with other similar errors in other files).

The first instruction the assembler complains about is:

         vandpd 0(%rax), %zmm6, %zmm1

As from Binutils 2.25 accepts the code.

Martin

On 05/25/2015 12:26 PM, Andrew Senkevich wrote:
> 2015-05-22 18:31 GMT+03:00 Joseph Myers <joseph@codesourcery.com>:
>> On Thu, 21 May 2015, Andrew Senkevich wrote:
>>
>>> diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile
>>> index 12b0526..5ccf97b 100644
>>> --- a/sysdeps/x86_64/fpu/multiarch/Makefile
>>> +++ b/sysdeps/x86_64/fpu/multiarch/Makefile
>>> @@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
>>>   CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
>>>   endif
>>>   endif
>>> +
>>> +ifeq ($(subdir),mathvec)
>>> +libmvec-support += svml_d_cos2_core svml_d_cos8_core
>>> +endif
>>
>> As far as I can tell, if you use --disable-multi-arch, then these files
>> won't be built into libmvec, and nothing else will provide the
>> _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
>> examples of how such things are handled - typically, the file outside the
>> multiarch directory implements things for an always-supported architecture
>> variant (in this case, that would be a variant guaranteed to be supported
>> if the given entry point gets called), then, in the multiarch directory,
>> there are the implementations for other variants, and a file with the same
>> name as that outside the multiarch directory, that (a) provides the IFUNC
>> resolver and (b) defines some macros before #including the file in the
>> directory above, so that the basic version of the function gets defined
>> under a different name.
>
> Hi, Joseph, updated patch is attached. Is it ok?
>
> 2015-05-25  Andrew Senkevich  <andrew.senkevich@intel.com>
>
>          * sysdeps/x86_64/fpu/Makefile: New file.
>          * sysdeps/x86_64/fpu/Versions: New file.
>          * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
>          * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
>          * sysdeps/x86_64/fpu/svml_d_cos2_core.S: New file.
>          * sysdeps/x86_64/fpu/svml_d_cos4_core.S: New file.
>          * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
>          * sysdeps/x86_64/fpu/svml_d_cos8_core.S: New file.
>          * sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: New file.
>          * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
>          * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core_sse4.S: New file.
>          * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core.S: New file.
>          * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core_avx2.S: New file.
>          * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S: New file.
>          * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: New file.
>          * sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-sysdep_routines): Added
>          build of SSE, AVX2 and AVX512 versions which are IFUNC.
>          * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
>          * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
>          * sysdeps/x86_64/configure.ac: Options for libmvec build.
>          * sysdeps/x86_64/configure: Regenerated.
>          * sysdeps/x86_64/sysdep.h (cfi_offset_rel_rsp): New macro.
>          * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.
>
>
> --
> WBR,
> Andrew
>
  
Andrew Senkevich June 9, 2015, 7:48 p.m. UTC | #12
2015-06-09 21:40 GMT+03:00 Martin Sebor <msebor@redhat.com>:
> This patch breaks x86_64 builds with Binutils 2.24 (Fedora 21):
>
> ../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Assembler
> messages:
> ../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:281: Error:
> operand type mismatch for `vandpd'
> ../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:327: Error:
> operand type mismatch for `vxorpd'
> /build/glibc-trunk/sysd-rules:1549: recipe for target
> '/build/glibc-trunk/mathvec/svml_d_cos8_core_avx512.o' failed
>
> (with other similar errors in other files).
>
> The first instruction the assembler complains about is:
>
>         vandpd 0(%rax), %zmm6, %zmm1
>
> As from Binutils 2.25 accepts the code.

I have tested build with manually built Binutils 2.24 (downloaded from
ftp://sourceware.org/pub/binutils/snapshots/binutils-2.24.90.tar.bz2)
on x86_64 Fedora 19 with configure/make with no addition options.

How Binutils on your side was built and what is exact version?
  
Joseph Myers June 9, 2015, 7:59 p.m. UTC | #13
On Tue, 9 Jun 2015, Andrew Senkevich wrote:

> I have tested build with manually built Binutils 2.24 (downloaded from
> ftp://sourceware.org/pub/binutils/snapshots/binutils-2.24.90.tar.bz2)
> on x86_64 Fedora 19 with configure/make with no addition options.

2.24.90 means a development version in between 2.24 and 2.25.  Try actual 
2.24 release.

If you can't get consensus on requiring a binutils version recent enough 
for this code, you'll need to make the x86_64 configure fragment disable 
libmvec by default if the assembler is too old (and make NEWS and 
install.texi note the requirement when saying it's on by default for 
x86_64).
  
Martin Sebor June 9, 2015, 8:07 p.m. UTC | #14
On 06/09/2015 01:48 PM, Andrew Senkevich wrote:
> 2015-06-09 21:40 GMT+03:00 Martin Sebor <msebor@redhat.com>:
>> This patch breaks x86_64 builds with Binutils 2.24 (Fedora 21):
>>
>> ../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Assembler
>> messages:
>> ../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:281: Error:
>> operand type mismatch for `vandpd'
>> ../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:327: Error:
>> operand type mismatch for `vxorpd'
>> /build/glibc-trunk/sysd-rules:1549: recipe for target
>> '/build/glibc-trunk/mathvec/svml_d_cos8_core_avx512.o' failed
>>
>> (with other similar errors in other files).
>>
>> The first instruction the assembler complains about is:
>>
>>          vandpd 0(%rax), %zmm6, %zmm1
>>
>> As from Binutils 2.25 accepts the code.
>
> I have tested build with manually built Binutils 2.24 (downloaded from
> ftp://sourceware.org/pub/binutils/snapshots/binutils-2.24.90.tar.bz2)
> on x86_64 Fedora 19 with configure/make with no addition options.
>
> How Binutils on your side was built and what is exact version?

Fedora 21's current Binutils is 2.24-30.fc21, a little older
than the snapshot you used.

Martin
  
Adhemerval Zanella Netto June 9, 2015, 8:35 p.m. UTC | #15
On 09-06-2015 17:07, Martin Sebor wrote:
> On 06/09/2015 01:48 PM, Andrew Senkevich wrote:
>> 2015-06-09 21:40 GMT+03:00 Martin Sebor <msebor@redhat.com>:
>>> This patch breaks x86_64 builds with Binutils 2.24 (Fedora 21):
>>>
>>> ../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Assembler
>>> messages:
>>> ../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:281: Error:
>>> operand type mismatch for `vandpd'
>>> ../sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S:327: Error:
>>> operand type mismatch for `vxorpd'
>>> /build/glibc-trunk/sysd-rules:1549: recipe for target
>>> '/build/glibc-trunk/mathvec/svml_d_cos8_core_avx512.o' failed
>>>
>>> (with other similar errors in other files).
>>>
>>> The first instruction the assembler complains about is:
>>>
>>>          vandpd 0(%rax), %zmm6, %zmm1
>>>
>>> As from Binutils 2.25 accepts the code.
>>
>> I have tested build with manually built Binutils 2.24 (downloaded from
>> ftp://sourceware.org/pub/binutils/snapshots/binutils-2.24.90.tar.bz2)
>> on x86_64 Fedora 19 with configure/make with no addition options.
>>
>> How Binutils on your side was built and what is exact version?
> 
> Fedora 21's current Binutils is 2.24-30.fc21, a little older
> than the snapshot you used.
> 
> Martin
> 

Ubuntu 14.04.2 one shows the same issue (and based on source contents I'd say
it is based on 2.24 release plus some backports).
  
Ondrej Bilka June 10, 2015, 8:04 a.m. UTC | #16
On Tue, Jun 09, 2015 at 07:59:57PM +0000, Joseph Myers wrote:
> On Tue, 9 Jun 2015, Andrew Senkevich wrote:
> 
> > I have tested build with manually built Binutils 2.24 (downloaded from
> > ftp://sourceware.org/pub/binutils/snapshots/binutils-2.24.90.tar.bz2)
> > on x86_64 Fedora 19 with configure/make with no addition options.
> 
> 2.24.90 means a development version in between 2.24 and 2.25.  Try actual 
> 2.24 release.
> 
> If you can't get consensus on requiring a binutils version recent enough 
> for this code, you'll need to make the x86_64 configure fragment disable 
> libmvec by default if the assembler is too old (and make NEWS and 
> install.texi note the requirement when saying it's on by default for 
> x86_64).
> 
No joseph, thats wrong solution. You don't have to disable entire mvec
just because you don't handle avx512.

Instead add configure test for avx512 and change makefile and surround
selection by ifdefs. We still check for sse4 and dont add memcmp_sse4 if
that configure option failed.
  
Joseph Myers June 10, 2015, 11:45 a.m. UTC | #17
On Wed, 10 Jun 2015, Ondřej Bílka wrote:

> No joseph, thats wrong solution. You don't have to disable entire mvec
> just because you don't handle avx512.
> 
> Instead add configure test for avx512 and change makefile and surround
> selection by ifdefs. We still check for sse4 and dont add memcmp_sse4 if
> that configure option failed.

The libmvec ABI must not depend on assembler features.  The shared library 
may or may not exist, and the functions may or may not simply be wrappers 
to the scalar versions with .byte encodings of AVX512 instructions, but 
building the library with AVX512-ABI functions omitted is not an option.
  

Patch

diff --git a/elf/Makefile b/elf/Makefile
index 34450ea..b06e0a7 100644
--- a/elf/Makefile
+++ b/elf/Makefile
@@ -990,6 +990,9 @@  localplt-built-dso := $(addprefix $(common-objpfx),\
   resolv/libresolv.so \
   crypt/libcrypt.so \
        )
+ifeq ($(build-mathvec),yes)
+localplt-built-dso += $(addprefix $(common-objpfx), mathvec/libmvec.so)
+endif
 ifeq ($(have-thread-library),yes)
 localplt-built-dso += $(filter-out %_nonshared.a, $(shared-thread-library))
 endif
diff --git a/include/libc-symbols.h b/include/libc-symbols.h
index ca3fe00..743b6f6 100644
--- a/include/libc-symbols.h
+++ b/include/libc-symbols.h
@@ -546,6 +546,26 @@  for linking")
 # define libm_hidden_data_ver(local, name)
 #endif

+#if IS_IN (libmvec)
+# define libmvec_hidden_proto(name, attrs...) hidden_proto (name, ##attrs)
+# define libmvec_hidden_tls_proto(name, attrs...) hidden_tls_proto
(name, ##attrs)
+# define libmvec_hidden_def(name) hidden_def (name)
+# define libmvec_hidden_weak(name) hidden_weak (name)
+# define libmvec_hidden_ver(local, name) hidden_ver (local, name)
+# define libmvec_hidden_data_def(name) hidden_data_def (name)
+# define libmvec_hidden_data_weak(name) hidden_data_weak (name)
+# define libmvec_hidden_data_ver(local, name) hidden_data_ver (local, name)
+#else
+# define libmvec_hidden_proto(name, attrs...)
+# define libmvec_hidden_tls_proto(name, attrs...)
+# define libmvec_hidden_def(name)
+# define libmvec_hidden_weak(name)
+# define libmvec_hidden_ver(local, name)
+# define libmvec_hidden_data_def(name)
+# define libmvec_hidden_data_weak(name)
+# define libmvec_hidden_data_ver(local, name)
+#endif
+
 #if IS_IN (libresolv)
 # define libresolv_hidden_proto(name, attrs...) hidden_proto (name, ##attrs)
 # define libresolv_hidden_tls_proto(name, attrs...) \
diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h
index e8e5577..85a6a95 100644
--- a/math/bits/mathcalls.h
+++ b/math/bits/mathcalls.h
@@ -60,7 +60,7 @@  __MATHCALL (atan,, (_Mdouble_ __x));
 __MATHCALL (atan2,, (_Mdouble_ __y, _Mdouble_ __x));

 /* Cosine of X.  */
-__MATHCALL (cos,, (_Mdouble_ __x));
+__MATHCALL_VEC (cos,, (_Mdouble_ __x));
 /* Sine of X.  */
 __MATHCALL (sin,, (_Mdouble_ __x));
 /* Tangent of X.  */
diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
new file mode 100644
index 0000000..be6eaed
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
@@ -0,0 +1,6 @@ 
+GLIBC_2.22
+ GLIBC_2.22 A
+ _ZGVbN2v_cos F
+ _ZGVcN4v_cos F
+ _ZGVdN4v_cos F
+ _ZGVeN8v_cos F
diff --git a/sysdeps/x86/fpu/bits/math-vector.h
b/sysdeps/x86/fpu/bits/math-vector.h
new file mode 100644
index 0000000..27294ce
--- /dev/null
+++ b/sysdeps/x86/fpu/bits/math-vector.h
@@ -0,0 +1,34 @@ 
+/* Platform-specific SIMD declarations of math functions.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _MATH_H
+# error "Never include <bits/math-vector.h> directly;\
+ include <math.h> instead."
+#endif
+
+/* Get default empty definitions for simd declarations.  */
+#include <bits/libm-simd-decl-stubs.h>
+
+#if defined __x86_64__ && defined __FAST_MATH__
+# if defined _OPENMP && _OPENMP >= 201307
+/* OpenMP case.  */
+#  define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch")
+#  undef __DECL_SIMD_cos
+#  define __DECL_SIMD_cos __DECL_SIMD_x86_64
+# endif
+#endif
diff --git a/sysdeps/x86_64/configure b/sysdeps/x86_64/configure
index 7d4dadd..1493523 100644
--- a/sysdeps/x86_64/configure
+++ b/sysdeps/x86_64/configure
@@ -275,6 +275,10 @@  fi
 config_vars="$config_vars
 config-cflags-avx2 = $libc_cv_cc_avx2"

+if test x"$build_mathvec" = xnotset; then
+  build_mathvec=yes
+fi
+
 $as_echo "#define PI_STATIC_AND_HIDDEN 1" >>confdefs.h

 # work around problem with autoconf and empty lines at the end of files
diff --git a/sysdeps/x86_64/configure.ac b/sysdeps/x86_64/configure.ac
index c9f9a51..1c2b35f 100644
--- a/sysdeps/x86_64/configure.ac
+++ b/sysdeps/x86_64/configure.ac
@@ -99,6 +99,10 @@  if test $libc_cv_cc_avx2 = yes; then
 fi
 LIBC_CONFIG_VAR([config-cflags-avx2], [$libc_cv_cc_avx2])

+if test x"$build_mathvec" = xnotset; then
+  build_mathvec=yes
+fi
+
 dnl It is always possible to access static and hidden symbols in an
 dnl position independent way.
 AC_DEFINE(PI_STATIC_AND_HIDDEN)
diff --git a/sysdeps/x86_64/fpu/Makefile b/sysdeps/x86_64/fpu/Makefile
new file mode 100644
index 0000000..9cbf68b
--- /dev/null
+++ b/sysdeps/x86_64/fpu/Makefile
@@ -0,0 +1,5 @@ 
+ifeq ($(subdir),mathvec)
+libmvec-support += svml_d_cos2_core_sse svml_d_cos4_core_avx \
+   svml_d_cos4_core_avx2 svml_d_cos8_core_avx512 \
+   svml_d_cos_data init-arch
+endif
diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions
new file mode 100644
index 0000000..b38ed07
--- /dev/null
+++ b/sysdeps/x86_64/fpu/Versions
@@ -0,0 +1,8 @@ 
+libmvec {
+  GLIBC_2.22 {
+    _ZGVbN2v_cos;
+    _ZGVcN4v_cos;
+    _ZGVdN4v_cos;
+    _ZGVeN8v_cos;
+  }
+}
diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile
b/sysdeps/x86_64/fpu/multiarch/Makefile
index 12b0526..5ccf97b 100644
--- a/sysdeps/x86_64/fpu/multiarch/Makefile
+++ b/sysdeps/x86_64/fpu/multiarch/Makefile
@@ -51,3 +51,7 @@  CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
 CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
 endif
 endif
+
+ifeq ($(subdir),mathvec)
+libmvec-support += svml_d_cos2_core svml_d_cos8_core
+endif
diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
new file mode 100644
index 0000000..dcf0925
--- /dev/null
+++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
@@ -0,0 +1,35 @@ 
+/* Multiple versions of vectorized cos.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <init-arch.h>
+
+ .text
+ENTRY (_ZGVbN2v_cos)
+        .type   _ZGVbN2v_cos, @gnu_indirect_function
+        cmpl    $0, KIND_OFFSET+__cpu_features(%rip)
+        jne     1f
+        call    __init_cpu_features
+1:      leaq    _ZGVbN2v_cos_sse4(%rip), %rax
+        testl   $bit_SSE4_1, __cpu_features+CPUID_OFFSET+index_SSE4_1(%rip)
+        jz      2f
+        ret
+2:      leaq    _ZGVbN2v_cos_sse2(%rip), %rax
+        ret
+END (_ZGVbN2v_cos)
+libmvec_hidden_def (_ZGVbN2v_cos)
diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S
b/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S
new file mode 100644
index 0000000..27314f9
--- /dev/null
+++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S
@@ -0,0 +1,36 @@ 
+/* Multiple versions of vectorized cos.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <init-arch.h>
+
+ .text
+ENTRY (_ZGVeN8v_cos)
+        .type   _ZGVeN8v_cos, @gnu_indirect_function
+        cmpl    $0, KIND_OFFSET+__cpu_features(%rip)
+        jne     1
+        call    __init_cpu_features
+1:      leaq    _ZGVeN8v_cos_skx(%rip), %rax
+        testl   $bit_AVX512DQ_Usable,
__cpu_features+FEATURE_OFFSET+index_AVX512DQ_Usable(%rip)
+        jnz     3
+2:      leaq    _ZGVeN8v_cos_knl(%rip), %rax
+        testl   $bit_AVX512F_Usable,
__cpu_features+FEATURE_OFFSET+index_AVX512F_Usable(%rip)
+        jnz     3
+        leaq    _ZGVeN8v_cos_avx2_wrapper(%rip), %rax
+3:      ret
+END (_ZGVeN8v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
new file mode 100644
index 0000000..658fe68
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
@@ -0,0 +1,228 @@ 
+/* Function cos vectorized with SSE2 and SSE4.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+#include "svml_d_wrapper_impl.h"
+
+ .text
+ENTRY (_ZGVbN2v_cos_sse2)
+WRAPPER_IMPL_SSE2 cos
+END (_ZGVbN2v_cos_sse2)
+
+ENTRY (_ZGVbN2v_cos_sse4)
+/* ALGORITHM DESCRIPTION:
+
+        ( low accuracy ( < 4ulp ) or enhanced performance
+         ( half of correct mantissa ) implementation )
+
+        Argument representation:
+        arg + Pi/2 = (N*Pi + R)
+
+        Result calculation:
+        cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+        sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq      %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq      $-64, %rsp
+        subq      $320, %rsp
+        movaps    %xmm0, %xmm3
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+        movups    __dHalfPI(%rax), %xmm2
+
+/* ARGUMENT RANGE REDUCTION:
+   Add Pi/2 to argument: X' = X+Pi/2
+ */
+        addpd     %xmm3, %xmm2
+        movups    __dInvPI(%rax), %xmm5
+        movups    __dAbsMask(%rax), %xmm4
+
+/* Get absolute argument value: X' = |X'| */
+        andps     %xmm2, %xmm4
+
+/* Y = X'*InvPi + RS : right shifter add */
+        mulpd     %xmm5, %xmm2
+
+/* Check for large arguments path */
+        cmpnlepd  __dRangeVal(%rax), %xmm4
+        movups    __dRShifter(%rax), %xmm6
+        addpd     %xmm6, %xmm2
+        movmskpd  %xmm4, %ecx
+
+/* N = Y - RS : right shifter sub */
+        movaps    %xmm2, %xmm1
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        psllq     $63, %xmm2
+        subpd     %xmm6, %xmm1
+
+/* N = N - 0.5 */
+        subpd     __dOneHalf(%rax), %xmm1
+        movups    __dPI1(%rax), %xmm7
+
+/* R = X - N*Pi1 */
+        mulpd     %xmm1, %xmm7
+        movups    __dPI2(%rax), %xmm4
+
+/* R = R - N*Pi2 */
+        mulpd     %xmm1, %xmm4
+        subpd     %xmm7, %xmm0
+        movups    __dPI3(%rax), %xmm5
+
+/* R = R - N*Pi3 */
+        mulpd     %xmm1, %xmm5
+        subpd     %xmm4, %xmm0
+
+/* R = R - N*Pi4 */
+        movups     __dPI4(%rax), %xmm6
+        mulpd     %xmm6, %xmm1
+        subpd     %xmm5, %xmm0
+        subpd     %xmm1, %xmm0
+
+/* POLYNOMIAL APPROXIMATION: R2 = R*R */
+        movaps    %xmm0, %xmm4
+        mulpd     %xmm0, %xmm4
+        movups    __dC7(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC6(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC5(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC4(%rax), %xmm1
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        mulpd     %xmm4, %xmm1
+        addpd     __dC3(%rax), %xmm1
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        mulpd     %xmm4, %xmm1
+        addpd     __dC2(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC1(%rax), %xmm1
+        mulpd     %xmm1, %xmm4
+        mulpd     %xmm0, %xmm4
+        addpd     %xmm4, %xmm0
+
+/* RECONSTRUCTION:
+   Final sign setting: Res = Poly^SignRes */
+        xorps     %xmm2, %xmm0
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        cfi_remember_state
+        movq      %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq      %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+
+.LBL_1_3:
+        cfi_restore_state
+        movups    %xmm3, 192(%rsp)
+        movups    %xmm0, 256(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        movups    %xmm8, 112(%rsp)
+        movups    %xmm9, 96(%rsp)
+        movups    %xmm10, 80(%rsp)
+        movups    %xmm11, 64(%rsp)
+        movups    %xmm12, 48(%rsp)
+        movups    %xmm13, 32(%rsp)
+        movups    %xmm14, 16(%rsp)
+        movups    %xmm15, (%rsp)
+        movq      %rsi, 136(%rsp)
+        movq      %rdi, 128(%rsp)
+        movq      %r12, 168(%rsp)
+        cfi_offset_rel_rsp (12, 168)
+        movb      %dl, %r12b
+        movq      %r13, 160(%rsp)
+        cfi_offset_rel_rsp (13, 160)
+        movl      %ecx, %r13d
+        movq      %r14, 152(%rsp)
+        cfi_offset_rel_rsp (14, 152)
+        movl      %eax, %r14d
+        movq      %r15, 144(%rsp)
+        cfi_offset_rel_rsp (15, 144)
+        cfi_remember_state
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        movups    112(%rsp), %xmm8
+        movups    96(%rsp), %xmm9
+        movups    80(%rsp), %xmm10
+        movups    64(%rsp), %xmm11
+        movups    48(%rsp), %xmm12
+        movups    32(%rsp), %xmm13
+        movups    16(%rsp), %xmm14
+        movups    (%rsp), %xmm15
+        movq      136(%rsp), %rsi
+        movq      128(%rsp), %rdi
+        movq      168(%rsp), %r12
+        cfi_restore (%r12)
+        movq      160(%rsp), %r13
+        cfi_restore (%r13)
+        movq      152(%rsp), %r14
+        cfi_restore (%r14)
+        movq      144(%rsp), %r15
+        cfi_restore (%r15)
+        movups    256(%rsp), %xmm0
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        cfi_restore_state
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        movsd     200(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        movsd     %xmm0, 264(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        movsd     192(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        movsd     %xmm0, 256(%rsp,%r15)
+        jmp       .LBL_1_7
+
+END (_ZGVbN2v_cos_sse4)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
new file mode 100644
index 0000000..bf10b01
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
@@ -0,0 +1,25 @@ 
+/* Function cos vectorized in AVX ISA as wrapper to SSE4 ISA version.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_wrapper_impl.h"
+
+ .text
+ENTRY (_ZGVcN4v_cos)
+WRAPPER_IMPL_AVX _ZGVbN2v_cos
+END (_ZGVcN4v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
new file mode 100644
index 0000000..ec8548f
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
@@ -0,0 +1,208 @@ 
+/* Function cos vectorized with AVX2.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+
+ .text
+ENTRY (_ZGVdN4v_cos)
+
+/* ALGORITHM DESCRIPTION:
+
+      ( low accuracy ( < 4ulp ) or enhanced performance
+       ( half of correct mantissa ) implementation )
+
+      Argument representation:
+      arg + Pi/2 = (N*Pi + R)
+
+      Result calculation:
+      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+      sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq      %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq      $-64, %rsp
+        subq      $448, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+        vmovapd   %ymm0, %ymm1
+        vmovupd __dInvPI(%rax), %ymm4
+        vmovupd __dRShifter(%rax), %ymm5
+
+/*
+   ARGUMENT RANGE REDUCTION:
+   Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %ymm1, %ymm7
+
+/* Get absolute argument value: X' = |X'| */
+        vandpd __dAbsMask(%rax), %ymm7, %ymm2
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd %ymm5, %ymm4, %ymm7
+        vmovupd __dC7(%rax), %ymm4
+
+/* Check for large arguments path */
+        vcmpnle_uqpd __dRangeVal(%rax), %ymm2, %ymm3
+
+/* N = Y - RS : right shifter sub */
+        vsubpd    %ymm5, %ymm7, %ymm6
+        vmovupd __dPI1_FMA(%rax), %ymm2
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %ymm7, %ymm7
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %ymm6, %ymm0
+        vmovmskpd %ymm3, %ecx
+
+/* R = X - N*Pi1 */
+        vmovapd   %ymm1, %ymm3
+        vfnmadd231pd %ymm0, %ymm2, %ymm3
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %ymm0, %ymm3
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %ymm3, %ymm0
+
+/* POLYNOMIAL APPROXIMATION: R2 = R*R */
+        vmulpd    %ymm0, %ymm0, %ymm5
+        vfmadd213pd __dC6(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC5(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC4(%rax), %ymm5, %ymm4
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %ymm5, %ymm4
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC1(%rax), %ymm5, %ymm4
+        vmulpd    %ymm5, %ymm4, %ymm6
+        vfmadd213pd %ymm0, %ymm0, %ymm6
+
+/*
+   RECONSTRUCTION:
+   Final sign setting: Res = Poly^SignRes */
+        vxorpd    %ymm7, %ymm6, %ymm0
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        cfi_remember_state
+        movq      %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq      %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+
+.LBL_1_3:
+        cfi_restore_state
+        vmovupd   %ymm1, 320(%rsp)
+        vmovupd   %ymm0, 384(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        vmovups   %ymm8, 224(%rsp)
+        vmovups   %ymm9, 192(%rsp)
+        vmovups   %ymm10, 160(%rsp)
+        vmovups   %ymm11, 128(%rsp)
+        vmovups   %ymm12, 96(%rsp)
+        vmovups   %ymm13, 64(%rsp)
+        vmovups   %ymm14, 32(%rsp)
+        vmovups   %ymm15, (%rsp)
+        movq      %rsi, 264(%rsp)
+        movq      %rdi, 256(%rsp)
+        movq      %r12, 296(%rsp)
+        cfi_offset_rel_rsp (12, 296)
+        movb      %dl, %r12b
+        movq      %r13, 288(%rsp)
+        cfi_offset_rel_rsp (13, 288)
+        movl      %ecx, %r13d
+        movq      %r14, 280(%rsp)
+        cfi_offset_rel_rsp (14, 280)
+        movl      %eax, %r14d
+        movq      %r15, 272(%rsp)
+        cfi_offset_rel_rsp (15, 272)
+        cfi_remember_state
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        vmovups   224(%rsp), %ymm8
+        vmovups   192(%rsp), %ymm9
+        vmovups   160(%rsp), %ymm10
+        vmovups   128(%rsp), %ymm11
+        vmovups   96(%rsp), %ymm12
+        vmovups   64(%rsp), %ymm13
+        vmovups   32(%rsp), %ymm14
+        vmovups   (%rsp), %ymm15
+        vmovupd   384(%rsp), %ymm0
+        movq      264(%rsp), %rsi
+        movq      256(%rsp), %rdi
+        movq      296(%rsp), %r12
+        cfi_restore (%r12)
+        movq      288(%rsp), %r13
+        cfi_restore (%r13)
+        movq      280(%rsp), %r14
+        cfi_restore (%r14)
+        movq      272(%rsp), %r15
+        cfi_restore (%r15)
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        cfi_restore_state
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    328(%rsp,%r15), %xmm0
+        vzeroupper
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 392(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    320(%rsp,%r15), %xmm0
+        vzeroupper
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 384(%rsp,%r15)
+        jmp       .LBL_1_7
+
+END (_ZGVdN4v_cos)
+libmvec_hidden_def (_ZGVdN4v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
new file mode 100644
index 0000000..daa77a9
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
@@ -0,0 +1,467 @@ 
+/* Function cos vectorized with AVX-512, wrapper to AVX2, KNL and SKX versions.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+#include "svml_d_wrapper_impl.h"
+
+ .text
+ENTRY (_ZGVeN8v_cos_avx2_wrapper)
+WRAPPER_IMPL_AVX512 _ZGVdN4v_cos
+END (_ZGVeN8v_cos_avx2_wrapper)
+
+ENTRY (_ZGVeN8v_cos_knl)
+#ifndef HAVE_AVX512_ASM_SUPPORT
+WRAPPER_IMPL_AVX512 _ZGVdN4v_cos
+#else
+/*
+  ALGORITHM DESCRIPTION:
+
+       ( low accuracy ( < 4ulp ) or enhanced performance
+        ( half of correct mantissa ) implementation )
+
+        Argument representation:
+        arg + Pi/2 = (N*Pi + R)
+
+        Result calculation:
+        cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+        sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq      %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq      $-64, %rsp
+        subq      $1280, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+
+/* R = X - N*Pi1 */
+        vmovaps   %zmm0, %zmm7
+
+/* Check for large arguments path */
+        movq      $-1, %rcx
+
+/*
+  ARGUMENT RANGE REDUCTION:
+  Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %zmm0, %zmm5
+        vmovups __dInvPI(%rax), %zmm3
+
+/* Get absolute argument value: X' = |X'| */
+        vpandq __dAbsMask(%rax), %zmm5, %zmm1
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd __dRShifter(%rax), %zmm3, %zmm5
+        vmovups __dPI1_FMA(%rax), %zmm6
+
+/* N = Y - RS : right shifter sub */
+        vsubpd __dRShifter(%rax), %zmm5, %zmm4
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %zmm5, %zmm12
+        vmovups __dC7(%rax), %zmm8
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %zmm4, %zmm10
+        vcmppd    $22, __dRangeVal(%rax), %zmm1, %k1
+        vpbroadcastq %rcx, %zmm2{%k1}{z}
+        vfnmadd231pd %zmm10, %zmm6, %zmm7
+        vptestmq  %zmm2, %zmm2, %k0
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %zmm10, %zmm7
+        kmovw     %k0, %ecx
+        movzbl    %cl, %ecx
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %zmm7, %zmm10
+
+/*
+  POLYNOMIAL APPROXIMATION:
+  R2 = R*R
+ */
+        vmulpd    %zmm10, %zmm10, %zmm9
+        vfmadd213pd __dC6(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC5(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC4(%rax), %zmm9, %zmm8
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %zmm9, %zmm8
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC1(%rax), %zmm9, %zmm8
+        vmulpd    %zmm9, %zmm8, %zmm11
+        vfmadd213pd %zmm10, %zmm10, %zmm11
+
+/*
+  RECONSTRUCTION:
+  Final sign setting: Res = Poly^SignRes
+ */
+        vpxorq    %zmm12, %zmm11, %zmm1
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        cfi_remember_state
+        vmovaps   %zmm1, %zmm0
+        movq      %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq      %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+
+.LBL_1_3:
+        cfi_restore_state
+        vmovups   %zmm0, 1152(%rsp)
+        vmovups   %zmm1, 1216(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        kmovw     %k4, 1048(%rsp)
+        xorl      %eax, %eax
+        kmovw     %k5, 1040(%rsp)
+        kmovw     %k6, 1032(%rsp)
+        kmovw     %k7, 1024(%rsp)
+        vmovups   %zmm16, 960(%rsp)
+        vmovups   %zmm17, 896(%rsp)
+        vmovups   %zmm18, 832(%rsp)
+        vmovups   %zmm19, 768(%rsp)
+        vmovups   %zmm20, 704(%rsp)
+        vmovups   %zmm21, 640(%rsp)
+        vmovups   %zmm22, 576(%rsp)
+        vmovups   %zmm23, 512(%rsp)
+        vmovups   %zmm24, 448(%rsp)
+        vmovups   %zmm25, 384(%rsp)
+        vmovups   %zmm26, 320(%rsp)
+        vmovups   %zmm27, 256(%rsp)
+        vmovups   %zmm28, 192(%rsp)
+        vmovups   %zmm29, 128(%rsp)
+        vmovups   %zmm30, 64(%rsp)
+        vmovups   %zmm31, (%rsp)
+        movq      %rsi, 1064(%rsp)
+        movq      %rdi, 1056(%rsp)
+        movq      %r12, 1096(%rsp)
+        cfi_offset_rel_rsp (12, 1096)
+        movb      %dl, %r12b
+        movq      %r13, 1088(%rsp)
+        cfi_offset_rel_rsp (13, 1088)
+        movl      %ecx, %r13d
+        movq      %r14, 1080(%rsp)
+        cfi_offset_rel_rsp (12, 1080)
+        movl      %eax, %r14d
+        movq      %r15, 1072(%rsp)
+        cfi_offset_rel_rsp (12, 1072)
+        cfi_remember_state
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        addb      $1, %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        kmovw     1048(%rsp), %k4
+        movq      1064(%rsp), %rsi
+        kmovw     1040(%rsp), %k5
+        movq      1056(%rsp), %rdi
+        kmovw     1032(%rsp), %k6
+        movq      1096(%rsp), %r12
+        cfi_restore (%r12)
+        movq      1088(%rsp), %r13
+        cfi_restore (%r13)
+        kmovw     1024(%rsp), %k7
+        vmovups   960(%rsp), %zmm16
+        vmovups   896(%rsp), %zmm17
+        vmovups   832(%rsp), %zmm18
+        vmovups   768(%rsp), %zmm19
+        vmovups   704(%rsp), %zmm20
+        vmovups   640(%rsp), %zmm21
+        vmovups   576(%rsp), %zmm22
+        vmovups   512(%rsp), %zmm23
+        vmovups   448(%rsp), %zmm24
+        vmovups   384(%rsp), %zmm25
+        vmovups   320(%rsp), %zmm26
+        vmovups   256(%rsp), %zmm27
+        vmovups   192(%rsp), %zmm28
+        vmovups   128(%rsp), %zmm29
+        vmovups   64(%rsp), %zmm30
+        vmovups   (%rsp), %zmm31
+        movq      1080(%rsp), %r14
+        cfi_restore (%r14)
+        movq      1072(%rsp), %r15
+        cfi_restore (%r15)
+        vmovups   1216(%rsp), %zmm1
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        cfi_restore_state
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1160(%rsp,%r15), %xmm0
+        call      cos@PLT
+        vmovsd    %xmm0, 1224(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1152(%rsp,%r15), %xmm0
+        call      cos@PLT
+        vmovsd    %xmm0, 1216(%rsp,%r15)
+        jmp       .LBL_1_7
+#endif
+END (_ZGVeN8v_cos_knl)
+
+ENTRY (_ZGVeN8v_cos_skx)
+#ifndef HAVE_AVX512_ASM_SUPPORT
+WRAPPER_IMPL_AVX512 _ZGVdN4v_cos
+#else
+/*
+   ALGORITHM DESCRIPTION:
+
+      ( low accuracy ( < 4ulp ) or enhanced performance
+       ( half of correct mantissa ) implementation )
+
+      Argument representation:
+      arg + Pi/2 = (N*Pi + R)
+
+      Result calculation:
+      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+      sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq      %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq      $-64, %rsp
+        subq      $1280, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+
+/* R = X - N*Pi1 */
+        vmovaps   %zmm0, %zmm8
+
+/* Check for large arguments path */
+        vpbroadcastq .L_2il0floatpacket.16(%rip), %zmm2
+
+/*
+  ARGUMENT RANGE REDUCTION:
+  Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %zmm0, %zmm6
+        vmovups __dInvPI(%rax), %zmm3
+        vmovups __dRShifter(%rax), %zmm4
+        vmovups __dPI1_FMA(%rax), %zmm7
+        vmovups __dC7(%rax), %zmm9
+
+/* Get absolute argument value: X' = |X'| */
+        vandpd __dAbsMask(%rax), %zmm6, %zmm1
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd %zmm4, %zmm3, %zmm6
+        vcmppd    $18, __dRangeVal(%rax), %zmm1, %k1
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %zmm6, %zmm13
+
+/* N = Y - RS : right shifter sub */
+        vsubpd    %zmm4, %zmm6, %zmm5
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %zmm5, %zmm11
+        vfnmadd231pd %zmm11, %zmm7, %zmm8
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %zmm11, %zmm8
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %zmm8, %zmm11
+
+/*
+  POLYNOMIAL APPROXIMATION:
+  R2 = R*R
+ */
+        vmulpd    %zmm11, %zmm11, %zmm10
+        vfmadd213pd __dC6(%rax), %zmm10, %zmm9
+        vfmadd213pd __dC5(%rax), %zmm10, %zmm9
+        vfmadd213pd __dC4(%rax), %zmm10, %zmm9
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %zmm10, %zmm9
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %zmm10, %zmm9
+        vfmadd213pd __dC1(%rax), %zmm10, %zmm9
+        vmulpd    %zmm10, %zmm9, %zmm12
+        vfmadd213pd %zmm11, %zmm11, %zmm12
+        vpandnq   %zmm1, %zmm1, %zmm2{%k1}
+        vcmppd    $3, %zmm2, %zmm2, %k0
+
+/*
+  RECONSTRUCTION:
+  Final sign setting: Res = Poly^SignRes
+ */
+        vxorpd    %zmm13, %zmm12, %zmm1
+        kmovw     %k0, %ecx
+        testl     %ecx, %ecx
+        jne       .LBL_2_3
+
+.LBL_2_2:
+        cfi_remember_state
+        vmovaps   %zmm1, %zmm0
+        movq      %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq      %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+
+.LBL_2_3:
+        cfi_restore_state
+        vmovups   %zmm0, 1152(%rsp)
+        vmovups   %zmm1, 1216(%rsp)
+        je        .LBL_2_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        kmovw     %k4, 1048(%rsp)
+        kmovw     %k5, 1040(%rsp)
+        kmovw     %k6, 1032(%rsp)
+        kmovw     %k7, 1024(%rsp)
+        vmovups   %zmm16, 960(%rsp)
+        vmovups   %zmm17, 896(%rsp)
+        vmovups   %zmm18, 832(%rsp)
+        vmovups   %zmm19, 768(%rsp)
+        vmovups   %zmm20, 704(%rsp)
+        vmovups   %zmm21, 640(%rsp)
+        vmovups   %zmm22, 576(%rsp)
+        vmovups   %zmm23, 512(%rsp)
+        vmovups   %zmm24, 448(%rsp)
+        vmovups   %zmm25, 384(%rsp)
+        vmovups   %zmm26, 320(%rsp)
+        vmovups   %zmm27, 256(%rsp)
+        vmovups   %zmm28, 192(%rsp)
+        vmovups   %zmm29, 128(%rsp)
+        vmovups   %zmm30, 64(%rsp)
+        vmovups   %zmm31, (%rsp)
+        movq      %rsi, 1064(%rsp)
+        movq      %rdi, 1056(%rsp)
+        movq      %r12, 1096(%rsp)
+        cfi_offset_rel_rsp (12, 1096)
+        movb      %dl, %r12b
+        movq      %r13, 1088(%rsp)
+        cfi_offset_rel_rsp (13, 1088)
+        movl      %ecx, %r13d
+        movq      %r14, 1080(%rsp)
+        cfi_offset_rel_rsp (14, 1080)
+        movl      %eax, %r14d
+        movq      %r15, 1072(%rsp)
+        cfi_offset_rel_rsp (15, 1072)
+        cfi_remember_state
+
+.LBL_2_6:
+        btl       %r14d, %r13d
+        jc        .LBL_2_12
+
+.LBL_2_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_2_10
+
+.LBL_2_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_2_6
+
+        kmovw     1048(%rsp), %k4
+        kmovw     1040(%rsp), %k5
+        kmovw     1032(%rsp), %k6
+        kmovw     1024(%rsp), %k7
+        vmovups   960(%rsp), %zmm16
+        vmovups   896(%rsp), %zmm17
+        vmovups   832(%rsp), %zmm18
+        vmovups   768(%rsp), %zmm19
+        vmovups   704(%rsp), %zmm20
+        vmovups   640(%rsp), %zmm21
+        vmovups   576(%rsp), %zmm22
+        vmovups   512(%rsp), %zmm23
+        vmovups   448(%rsp), %zmm24
+        vmovups   384(%rsp), %zmm25
+        vmovups   320(%rsp), %zmm26
+        vmovups   256(%rsp), %zmm27
+        vmovups   192(%rsp), %zmm28
+        vmovups   128(%rsp), %zmm29
+        vmovups   64(%rsp), %zmm30
+        vmovups   (%rsp), %zmm31
+        vmovups   1216(%rsp), %zmm1
+        movq      1064(%rsp), %rsi
+        movq      1056(%rsp), %rdi
+        movq      1096(%rsp), %r12
+        cfi_restore (%r12)
+        movq      1088(%rsp), %r13
+        cfi_restore (%r13)
+        movq      1080(%rsp), %r14
+        cfi_restore (%r14)
+        movq      1072(%rsp), %r15
+        cfi_restore (%r15)
+        jmp       .LBL_2_2
+
+.LBL_2_10:
+        cfi_restore_state
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1160(%rsp,%r15), %xmm0
+        vzeroupper
+        vmovsd    1160(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 1224(%rsp,%r15)
+        jmp       .LBL_2_8
+
+.LBL_2_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1152(%rsp,%r15), %xmm0
+        vzeroupper
+        vmovsd    1152(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 1216(%rsp,%r15)
+        jmp       .LBL_2_7
+#endif
+END (_ZGVeN8v_cos_skx)
+
+ .section .rodata, "a"
+.L_2il0floatpacket.16:
+ .long 0xffffffff,0xffffffff
+ .type .L_2il0floatpacket.16,@object
diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.S
b/sysdeps/x86_64/fpu/svml_d_cos_data.S
new file mode 100644
index 0000000..c9bfd63
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos_data.S
@@ -0,0 +1,114 @@ 
+/* Data for vectorized cos.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "svml_d_cos_data.h"
+
+.macro double_vector offset value
+.if .-__svml_dcos_data != \offset
+.err
+.endif
+.rept 8
+.quad \value
+.endr
+.endm
+
+ .section .rodata, "a"
+ .align 64
+
+/* Data table for vector implementations of function cos.
+   The table may contain polynomial, reduction, lookup
+   coefficients and other constants obtained through different
+   methods of research and experimental work.
+ */
+ .globl __svml_dcos_data
+__svml_dcos_data:
+
+/* General purpose constants:
+   absolute value mask
+ */
+double_vector __dAbsMask 0x7fffffffffffffff
+
+/* working range threshold */
+double_vector __dRangeVal 0x4160000000000000
+
+/* PI/2 */
+double_vector __dHalfPI 0x3ff921fb54442d18
+
+/* 1/PI */
+double_vector __dInvPI 0x3fd45f306dc9c883
+
+/* right-shifter constant */
+double_vector __dRShifter 0x4338000000000000
+
+/* 0.5 */
+double_vector __dOneHalf 0x3fe0000000000000
+
+/* Range reduction PI-based constants:
+   PI high part
+ */
+double_vector __dPI1 0x400921fb40000000
+
+/* PI mid  part 1 */
+double_vector __dPI2 0x3e84442d00000000
+
+/* PI mid  part 2 */
+double_vector __dPI3 0x3d08469880000000
+
+/* PI low  part */
+double_vector __dPI4 0x3b88cc51701b839a
+
+/* Range reduction PI-based constants if FMA available:
+   PI high part (FMA available)
+ */
+double_vector __dPI1_FMA 0x400921fb54442d18
+
+/* PI mid part  (FMA available) */
+double_vector __dPI2_FMA 0x3ca1a62633145c06
+
+/* PI low part  (FMA available) */
+double_vector __dPI3_FMA 0x395c1cd129024e09
+
+/* Polynomial coefficients (relative error 2^(-52.115)): */
+double_vector __dC1 0xbfc55555555554a7
+double_vector __dC2 0x3f8111111110a4a8
+double_vector __dC3 0xbf2a01a019a5b86d
+double_vector __dC4 0x3ec71de38030fea0
+double_vector __dC5 0xbe5ae63546002231
+double_vector __dC6 0x3de60e6857a2f220
+double_vector __dC7 0xbd69f0d60811aac8
+
+/*
+   Additional constants:
+   absolute value mask
+ */
+double_vector __dAbsMask_la 0x7fffffffffffffff
+
+/* 1/PI */
+double_vector __dInvPI_la 0x3fd45f306dc9c883
+
+/* right-shifer for low accuracy version */
+double_vector __dRShifter_la 0x4330000000000000
+
+/* right-shifer-1.0 for low accuracy version */
+double_vector __dRShifterm5_la 0x432fffffffffffff
+
+/* right-shifer with low mask for low accuracy version */
+double_vector __dRXmax_la 0x43300000007ffffe
+
+ .type __svml_dcos_data,@object
+ .size __svml_dcos_data,.-__svml_dcos_data
diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.h
b/sysdeps/x86_64/fpu/svml_d_cos_data.h
new file mode 100644
index 0000000..4d28e6e
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos_data.h
@@ -0,0 +1,48 @@ 
+/* Offsets for data table for vectorized cos.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef D_COS_DATA_H
+#define D_COS_DATA_H
+
+#define __dAbsMask              0
+#define __dRangeVal             64
+#define __dHalfPI               128
+#define __dInvPI                192
+#define __dRShifter             256
+#define __dOneHalf              320
+#define __dPI1                  384
+#define __dPI2                  448
+#define __dPI3                  512
+#define __dPI4                  576
+#define __dPI1_FMA              640
+#define __dPI2_FMA              704
+#define __dPI3_FMA              768
+#define __dC1                   832
+#define __dC2                   896
+#define __dC3                   960
+#define __dC4                   1024
+#define __dC5                   1088
+#define __dC6                   1152
+#define __dC7                   1216
+#define __dAbsMask_la           1280
+#define __dInvPI_la             1344
+#define __dRShifter_la          1408
+#define __dRShifterm5_la        1472
+#define __dRXmax_la             1536
+
+#endif
diff --git a/sysdeps/x86_64/fpu/svml_d_wrapper_impl.h
b/sysdeps/x86_64/fpu/svml_d_wrapper_impl.h
new file mode 100644
index 0000000..4da5e61
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_wrapper_impl.h
@@ -0,0 +1,101 @@ 
+/* Wrapper implementations of several versions of vector math functions.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* SSE2 ISA version as wrapper to scalar.  */
+.macro WRAPPER_IMPL_SSE2 callee
+        subq      $40, %rsp
+        cfi_adjust_cfa_offset(40)
+        movaps    %xmm0, (%rsp)
+        call      \callee@PLT
+        movsd     %xmm0, 16(%rsp)
+        movsd     8(%rsp), %xmm0
+        call      \callee@PLT
+        movsd     16(%rsp), %xmm1
+        movsd     %xmm0, 24(%rsp)
+        unpcklpd  %xmm0, %xmm1
+        movaps    %xmm1, %xmm0
+        addq      $40, %rsp
+        cfi_adjust_cfa_offset(-40)
+        ret
+.endm
+
+/* AVX ISA version as wrapper to SSE4 ISA version.  */
+.macro WRAPPER_IMPL_AVX callee
+        pushq %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq $-32, %rsp
+        subq $32, %rsp
+        vextractf128 $1, %ymm0, (%rsp)
+        vzeroupper
+        call HIDDEN_JUMPTARGET(\callee)
+        vmovapd %xmm0, 16(%rsp)
+        vmovaps (%rsp), %xmm0
+        call HIDDEN_JUMPTARGET(\callee)
+        vmovapd %xmm0, %xmm1
+        vmovapd 16(%rsp), %xmm0
+        vinsertf128 $1, %xmm1, %ymm0, %ymm0
+        movq %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+.endm
+
+/* AVX512 ISA version as wrapper to AVX2 ISA version.  */
+.macro WRAPPER_IMPL_AVX512 callee
+        pushq %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq $-64, %rsp
+        subq $64, %rsp
+/* Below is encoding for vmovaps %zmm0, (%rsp).  */
+        .byte 0x62
+        .byte 0xf1
+        .byte 0x7c
+        .byte 0x48
+        .byte 0x29
+        .byte 0x04
+        .byte 0x24
+/* Below is encoding for vmovapd (%rsp), %ymm0.  */
+        .byte 0xc5
+        .byte 0xfd
+        .byte 0x28
+        .byte 0x04
+        .byte 0x24
+        call HIDDEN_JUMPTARGET(\callee)
+/* Below is encoding for vmovapd 32(%rsp), %ymm0.  */
+        .byte 0xc5
+        .byte 0xfd
+        .byte 0x28
+        .byte 0x44
+        .byte 0x24
+        .byte 0x20
+        call HIDDEN_JUMPTARGET(\callee)
+        movq %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+.endm
diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h
index e652171..e79a397 100644
--- a/sysdeps/x86_64/sysdep.h
+++ b/sysdeps/x86_64/sysdep.h
@@ -25,6 +25,13 @@ 

 /* Syntactic details of assembler.  */

+/* This macro is for setting proper CFI with DW_CFA_expression describing
+   the register as saved relative to %rsp instead of relative to the CFA.
+   Expression is DW_OP_drop, DW_OP_breg7 (%rsp is register 7), sleb128 offset
+   from %rsp.  */
+#define cfi_offset_rel_rsp(regn, off) .cfi_escape 0x10, regn, 0x4, 0x13, \
+ 0x77, off & 0x7F | 0x80, off >> 7
+
 /* ELF uses byte-counts for .align, most others use log2 of count of bytes.  */
 #define ALIGNARG(log2) 1<<log2
 #define ASM_SIZE_DIRECTIVE(name) .size name,.-name;