From patchwork Tue Mar  6 21:33:08 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Steve Ellcey <sellcey@cavium.com>
X-Patchwork-Id: 26221
Received: (qmail 27255 invoked by alias); 6 Mar 2018 21:33:26 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 26700 invoked by uid 89); 6 Mar 2018 21:33:25 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-24.0 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	KAM_ASCII_DIVIDERS, KAM_NUMSUBJECT, KAM_SHORT,
	RCVD_IN_DNSWL_NONE, SPF_HELO_PASS,
	SPF_PASS autolearn=ham version=3.3.2 spammy=ux, 19545, u1x, 6*
X-HELO: NAM02-BL2-obe.outbound.protection.outlook.com
Authentication-Results: spf=none (sender IP is )
	smtp.mailfrom=Steve.Ellcey@cavium.com;
Message-ID: <1520371988.6774.82.camel@cavium.com>
Subject: RFC: Proposal for implementing libmvec on aarch64
From: Steve Ellcey <sellcey@cavium.com>
Reply-To: sellcey@cavium.com
To: libc-alpha <libc-alpha@sourceware.org>
Date: Tue, 06 Mar 2018 13:33:08 -0800
Mime-Version: 1.0
X-ClientProxiedBy: CO2PR18CA0057.namprd18.prod.outlook.com
	(2603:10b6:104:2::25) To DM6PR07MB4571.namprd07.prod.outlook.com
	(2603:10b6:5:c1::21)
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id: baef91e4-91b8-4d03-7513-08d583a9e096
X-Microsoft-Antispam: UriScan:; BCL:0; PCL:0;
	RULEID:(7020095)(4652020)(5600026)(4604075)(4534165)(4627221)(201703031133081)(201702281549075)(2017052603328)(7153060)(49563074)(7193020);
	SRVR:DM6PR07MB4571;
X-Microsoft-Exchange-Diagnostics: 1; DM6PR07MB4571;
	3:Mu0xB4cdpdUX7zUduHuhjlahyauBMM7eg6OvPhwVUpUET/gqfpvzV4C4jfmOMMlcXifOPH8Ue6J8Gc4OvTE6hYj+eGObP8xUsMZllaNXH2CCJYh2qfrY+qDuy+1FLM5uMnMyTfD1zcjeUD/fq5VLip2qTnXM++2DxQuttJpRDfYJ9x/WgwUHuQlbHHpQlXFHGqY1v4X90XJQ6kIyQbXcq1CYshndBMirP0cu5MYLrKqmd7AhV6P/BAP+r5r7VYNd;
	25:wSF2dKaM29dj4/Mzy3amCn1fwQAy/pR1UN447X7KfiGOE0M97jGS1QQsikWHp3JwiJ848drDUHAZS6D9lN71y2uZHoZote2cxfxv80n8FC9UlH1KqMqm2NGQi2egHRwYbbukF6XpzqZyU40mPiHWjztiMZEC1k8H3NAkzZrYCophBgs8kDJKsuXGee6HoFJPVlhVo7TxVTSkBrcVp4URyYKH08cReQ8f0HkZk4cOJwo6tuJ7Pnib3znm9qG3YqHdgdDhmTXnskQNvVyflc/wUdJitz/9JRY+7N6Kt5aC75wSWqzTao73oWizYsPomO9WT76yfAL5cymlflEe/Enuiw==;
	31:KAWEg34KXjfggbv9yOUhCo6AdCz++31Q/JXEcDJPcO0EHnXNiMA82cIexApjqRuw7y2BimOVCjAZpOTNTPq6Ib6QMCi4QKsvP5AzR2NdOi7Zve1zX6xrjz0AozvVMxBVBPMFUnSeLRgQ3hV+VvxOVa274r44tVoQ68idIwo4gq/qxzVCzBvJuMk7DBY85F5OxMzIy9TI84MYfbX635Oh8AAh8sLtBZ5px4O352/LUlo=
X-MS-TrafficTypeDiagnostic: DM6PR07MB4571:
X-Microsoft-Exchange-Diagnostics: 1; DM6PR07MB4571;
	20:2R5njgdXshznXpY0CWjG6VIenUckX0DXAYtz6/gVinWD81hwfaFaPkm2x9qnBpfjMNSUV/v0ve2iV7MMgfmhLr5gPYiJ1ND8xuPFFmR06KX1vpexes/k3wMwgmTYt/AK+7qEYWJrhMkdtChkxNzaNcGx0VI4fwsji37eFlqU8SeGCq7zsWziJSlViy3OHxoDeJnF1aZLuRakmmHH0TtcmLGq1O3U4VykysqL//UqYjl6CI+fBHKFKc9J44e5pgpprdARDxv/sC8JB6RkQX35xFjpVFAtBR5MISVyTzb8I8Zp30Ar6t9IfIyAj4y1ozmv7KrDtEhNwK4ZmLM7u0Dxdj62e0GuvZjo3MoE7KoGuD0bECW3Q/UmNi7oRDbA5J+AZt6fu1CWvTUyKutTcH8FBbiYij8qHCrLrkbow0xc+5aDMklScgQTGzwDm6ktR9NpO/9VXV58IuiP0ZeJBFaB9ugBDlK/ZjSu68Ey6Qpn+n++jdH0uB9WumFK1A5CzqMG;
	4:lUF5nvElKh90NanjUBoW21OYNGQ+2ylJkqRa4oL1EijcpabFP9vXUimcr1Nuhc+ixaKcUFtkJ0eru2fhNWr7/LWZQvm0tb/6Vm3VvTSgWcB8gD/bNkMiHXaI2/p+ZaZkHdNWzkHTUvZEpIgnsMOt+to9tUtvrSibREyPgrxGln+RYLpFDL10BvkYy+yYrZX/7y9fpT9vRQ6KZrwcdHcdt1WFeTFWVBwNorCj/D/H8q7ANa61fbq8RgX20W7oTccHqlIoZjnuPaVe4rq9P5OGSw==
X-Microsoft-Antispam-PRVS: 
 <DM6PR07MB4571A8866F5B544D09199856F5D90@DM6PR07MB4571.namprd07.prod.outlook.com>
X-Exchange-Antispam-Report-Test: UriScan:;
X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0;
	RULEID:(8211001083)(102415395)(6040501)(2401047)(5005006)(8121501046)(93006095)(93001095)(10201501046)(3231220)(944501244)(52105095)(3002001)(6041288)(20161123564045)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(20161123560045)(20161123562045)(6072148)(201708071742011);
	SRVR:DM6PR07MB4571; BCL:0; PCL:0; RULEID:; SRVR:DM6PR07MB4571;
X-Forefront-PRVS: 06036BD506
X-Forefront-Antispam-Report: SFV:NSPM;
	SFS:(10009020)(346002)(396003)(39860400002)(39380400002)(366004)(376002)(199004)(189003)(6916009)(186003)(25786009)(50226002)(7736002)(68736007)(105586002)(3450700001)(3846002)(6116002)(6486002)(6666003)(305945005)(81166006)(81156014)(26005)(16526019)(4610100001)(103116003)(69596002)(66066001)(84326002)(386003)(316002)(6506007)(33964004)(43066004)(568964002)(2906002)(53936002)(16586007)(478600001)(4810100001)(53416004)(5660300001)(72206003)(8936002)(6512007)(8676002)(36756003)(97736004)(106356001)(52116002)(5890100001)(2476003)(99106002);
	DIR:OUT; SFP:1101; SCL:1; SRVR:DM6PR07MB4571;
	H:sellcey-dt.caveonetworks.com; FPR:; SPF:None;
	PTR:InfoNoRecords; A:1; MX:1; LANG:en;
Received-SPF: None (protection.outlook.com: cavium.com does not designate
	permitted sender hosts)
X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; DM6PR07MB4571;
	23:oM0bzXrLrU33DzzX2H5xiGj3UlXCtgc6ZNqiT9xPx?=
	=?us-ascii?Q?4ZVhgSErgxCHwZWsWLk08ndT3/yaAcxE1rFC/Uve1Z2OIRaLKrmLvUxiGTjb?=
	=?us-ascii?Q?5MDE3DlJVVJ3dBdA6qoSpbS/mAvdbQ43wK5v4E7Q87q0nM2VCHrjrU7WSRwj?=
	=?us-ascii?Q?cXepItIOk0Q8crvEUJNJXzQlCmHCC78WUgua/3VawhNHHiOOv8bJoF9AKUDS?=
	=?us-ascii?Q?2iubeeozkqPlXs6/qnblOsRAcCKz80uPbEF2Cg2mW+wWKjwhITRz4roBb677?=
	=?us-ascii?Q?kpY6XG9+yxZprRBafZFnpha203BYE8fkN/hRky6ze3GYED5JypGCDNoGZfHz?=
	=?us-ascii?Q?AeKHYOq0hrGfjm2V1Jdig2KYTX96H0op7zzyjxS61b7bzwQ4DMEJqWBJ9I32?=
	=?us-ascii?Q?4F0vZdJMNkDKzNhDCYAnhx8C7ul4FSmFee63i58cIPwp6sJltilg/oQ5uwqh?=
	=?us-ascii?Q?KMZjcqlvWyWtalR3+T7YYLvfIgUQTVgY3ciUJFno3KYzUtHcFrdc67NWbHOJ?=
	=?us-ascii?Q?mJNxn8St0SN7vu2eHHq/YMv0W9M+V9A6+IVZyiixzeYtMmimzFZaX1/RFqxl?=
	=?us-ascii?Q?GPLCcYbcXKRrljJ6v7O7CQvtPzlbxsLqwePy26QnNI7C9axjKqUjhJ3kWQf0?=
	=?us-ascii?Q?tyj7p/ec14p2ZgUKOWApKGVZixyJJ2mj3dTFeWqsAq3zIn/TU2ZZMMJi7Z2b?=
	=?us-ascii?Q?sHmBIdmPYIFEu643xI2meBr5Fe7VYaLQO8mM1YC7J+8yPZ6hFr6xM9rdNvRD?=
	=?us-ascii?Q?XrRev9uRVt/6FSRpTaq0iZpdj8okphmwTNptGvI/Tmn1BF0Y26CJZ03mX9C8?=
	=?us-ascii?Q?tZH5De+T3tyUExqX0odyNoARm08G55YrtR0r3k+XYM7pJNtT0MTUZm0nqwgv?=
	=?us-ascii?Q?Kfdtub7j2dSpBIV7nNqIHOxq3MgGHjprs4YeYhXBDyVRHxBI+kff6o86Q6LT?=
	=?us-ascii?Q?JM2Fus37qFVuUsVoDMsgfJXcTCDI/BYF0xNRaP0qwdLXoSqnvJrwExSCrnJC?=
	=?us-ascii?Q?GdJUwgsx7NN6/8nyK7UDHi7tts/UO7rgsY0e1ECjzLIRuUhHcBzB+NZKEjc5?=
	=?us-ascii?Q?pORthttVD2fcFBxj2HK/8sBaJxz1B2yY75ZBCiE8gAgKXVmuWAeO3uG1ZaAu?=
	=?us-ascii?Q?dQGrmCAqySpldtdlrzz6s5uXPdOrvFj4xPoZx2PKBVPBsvqtWWqR+yInfgzg?=
	=?us-ascii?Q?JXXwlEhfSXDeGQzzGo3D9dbhNJi5LZzPF4SBf7Bn3IYRsiPOT1Q4gAdEo8Nw?=
	=?us-ascii?Q?itGpPGsXkL55BM5XNk=3D?=
X-Microsoft-Antispam-Message-Info: 
 dalNY5vUNxHdRJn6MeW8tvohcZdDMCUmDBoLyKyZBlHOhB5Hqe47YV0lH8ao+C/ZX6g7gbdksSxLgzfUmTVyW2fk0vShLBqyvR0ONZpXKgObpt/1vKdK5WY3nYQmY9UpcJ9jdofnlmoGS0rqu2JkslIQ+MuJtSVPLk6Pgo1HMDeDslJtNE8BooON830KMLeV
X-Microsoft-Exchange-Diagnostics: 1; DM6PR07MB4571;
	6:5Vj9rjt5WffV5MeuB2I1Mh/YM9/TJ8yzlZdYZocqD4dWVHd7nLUYnZcofl4snRB38xBA7WolBdbv6Q8FPE1pMU8SoSPyjQPIBkx72pKQ+P7cpZr1C31AzCsGN5sDf6fTQ9ry21hirWfSHAY6FbjwvUjlWBObRl+lYp9ZUInILMrnkzcU9g6B+dcDAVbJppXf50xBJ0Jlu8+wWUCZYBcxdVfDeYTgthh6fBqziLAFf9jSKC0ukiyTHfkIR2+k04GsmyaePvKCDYD3vAXyGCSDCYHX9iBYYzhNbuoLTYQeIuEOnZ9M+Xu1o54sOd2TDBvnPNfLU6S22q5GhrjJqGWIhIRKLO7M43U1wVZmiBxbTL8=;
	5:99vzE96O3yg8fSGN0NPWmTxeVn7s5vljfHTJvs2PHgeC36QFN23xGUY2IpDqUk00efZiG2Juz8CXmLFbZtyjc60o/az1JYbFbM/bX8UunRXY2Y9Mx1b+cGEKPDmjlSHWY6QWzFB6yDYEsef/7EMuxrroXofVKJPr4gI98Tu7Ce0=;
	24:p9H0yHh5wWjGXWrxrZ6/H3tF1MMIPBAt47Tx3sP2fCqBevrHsy9/z6IZiZWRJWuINQA6bwKMDqiEP6iscifh9ZierI5d2rzm+r/lzLgPJTk=;
	7:TnLDezzRaJqdZSTn+UV2TS2/THIlRZO4T5olnNsQsS34td0UFhfROtD7GGAWy8Jj2ZeXrn+qbFNKGv7hu8LevbKavEnPn4814/TCRl7LAIr+vw37oTc2scu0hO+NPRXyiGbg9dCC1kP8aaVBO0DcKlLou46kcAxRJofIOIdlnUe7NpZc9FJX494E0k2887/CkQdpJCykia6B0Pp2i9TDs+8WPsN/JC4B1OBoXeodjluN0Atpq3LNqiHihki8L3T9
SpamDiagnosticOutput: 1:99
SpamDiagnosticMetadata: NSPM
X-OriginatorOrg: cavium.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 06 Mar 2018 21:33:18.1274
	(UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 baef91e4-91b8-4d03-7513-08d583a9e096
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 711e4ccf-2e9b-4bcf-a551-4094005b6194
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM6PR07MB4571

This is an RFC for an approach to implementing libmvec vector math routines
for aarch64.  There are three patches attached to this email, one is a GCC
patch that allows aarch64 to generate calls to vector functions that are
marked with __attribute__ ((simd)), the second is the infrastructure part
of glibc changes that mark sin and cos with this attribute in the header files,
and the third is an implementation of vector sin and cos by adding ifdefs to
the existing scalar sin/cos functions in sysdeps/ieee754/dbl-64/s_sin.c.

I have several questions that I would like to get some feedback on:

Does the approach of modifying the existing scalar math routines to also
work on vector types, as opposed to having completely seperate vector
routines sound like a good approach?  I like it because it reduced the
amount of duplicated code between scalar and vector functions and should
keep the accuracy of scalar and vector routines in sync.  And since it is
written in C it can also be shared among multiple platforms.  The main
downside right now is the slow paths are still mostly handled serially
and so I am only seeing about a 5% improvement in a test that does sin
calls on an array of doubles.  I think more changes could be done to improve
the amount of parallelism in the vector versions but this is a start.
Perhaps we could remove some of the slow paths in the vector versions like
we do with the sincos function.

The current implementation does not build because when linking libmvec
it gets undefined externals for __branred, __docos, __dubsin, __mpcos,
__mpsin, and __sincostab.  How should this be handled?  Should they be
exported from libm so that libmvec can call them or should they be 
copied into libmvec so that they don't have to be externally visible?
Are there any issues with making libmvec depend on libm?

The vector ABI for aarch64 is not yet defined, I believe ARM is working
on that.  As you can see from the GCC patch I am just using the same
approach as x86 and just changed the vecsize_mangle variable to 'n' for
128 bit vectors on Aarch64.

Comments?

Steve Ellcey
sellcey@cavium.com

diff --git a/sysdeps/aarch64/fpu/libmvec_d_sincos2.c b/sysdeps/aarch64/fpu/libmvec_d_sincos2.c
index e69de29..f92291f 100644
--- a/sysdeps/aarch64/fpu/libmvec_d_sincos2.c
+++ b/sysdeps/aarch64/fpu/libmvec_d_sincos2.c
@@ -0,0 +1,21 @@
+
+/* We need the includes before the defines so that the define of __sin
+   and __cos do not mess up the bits/mathcalls.h declaration of __sin
+   and __cos.  */
+
+#include <float.h>
+#include "endian.h"
+#include "mydefs.h"
+#include "usncs.h"
+#include "MathLib.h"
+#include <math.h>
+#include <math_private.h>
+#include <libm-alias-double.h>
+#include <fenv.h>
+
+#define VEC_SIZE 2
+#define OP_TYPE __Float64x2_t
+#define __sin _ZGVnN2v_sin
+#define __cos _ZGVnN2v_cos
+
+#include "sysdeps/ieee754/dbl-64/s_sin.c"
diff --git a/sysdeps/ieee754/dbl-64/s_sin.c b/sysdeps/ieee754/dbl-64/s_sin.c
index 8c589cb..e1db599 100644
--- a/sysdeps/ieee754/dbl-64/s_sin.c
+++ b/sysdeps/ieee754/dbl-64/s_sin.c
@@ -55,6 +55,18 @@
 #include <libm-alias-double.h>
 #include <fenv.h>
 
+#ifndef VEC_SIZE
+# define VEC_SIZE 1
+#endif
+
+#if VEC_SIZE != 1 && VEC_SIZE != 2
+# error "This file must be updated to support other vector lengths."
+#endif
+
+#ifndef OP_TYPE
+# define OP_TYPE double
+#endif
+
 /* Helper macros to compute sin of the input values.  */
 #define POLYNOMIAL2(xx) ((((s5 * (xx) + s4) * (xx) + s3) * (xx) + s2) * (xx))
 
@@ -67,13 +79,25 @@
 
    The constants s1, s2, s3, etc. are pre-computed values of 1/3!, 1/5! and so
    on.  The result is returned to LHS and correction in COR.  */
-#define TAYLOR_SIN(xx, a, da, cor) \
-({									      \
-  double t = ((POLYNOMIAL (xx)  * (a) - 0.5 * (da))  * (xx) + (da));	      \
-  double res = (a) + t;							      \
-  (cor) = ((a) - res) + t;						      \
-  res;									      \
-})
+static inline double
+__always_inline
+taylor_sin_scalar (double xx, double a, double da, double *cor)
+{
+  double t = (POLYNOMIAL (xx) * a - 0.5 * da)  * xx + da;
+  double res = a + t;
+  *cor = (a - res) + t;
+  return res;
+}
+
+static inline OP_TYPE
+__always_inline
+taylor_sin_vector (OP_TYPE xx, OP_TYPE a, OP_TYPE da, OP_TYPE *cor)
+{
+  OP_TYPE t = (POLYNOMIAL (xx) * a - 0.5 * da)  * xx + da;
+  OP_TYPE res = a + t;
+  *cor = (a - res) + t;
+  return res;
+}
 
 /* This is again a variation of the Taylor series expansion with the term
    x^3/3! expanded into the following for better accuracy:
@@ -126,10 +150,11 @@ static const double
 
 static const double t22 = 0x1.8p22;
 
-void __dubsin (double x, double dx, double w[]);
-void __docos (double x, double dx, double w[]);
-double __mpsin (double x, double dx, bool reduce_range);
-double __mpcos (double x, double dx, bool reduce_range);
+extern void __dubsin (double x, double dx, double w[]);
+extern void __docos (double x, double dx, double w[]);
+extern double __mpsin (double x, double dx, bool reduce_range);
+extern double __mpcos (double x, double dx, bool reduce_range);
+
 static double slow (double x);
 static double slow1 (double x);
 static double slow2 (double x);
@@ -148,7 +173,7 @@ static double cslow2 (double x);
    get the result in RES and a correction value in COR.  */
 static inline double
 __always_inline
-do_cos (double x, double dx, double *corp)
+do_cos_scalar (double x, double dx, double *corp)
 {
   mynumber u;
 
@@ -170,6 +195,45 @@ do_cos (double x, double dx, double *corp)
   return res;
 }
 
+static inline OP_TYPE
+__always_inline
+do_cos_vector (OP_TYPE x, OP_TYPE dx, OP_TYPE *corp)
+{
+#if VEC_SIZE == 1
+  mynumber u;
+  if (x < 0)
+    dx = -dx;
+  u.x = big + fabs (x);
+  x = fabs (x) - (u.x - big) + dx;
+#else
+  mynumber u0, u1;
+  if (x[0] < 0)
+    dx[0] = -dx[0];
+  if (x[1] < 0)
+    dx[1] = -dx[1];
+  u0.x = big + fabs (x[0]);
+  u1.x = big + fabs (x[1]);
+  x[0] = fabs (x[0]) - (u0.x - big) + dx[0];
+  x[1] = fabs (x[1]) - (u1.x - big) + dx[1];
+#endif
+
+  OP_TYPE xx, s, sn, ssn, c, cs, ccs, res, cor;
+  xx = x * x;
+  s = x + x * xx * (sn3 + xx * sn5);
+  c = xx * (cs2 + xx * (cs4 + xx * cs6));
+#if VEC_SIZE == 1
+  SINCOS_TABLE_LOOKUP (u, sn, ssn, cs, ccs);
+#else
+  SINCOS_TABLE_LOOKUP (u0, sn[0], ssn[0], cs[0], ccs[0]);
+  SINCOS_TABLE_LOOKUP (u1, sn[1], ssn[1], cs[1], ccs[1]);
+#endif
+  cor = (ccs - s * ssn - cs * c) - sn * s;
+  res = cs + cor;
+  cor = (cs - res) + cor;
+  *corp = cor;
+  return res;
+}
+
 /* A more precise variant of DO_COS.  EPS is the adjustment to the correction
    COR.  */
 static inline double
@@ -210,7 +274,7 @@ do_cos_slow (double x, double dx, double eps, double *corp)
    the result in RES and a correction value in COR.  */
 static inline double
 __always_inline
-do_sin (double x, double dx, double *corp)
+do_sin_scalar (double x, double dx, double *corp)
 {
   mynumber u;
 
@@ -231,6 +295,45 @@ do_sin (double x, double dx, double *corp)
   return res;
 }
 
+static inline OP_TYPE
+__always_inline
+do_sin_vector (OP_TYPE x, OP_TYPE dx, OP_TYPE *corp)
+{
+#if VEC_SIZE == 1
+  mynumber u;
+  if (x <= 0)
+    dx = -dx;
+  u.x = big + fabs (x);
+  x = fabs (x) - (u.x - big);
+#else
+  mynumber u0, u1;
+  if (x[0] <= 0)
+    dx[0] = -dx[0];
+  if (x[1] <= 0)
+    dx[1] = -dx[1];
+  u0.x = big + fabs (x[0]);
+  u1.x = big + fabs (x[1]);
+  x[0] = fabs (x[0]) - (u0.x - big);
+  x[1] = fabs (x[1]) - (u1.x - big);
+#endif
+
+  OP_TYPE xx, s, sn, ssn, c, cs, ccs, cor, res;
+  xx = x * x;
+  s = x + (dx + x * xx * (sn3 + xx * sn5));
+  c = x * dx + xx * (cs2 + xx * (cs4 + xx * cs6));
+#if VEC_SIZE == 1
+  SINCOS_TABLE_LOOKUP (u, sn, ssn, cs, ccs);
+#else
+  SINCOS_TABLE_LOOKUP (u0, sn[0], ssn[0], cs[0], ccs[0]);
+  SINCOS_TABLE_LOOKUP (u1, sn[1], ssn[1], cs[1], ccs[1]);
+#endif
+  cor = (ssn + s * ccs - sn * c) + cs * s;
+  res = sn + cor;
+  cor = (sn - res) + cor;
+  *corp = cor;
+  return res;
+}
+
 /* A more precise variant of DO_SIN.  EPS is the adjustment to the correction
    COR.  */
 static inline double
@@ -337,13 +440,13 @@ do_sincos_1 (double a, double da, double x, int4 n, bool shift_quadrant)
       if (xx < 0.01588)
 	{
 	  /* Taylor series.  */
-	  res = TAYLOR_SIN (xx, a, da, cor);
+	  res = taylor_sin_scalar (xx, a, da, &cor);
 	  cor = 1.02 * cor + __copysign (eps, cor);
 	  retval = (res == res + cor) ? res : sloww (a, da, x, shift_quadrant);
 	}
       else
 	{
-	  res = do_sin (a, da, &cor);
+	  res = do_sin_scalar (a, da, &cor);
 	  cor = 1.035 * cor + __copysign (eps, cor);
 	  retval = ((res == res + cor) ? __copysign (res, a)
 		    : sloww1 (a, da, x, shift_quadrant));
@@ -352,7 +455,7 @@ do_sincos_1 (double a, double da, double x, int4 n, bool shift_quadrant)
 
     case 1:
     case 3:
-      res = do_cos (a, da, &cor);
+      res = do_cos_scalar (a, da, &cor);
       cor = 1.025 * cor + __copysign (eps, cor);
       retval = ((res == res + cor) ? ((n & 2) ? -res : res)
 		: sloww2 (a, da, x, n));
@@ -411,13 +514,13 @@ do_sincos_2 (double a, double da, double x, int4 n, bool shift_quadrant)
       if (xx < 0.01588)
 	{
 	  /* Taylor series.  */
-	  res = TAYLOR_SIN (xx, a, da, cor);
+	  res = taylor_sin_scalar (xx, a, da, &cor);
 	  cor = 1.02 * cor + __copysign (eps, cor);
 	  retval = (res == res + cor) ? res : bsloww (a, da, x, n);
 	}
       else
 	{
-	  res = do_sin (a, da, &cor);
+	  res = do_sin_scalar (a, da, &cor);
 	  cor = 1.035 * cor + __copysign (eps, cor);
 	  retval = ((res == res + cor) ? __copysign (res, a)
 		    : bsloww1 (a, da, x, n));
@@ -426,7 +529,7 @@ do_sincos_2 (double a, double da, double x, int4 n, bool shift_quadrant)
 
     case 1:
     case 3:
-      res = do_cos (a, da, &cor);
+      res = do_cos_scalar (a, da, &cor);
       cor = 1.025 * cor + __copysign (eps, cor);
       retval = ((res == res + cor) ? ((n & 2) ? -res : res)
 		: bsloww2 (a, da, x, n));
@@ -436,33 +539,84 @@ do_sincos_2 (double a, double da, double x, int4 n, bool shift_quadrant)
   return retval;
 }
 
+static double
+__always_inline
+do_sincos_tail (double x, int4 k, int4 kl, bool shift_quadrant)
+{
+  int4 n;
+  double retval;
+  if (k < 0x419921FB)
+    {
+      double a, da;
+      n = reduce_sincos_1 (x, &a, &da);
+      retval = do_sincos_1 (a, da, x, n, shift_quadrant);
+    }
+  else if (k < 0x42F00000)
+    {
+      double a, da;
+      n = reduce_sincos_2 (x, &a, &da);
+      retval = do_sincos_2 (a, da, x, n, shift_quadrant);
+    }
+  else if (k < 0x7ff00000)
+    {
+      retval = reduce_and_compute (x, shift_quadrant);
+    }
+  else
+    {
+      if (k == 0x7ff00000 && kl == 0)
+	__set_errno (EDOM);
+      retval = x / x;
+    }
+  return retval;
+}
+
 /*******************************************************************/
 /* An ultimate sin routine. Given an IEEE double machine number x   */
 /* it computes the correctly rounded (to nearest) value of sin(x)  */
 /*******************************************************************/
 #ifdef IN_SINCOS
-static double
+static OP_TYPE
 #else
-double
+OP_TYPE
 SECTION
 #endif
-__sin (double x)
+__sin (OP_TYPE x)
 {
-  double xx, res, t, cor;
-  mynumber u;
-  int4 k, m;
-  double retval = 0;
+  OP_TYPE xx, res, res2, t, cor;
+  OP_TYPE retval;
 
 #ifndef IN_SINCOS
   SET_RESTORE_ROUND_53BIT (FE_TONEAREST);
 #endif
 
+#if VEC_SIZE == 1
+  mynumber u;
+  int4 m, k;
+  OP_TYPE zero = 0.0;
   u.x = x;
   m = u.i[HIGH_HALF];
   k = 0x7fffffff & m;		/* no sign           */
+#else /* VEC_SIZE == 2 */
+  mynumber u0, u1;
+  int4 m0, m1, k0, k1, k;
+  OP_TYPE zero = {0.0, 0.0};
+  u0.x = x[0];
+  u1.x = x[1];
+  m0 = u0.i[HIGH_HALF];
+  m1 = u1.i[HIGH_HALF];
+  k0 = 0x7fffffff & m0;
+  k1 = 0x7fffffff & m1;
+  k = k0 > k1 ? k0 : k1;
+#endif
+
   if (k < 0x3e500000)		/* if x->0 =>sin(x)=x */
     {
+#if VEC_SIZE == 1
       math_check_force_underflow (x);
+#else
+      math_check_force_underflow (x[0]);
+      math_check_force_underflow (x[1]);
+#endif
       retval = x;
     }
  /*---------------------------- 2^-26 < |x|< 0.25 ----------------------*/
@@ -473,54 +627,64 @@ __sin (double x)
       t = POLYNOMIAL (xx) * (xx * x);
       res = x + t;
       cor = (x - res) + t;
-      retval = (res == res + 1.07 * cor) ? res : slow (x);
+      res2 = res + 1.07 * cor;
+#if VEC_SIZE == 1
+      retval = (res == res2) ? res : slow (x);
+#else
+      retval[0] = (res[0] == res2[0]) ? res[0] : slow (x[0]);
+      retval[1] = (res[1] == res2[1]) ? res[1] : slow (x[1]);
+#endif
     }				/*  else  if (k < 0x3fd00000)    */
+
 /*---------------------------- 0.25<|x|< 0.855469---------------------- */
   else if (k < 0x3feb6000)
     {
-      res = do_sin (x, 0, &cor);
-      retval = (res == res + 1.096 * cor) ? res : slow1 (x);
+      res = do_sin_vector (x, zero, &cor);
+      res2 = res + 1.096 * cor;
+#if VEC_SIZE == 1
+      retval = (res == res2) ? res : slow1 (x);
       retval = __copysign (retval, x);
+#else
+      retval[0] = (res[0] == res2[0]) ? res[0] : slow1 (x[0]);
+      retval[1] = (res[1] == res2[1]) ? res[1] : slow1 (x[1]);
+      retval[0] = __copysign (retval[0], x[0]);
+      retval[1] = __copysign (retval[1], x[1]);
+#endif
     }				/*   else  if (k < 0x3feb6000)    */
 
 /*----------------------- 0.855469  <|x|<2.426265  ----------------------*/
   else if (k < 0x400368fd)
     {
-
+#if VEC_SIZE == 1
       t = hp0 - fabs (x);
-      res = do_cos (t, hp1, &cor);
-      retval = (res == res + 1.020 * cor) ? res : slow2 (x);
+      res = do_cos_vector (t, hp1, &cor);
+      res2 = res + 1.020 * cor;
+      retval = (res == res2) ? res : slow2 (x);
       retval = __copysign (retval, x);
+#else
+      OP_TYPE hp1_vec;
+      t[0] = hp0 - fabs (x[0]);
+      t[1] = hp0 - fabs (x[1]);
+      hp1_vec[0] = hp1_vec[1] = hp1;
+      res = do_cos_vector (t, hp1_vec, &cor);
+      res2 = res + 1.020 * cor;
+      retval[0] = (res[0] == res2[0]) ? res[0] : slow2 (x[0]);
+      retval[1] = (res[1] == res2[1]) ? res[1] : slow2 (x[1]);
+      retval[0] = __copysign (retval[0], x[0]);
+      retval[1] = __copysign (retval[1], x[1]);
+#endif
     }				/*   else  if (k < 0x400368fd)    */
 
 #ifndef IN_SINCOS
 /*-------------------------- 2.426265<|x|< 105414350 ----------------------*/
-  else if (k < 0x419921FB)
-    {
-      double a, da;
-      int4 n = reduce_sincos_1 (x, &a, &da);
-      retval = do_sincos_1 (a, da, x, n, false);
-    }				/*   else  if (k <  0x419921FB )    */
-
-/*---------------------105414350 <|x|< 281474976710656 --------------------*/
-  else if (k < 0x42F00000)
-    {
-      double a, da;
-
-      int4 n = reduce_sincos_2 (x, &a, &da);
-      retval = do_sincos_2 (a, da, x, n, false);
-    }				/*   else  if (k <  0x42F00000 )   */
-
-/* -----------------281474976710656 <|x| <2^1024----------------------------*/
-  else if (k < 0x7ff00000)
-    retval = reduce_and_compute (x, false);
-
-/*--------------------- |x| > 2^1024 ----------------------------------*/
   else
     {
-      if (k == 0x7ff00000 && u.i[LOW_HALF] == 0)
-	__set_errno (EDOM);
-      retval = x / x;
+#if VEC_SIZE == 1
+      retval = do_sincos_tail (x, k, u.i[LOW_HALF], false);
+#else
+      retval[0] = do_sincos_tail (x[0], k0, u0.i[LOW_HALF], false);
+      retval[1] = do_sincos_tail (x[1], k1, u1.i[LOW_HALF], false);
+#endif
     }
 #endif
 
@@ -534,85 +698,122 @@ __sin (double x)
 /*******************************************************************/
 
 #ifdef IN_SINCOS
-static double
+static OP_TYPE
 #else
-double
+OP_TYPE
 SECTION
 #endif
-__cos (double x)
+__cos (OP_TYPE x)
 {
-  double y, xx, res, cor, a, da;
-  mynumber u;
-  int4 k, m;
-
-  double retval = 0;
+  OP_TYPE y, xx, res, res2, a, da, cor;
+  OP_TYPE retval;
 
 #ifndef IN_SINCOS
   SET_RESTORE_ROUND_53BIT (FE_TONEAREST);
 #endif
 
+#if VEC_SIZE == 1
+  mynumber u;
+  int4 m, k;
+  OP_TYPE zero = 0.0;
+  OP_TYPE one = 1.0;
   u.x = x;
   m = u.i[HIGH_HALF];
-  k = 0x7fffffff & m;
+  k = 0x7fffffff & m;           /* no sign           */
+#else /* VEC_SIZE == 2 */
+  mynumber u0, u1;
+  int4 m0, m1, k0, k1, k;
+  OP_TYPE zero = {0.0, 0.0};
+  OP_TYPE one = {1.0, 1.0};
+  u0.x = x[0];
+  u1.x = x[1];
+  m0 = u0.i[HIGH_HALF];
+  m1 = u1.i[HIGH_HALF];
+  k0 = 0x7fffffff & m0;
+  k1 = 0x7fffffff & m1;
+  k = k0 > k1 ? k0 : k1;
+#endif
 
   /* |x|<2^-27 => cos(x)=1 */
   if (k < 0x3e400000)
-    retval = 1.0;
+      retval = one;
 
   else if (k < 0x3feb6000)
     {				/* 2^-27 < |x| < 0.855469 */
-      res = do_cos (x, 0, &cor);
-      retval = (res == res + 1.020 * cor) ? res : cslow2 (x);
+      res = do_cos_vector(x, zero, &cor);
+      res2 = res + 1.020 * cor;
+#if VEC_SIZE == 1
+      retval = (res == res2) ? res : cslow2 (x);
+#else
+      retval[0] = (res[0] == res2[0]) ? res[0] : cslow2 (x[0]);
+      retval[1] = (res[1] == res2[1]) ? res[1] : cslow2 (x[1]);
+#endif
     }				/*   else  if (k < 0x3feb6000)    */
 
   else if (k < 0x400368fd)
     { /* 0.855469  <|x|<2.426265  */ ;
+#if VEC_SIZE == 1
       y = hp0 - fabs (x);
       a = y + hp1;
       da = (y - a) + hp1;
       xx = a * a;
       if (xx < 0.01588)
 	{
-	  res = TAYLOR_SIN (xx, a, da, cor);
+	  res = taylor_sin_scalar (xx, a, da, &cor);
 	  cor = 1.02 * cor + __copysign (1.0e-31, cor);
 	  retval = (res == res + cor) ? res : sloww (a, da, x, true);
 	}
       else
 	{
-	  res = do_sin (a, da, &cor);
+	  res = do_sin_scalar (a, da, &cor);
 	  cor = 1.035 * cor + __copysign (1.0e-31, cor);
 	  retval = ((res == res + cor) ? __copysign (res, a)
 		    : sloww1 (a, da, x, true));
 	}
-
+#else
+      OP_TYPE hp1_vec, cors;
+      hp1_vec[0] = hp1_vec[1] = hp1;
+      y[0] = hp0 - fabs (x[0]);
+      y[1] = hp0 - fabs (x[1]);
+      a = y + hp1_vec;
+      da = (y - a) + hp1_vec;
+      xx = a * a;
+      if (xx[0] < 0.01588 && x[1] < 0.01588)
+	{
+	  res = taylor_sin_vector (xx, a, da, &cor);
+          cors[0] = __copysign (1.0e-31, cor[0]);
+          cors[1] = __copysign (1.0e-31, cor[1]);
+	  cor = 1.02 * cor + cors;
+          res2 = res + cor;
+	  retval[0] = (res[0] == res2[0]) ? res[0]
+					  : sloww (a[0], da[0], x[0], true);
+	  retval[1] = (res[1] == res2[1]) ? res[1]
+					  : sloww (a[1], da[1], x[1], true);
+	}
+      else
+	{
+	  res = do_sin_vector (a, da, &cor);
+	  cors[0] = __copysign (1.0e-31, cor[0]);
+	  cors[1] = __copysign (1.0e-31, cor[1]);
+	  cor = 1.035 * cor + cors;
+	  res2 = res + cor;
+          retval[0] = __copysign (res[0], a[0]);
+          retval[1] = __copysign (res[1], a[1]);
+	  retval[0] = (res[0] == res2[0]) ? retval[0] : sloww1 (a[0], da[0], x[0], true);
+	  retval[1] = (res[1] == res2[1]) ? retval[1] : sloww1 (a[1], da[1], x[1], true);
+	}
+#endif
     }				/*   else  if (k < 0x400368fd)    */
 
-
 #ifndef IN_SINCOS
-  else if (k < 0x419921FB)
-    {				/* 2.426265<|x|< 105414350 */
-      double a, da;
-      int4 n = reduce_sincos_1 (x, &a, &da);
-      retval = do_sincos_1 (a, da, x, n, true);
-    }				/*   else  if (k <  0x419921FB )    */
-
-  else if (k < 0x42F00000)
-    {
-      double a, da;
-
-      int4 n = reduce_sincos_2 (x, &a, &da);
-      retval = do_sincos_2 (a, da, x, n, true);
-    }				/*   else  if (k <  0x42F00000 )    */
-
-  /* 281474976710656 <|x| <2^1024 */
-  else if (k < 0x7ff00000)
-    retval = reduce_and_compute (x, true);
-
   else
     {
-      if (k == 0x7ff00000 && u.i[LOW_HALF] == 0)
-	__set_errno (EDOM);
-      retval = x / x;		/* |x| > 2^1024 */
+#if VEC_SIZE == 1
+      retval = do_sincos_tail (x, k, u.i[LOW_HALF], true);
+#else
+      retval[0] = do_sincos_tail (x[0], k0, u0.i[LOW_HALF], true);
+      retval[1] = do_sincos_tail (x[1], k1, u1.i[LOW_HALF], true);
+#endif
     }
 #endif