Patchwork [2/6] Remove slow paths from sin/cos

login
register
mail settings
Submitter Wilco Dijkstra
Date March 9, 2018, 3:46 p.m.
Message ID <DB6PR0801MB2053788EB64591EC936E598683DE0@DB6PR0801MB2053.eurprd08.prod.outlook.com>
Download mbox | patch
Permalink /patch/26251/
State New
Headers show

Comments

Wilco Dijkstra - March 9, 2018, 3:46 p.m.
This patch removes 2nd of the 3 range reduction cases and defer to the final one.
Input values above 2^27 are extremely rare, so this case doesn't need to as be
optimized as smaller inputs.

ChangeLog:
2018-03-09  Wilco Dijkstra  <wdijkstr@arm.com>

	* sysdeps/ieee754/dbl-64/s_sin.c (reduce_sincos_2): Remove function.
	(do_sincos_2): Likewise.
	(__sin): Remove middle range reduction case.
	(__cos): Likewise.
	* sysdeps/ieee754/dbl-64/s_sincos.c (__sincos): Remove middle range
	reduction case.

--
Siddhesh Poyarekar - March 9, 2018, 4:17 p.m.
On Friday 09 March 2018 09:16 PM, Wilco Dijkstra wrote:
> This patch removes 2nd of the 3 range reduction cases and defer to the final one.
> Input values above 2^27 are extremely rare, so this case doesn't need to as be
> optimized as smaller inputs.

Please elaborate on two more points:

- The basis for your claim that values above 2^27 are extremely rare,
  i.e. papers that you may have referred that conclude this and/or
  applications you may have profiled

- The quantum of performance loss due to dropping the 2nd case.

Thanks,
Siddhesh
Ondrej Bilka - March 9, 2018, 6:19 p.m.
On Fri, Mar 09, 2018 at 09:47:09PM +0530, Siddhesh Poyarekar wrote:
> On Friday 09 March 2018 09:16 PM, Wilco Dijkstra wrote:
> > This patch removes 2nd of the 3 range reduction cases and defer to the final one.
> > Input values above 2^27 are extremely rare, so this case doesn't need to as be
> > optimized as smaller inputs.
> 
> Please elaborate on two more points:
> 
> - The basis for your claim that values above 2^27 are extremely rare,
>   i.e. papers that you may have referred that conclude this and/or
>   applications you may have profiled
> 
Main reason is that for these inputs accuracy doesn't make a sense.
There is already millions bigger error caused by limited precision when
rounding input to float.
Zack Weinberg - March 9, 2018, 6:52 p.m.
On Fri, Mar 9, 2018 at 1:19 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Fri, Mar 09, 2018 at 09:47:09PM +0530, Siddhesh Poyarekar wrote:
>> On Friday 09 March 2018 09:16 PM, Wilco Dijkstra wrote:
>> > This patch removes 2nd of the 3 range reduction cases and defer to the final one.
>> > Input values above 2^27 are extremely rare, so this case doesn't need to as be
>> > optimized as smaller inputs.
>>
>> Please elaborate on two more points:
>>
>> - The basis for your claim that values above 2^27 are extremely rare,
>>   i.e. papers that you may have referred that conclude this and/or
>>   applications you may have profiled
>>
> Main reason is that for these inputs accuracy doesn't make a sense.
> There is already millions bigger error caused by limited precision when
> rounding input to float.

More concretely, for IEEE single, the gap between representable values
is bigger than 2π for values whose exponent is 2^26 or above.  Since
the sine function is periodic over 2π, that means the result of sinf()
is effectively meaningless for any input at least that big - _any
value_ within the input period could have been rounded to the
representable, so the "true" answer could be anything.  (I am tempted
to suggest that we return NaN for inputs this large, without even
bothering to do argument reduction.)

For double, this happens at 2^55, and for x86-64 long double it
happens at 2^66.  I _think_ x86-64 long double is 80-bit IEEE
extended, not quad, but I could be wrong.

zw
Wilco Dijkstra - March 9, 2018, 7:06 p.m.
Ondřej Bílka wrote:
> On Fri, Mar 09, 2018 at 09:47:09PM +0530, Siddhesh Poyarekar wrote:
>> On Friday 09 March 2018 09:16 PM, Wilco Dijkstra wrote:
>> > This patch removes 2nd of the 3 range reduction cases and defer to the final one.
>> > Input values above 2^27 are extremely rare, so this case doesn't need to as be
>> > optimized as smaller inputs.
>> 
>> Please elaborate on two more points:
>> 
>> - The basis for your claim that values above 2^27 are extremely rare,
>>   i.e. papers that you may have referred that conclude this and/or
>>   applications you may have profiled
>> 
> Main reason is that for these inputs accuracy doesn't make a sense.
> There is already millions bigger error caused by limited precision when
> rounding input to float.

Exactly, it's rare for sin/cos input to be > 10, largest I've ever seen was around 1000. 
There is a case to be made for making wildly out or range inputs an error or just return 0.0
rather than using insanely accurate range reduction (the result is going to be equally wrong
either way), but that's a different discussion.

Fast range reduction of up to 2^27 is a huge overkill, we could get a significant speedup by
reducing this range.

The 2nd level range reduction (2^27 till 2^48) is so pointless that there is absolutely no
reason keep it. There is about a 2.3x slowdown in this range due to __branred being
braindead slow.

Basically I can only see a reason for 3 levels of range reduction if the first range reduction
only handles a tiny range (say up to 10), the final range reduction is extremely slow and we
can find proof of an application relying on fast range reduction up to say 1000 or 1000000.

I think a 3-4x speedup of __branred should be feasible using a much simpler algorithm and
integer arithmetic. This will not only cover a much wider range but then we will also be sure
it works correctly!

Wilco
Siddhesh Poyarekar - March 9, 2018, 7:30 p.m.
On Saturday 10 March 2018 12:36 AM, Wilco Dijkstra wrote:
> Exactly, it's rare for sin/cos input to be > 10, largest I've ever seen was around 1000. 
> There is a case to be made for making wildly out or range inputs an error or just return 0.0
> rather than using insanely accurate range reduction (the result is going to be equally wrong
> either way), but that's a different discussion.

To be clear, I do not object to the patchset, it is the right way
forward.  I would like to see clearer documentation as to why we decided
to drop this path in this patch and what the impact of that is.  In that
sense, it would be nice to have a better rationale in the commit log
than "I've never seen inputs that big".

> Fast range reduction of up to 2^27 is a huge overkill, we could get a significant speedup by
> reducing this range.
>
> The 2nd level range reduction (2^27 till 2^48) is so pointless that there is absolutely no
> reason keep it. There is about a 2.3x slowdown in this range due to __branred being
> braindead slow.

Thanks, all this should be part of the commit log.  However, from what I
understand, this particular patch will result in a regression (since
IIRC reduce_and_compute is slower) but the following patch will fix that
and improve things.

Siddhesh
Steve Ellcey - March 9, 2018, 11:05 p.m.
On Fri, 2018-03-09 at 13:52 -0500, Zack Weinberg wrote:

> More concretely, for IEEE single, the gap between representable values
> is bigger than 2π for values whose exponent is 2^26 or above.  Since
> the sine function is periodic over 2π, that means the result of sinf()
> is effectively meaningless for any input at least that big - _any
> value_ within the input period could have been rounded to the
> representable, so the "true" answer could be anything.  (I am tempted
> to suggest that we return NaN for inputs this large, without even
> bothering to do argument reduction.)
> 
> For double, this happens at 2^55, and for x86-64 long double it
> happens at 2^66.  I _think_ x86-64 long double is 80-bit IEEE
> extended, not quad, but I could be wrong.
> 
> zw

And yet we have tests for large numbers in auto-libm-test-out-sin
and auto-libm-test-out-cos.  I was testing a different vector sin/cos
implementation on aarch64 and I got some big diffs on very large
output.  Maybe we shouldn't test for these values if they are not
of any value.

testing double (vector length 2)
Failure: Test: sin_vlen2 (-0x2p+64)
Result:
 is:         -0.0000000000000000e+00  -0x0.0000000000000p+0
 should be:   4.7183876212354675e-02   0x1.8287c2a760e04p-5
 difference:  4.7183876212354675e-02   0x1.8287c2a760e04p-5
 ulp       :  6799913194491396.0000
 max.ulp   :  1.0000
Failure: Test: sin_vlen2 (-0x3.3de320f6be87ep+1020)
Result:
 is:         -0.0000000000000000e+00  -0x0.0000000000000p+0
 should be:  -9.9219562491895441e-01  -0x1.fc0110a085bafp-1
 difference:  9.9219562491895441e-01   0x1.fc0110a085bafp-1
 ulp       :  8936903693327279.0000
 max.ulp   :  1.0000
Joseph Myers - March 10, 2018, 12:52 a.m.
On Fri, 9 Mar 2018, Steve Ellcey wrote:

> And yet we have tests for large numbers in auto-libm-test-out-sin
> and auto-libm-test-out-cos.  I was testing a different vector sin/cos

Yes.  A floating-point number represents a particular real number, not an 
interval, so all the usual accuracy goals (of results within a few ulps of 
the correct answer) apply for large inputs (but performance is not a 
concern for those inputs).  Only for IBM long double is this relaxed, to 
treat values not representable in 106 mantissa bits as if they do 
represent intervals.
Wilco Dijkstra - March 12, 2018, 3:36 p.m.
Joseph Myers wrote:

> Yes.  A floating-point number represents a particular real number, not an 
> interval, so all the usual accuracy goals (of results within a few ulps of 
> the correct answer) apply for large inputs (but performance is not a 
> concern for those inputs).  Only for IBM long double is this relaxed, to 
> treat values not representable in 106 mantissa bits as if they do 
> represent intervals.

Though these patches keep the ULP accuracy across the full range as is, 
we could agree on higher ULP errors for large/huge range reduction cases 
in the future. The main complexity is for certain rare inputs which happen to
be extremely close to an integer multiple of PI/2, and those few cases mean
you need significant extra work to guarantee 0.5 ULP error bound on range
reduction.

Wilco
Zack Weinberg - March 12, 2018, 3:45 p.m.
On Mon, Mar 12, 2018 at 11:36 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Joseph Myers wrote:
>
>> Yes.  A floating-point number represents a particular real number, not an
>> interval, so all the usual accuracy goals (of results within a few ulps of
>> the correct answer) apply for large inputs (but performance is not a
>> concern for those inputs).  Only for IBM long double is this relaxed, to
>> treat values not representable in 106 mantissa bits as if they do
>> represent intervals.
>
> Though these patches keep the ULP accuracy across the full range as is,
> we could agree on higher ULP errors for large/huge range reduction cases
> in the future. The main complexity is for certain rare inputs which happen to
> be extremely close to an integer multiple of PI/2, and those few cases mean
> you need significant extra work to guarantee 0.5 ULP error bound on range
> reduction.

I think 0.5 ULP error bound for values close to _small_ integer
multiples of pi/2 is worth preserving.  I'm not sure what the right
cutoff is, but if we need to do that "significant extra work" for,
like, values close to 10*pi and the most natural way to implement that
also gives us the same error bound for values close to 1e6*pi, then I
don't see why we shouldn't keep it for both.

zw
Joseph Myers - March 12, 2018, 4:10 p.m.
On Mon, 12 Mar 2018, Wilco Dijkstra wrote:

> Though these patches keep the ULP accuracy across the full range as is, 
> we could agree on higher ULP errors for large/huge range reduction cases 
> in the future. The main complexity is for certain rare inputs which happen to
> be extremely close to an integer multiple of PI/2, and those few cases mean
> you need significant extra work to guarantee 0.5 ULP error bound on range
> reduction.

I don't think the work for having an error bound not much more than 0.5ulp 
on the final result of sin/cos for large arguments is significantly 
different from the work for having an error bound of say 3ulp (the 
testsuite has a global maximum of 9ulp (16ulp for IBM long double) beyond 
which it will not accept errors even if those large errors are listed in 
libm-test-ulps files - that bound is simply based on the errors 
empirically observed at present, for functions other than Bessel functions 
and cpow which are known to have cases with much larger errors).
Wilco Dijkstra - March 12, 2018, 6:08 p.m.
Siddhesh Poyarekar wrote:
> On Saturday 10 March 2018 12:36 AM, Wilco Dijkstra wrote:
>> Exactly, it's rare for sin/cos input to be > 10, largest I've ever seen was around 1000.
>> There is a case to be made for making wildly out or range inputs an error or just return 0.0
>> rather than using insanely accurate range reduction (the result is going to be equally wrong
>> either way), but that's a different discussion.
>
> To be clear, I do not object to the patchset, it is the right way
> forward.  I would like to see clearer documentation as to why we decided
> to drop this path in this patch and what the impact of that is.  In that
> sense, it would be nice to have a better rationale in the commit log
> than "I've never seen inputs that big".

There are multiple reasons, correctness (the implementation wasn't accurate enough),
premature optimization, complexity, etc. If we are to spend time on developing a better
range reduction, the time is best spent on (a) common cases, ie. small values, and
(b) on a much faster general range reducer (which could be designed to be faster on
smaller inputs rather than be constant-time time __branred).

A lot of the speedup is achieved by just making the code simpler. If we want to support 3-level
range reduction, there must be a really good reason for it (I've not seen 3-level reduction used
anywhere else).

>> Fast range reduction of up to 2^27 is a huge overkill, we could get a significant speedup by
>> reducing this range.
>>
>> The 2nd level range reduction (2^27 till 2^48) is so pointless that there is absolutely no
>> reason keep it. There is about a 2.3x slowdown in this range due to __branred being
>> braindead slow.
>
> Thanks, all this should be part of the commit log.  However, from what I
> understand, this particular patch will result in a regression (since
> IIRC reduce_and_compute is slower) but the following patch will fix that
> and improve things.

I'll improve the commit log in the next version (assuming there are comments on the actual
patch!).

Wilco
Wilco Dijkstra - March 12, 2018, 9:13 p.m.
Joseph Myers wrote:
> On Mon, 12 Mar 2018, Wilco Dijkstra wrote:
>
>> Though these patches keep the ULP accuracy across the full range as is, 
>> we could agree on higher ULP errors for large/huge range reduction cases 
>> in the future. The main complexity is for certain rare inputs which happen to
>> be extremely close to an integer multiple of PI/2, and those few cases mean
>> you need significant extra work to guarantee 0.5 ULP error bound on range
>> reduction.
>
> I don't think the work for having an error bound not much more than 0.5ulp 
> on the final result of sin/cos for large arguments is significantly 
> different from the work for having an error bound of say 3ulp (the 
> testsuite has a global maximum of 9ulp (16ulp for IBM long double) beyond 
> which it will not accept errors even if those large errors are listed in 
> libm-test-ulps files - that bound is simply based on the errors 
> empirically observed at present, for functions other than Bessel functions 
> and cpow which are known to have cases with much larger errors).

Having to do another reduction step can increase latency significantly. To give
an example, I had to increase accuracy of the first-level range reduction because
a handful huge multiples of PI (6-7 digits) had 2-3ULP reduction error. The additional
accuracy means range reduction became 50% slower while being way too accurate
for almost all inputs.

Obviously it would be possible redesign it from scratch using a much smaller input
range and FMA when available etc.

Wilco
Siddhesh Poyarekar - March 13, 2018, 8:53 a.m.
On Monday 12 March 2018 11:38 PM, Wilco Dijkstra wrote:
> I'll improve the commit log in the next version (assuming there are comments on the actual
> patch!).

Overall the patchset looks good to me but I don't think I'll have time
to take a closer look any time this week or the next.  In any case I
will defer to Joseph for final approval of the patchset given his more
extensive experience in the area.

Siddhesh

Patch

diff --git a/sysdeps/ieee754/dbl-64/s_sin.c b/sysdeps/ieee754/dbl-64/s_sin.c
index 9673a461ac592fc2bf3babc755dae336312e4c56..1f98e29278183d1fccd7c2b3fd467d6b16c245ed 100644
--- a/sysdeps/ieee754/dbl-64/s_sin.c
+++ b/sysdeps/ieee754/dbl-64/s_sin.c
@@ -362,80 +362,6 @@  do_sincos_1 (double a, double da, double x, int4 n, bool shift_quadrant)
   return retval;
 }
 
-static inline int4
-__always_inline
-reduce_sincos_2 (double x, double *a, double *da)
-{
-  mynumber v;
-
-  double t = (x * hpinv + toint);
-  double xn = t - toint;
-  v.x = t;
-  double xn1 = (xn + 8.0e22) - 8.0e22;
-  double xn2 = xn - xn1;
-  double y = ((((x - xn1 * mp1) - xn1 * mp2) - xn2 * mp1) - xn2 * mp2);
-  int4 n = v.i[LOW_HALF] & 3;
-  double db = xn1 * pp3;
-  t = y - db;
-  db = (y - t) - db;
-  db = (db - xn2 * pp3) - xn * pp4;
-  double b = t + db;
-  db = (t - b) + db;
-
-  *a = b;
-  *da = db;
-
-  return n;
-}
-
-/* Compute sin (A + DA).  cos can be computed by passing SHIFT_QUADRANT as
-   true, which results in shifting the quadrant N clockwise.  */
-static double
-__always_inline
-do_sincos_2 (double a, double da, double x, int4 n, bool shift_quadrant)
-{
-  double res, retval, cor, xx;
-
-  double eps = 1.0e-24;
-
-  int4 k = (n + shift_quadrant) & 3;
-
-  switch (k)
-    {
-    case 2:
-      a = -a;
-      da = -da;
-      /* Fall through.  */
-    case 0:
-      xx = a * a;
-      if (xx < 0.01588)
-	{
-	  /* Taylor series.  */
-	  res = TAYLOR_SIN (xx, a, da, cor);
-	  cor = 1.02 * cor + __copysign (eps, cor);
-	  retval = (res == res + cor) ? res : bsloww (a, da, x, n);
-	}
-      else
-	{
-	  res = do_sin (a, da, &cor);
-	  cor = 1.035 * cor + __copysign (eps, cor);
-	  retval = ((res == res + cor) ? __copysign (res, a)
-		    : bsloww1 (a, da, x, n));
-	}
-      break;
-
-    case 1:
-    case 3:
-      res = do_cos (a, da, &cor);
-      cor = 1.025 * cor + __copysign (eps, cor);
-      retval = ((res == res + cor) ? ((n & 2) ? -res : res)
-		: bsloww2 (a, da, x, n));
-      break;
-    }
-
-  return retval;
-}
-
 /*******************************************************************/
 /* An ultimate sin routine. Given an IEEE double machine number x   */
 /* it computes the correctly rounded (to nearest) value of sin(x)  */
@@ -498,16 +424,7 @@  __sin (double x)
       retval = do_sincos_1 (a, da, x, n, false);
     }				/*   else  if (k <  0x419921FB )    */
 
-/*---------------------105414350 <|x|< 281474976710656 --------------------*/
-  else if (k < 0x42F00000)
-    {
-      double a, da;
-
-      int4 n = reduce_sincos_2 (x, &a, &da);
-      retval = do_sincos_2 (a, da, x, n, false);
-    }				/*   else  if (k <  0x42F00000 )   */
-
-/* -----------------281474976710656 <|x| <2^1024----------------------------*/
+/* --------------------105414350 <|x| <2^1024------------------------------*/
   else if (k < 0x7ff00000)
     retval = reduce_and_compute (x, false);
 
@@ -584,15 +501,7 @@  __cos (double x)
       retval = do_sincos_1 (a, da, x, n, true);
     }				/*   else  if (k <  0x419921FB )    */
 
-  else if (k < 0x42F00000)
-    {
-      double a, da;
-
-      int4 n = reduce_sincos_2 (x, &a, &da);
-      retval = do_sincos_2 (a, da, x, n, true);
-    }				/*   else  if (k <  0x42F00000 )    */
-
-  /* 281474976710656 <|x| <2^1024 */
+  /* 105414350 <|x| <2^1024 */
   else if (k < 0x7ff00000)
     retval = reduce_and_compute (x, true);
 
diff --git a/sysdeps/ieee754/dbl-64/s_sincos.c b/sysdeps/ieee754/dbl-64/s_sincos.c
index e1977ea7e93c32cca5369677f23e68f8f797a9f4..a9af8ce526bfe78c06cfafa65de0815ec69585c5 100644
--- a/sysdeps/ieee754/dbl-64/s_sincos.c
+++ b/sysdeps/ieee754/dbl-64/s_sincos.c
@@ -86,16 +86,6 @@  __sincos (double x, double *sinx, double *cosx)
 
       return;
     }
-  if (k < 0x42F00000)
-    {
-      double a, da;
-      int4 n = reduce_sincos_2 (x, &a, &da);
-
-      *sinx = do_sincos_2 (a, da, x, n, false);
-      *cosx = do_sincos_2 (a, da, x, n, true);
-
-      return;
-    }
   if (k < 0x7ff00000)
     {
       reduce_and_compute_sincos (x, sinx, cosx);