Patchwork s_sinf.c: Replace floor with FLOOR_DOUBLE_TO_INT/FLOOR_INT_TO_DOUBLE_HALF

login
register
mail settings
Submitter H.J. Lu
Date Dec. 5, 2017, 1:08 p.m.
Message ID <20171205130832.GA3696@intel.com>
Download mbox | patch
Permalink /patch/24735/
State New
Headers show

Comments

H.J. Lu - Dec. 5, 2017, 1:08 p.m.
Since s_sinf.c either assigns the return value of floor to integer
or passes double converted from integer to floor, this patch adds
FLOOR_DOUBLE_TO_INT and FLOOR_INT_TO_DOUBLE_HALF to replace floor.
They are default to floor.  A target can define FLOOR_DOUBLE_TO_INT
and FLOOR_INT_TO_DOUBLE_HALF as simple casts to avoid calling floor.

Also since long == int for 32-bit targets, we can use long instead of
int to avoid 64-bit integer for 64-bit targets.

On Skylake, bench-sinf reports performance improvement:

           Before        After         Improvement
max        130.566       129.564           0.8%
min        7.704         7.706             0%
mean       21.8188       19.1363           14%

Any comments?

H.J.
---
	* sysdeps/generic/math_private.h (FLOOR_DOUBLE_TO_INT): New.
	(FLOOR_INT_TO_DOUBLE_HALF): Likewise.
	* sysdeps/ieee754/flt-32/s_sinf.c (reduced): Replace long with
	int.
	(SINF_FUNC): Likewise.  Replace floor with FLOOR_DOUBLE_TO_INT
	and FLOOR_INT_TO_DOUBLE_HALF.
	* sysdeps/x86_64/fpu/math_private.h (__floor_double_to_int):
	New.
	(FLOOR_DOUBLE_TO_INT): Likewise.
	(__floor_int_to_double_half): Likewise.
	(FLOOR_INT_TO_DOUBLE_HALF): Likewise.
---
 sysdeps/generic/math_private.h    |  8 ++++++++
 sysdeps/ieee754/flt-32/s_sinf.c   | 13 +++++++------
 sysdeps/x86_64/fpu/math_private.h | 17 +++++++++++++++++
 3 files changed, 32 insertions(+), 6 deletions(-)
Joseph Myers - Dec. 5, 2017, 1:32 p.m.
On Tue, 5 Dec 2017, H.J. Lu wrote:

> diff --git a/sysdeps/generic/math_private.h b/sysdeps/generic/math_private.h
> index f29898c19c..a2cdce5b6a 100644
> --- a/sysdeps/generic/math_private.h
> +++ b/sysdeps/generic/math_private.h
> @@ -184,6 +184,14 @@ do {								\
>  } while (0)
>  #endif
>  
> +#ifndef FLOOR_DOUBLE_TO_INT
> +# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x))
> +#endif
> +
> +#ifndef FLOOR_INT_TO_DOUBLE_HALF
> +# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0)
> +#endif

These need comments defining their semantics (argument types and ranges, 
return values and types).

The existing code is using unsigned int.  You're using int here for the 
return type of FLOOR_DOUBLE_TO_INT but still converting to unsigned int in 
the caller.  (The semantics in the caller do in fact include that the 
argument is positive but in the range of int.)

I'd be surprised if the FLOOR_INT_TO_DOUBLE_HALF definition dividing by 
2.0 and calling floor, as opposed to dividing by integer 2 and converting 
the result to double (which is just a right shift, given the argument is 
unsigned) is optimal anywhere, unless the compiler optimizes it into an 
integer shift anyway.  Thus, it would make sense just to use "n / 2" 
unconditionally in place of "__floor (n / 2.0)".

Similarly, I'd expect a direct conversion to int (without a call to 
__floor) to be at least as good as the present code for 
FLOOR_DOUBLE_TO_INT everywhere, given the range constraints that make it 
OK to do so.

So I think the natural change would not need to add any new macros, but 
would change s_sinf.c unconditionally to do essentially what your patch 
does in the x86_64 case.  As the original patch of Rajalakshmi's was 
benchmarked on s390, you could ask her to confirm such a change makes 
performance no worse there (and possibly improves it).
H.J. Lu - Dec. 5, 2017, 1:59 p.m.
On Tue, Dec 5, 2017 at 5:32 AM, Joseph Myers <joseph@codesourcery.com> wrote:
> On Tue, 5 Dec 2017, H.J. Lu wrote:
>
>> diff --git a/sysdeps/generic/math_private.h b/sysdeps/generic/math_private.h
>> index f29898c19c..a2cdce5b6a 100644
>> --- a/sysdeps/generic/math_private.h
>> +++ b/sysdeps/generic/math_private.h
>> @@ -184,6 +184,14 @@ do {                                                             \
>>  } while (0)
>>  #endif
>>
>> +#ifndef FLOOR_DOUBLE_TO_INT
>> +# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x))
>> +#endif
>> +
>> +#ifndef FLOOR_INT_TO_DOUBLE_HALF
>> +# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0)
>> +#endif
>
> These need comments defining their semantics (argument types and ranges,
> return values and types).
>
> The existing code is using unsigned int.  You're using int here for the
> return type of FLOOR_DOUBLE_TO_INT but still converting to unsigned int in
> the caller.  (The semantics in the caller do in fact include that the
> argument is positive but in the range of int.)
>
> I'd be surprised if the FLOOR_INT_TO_DOUBLE_HALF definition dividing by
> 2.0 and calling floor, as opposed to dividing by integer 2 and converting
> the result to double (which is just a right shift, given the argument is
> unsigned) is optimal anywhere, unless the compiler optimizes it into an
> integer shift anyway.  Thus, it would make sense just to use "n / 2"
> unconditionally in place of "__floor (n / 2.0)".
>
> Similarly, I'd expect a direct conversion to int (without a call to
> __floor) to be at least as good as the present code for
> FLOOR_DOUBLE_TO_INT everywhere, given the range constraints that make it
> OK to do so.
>
> So I think the natural change would not need to add any new macros, but
> would change s_sinf.c unconditionally to do essentially what your patch
> does in the x86_64 case.  As the original patch of Rajalakshmi's was
> benchmarked on s390, you could ask her to confirm such a change makes
> performance no worse there (and possibly improves it).

Here is the patch to replace floor with simple casts.  It improves performance
of max and mean by 30% on x86-64.  Rajalakshmi, can you try it on s390?

Thanks.
Rajalakshmi S - Dec. 5, 2017, 2:29 p.m.
On 12/05/2017 07:29 PM, H.J. Lu wrote:
> On Tue, Dec 5, 2017 at 5:32 AM, Joseph Myers<joseph@codesourcery.com>  wrote:
>> On Tue, 5 Dec 2017, H.J. Lu wrote:
>>
>>> diff --git a/sysdeps/generic/math_private.h b/sysdeps/generic/math_private.h
>>> index f29898c19c..a2cdce5b6a 100644
>>> --- a/sysdeps/generic/math_private.h
>>> +++ b/sysdeps/generic/math_private.h
>>> @@ -184,6 +184,14 @@ do {                                                             \
>>>   } while (0)
>>>   #endif
>>>
>>> +#ifndef FLOOR_DOUBLE_TO_INT
>>> +# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x))
>>> +#endif
>>> +
>>> +#ifndef FLOOR_INT_TO_DOUBLE_HALF
>>> +# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0)
>>> +#endif
>> These need comments defining their semantics (argument types and ranges,
>> return values and types).
>>
>> The existing code is using unsigned int.  You're using int here for the
>> return type of FLOOR_DOUBLE_TO_INT but still converting to unsigned int in
>> the caller.  (The semantics in the caller do in fact include that the
>> argument is positive but in the range of int.)
>>
>> I'd be surprised if the FLOOR_INT_TO_DOUBLE_HALF definition dividing by
>> 2.0 and calling floor, as opposed to dividing by integer 2 and converting
>> the result to double (which is just a right shift, given the argument is
>> unsigned) is optimal anywhere, unless the compiler optimizes it into an
>> integer shift anyway.  Thus, it would make sense just to use "n / 2"
>> unconditionally in place of "__floor (n / 2.0)".
>>
>> Similarly, I'd expect a direct conversion to int (without a call to
>> __floor) to be at least as good as the present code for
>> FLOOR_DOUBLE_TO_INT everywhere, given the range constraints that make it
>> OK to do so.
>>
>> So I think the natural change would not need to add any new macros, but
>> would change s_sinf.c unconditionally to do essentially what your patch
>> does in the x86_64 case.  As the original patch of Rajalakshmi's was
>> benchmarked on s390, you could ask her to confirm such a change makes
>> performance no worse there (and possibly improves it).
> Here is the patch to replace floor with simple casts.  It improves performance
> of max and mean by 30% on x86-64.  Rajalakshmi, can you try it on s390?

Mean value reduced from 22.0 to 20.0 with this patch on s390x.
H.J. Lu - Dec. 5, 2017, 4 p.m.
On Tue, Dec 5, 2017 at 6:29 AM, Rajalakshmi Srinivasaraghavan
<raji@linux.vnet.ibm.com> wrote:
>
>
> On 12/05/2017 07:29 PM, H.J. Lu wrote:
>>
>> On Tue, Dec 5, 2017 at 5:32 AM, Joseph Myers<joseph@codesourcery.com>
>> wrote:
>>>
>>> On Tue, 5 Dec 2017, H.J. Lu wrote:
>>>
>>>> diff --git a/sysdeps/generic/math_private.h
>>>> b/sysdeps/generic/math_private.h
>>>> index f29898c19c..a2cdce5b6a 100644
>>>> --- a/sysdeps/generic/math_private.h
>>>> +++ b/sysdeps/generic/math_private.h
>>>> @@ -184,6 +184,14 @@ do {
>>>> \
>>>>   } while (0)
>>>>   #endif
>>>>
>>>> +#ifndef FLOOR_DOUBLE_TO_INT
>>>> +# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x))
>>>> +#endif
>>>> +
>>>> +#ifndef FLOOR_INT_TO_DOUBLE_HALF
>>>> +# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0)
>>>> +#endif
>>>
>>> These need comments defining their semantics (argument types and ranges,
>>> return values and types).
>>>
>>> The existing code is using unsigned int.  You're using int here for the
>>> return type of FLOOR_DOUBLE_TO_INT but still converting to unsigned int
>>> in
>>> the caller.  (The semantics in the caller do in fact include that the
>>> argument is positive but in the range of int.)
>>>
>>> I'd be surprised if the FLOOR_INT_TO_DOUBLE_HALF definition dividing by
>>> 2.0 and calling floor, as opposed to dividing by integer 2 and converting
>>> the result to double (which is just a right shift, given the argument is
>>> unsigned) is optimal anywhere, unless the compiler optimizes it into an
>>> integer shift anyway.  Thus, it would make sense just to use "n / 2"
>>> unconditionally in place of "__floor (n / 2.0)".
>>>
>>> Similarly, I'd expect a direct conversion to int (without a call to
>>> __floor) to be at least as good as the present code for
>>> FLOOR_DOUBLE_TO_INT everywhere, given the range constraints that make it
>>> OK to do so.
>>>
>>> So I think the natural change would not need to add any new macros, but
>>> would change s_sinf.c unconditionally to do essentially what your patch
>>> does in the x86_64 case.  As the original patch of Rajalakshmi's was
>>> benchmarked on s390, you could ask her to confirm such a change makes
>>> performance no worse there (and possibly improves it).
>>
>> Here is the patch to replace floor with simple casts.  It improves
>> performance
>> of max and mean by 30% on x86-64.  Rajalakshmi, can you try it on s390?
>
>
> Mean value reduced from 22.0 to 20.0 with this patch on s390x.
>

Any objections/comenrts for my patch:

https://sourceware.org/ml/libc-alpha/2017-12/msg00121.html

to master?

Thanks.
Joseph Myers - Dec. 5, 2017, 4:20 p.m.
On Tue, 5 Dec 2017, Rajalakshmi Srinivasaraghavan wrote:

> > Here is the patch to replace floor with simple casts.  It improves
> > performance
> > of max and mean by 30% on x86-64.  Rajalakshmi, can you try it on s390?
> 
> Mean value reduced from 22.0 to 20.0 with this patch on s390x.

Thanks.  The patch is OK.

Patch

diff --git a/sysdeps/generic/math_private.h b/sysdeps/generic/math_private.h
index f29898c19c..a2cdce5b6a 100644
--- a/sysdeps/generic/math_private.h
+++ b/sysdeps/generic/math_private.h
@@ -184,6 +184,14 @@  do {								\
 } while (0)
 #endif
 
+#ifndef FLOOR_DOUBLE_TO_INT
+# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x))
+#endif
+
+#ifndef FLOOR_INT_TO_DOUBLE_HALF
+# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0)
+#endif
+
 /* We need to guarantee an expansion of name when building
    ldbl-128 files as another type (e.g _Float128).  */
 #define mathx_hidden_def(name) hidden_def(name)
diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
index 40d3d197a8..13a49ceb1b 100644
--- a/sysdeps/ieee754/flt-32/s_sinf.c
+++ b/sysdeps/ieee754/flt-32/s_sinf.c
@@ -85,8 +85,8 @@  static const int ones[] = { +1, -1 };
    SIGNBIT is used to add the correct sign after the Chebyshev
    polynomial is computed.  */
 static inline float
-reduced (const double theta, const unsigned long int n,
-	 const unsigned long int signbit)
+reduced (const double theta, const unsigned int n,
+	 const unsigned int signbit)
 {
   double sx;
   const double theta2 = theta * theta;
@@ -162,14 +162,14 @@  SINF_FUNC (float x)
     }
   else                          /* |x| >= Pi/4.  */
     {
-      unsigned long int signbit = (x < 0);
+      unsigned int signbit = (x < 0);
       if (abstheta < 9 * M_PI_4)        /* |x| < 9*Pi/4.  */
 	{
 	  /* There are cases where FE_UPWARD rounding mode can
 	     produce a result of abstheta * inv_PI_4 == 9,
 	     where abstheta < 9pi/4, so the domain for
 	     pio2_table must go to 5 (9 / 2 + 1).  */
-	  unsigned long int n = (abstheta * inv_PI_4) + 1;
+	  unsigned int n = (abstheta * inv_PI_4) + 1;
 	  theta = abstheta - pio2_table[n / 2];
 	  return reduced (theta, n, signbit);
 	}
@@ -177,8 +177,9 @@  SINF_FUNC (float x)
 	{
 	  if (abstheta < 0x1p+23)     /* |x| < 2^23.  */
 	    {
-	      unsigned long int n = __floor (abstheta * inv_PI_4) + 1.0;
-	      double x = __floor (n / 2.0);
+	      unsigned int n
+		= FLOOR_DOUBLE_TO_INT (abstheta * inv_PI_4) + 1.0;
+	      double x = FLOOR_INT_TO_DOUBLE_HALF (n);
 	      theta = x * PI_2_lo + (x * PI_2_hi + abstheta);
 	      /* Argument reduction needed.  */
 	      return reduced (theta, n, signbit);
diff --git a/sysdeps/x86_64/fpu/math_private.h b/sysdeps/x86_64/fpu/math_private.h
index 027a6a3a4d..738897c9c6 100644
--- a/sysdeps/x86_64/fpu/math_private.h
+++ b/sysdeps/x86_64/fpu/math_private.h
@@ -45,6 +45,23 @@ 
     f = f__;								      \
   } while (0)
 
+extern inline int
+__floor_double_to_int (double x)
+{
+  return x;
+}
+
+#define FLOOR_DOUBLE_TO_INT(x) __floor_double_to_int (x)
+
+extern inline double
+__floor_int_to_double_half (int x)
+{
+  x /= 2;
+  return x;
+}
+
+#define FLOOR_INT_TO_DOUBLE_HALF(x) __floor_int_to_double_half (x)
+
 #include <sysdeps/i386/fpu/fenv_private.h>
 #include_next <math_private.h>