libm,ieee754:New algorithm of fmod function for, dbl-64/wordsize-64

Message ID 0c028906-f5a6-deed-a0ff-f741aee9c5e6@gmail.com
State Superseded
Headers
Series libm,ieee754:New algorithm of fmod function for, dbl-64/wordsize-64 |

Commit Message

Kirill Okhotnikov Nov. 19, 2020, 1:28 p.m. UTC
  I proposed a new algorithm for fmod calculations for double type and
wordsize-64 architectures. Algorithm description is in the file.
I successfully ran internal tests on x86_64 processors.
Also I did some extensive tests and benchmark with my own test-suite on 
x86_64 and ARM64. See
https://github.com/orex/test_fmod
My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm 
up to 20 times faster for
"extreme" cases. And up to two times faster for regular cases of using 
the function.
Also, I did some unit testing which shows that old and a new algorithms 
gives binary
equivalent result for each of billions different pairs (x, y) with wide 
range of numbers
including normal, subnormal, and special one (NaN INF, 0).

Reply to adhemerval.zanella@linaro.org from libc-help thread.

> I won't comment to your implementation until it is submitted to
> libc-alpha with proper Copyright assignment, but some remarks:
FSF status: In progress.
>
>   1. If you are benchmarking on i686, it won't use the generic
>      sysdeps/ieee754/dbl-64/e_fmod.c but rather a asm specific
>      implementation (sysdeps/i386/fpu/e_fmod.S).  It would be good
>      if you check if we can remove the i386 assembly implementation
>      in favor of the generic one.
You can remove it of course, but in favour to generic implementation, 
but I assume, you will get a strong performance degradation. It is worth 
to check it. I don't have a i686 machine. I don't know how relevant will 
be tests on my CPU.
>
>   2. __builtin_clz might be a libcall on some targets, which might
>      be worse than using the loop.  It is usually not an issue on
>      some implementation (it is used on some float128 which is
>      usually soft implementation anyway), but it is something to
>      keep in mind.
I check the code of libm. The function is already used (except ldbl-128) in

sysdeps/ieee754/flt-32/s_logbf.c:      rix -= __builtin_clz (ix) - 9;
sysdeps/ieee754/dbl-64/s_logb.c:      int m = __builtin_clzll (ix);

The second thing is that loop is not efficient anyhow. There is more 
efficient approach with table (for CLZ and CTZ).

https://www.geeksforgeeks.org/count-trailing-zero-bits-using-lookup-table/

Probably it is already implemented in GCC?

>
>   3. It would be good if could get rid of the wordsize-64
>      implementation and just have a generic one good enough regardless
>      of word size.
I can't imagine such algorithm. The size of double is 64 bit. Splitting 
it over two 32 bits variable, of course, will have a performance impact. 
Algorithm, proposed by me (see description), can work strictly with 64 
bits variables.

Best,
Kirill.
  

Comments

Adhemerval Zanella Nov. 19, 2020, 2:54 p.m. UTC | #1
On 19/11/2020 10:28, Kirill Okhotnikov wrote:
> I proposed a new algorithm for fmod calculations for double type and
> wordsize-64 architectures. Algorithm description is in the file.
> I successfully ran internal tests on x86_64 processors.
> Also I did some extensive tests and benchmark with my own test-suite on x86_64 and ARM64. See
> https://github.com/orex/test_fmod
> My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to 20 times faster for
> "extreme" cases. And up to two times faster for regular cases of using the function.
> Also, I did some unit testing which shows that old and a new algorithms gives binary
> equivalent result for each of billions different pairs (x, y) with wide range of numbers
> including normal, subnormal, and special one (NaN INF, 0).
> 
> Reply to adhemerval.zanella@linaro.org from libc-help thread.
> 
>> I won't comment to your implementation until it is submitted to
>> libc-alpha with proper Copyright assignment, but some remarks:
> FSF status: In progress.

Thanks (I think other glibc developers will only check the patch itself
once the paper are sorted out).

>>
>>   1. If you are benchmarking on i686, it won't use the generic
>>      sysdeps/ieee754/dbl-64/e_fmod.c but rather a asm specific
>>      implementation (sysdeps/i386/fpu/e_fmod.S).  It would be good
>>      if you check if we can remove the i386 assembly implementation
>>      in favor of the generic one.
> You can remove it of course, but in favour to generic implementation, but I assume, you will get a strong performance degradation. It is worth to check it. I don't have a i686 machine. I don't know how relevant will be tests on my CPU.

You can check the i686 performance on a x86_64 machine, the idea is to see
if this resulting performance and depending of the result we might 
remove the assembly implementation.

>>
>>   2. __builtin_clz might be a libcall on some targets, which might
>>      be worse than using the loop.  It is usually not an issue on
>>      some implementation (it is used on some float128 which is
>>      usually soft implementation anyway), but it is something to
>>      keep in mind.
> I check the code of libm. The function is already used (except ldbl-128) in
> 
> sysdeps/ieee754/flt-32/s_logbf.c:      rix -= __builtin_clz (ix) - 9;
> sysdeps/ieee754/dbl-64/s_logb.c:      int m = __builtin_clzll (ix);
> 
> The second thing is that loop is not efficient anyhow. There is more efficient approach with table (for CLZ and CTZ).
> 
> https://www.geeksforgeeks.org/count-trailing-zero-bits-using-lookup-table/
> 
> Probably it is already implemented in GCC?

It really depends of the architecture, the stdlib/longlong.h has both
count_trailing_zeros/count_trailing_zeros which should either call
the __builtin_clzll/__builtin_ctzll or an optimized version without
a libgcc libcall.

The libgcc also has multiple optimized version depending of the
architecture, and the generic one does use a lookup table afaik.

> 
>>
>>   3. It would be good if could get rid of the wordsize-64
>>      implementation and just have a generic one good enough regardless
>>      of word size.
> I can't imagine such algorithm. The size of double is 64 bit. Splitting it over two 32 bits variable, of course, will have a performance impact. Algorithm, proposed by me (see description), can work strictly with 64 bits variables.

The question is whether using the 64-bit variable version yields
enough good performance on 32-bit architecture so we can have only 
1 implementation instead of multiple ones.  It would incur in less
maintainability effort and a simple code base.
  
Joseph Myers Nov. 19, 2020, 9:43 p.m. UTC | #2
On Thu, 19 Nov 2020, Kirill Okhotnikov via Libc-alpha wrote:

> I proposed a new algorithm for fmod calculations for double type and
> wordsize-64 architectures. Algorithm description is in the file.
> I successfully ran internal tests on x86_64 processors.
> Also I did some extensive tests and benchmark with my own test-suite on x86_64
> and ARM64. See
> https://github.com/orex/test_fmod
> My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to
> 20 times faster for
> "extreme" cases. And up to two times faster for regular cases of using the
> function.

We'd like benchmarks that justify performance improvements to be added to 
glibc's benchtests, and for the corresponding performance data (before and 
after) to be included in the patch submission.

In the case of large exponent differences there are other possible (more 
complicated) approaches that might improve performance further - but it 
makes sense to get a good version of the approach in this patch first 
before trying for any further improvements.

Repeated squaring to compute the relevant power of 2 to the relevant 
modulus means an exponent difference of n should only take time O(log(n)) 
not O(n) (but then the intermediate products, to be reduced to that 
modulus, would be more than 64 bits).  And Montgomery multiplication could 
also help improve performance there.

> + * Performance: 
> + *   Various tests on AArch64 and x86_64 shows that the current algorithm is faster than the previous one by factor 1.5-10 depend
> + *   on input.

I think performance information belongs in the proposed commit message, 
not in comments in the code which should describe the code as it is in a 
self-contained way rather than referring to a previous version.

Note that source lines should not generally be longer than 80 columns.

> +#define CLZ(a) __builtin_clzl(a)
> +#define CTZ(a) __builtin_ctzl(a)

long can be 32-bit in dbl-64/wordsize-64, in the x32 or MIPS n32 case; you 
can't rely on functions for long being suitable for uint64_t, so those for 
long long should be used instead.

> +  sx.value *= one;  /* Optimized out? May be math_check_force_underflow? */

math_check_force_underflow would be incorrect here.  We don't try to 
ensure exact underflow exceptions (the case where IEEE signals 
FE_UNDERFLOW without FE_INEXACT and default exception handling does not 
raise the corresponding flag).  fmod always produces exact results 
("invalid" is the only exception it can raise the flag for, for x infinite 
or y zero if the other is not a NaN, or if either argument is a signaling 
NaN).
  
Kirill Okhotnikov Nov. 19, 2020, 10:31 p.m. UTC | #3
On 11/19/20 3:54 PM, Adhemerval Zanella wrote:
>
> On 19/11/2020 10:28, Kirill Okhotnikov wrote:
>> I proposed a new algorithm for fmod calculations for double type and
>> wordsize-64 architectures. Algorithm description is in the file.
>> I successfully ran internal tests on x86_64 processors.
>> Also I did some extensive tests and benchmark with my own test-suite on x86_64 and ARM64. See
>> https://github.com/orex/test_fmod
>> My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to 20 times faster for
>> "extreme" cases. And up to two times faster for regular cases of using the function.
>> Also, I did some unit testing which shows that old and a new algorithms gives binary
>> equivalent result for each of billions different pairs (x, y) with wide range of numbers
>> including normal, subnormal, and special one (NaN INF, 0).
>>
>> Reply to adhemerval.zanella@linaro.org from libc-help thread.
>>
>>> I won't comment to your implementation until it is submitted to
>>> libc-alpha with proper Copyright assignment, but some remarks:
>> FSF status: In progress.
> Thanks (I think other glibc developers will only check the patch itself
> once the paper are sorted out).
>
>>>    1. If you are benchmarking on i686, it won't use the generic
>>>       sysdeps/ieee754/dbl-64/e_fmod.c but rather a asm specific
>>>       implementation (sysdeps/i386/fpu/e_fmod.S).  It would be good
>>>       if you check if we can remove the i386 assembly implementation
>>>       in favor of the generic one.
>> You can remove it of course, but in favour to generic implementation, but I assume, you will get a strong performance degradation. It is worth to check it. I don't have a i686 machine. I don't know how relevant will be tests on my CPU.
> You can check the i686 performance on a x86_64 machine, the idea is to see
> if this resulting performance and depending of the result we might
> remove the assembly implementation.

I've checked i686 approach. It is simple FPU call, which is quite 
efficient. I did an extra checks on i5 machine with my function vs 
internal one (64 bit version) vs FPU one. The approximate rates are (FPU 
performance/my function performance) ( test details on github):

Main test: ~2.5x (FPU 2.5 times better).
Wrap 2pi: ~0.75x (proposed 25% better).
Wrap 1.0: 1.75x
Extreme cases: 0.55x

Although proposed function is better on (from my point of view) most 
frequent cases (wrap pi). It is still debateable which function should 
be used for x86_64: FPU or proposed one.

>
>>>    2. __builtin_clz might be a libcall on some targets, which might
>>>       be worse than using the loop.  It is usually not an issue on
>>>       some implementation (it is used on some float128 which is
>>>       usually soft implementation anyway), but it is something to
>>>       keep in mind.
>> I check the code of libm. The function is already used (except ldbl-128) in
>>
>> sysdeps/ieee754/flt-32/s_logbf.c:      rix -= __builtin_clz (ix) - 9;
>> sysdeps/ieee754/dbl-64/s_logb.c:      int m = __builtin_clzll (ix);
>>
>> The second thing is that loop is not efficient anyhow. There is more efficient approach with table (for CLZ and CTZ).
>>
>> https://www.geeksforgeeks.org/count-trailing-zero-bits-using-lookup-table/
>>
>> Probably it is already implemented in GCC?
> It really depends of the architecture, the stdlib/longlong.h has both
> count_trailing_zeros/count_trailing_zeros which should either call
> the __builtin_clzll/__builtin_ctzll or an optimized version without
> a libgcc libcall.
>
> The libgcc also has multiple optimized version depending of the
> architecture, and the generic one does use a lookup table afaik.
>
>>>    3. It would be good if could get rid of the wordsize-64
>>>       implementation and just have a generic one good enough regardless
>>>       of word size.
>> I can't imagine such algorithm. The size of double is 64 bit. Splitting it over two 32 bits variable, of course, will have a performance impact. Algorithm, proposed by me (see description), can work strictly with 64 bits variables.
> The question is whether using the 64-bit variable version yields
> enough good performance on 32-bit architecture so we can have only
> 1 implementation instead of multiple ones.  It would incur in less
> maintainability effort and a simple code base.

I compiled and ran 32 bit program on my computer. I used docker, it 
should not change performance significantly. So, I got a performance 
degradation compare to internal function by on average 1.5 times.

Best,
Kirill.
  
Kirill Okhotnikov Nov. 20, 2020, 12:25 a.m. UTC | #4
On 11/19/20 10:43 PM, Joseph Myers wrote:
> On Thu, 19 Nov 2020, Kirill Okhotnikov via Libc-alpha wrote:
>
>> I proposed a new algorithm for fmod calculations for double type and
>> wordsize-64 architectures. Algorithm description is in the file.
>> I successfully ran internal tests on x86_64 processors.
>> Also I did some extensive tests and benchmark with my own test-suite on x86_64
>> and ARM64. See
>> https://github.com/orex/test_fmod
>> My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to
>> 20 times faster for
>> "extreme" cases. And up to two times faster for regular cases of using the
>> function.
> We'd like benchmarks that justify performance improvements to be added to
> glibc's benchtests, and for the corresponding performance data (before and
> after) to be included in the patch submission.

> In the case of large exponent differences there are other possible (more
> complicated) approaches that might improve performance further - but it
> makes sense to get a good version of the approach in this patch first
> before trying for any further improvements.

> Repeated squaring to compute the relevant power of 2 to the relevant
> modulus means an exponent difference of n should only take time O(log(n))
> not O(n) (but then the intermediate products, to be reduced to that
> modulus, would be more than 64 bits).  And Montgomery multiplication could
> also help improve performance there.

If I understand your idea right, It is possible to go for O(log(n)). You 
are absolutely right! I had this idea, but leave it for larger 
variables. Am I understand you correctly, that you proposed something  
(a*b) mod n = (a mod n)*(b mod n) mod n. if x = a*b=a*2^i in this case 
we can downgrade 2^i in logarithmic time, of course. 2^i mod n = 
(2^(i/2) mod n) * (2^(i/2) mod n) * ( (i & 1) ? 2 : 1).

Why I did not implement it, is because it requires either Montgomery 
multiplication, or as you said, int128 or MULQ (x86.) or MULH MULL 
(amr64) assembler command. Anyhow it requires several additional 
multiplications, which can decrease performance significantly for 
intermediate cases, where |x/y|<1E20. So, it should be one more branch 
of code for such approach. The absolute maximum in the number of 
iterations for double is 187 in current version. For the same conditions 
the algorithm described above will need to do recursion. Of course it 
can be flatten, but an extra space is needed, also it needs 6 iteration 
until it reach 2^63 which can be solved buy simple division and 2 or 3 
Montgomery multiplication each step. Therefore I assume that this 
iteration can be minimum 10 times more complex, than initial one. So, 
the difference become 3 times only for such extreme case. Moreover, when 
number of significant bits in y variable less than 53 the current 
algorithm decrease the number of steps. So for subnormals, the 
performance increases.


>
>> + * Performance:
>> + *   Various tests on AArch64 and x86_64 shows that the current algorithm is faster than the previous one by factor 1.5-10 depend
>> + *   on input.
> I think performance information belongs in the proposed commit message,
> not in comments in the code which should describe the code as it is in a
> self-contained way rather than referring to a previous version.
Sorry. I did not get the idea.
>
> Note that source lines should not generally be longer than 80 columns.
>
>> +#define CLZ(a) __builtin_clzl(a)
>> +#define CTZ(a) __builtin_ctzl(a)
Of course!
> long can be 32-bit in dbl-64/wordsize-64, in the x32 or MIPS n32 case; you
> can't rely on functions for long being suitable for uint64_t, so those for
> long long should be used instead.
>
>> +  sx.value *= one;  /* Optimized out? May be math_check_force_underflow? */
> math_check_force_underflow would be incorrect here.  We don't try to
> ensure exact underflow exceptions (the case where IEEE signals
> FE_UNDERFLOW without FE_INEXACT and default exception handling does not
> raise the corresponding flag).  fmod always produces exact results
> ("invalid" is the only exception it can raise the flag for, for x infinite
> or y zero if the other is not a NaN, or if either argument is a signaling
> NaN).
>
I just get it from previous code. Such thing was also questionable for 
me. Fully remove it?
  
Joseph Myers Nov. 20, 2020, 1:02 a.m. UTC | #5
On Fri, 20 Nov 2020, Kirill Okhotnikov via Libc-alpha wrote:

> > > + * Performance:
> > > + *   Various tests on AArch64 and x86_64 shows that the current algorithm
> > > is faster than the previous one by factor 1.5-10 depend
> > > + *   on input.
> > I think performance information belongs in the proposed commit message,
> > not in comments in the code which should describe the code as it is in a
> > self-contained way rather than referring to a previous version.
> Sorry. I did not get the idea.

I don't think comments in the code should refer to the previous algorithm; 
I think they should only relate to the version of the code containing 
those comments.

Comparisons of performance of the two versions are fully appropriate in 
the commit message, however.

> > > +  sx.value *= one;  /* Optimized out? May be math_check_force_underflow?
> > > */
> > math_check_force_underflow would be incorrect here.  We don't try to
> > ensure exact underflow exceptions (the case where IEEE signals
> > FE_UNDERFLOW without FE_INEXACT and default exception handling does not
> > raise the corresponding flag).  fmod always produces exact results
> > ("invalid" is the only exception it can raise the flag for, for x infinite
> > or y zero if the other is not a NaN, or if either argument is a signaling
> > NaN).
> > 
> I just get it from previous code. Such thing was also questionable for me.
> Fully remove it?

Yes, I think remove it.
  

Patch

From e88e7fc7b24eeaeabbe9b4649aa39d2ee9977a09 Mon Sep 17 00:00:00 2001
From: kirill <kirill.okhotnikov@gmail.com>
Date: Thu, 19 Nov 2020 13:29:41 +0100
Subject: [PATCH] libm,ieee754:New algorithm of fmod function for
 dbl-64/wordsize-64
To: libc-alpha@sourceware.org
Cc: adhemerval.zanella@linaro.org

I proposed a new algorithm for fmod calculations for double type and
wordsize-64 architectures. Algorithm description is in the file.
I successfully ran internal tests on x86_64 processors.
Also I did some extensive tests and benchmark with my own test-suite on x86_64 and ARM64. See
https://github.com/orex/test_fmod
My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to 20 times faster for
"extreme" cases. And up to two times faster for regular cases of using the function.
Also, I did some unit testing which shows that old and a new algorithms gives binary
equivalent result for each of billions different pairs (x, y) with wide range of numbers
including normal, subnormal, and special one (NaN INF, 0).
---
 sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c | 282 ++++++++++++++------
 1 file changed, 194 insertions(+), 88 deletions(-)

diff --git a/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c b/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c
index 52a8687448..b3c7c4a28c 100644
--- a/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c
+++ b/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c
@@ -1,19 +1,61 @@ 
-/* Rewritten for 64-bit machines by Ulrich Drepper <drepper@gmail.com>.  */
-/*
- * ====================================================
- * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
- *
- * Developed at SunPro, a Sun Microsystems, Inc. business.
- * Permission to use, copy, modify, and distribute this
- * software is freely granted, provided that this notice
- * is preserved.
- * ====================================================
+/* Author: Kirill Okhotnikov <kirill.okhotnikov@gmail.com>
+ * 
+ * Background:
+ *   Like the previous algorithm written by Sun Microsystems, the code also
+ *   splits double value to integers with exponential (ix, iy) 
+ *   and mantissa part (hx, hy). The variables names are kept common with previous code 
+ *   in order to simplify understanding. The part of checking special cases, numbers NaN, INF etc
+ *   is copied from Ulrich Drepper <drepper@gmail.com> previous code.
+ *   The rest of the code uses different algorithms for fmod calculation.
+ *   An input x, y in the algorithm is represented (mathematically) like h(x|y) * 2^i(x|y).
+ *   There is an unambiguous number representation. For example h * 2^i = (2 * h) * 2^(i-1).
+ *   The previous algorithm removes such unambiguous, "aligning" bits in hx, and hy. But for this code
+ *   such unambiguity is essential.
+ * 
+ * Mathematics:
+ *   The algorithm uses the fact that
+ *   r = a % b = (a % (N * b)) % b, where N is positive integer number. a and b - positive.
+ *   Let's adopt the formula for representation above.
+ *   a = hx * 2^ix, b = hy * 2^iy, N = 2^k
+ *   r(k) = a % b = (hx * 2^ix) % (2 ^k * hy * 2^iy) = 2^(iy + k) * (hx * 2^(ix - iy - k) % hy)
+ *   r(k) = hr * 2^ir = (hx % hy) * 2^(iy + k) = (2^p * (hx % hy) * 2^(iy + k - p))
+ *      hr = 2^p * (hx % hy)
+ *      ir = iy + k - p  
+ *    
+ *     
+ * Algorithm description:
+ *   The examples below use byte (not uint64_t) for simplicity.
+ *   1) Shift hy maximum to right without losing bits and increase iy value
+ *      hy = 0b00101100 iy = 20 after shift hy = 0b00001011 iy = 22.
+ *   2) hx = hx % hy.
+ *   3) Move hx maximum to left. Note that after (hx = hx% hy) CLZ in hx is not lower than CLZ in hy.
+ *      hx=0b00001001 ix = 100, hx=0b10010000, ix = 100-4 = 96.
+ *   4) Repeat (2) until ix == iy. 
+ * 
+ * Complexity analysis:
+ *   Converting x, y to (hx, ix), (hy, iy): CTZ/shift/AND/OR/if.
+ *   Loop count: (ix - iy) / (64 - "length of hy"). max("length of hy") = 53, max(ix - iy) = 2048
+ *                Maximum operation is 186. For rare "unrealistic" cases. 
+ *                Previous algorithm had ix - iy complexity. 11 times more iterations for worst case. 
+ *   
+ *   Converting hx, ix back to double: CLZ/shift/AND/OR/if
+ *   Previous algorithm uses loops for CLZ when converting from/to double.
+ * 
+ * Special cases:
+ *   Supposing that case where |y| > 1e-292 and |x/y|<2000 is very common special processing is implemented.
+ *   No hy alignment, no loop: result = (hx * 2^(ix - iy)) % hy.
+ *   When x and y are both subnormal (rare case but...) the result = hx % hy. Simplified conversion back to double.
+ * 
+ * Performance: 
+ *   Various tests on AArch64 and x86_64 shows that the current algorithm is faster than the previous one by factor 1.5-10 depend
+ *   on input.
+ * 
  */
 
 /*
  * __ieee754_fmod(x,y)
  * Return x mod y in exact arithmetic
- * Method: shift and subtract
+ * Method: divide and shift
  */
 
 #include <math.h>
@@ -21,86 +63,150 @@ 
 #include <stdint.h>
 #include <libm-alias-finite.h>
 
-static const double one = 1.0, Zero[] = {0.0, -0.0,};
+static const double one = 1.0;
+
+#define SIGN_POS_MASK UINT64_C(0x8000000000000000)
+#define SIGN_NEG_MASK UINT64_C(0x7fffffffffffffff)
+#define DOUBLE_INF_NAN UINT64_C(0x7ff0000000000000)
+#define LAST_EXP_BIT UINT64_C(0x0010000000000000)
+#define SET_NORMAL_VALUE(a) a = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&(a))
+#define CLZ(a) __builtin_clzl(a)
+#define CTZ(a) __builtin_ctzl(a)
+
+/* Convert back to double */
+inline
+double make_double (ieee_double_shape_type sx, uint64_t num, int32_t ep)
+{
+  int32_t lz;
+  lz = CLZ(num) - 11;
+  num <<= lz;
+  ep -= lz;
+
+  if (__glibc_likely(ep >= 0))
+    {
+      num = ((num - LAST_EXP_BIT) | ((uint64_t) (ep + 1) << 52));
+    }
+  else
+    {
+      num >>= -ep;
+    }
+  sx.word += num;
+
+  sx.value *= one;  /* Optimized out? May be math_check_force_underflow? */
+  return sx.value;
+}
 
 double
 __ieee754_fmod (double x, double y)
 {
-	int32_t n,ix,iy;
-	int64_t hx,hy,hz,sx,i;
-
-	EXTRACT_WORDS64(hx,x);
-	EXTRACT_WORDS64(hy,y);
-	sx = hx&UINT64_C(0x8000000000000000);	/* sign of x */
-	hx ^=sx;				/* |x| */
-	hy &= UINT64_C(0x7fffffffffffffff);	/* |y| */
-
-    /* purge off exception values */
-	if(__builtin_expect(hy==0
-			    || hx >= UINT64_C(0x7ff0000000000000)
-			    || hy > UINT64_C(0x7ff0000000000000), 0))
-	  /* y=0,or x not finite or y is NaN */
-	    return (x*y)/(x*y);
-	if(__builtin_expect(hx<=hy, 0)) {
-	    if(hx<hy) return x;	/* |x|<|y| return x */
-	    return Zero[(uint64_t)sx>>63];	/* |x|=|y| return x*0*/
-	}
-
-    /* determine ix = ilogb(x) */
-	if(__builtin_expect(hx<UINT64_C(0x0010000000000000), 0)) {
-	  /* subnormal x */
-	  for (ix = -1022,i=(hx<<11); i>0; i<<=1) ix -=1;
-	} else ix = (hx>>52)-1023;
-
-    /* determine iy = ilogb(y) */
-	if(__builtin_expect(hy<UINT64_C(0x0010000000000000), 0)) {	/* subnormal y */
-	  for (iy = -1022,i=(hy<<11); i>0; i<<=1) iy -=1;
-	} else iy = (hy>>52)-1023;
-
-    /* set up hx, hy and align y to x */
-	if(__builtin_expect(ix >= -1022, 1))
-	    hx = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hx);
-	else {		/* subnormal x, shift x to normal */
-	    n = -1022-ix;
-	    hx<<=n;
-	}
-	if(__builtin_expect(iy >= -1022, 1))
-	    hy = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hy);
-	else {		/* subnormal y, shift y to normal */
-	    n = -1022-iy;
-	    hy<<=n;
-	}
-
-    /* fix point fmod */
-	n = ix - iy;
-	while(n--) {
-	    hz=hx-hy;
-	    if(hz<0){hx = hx+hx;}
-	    else {
-		if(hz==0)		/* return sign(x)*0 */
-		    return Zero[(uint64_t)sx>>63];
-		hx = hz+hz;
-	    }
-	}
-	hz=hx-hy;
-	if(hz>=0) {hx=hz;}
-
-    /* convert back to floating value and restore the sign */
-	if(hx==0)			/* return sign(x)*0 */
-	    return Zero[(uint64_t)sx>>63];
-	while(hx<UINT64_C(0x0010000000000000)) {	/* normalize x */
-	    hx = hx+hx;
-	    iy -= 1;
-	}
-	if(__builtin_expect(iy>= -1022, 1)) {	/* normalize output */
-	  hx = ((hx-UINT64_C(0x0010000000000000))|((uint64_t)(iy+1023)<<52));
-	    INSERT_WORDS64(x,hx|sx);
-	} else {		/* subnormal output */
-	    n = -1022 - iy;
-	    hx>>=n;
-	    INSERT_WORDS64(x,hx|sx);
-	    x *= one;		/* create necessary signal */
-	}
-	return x;		/* exact output */
+  int32_t n, ix, iy, lzhy, tzhy, hyltzeroes;
+  uint64_t hx, hy, d;
+  ieee_double_shape_type sx;
+
+  EXTRACT_WORDS64 (hx, x);
+  EXTRACT_WORDS64 (hy, y);
+  sx.word = hx & SIGN_POS_MASK;    /* sign of x */
+  hx ^= sx.word;            /* |x| */
+  hy &= SIGN_NEG_MASK;    /* |y| */
+
+  /* purge off exception values */
+  if (__glibc_unlikely (hy == 0
+                        || hx >= DOUBLE_INF_NAN
+                        || hy > DOUBLE_INF_NAN))
+    /* y=0,or x not finite or y is NaN */
+    return (x * y) / (x * y);
+  if (__glibc_likely(hx <= hy))
+    {
+      if (hx < hy) return x;    /* |x|<|y| return x */
+      return sx.value;    /* |x|=|y| return x*0*/
+    }
+
+  ix = hx >> 52;
+  iy = hy >> 52;
+
+  /* Most common case where |y| > 1e-292 and |x/y|<2000 */
+  if (__glibc_likely(ix >= 53 && ix - iy <= 11))
+    {
+      SET_NORMAL_VALUE(hx);
+      SET_NORMAL_VALUE(hy);
+      d = (ix == iy) ? (hx - hy) : (hx << (ix - iy)) % hy;
+      if (d == 0)
+        return sx.value;
+      return make_double (sx, d, iy - 1); /* iy - 1 because of "zero power" for number with power 1 */
+    }
+
+  /* Both subnormal special case. */
+  if (__glibc_unlikely (ix == 0 && iy == 0))
+    {
+      uint64_t d = hx % hy;
+      sx.word += d;
+      sx.value *= one;  /* Optimized out? need for math signals */
+      return sx.value;
+    }
+
+  /* Assume that hx is not subnormal by conditions above. */
+  SET_NORMAL_VALUE(hx);
+  ix--;
+  if (__builtin_expect (iy > 0, 1))
+    {
+      SET_NORMAL_VALUE(hy);
+      lzhy = 11;
+      iy--;
+    }
+  else
+    {
+      lzhy = CLZ(hy);
+    }
+
+  /* Assume hy != 0 */
+  tzhy = CTZ(hy);
+  hyltzeroes = lzhy + tzhy;
+  n = ix - iy;
+
+  /* Shift hy right until the end or n = 0 */
+  if (n > tzhy)
+    {
+      hy >>= tzhy;
+      n -= tzhy;
+      iy += tzhy;
+    }
+  else
+    {
+      hy >>= n;
+      iy += n;
+      n = 0;
+    }
+
+  /* Shift hx left until the end or n = 0 */
+  if (n > 11)
+    {
+      hx <<= 11;
+      n -= 11;
+    }
+  else
+    {
+      hx <<= n;
+      n = 0;
+    }
+  hx %= hy;
+  if (hx == 0)
+    return sx.value;
+
+  if (n == 0)
+    return make_double (sx, hx, iy);
+
+  /* hx in next code can become 0, because hx < hy, hy % 2 == 1 hx * 2^i % hy != 0 */
+
+  while (n > hyltzeroes)
+    {
+      n -= hyltzeroes;
+      hx <<= hyltzeroes;
+      hx %= hy;
+    }
+
+  hx <<= n;
+  hx %= hy;
+
+  return make_double (sx, hx, iy);
 }
 libm_alias_finite (__ieee754_fmod, __fmod)
-- 
2.25.1