Message ID | 0c028906-f5a6-deed-a0ff-f741aee9c5e6@gmail.com |
---|---|

State | Superseded |

Headers | show |

Series | libm,ieee754:New algorithm of fmod function for, dbl-64/wordsize-64 | expand |

On 19/11/2020 10:28, Kirill Okhotnikov wrote: > I proposed a new algorithm for fmod calculations for double type and > wordsize-64 architectures. Algorithm description is in the file. > I successfully ran internal tests on x86_64 processors. > Also I did some extensive tests and benchmark with my own test-suite on x86_64 and ARM64. See > https://github.com/orex/test_fmod > My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to 20 times faster for > "extreme" cases. And up to two times faster for regular cases of using the function. > Also, I did some unit testing which shows that old and a new algorithms gives binary > equivalent result for each of billions different pairs (x, y) with wide range of numbers > including normal, subnormal, and special one (NaN INF, 0). > > Reply to adhemerval.zanella@linaro.org from libc-help thread. > >> I won't comment to your implementation until it is submitted to >> libc-alpha with proper Copyright assignment, but some remarks: > FSF status: In progress. Thanks (I think other glibc developers will only check the patch itself once the paper are sorted out). >> >> 1. If you are benchmarking on i686, it won't use the generic >> sysdeps/ieee754/dbl-64/e_fmod.c but rather a asm specific >> implementation (sysdeps/i386/fpu/e_fmod.S). It would be good >> if you check if we can remove the i386 assembly implementation >> in favor of the generic one. > You can remove it of course, but in favour to generic implementation, but I assume, you will get a strong performance degradation. It is worth to check it. I don't have a i686 machine. I don't know how relevant will be tests on my CPU. You can check the i686 performance on a x86_64 machine, the idea is to see if this resulting performance and depending of the result we might remove the assembly implementation. >> >> 2. __builtin_clz might be a libcall on some targets, which might >> be worse than using the loop. It is usually not an issue on >> some implementation (it is used on some float128 which is >> usually soft implementation anyway), but it is something to >> keep in mind. > I check the code of libm. The function is already used (except ldbl-128) in > > sysdeps/ieee754/flt-32/s_logbf.c: rix -= __builtin_clz (ix) - 9; > sysdeps/ieee754/dbl-64/s_logb.c: int m = __builtin_clzll (ix); > > The second thing is that loop is not efficient anyhow. There is more efficient approach with table (for CLZ and CTZ). > > https://www.geeksforgeeks.org/count-trailing-zero-bits-using-lookup-table/ > > Probably it is already implemented in GCC? It really depends of the architecture, the stdlib/longlong.h has both count_trailing_zeros/count_trailing_zeros which should either call the __builtin_clzll/__builtin_ctzll or an optimized version without a libgcc libcall. The libgcc also has multiple optimized version depending of the architecture, and the generic one does use a lookup table afaik. > >> >> 3. It would be good if could get rid of the wordsize-64 >> implementation and just have a generic one good enough regardless >> of word size. > I can't imagine such algorithm. The size of double is 64 bit. Splitting it over two 32 bits variable, of course, will have a performance impact. Algorithm, proposed by me (see description), can work strictly with 64 bits variables. The question is whether using the 64-bit variable version yields enough good performance on 32-bit architecture so we can have only 1 implementation instead of multiple ones. It would incur in less maintainability effort and a simple code base.

On Thu, 19 Nov 2020, Kirill Okhotnikov via Libc-alpha wrote: > I proposed a new algorithm for fmod calculations for double type and > wordsize-64 architectures. Algorithm description is in the file. > I successfully ran internal tests on x86_64 processors. > Also I did some extensive tests and benchmark with my own test-suite on x86_64 > and ARM64. See > https://github.com/orex/test_fmod > My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to > 20 times faster for > "extreme" cases. And up to two times faster for regular cases of using the > function. We'd like benchmarks that justify performance improvements to be added to glibc's benchtests, and for the corresponding performance data (before and after) to be included in the patch submission. In the case of large exponent differences there are other possible (more complicated) approaches that might improve performance further - but it makes sense to get a good version of the approach in this patch first before trying for any further improvements. Repeated squaring to compute the relevant power of 2 to the relevant modulus means an exponent difference of n should only take time O(log(n)) not O(n) (but then the intermediate products, to be reduced to that modulus, would be more than 64 bits). And Montgomery multiplication could also help improve performance there. > + * Performance: > + * Various tests on AArch64 and x86_64 shows that the current algorithm is faster than the previous one by factor 1.5-10 depend > + * on input. I think performance information belongs in the proposed commit message, not in comments in the code which should describe the code as it is in a self-contained way rather than referring to a previous version. Note that source lines should not generally be longer than 80 columns. > +#define CLZ(a) __builtin_clzl(a) > +#define CTZ(a) __builtin_ctzl(a) long can be 32-bit in dbl-64/wordsize-64, in the x32 or MIPS n32 case; you can't rely on functions for long being suitable for uint64_t, so those for long long should be used instead. > + sx.value *= one; /* Optimized out? May be math_check_force_underflow? */ math_check_force_underflow would be incorrect here. We don't try to ensure exact underflow exceptions (the case where IEEE signals FE_UNDERFLOW without FE_INEXACT and default exception handling does not raise the corresponding flag). fmod always produces exact results ("invalid" is the only exception it can raise the flag for, for x infinite or y zero if the other is not a NaN, or if either argument is a signaling NaN).

On 11/19/20 3:54 PM, Adhemerval Zanella wrote: > > On 19/11/2020 10:28, Kirill Okhotnikov wrote: >> I proposed a new algorithm for fmod calculations for double type and >> wordsize-64 architectures. Algorithm description is in the file. >> I successfully ran internal tests on x86_64 processors. >> Also I did some extensive tests and benchmark with my own test-suite on x86_64 and ARM64. See >> https://github.com/orex/test_fmod >> My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to 20 times faster for >> "extreme" cases. And up to two times faster for regular cases of using the function. >> Also, I did some unit testing which shows that old and a new algorithms gives binary >> equivalent result for each of billions different pairs (x, y) with wide range of numbers >> including normal, subnormal, and special one (NaN INF, 0). >> >> Reply to adhemerval.zanella@linaro.org from libc-help thread. >> >>> I won't comment to your implementation until it is submitted to >>> libc-alpha with proper Copyright assignment, but some remarks: >> FSF status: In progress. > Thanks (I think other glibc developers will only check the patch itself > once the paper are sorted out). > >>> 1. If you are benchmarking on i686, it won't use the generic >>> sysdeps/ieee754/dbl-64/e_fmod.c but rather a asm specific >>> implementation (sysdeps/i386/fpu/e_fmod.S). It would be good >>> if you check if we can remove the i386 assembly implementation >>> in favor of the generic one. >> You can remove it of course, but in favour to generic implementation, but I assume, you will get a strong performance degradation. It is worth to check it. I don't have a i686 machine. I don't know how relevant will be tests on my CPU. > You can check the i686 performance on a x86_64 machine, the idea is to see > if this resulting performance and depending of the result we might > remove the assembly implementation. I've checked i686 approach. It is simple FPU call, which is quite efficient. I did an extra checks on i5 machine with my function vs internal one (64 bit version) vs FPU one. The approximate rates are (FPU performance/my function performance) ( test details on github): Main test: ~2.5x (FPU 2.5 times better). Wrap 2pi: ~0.75x (proposed 25% better). Wrap 1.0: 1.75x Extreme cases: 0.55x Although proposed function is better on (from my point of view) most frequent cases (wrap pi). It is still debateable which function should be used for x86_64: FPU or proposed one. > >>> 2. __builtin_clz might be a libcall on some targets, which might >>> be worse than using the loop. It is usually not an issue on >>> some implementation (it is used on some float128 which is >>> usually soft implementation anyway), but it is something to >>> keep in mind. >> I check the code of libm. The function is already used (except ldbl-128) in >> >> sysdeps/ieee754/flt-32/s_logbf.c: rix -= __builtin_clz (ix) - 9; >> sysdeps/ieee754/dbl-64/s_logb.c: int m = __builtin_clzll (ix); >> >> The second thing is that loop is not efficient anyhow. There is more efficient approach with table (for CLZ and CTZ). >> >> https://www.geeksforgeeks.org/count-trailing-zero-bits-using-lookup-table/ >> >> Probably it is already implemented in GCC? > It really depends of the architecture, the stdlib/longlong.h has both > count_trailing_zeros/count_trailing_zeros which should either call > the __builtin_clzll/__builtin_ctzll or an optimized version without > a libgcc libcall. > > The libgcc also has multiple optimized version depending of the > architecture, and the generic one does use a lookup table afaik. > >>> 3. It would be good if could get rid of the wordsize-64 >>> implementation and just have a generic one good enough regardless >>> of word size. >> I can't imagine such algorithm. The size of double is 64 bit. Splitting it over two 32 bits variable, of course, will have a performance impact. Algorithm, proposed by me (see description), can work strictly with 64 bits variables. > The question is whether using the 64-bit variable version yields > enough good performance on 32-bit architecture so we can have only > 1 implementation instead of multiple ones. It would incur in less > maintainability effort and a simple code base. I compiled and ran 32 bit program on my computer. I used docker, it should not change performance significantly. So, I got a performance degradation compare to internal function by on average 1.5 times. Best, Kirill.

On 11/19/20 10:43 PM, Joseph Myers wrote: > On Thu, 19 Nov 2020, Kirill Okhotnikov via Libc-alpha wrote: > >> I proposed a new algorithm for fmod calculations for double type and >> wordsize-64 architectures. Algorithm description is in the file. >> I successfully ran internal tests on x86_64 processors. >> Also I did some extensive tests and benchmark with my own test-suite on x86_64 >> and ARM64. See >> https://github.com/orex/test_fmod >> My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to >> 20 times faster for >> "extreme" cases. And up to two times faster for regular cases of using the >> function. > We'd like benchmarks that justify performance improvements to be added to > glibc's benchtests, and for the corresponding performance data (before and > after) to be included in the patch submission. > In the case of large exponent differences there are other possible (more > complicated) approaches that might improve performance further - but it > makes sense to get a good version of the approach in this patch first > before trying for any further improvements. > Repeated squaring to compute the relevant power of 2 to the relevant > modulus means an exponent difference of n should only take time O(log(n)) > not O(n) (but then the intermediate products, to be reduced to that > modulus, would be more than 64 bits). And Montgomery multiplication could > also help improve performance there. If I understand your idea right, It is possible to go for O(log(n)). You are absolutely right! I had this idea, but leave it for larger variables. Am I understand you correctly, that you proposed something (a*b) mod n = (a mod n)*(b mod n) mod n. if x = a*b=a*2^i in this case we can downgrade 2^i in logarithmic time, of course. 2^i mod n = (2^(i/2) mod n) * (2^(i/2) mod n) * ( (i & 1) ? 2 : 1). Why I did not implement it, is because it requires either Montgomery multiplication, or as you said, int128 or MULQ (x86.) or MULH MULL (amr64) assembler command. Anyhow it requires several additional multiplications, which can decrease performance significantly for intermediate cases, where |x/y|<1E20. So, it should be one more branch of code for such approach. The absolute maximum in the number of iterations for double is 187 in current version. For the same conditions the algorithm described above will need to do recursion. Of course it can be flatten, but an extra space is needed, also it needs 6 iteration until it reach 2^63 which can be solved buy simple division and 2 or 3 Montgomery multiplication each step. Therefore I assume that this iteration can be minimum 10 times more complex, than initial one. So, the difference become 3 times only for such extreme case. Moreover, when number of significant bits in y variable less than 53 the current algorithm decrease the number of steps. So for subnormals, the performance increases. > >> + * Performance: >> + * Various tests on AArch64 and x86_64 shows that the current algorithm is faster than the previous one by factor 1.5-10 depend >> + * on input. > I think performance information belongs in the proposed commit message, > not in comments in the code which should describe the code as it is in a > self-contained way rather than referring to a previous version. Sorry. I did not get the idea. > > Note that source lines should not generally be longer than 80 columns. > >> +#define CLZ(a) __builtin_clzl(a) >> +#define CTZ(a) __builtin_ctzl(a) Of course! > long can be 32-bit in dbl-64/wordsize-64, in the x32 or MIPS n32 case; you > can't rely on functions for long being suitable for uint64_t, so those for > long long should be used instead. > >> + sx.value *= one; /* Optimized out? May be math_check_force_underflow? */ > math_check_force_underflow would be incorrect here. We don't try to > ensure exact underflow exceptions (the case where IEEE signals > FE_UNDERFLOW without FE_INEXACT and default exception handling does not > raise the corresponding flag). fmod always produces exact results > ("invalid" is the only exception it can raise the flag for, for x infinite > or y zero if the other is not a NaN, or if either argument is a signaling > NaN). > I just get it from previous code. Such thing was also questionable for me. Fully remove it?

On Fri, 20 Nov 2020, Kirill Okhotnikov via Libc-alpha wrote: > > > + * Performance: > > > + * Various tests on AArch64 and x86_64 shows that the current algorithm > > > is faster than the previous one by factor 1.5-10 depend > > > + * on input. > > I think performance information belongs in the proposed commit message, > > not in comments in the code which should describe the code as it is in a > > self-contained way rather than referring to a previous version. > Sorry. I did not get the idea. I don't think comments in the code should refer to the previous algorithm; I think they should only relate to the version of the code containing those comments. Comparisons of performance of the two versions are fully appropriate in the commit message, however. > > > + sx.value *= one; /* Optimized out? May be math_check_force_underflow? > > > */ > > math_check_force_underflow would be incorrect here. We don't try to > > ensure exact underflow exceptions (the case where IEEE signals > > FE_UNDERFLOW without FE_INEXACT and default exception handling does not > > raise the corresponding flag). fmod always produces exact results > > ("invalid" is the only exception it can raise the flag for, for x infinite > > or y zero if the other is not a NaN, or if either argument is a signaling > > NaN). > > > I just get it from previous code. Such thing was also questionable for me. > Fully remove it? Yes, I think remove it.

From e88e7fc7b24eeaeabbe9b4649aa39d2ee9977a09 Mon Sep 17 00:00:00 2001 From: kirill <kirill.okhotnikov@gmail.com> Date: Thu, 19 Nov 2020 13:29:41 +0100 Subject: [PATCH] libm,ieee754:New algorithm of fmod function for dbl-64/wordsize-64 To: libc-alpha@sourceware.org Cc: adhemerval.zanella@linaro.org I proposed a new algorithm for fmod calculations for double type and wordsize-64 architectures. Algorithm description is in the file. I successfully ran internal tests on x86_64 processors. Also I did some extensive tests and benchmark with my own test-suite on x86_64 and ARM64. See https://github.com/orex/test_fmod My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to 20 times faster for "extreme" cases. And up to two times faster for regular cases of using the function. Also, I did some unit testing which shows that old and a new algorithms gives binary equivalent result for each of billions different pairs (x, y) with wide range of numbers including normal, subnormal, and special one (NaN INF, 0). --- sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c | 282 ++++++++++++++------ 1 file changed, 194 insertions(+), 88 deletions(-) diff --git a/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c b/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c index 52a8687448..b3c7c4a28c 100644 --- a/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c +++ b/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c @@ -1,19 +1,61 @@ -/* Rewritten for 64-bit machines by Ulrich Drepper <drepper@gmail.com>. */ -/* - * ==================================================== - * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved. - * - * Developed at SunPro, a Sun Microsystems, Inc. business. - * Permission to use, copy, modify, and distribute this - * software is freely granted, provided that this notice - * is preserved. - * ==================================================== +/* Author: Kirill Okhotnikov <kirill.okhotnikov@gmail.com> + * + * Background: + * Like the previous algorithm written by Sun Microsystems, the code also + * splits double value to integers with exponential (ix, iy) + * and mantissa part (hx, hy). The variables names are kept common with previous code + * in order to simplify understanding. The part of checking special cases, numbers NaN, INF etc + * is copied from Ulrich Drepper <drepper@gmail.com> previous code. + * The rest of the code uses different algorithms for fmod calculation. + * An input x, y in the algorithm is represented (mathematically) like h(x|y) * 2^i(x|y). + * There is an unambiguous number representation. For example h * 2^i = (2 * h) * 2^(i-1). + * The previous algorithm removes such unambiguous, "aligning" bits in hx, and hy. But for this code + * such unambiguity is essential. + * + * Mathematics: + * The algorithm uses the fact that + * r = a % b = (a % (N * b)) % b, where N is positive integer number. a and b - positive. + * Let's adopt the formula for representation above. + * a = hx * 2^ix, b = hy * 2^iy, N = 2^k + * r(k) = a % b = (hx * 2^ix) % (2 ^k * hy * 2^iy) = 2^(iy + k) * (hx * 2^(ix - iy - k) % hy) + * r(k) = hr * 2^ir = (hx % hy) * 2^(iy + k) = (2^p * (hx % hy) * 2^(iy + k - p)) + * hr = 2^p * (hx % hy) + * ir = iy + k - p + * + * + * Algorithm description: + * The examples below use byte (not uint64_t) for simplicity. + * 1) Shift hy maximum to right without losing bits and increase iy value + * hy = 0b00101100 iy = 20 after shift hy = 0b00001011 iy = 22. + * 2) hx = hx % hy. + * 3) Move hx maximum to left. Note that after (hx = hx% hy) CLZ in hx is not lower than CLZ in hy. + * hx=0b00001001 ix = 100, hx=0b10010000, ix = 100-4 = 96. + * 4) Repeat (2) until ix == iy. + * + * Complexity analysis: + * Converting x, y to (hx, ix), (hy, iy): CTZ/shift/AND/OR/if. + * Loop count: (ix - iy) / (64 - "length of hy"). max("length of hy") = 53, max(ix - iy) = 2048 + * Maximum operation is 186. For rare "unrealistic" cases. + * Previous algorithm had ix - iy complexity. 11 times more iterations for worst case. + * + * Converting hx, ix back to double: CLZ/shift/AND/OR/if + * Previous algorithm uses loops for CLZ when converting from/to double. + * + * Special cases: + * Supposing that case where |y| > 1e-292 and |x/y|<2000 is very common special processing is implemented. + * No hy alignment, no loop: result = (hx * 2^(ix - iy)) % hy. + * When x and y are both subnormal (rare case but...) the result = hx % hy. Simplified conversion back to double. + * + * Performance: + * Various tests on AArch64 and x86_64 shows that the current algorithm is faster than the previous one by factor 1.5-10 depend + * on input. + * */ /* * __ieee754_fmod(x,y) * Return x mod y in exact arithmetic - * Method: shift and subtract + * Method: divide and shift */ #include <math.h> @@ -21,86 +63,150 @@ #include <stdint.h> #include <libm-alias-finite.h> -static const double one = 1.0, Zero[] = {0.0, -0.0,}; +static const double one = 1.0; + +#define SIGN_POS_MASK UINT64_C(0x8000000000000000) +#define SIGN_NEG_MASK UINT64_C(0x7fffffffffffffff) +#define DOUBLE_INF_NAN UINT64_C(0x7ff0000000000000) +#define LAST_EXP_BIT UINT64_C(0x0010000000000000) +#define SET_NORMAL_VALUE(a) a = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&(a)) +#define CLZ(a) __builtin_clzl(a) +#define CTZ(a) __builtin_ctzl(a) + +/* Convert back to double */ +inline +double make_double (ieee_double_shape_type sx, uint64_t num, int32_t ep) +{ + int32_t lz; + lz = CLZ(num) - 11; + num <<= lz; + ep -= lz; + + if (__glibc_likely(ep >= 0)) + { + num = ((num - LAST_EXP_BIT) | ((uint64_t) (ep + 1) << 52)); + } + else + { + num >>= -ep; + } + sx.word += num; + + sx.value *= one; /* Optimized out? May be math_check_force_underflow? */ + return sx.value; +} double __ieee754_fmod (double x, double y) { - int32_t n,ix,iy; - int64_t hx,hy,hz,sx,i; - - EXTRACT_WORDS64(hx,x); - EXTRACT_WORDS64(hy,y); - sx = hx&UINT64_C(0x8000000000000000); /* sign of x */ - hx ^=sx; /* |x| */ - hy &= UINT64_C(0x7fffffffffffffff); /* |y| */ - - /* purge off exception values */ - if(__builtin_expect(hy==0 - || hx >= UINT64_C(0x7ff0000000000000) - || hy > UINT64_C(0x7ff0000000000000), 0)) - /* y=0,or x not finite or y is NaN */ - return (x*y)/(x*y); - if(__builtin_expect(hx<=hy, 0)) { - if(hx<hy) return x; /* |x|<|y| return x */ - return Zero[(uint64_t)sx>>63]; /* |x|=|y| return x*0*/ - } - - /* determine ix = ilogb(x) */ - if(__builtin_expect(hx<UINT64_C(0x0010000000000000), 0)) { - /* subnormal x */ - for (ix = -1022,i=(hx<<11); i>0; i<<=1) ix -=1; - } else ix = (hx>>52)-1023; - - /* determine iy = ilogb(y) */ - if(__builtin_expect(hy<UINT64_C(0x0010000000000000), 0)) { /* subnormal y */ - for (iy = -1022,i=(hy<<11); i>0; i<<=1) iy -=1; - } else iy = (hy>>52)-1023; - - /* set up hx, hy and align y to x */ - if(__builtin_expect(ix >= -1022, 1)) - hx = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hx); - else { /* subnormal x, shift x to normal */ - n = -1022-ix; - hx<<=n; - } - if(__builtin_expect(iy >= -1022, 1)) - hy = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hy); - else { /* subnormal y, shift y to normal */ - n = -1022-iy; - hy<<=n; - } - - /* fix point fmod */ - n = ix - iy; - while(n--) { - hz=hx-hy; - if(hz<0){hx = hx+hx;} - else { - if(hz==0) /* return sign(x)*0 */ - return Zero[(uint64_t)sx>>63]; - hx = hz+hz; - } - } - hz=hx-hy; - if(hz>=0) {hx=hz;} - - /* convert back to floating value and restore the sign */ - if(hx==0) /* return sign(x)*0 */ - return Zero[(uint64_t)sx>>63]; - while(hx<UINT64_C(0x0010000000000000)) { /* normalize x */ - hx = hx+hx; - iy -= 1; - } - if(__builtin_expect(iy>= -1022, 1)) { /* normalize output */ - hx = ((hx-UINT64_C(0x0010000000000000))|((uint64_t)(iy+1023)<<52)); - INSERT_WORDS64(x,hx|sx); - } else { /* subnormal output */ - n = -1022 - iy; - hx>>=n; - INSERT_WORDS64(x,hx|sx); - x *= one; /* create necessary signal */ - } - return x; /* exact output */ + int32_t n, ix, iy, lzhy, tzhy, hyltzeroes; + uint64_t hx, hy, d; + ieee_double_shape_type sx; + + EXTRACT_WORDS64 (hx, x); + EXTRACT_WORDS64 (hy, y); + sx.word = hx & SIGN_POS_MASK; /* sign of x */ + hx ^= sx.word; /* |x| */ + hy &= SIGN_NEG_MASK; /* |y| */ + + /* purge off exception values */ + if (__glibc_unlikely (hy == 0 + || hx >= DOUBLE_INF_NAN + || hy > DOUBLE_INF_NAN)) + /* y=0,or x not finite or y is NaN */ + return (x * y) / (x * y); + if (__glibc_likely(hx <= hy)) + { + if (hx < hy) return x; /* |x|<|y| return x */ + return sx.value; /* |x|=|y| return x*0*/ + } + + ix = hx >> 52; + iy = hy >> 52; + + /* Most common case where |y| > 1e-292 and |x/y|<2000 */ + if (__glibc_likely(ix >= 53 && ix - iy <= 11)) + { + SET_NORMAL_VALUE(hx); + SET_NORMAL_VALUE(hy); + d = (ix == iy) ? (hx - hy) : (hx << (ix - iy)) % hy; + if (d == 0) + return sx.value; + return make_double (sx, d, iy - 1); /* iy - 1 because of "zero power" for number with power 1 */ + } + + /* Both subnormal special case. */ + if (__glibc_unlikely (ix == 0 && iy == 0)) + { + uint64_t d = hx % hy; + sx.word += d; + sx.value *= one; /* Optimized out? need for math signals */ + return sx.value; + } + + /* Assume that hx is not subnormal by conditions above. */ + SET_NORMAL_VALUE(hx); + ix--; + if (__builtin_expect (iy > 0, 1)) + { + SET_NORMAL_VALUE(hy); + lzhy = 11; + iy--; + } + else + { + lzhy = CLZ(hy); + } + + /* Assume hy != 0 */ + tzhy = CTZ(hy); + hyltzeroes = lzhy + tzhy; + n = ix - iy; + + /* Shift hy right until the end or n = 0 */ + if (n > tzhy) + { + hy >>= tzhy; + n -= tzhy; + iy += tzhy; + } + else + { + hy >>= n; + iy += n; + n = 0; + } + + /* Shift hx left until the end or n = 0 */ + if (n > 11) + { + hx <<= 11; + n -= 11; + } + else + { + hx <<= n; + n = 0; + } + hx %= hy; + if (hx == 0) + return sx.value; + + if (n == 0) + return make_double (sx, hx, iy); + + /* hx in next code can become 0, because hx < hy, hy % 2 == 1 hx * 2^i % hy != 0 */ + + while (n > hyltzeroes) + { + n -= hyltzeroes; + hx <<= hyltzeroes; + hx %= hy; + } + + hx <<= n; + hx %= hy; + + return make_double (sx, hx, iy); } libm_alias_finite (__ieee754_fmod, __fmod) -- 2.25.1