From patchwork Thu Nov 19 13:28:02 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Kirill Okhotnikov <kirill.okhotnikov@gmail.com>
X-Patchwork-Id: 41131
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 28008398D06E;
	Thu, 19 Nov 2020 13:28:09 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 28008398D06E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1605792489;
	bh=l8+eUpj6iIL+5WSFppjjWspyzmscEFkgXSIEEdOwjIs=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:From;
	b=U4f5jB6UUZMZMMndHhzuFPv4CIoaCyLDjKriEKEMJ/VvfKUJ2tzGHsFr/cT5znV77
	 PAUQN3fGOLdJo1S7MpOS1vK2wdASwBW3IB/F7fezNJzg7lMPn2s7r1OGtXemS40KuV
	 m54dmIWqCQP8KV5PUokwIXgogNQ0sR57Qj5sKk74=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com
 [IPv6:2a00:1450:4864:20::130])
 by sourceware.org (Postfix) with ESMTPS id 123EA387088C
 for <libc-alpha@sourceware.org>; Thu, 19 Nov 2020 13:28:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 123EA387088C
Received: by mail-lf1-x130.google.com with SMTP id u18so8210555lfd.9
 for <libc-alpha@sourceware.org>; Thu, 19 Nov 2020 05:28:05 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:to:cc:from:subject:message-id:date:user-agent
 :mime-version:content-language;
 bh=l8+eUpj6iIL+5WSFppjjWspyzmscEFkgXSIEEdOwjIs=;
 b=OiQqXMyVnzLHPmpIk/dfCdOLUSo25IyPIrA6ukbMMJD24bdPqcYgPAm5wTq6G8JS0g
 uz9v9EaScULZ6AxUFofIxyDzBDk3L+jz6nRQN9z1vYEf5zZeEPqPoS7EuX91DJrvJx82
 nkIKcAzBp8hxnJg/PEPhxv9rkEcq5VvCrsi0NNLFFSBQAAdCt8LmYrTuPdT3qOK91td2
 KiOraf7V0sggtxartlgYJGHRwyolbv2nJtjuKU2QhOJgqE7t8g66we+ybOqGccIUeCj9
 e16vW13fZQ21Bn2FvcsuKqQUQxpe6L0QdO1tFgZZ1WN2w8nXFtuYwq6LchXqSY5KZ798
 fsFA==
X-Gm-Message-State: AOAM531kkAan3CBK9zIOE7/JmBVtSxFccZbw2SbUzqa4m4tywuRtyxHj
 WH80+EjKtuiTse5iHXhqexc=
X-Google-Smtp-Source: 
 ABdhPJynVfECC58+obZXca974/CqYnMit94fYeReadl8nIMmOkzcVfQ85poXDMN6ha8NdYSeS7sw7Q==
X-Received: by 2002:ac2:4a79:: with SMTP id q25mr5521613lfp.495.1605792483858;
 Thu, 19 Nov 2020 05:28:03 -0800 (PST)
Received: from [192.168.5.169] ([91.224.181.33])
 by smtp.googlemail.com with ESMTPSA id
 x123sm3993457lfa.154.2020.11.19.05.28.03
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Thu, 19 Nov 2020 05:28:03 -0800 (PST)
To: libc-alpha@sourceware.org
Subject: [PATCH] libm,ieee754:New algorithm of fmod function for,
 dbl-64/wordsize-64
Message-ID: <0c028906-f5a6-deed-a0ff-f741aee9c5e6@gmail.com>
Date: Thu, 19 Nov 2020 14:28:02 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.10.0
MIME-Version: 1.0
Content-Language: en-GB
X-Spam-Status: No, score=-8.1 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM,
 GIT_PATCH_0, KAM_ASCII_DIVIDERS, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Kirill Okhotnikov via Libc-alpha
 <libc-alpha@sourceware.org>
From: Kirill Okhotnikov <kirill.okhotnikov@gmail.com>
Reply-To: Kirill Okhotnikov <kirill.okhotnikov@gmail.com>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

I proposed a new algorithm for fmod calculations for double type and
wordsize-64 architectures. Algorithm description is in the file.
I successfully ran internal tests on x86_64 processors.
Also I did some extensive tests and benchmark with my own test-suite on 
x86_64 and ARM64. See
https://github.com/orex/test_fmod
My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm 
up to 20 times faster for
"extreme" cases. And up to two times faster for regular cases of using 
the function.
Also, I did some unit testing which shows that old and a new algorithms 
gives binary
equivalent result for each of billions different pairs (x, y) with wide 
range of numbers
including normal, subnormal, and special one (NaN INF, 0).

Reply to adhemerval.zanella@linaro.org from libc-help thread.

> I won't comment to your implementation until it is submitted to
> libc-alpha with proper Copyright assignment, but some remarks:
FSF status: In progress.
>
>   1. If you are benchmarking on i686, it won't use the generic
>      sysdeps/ieee754/dbl-64/e_fmod.c but rather a asm specific
>      implementation (sysdeps/i386/fpu/e_fmod.S).  It would be good
>      if you check if we can remove the i386 assembly implementation
>      in favor of the generic one.
You can remove it of course, but in favour to generic implementation, 
but I assume, you will get a strong performance degradation. It is worth 
to check it. I don't have a i686 machine. I don't know how relevant will 
be tests on my CPU.
>
>   2. __builtin_clz might be a libcall on some targets, which might
>      be worse than using the loop.  It is usually not an issue on
>      some implementation (it is used on some float128 which is
>      usually soft implementation anyway), but it is something to
>      keep in mind.
I check the code of libm. The function is already used (except ldbl-128) in

sysdeps/ieee754/flt-32/s_logbf.c:      rix -= __builtin_clz (ix) - 9;
sysdeps/ieee754/dbl-64/s_logb.c:      int m = __builtin_clzll (ix);

The second thing is that loop is not efficient anyhow. There is more 
efficient approach with table (for CLZ and CTZ).

https://www.geeksforgeeks.org/count-trailing-zero-bits-using-lookup-table/

Probably it is already implemented in GCC?

>
>   3. It would be good if could get rid of the wordsize-64
>      implementation and just have a generic one good enough regardless
>      of word size.
I can't imagine such algorithm. The size of double is 64 bit. Splitting 
it over two 32 bits variable, of course, will have a performance impact. 
Algorithm, proposed by me (see description), can work strictly with 64 
bits variables.

Best,
Kirill.

From e88e7fc7b24eeaeabbe9b4649aa39d2ee9977a09 Mon Sep 17 00:00:00 2001
From: kirill <kirill.okhotnikov@gmail.com>
Date: Thu, 19 Nov 2020 13:29:41 +0100
Subject: [PATCH] libm,ieee754:New algorithm of fmod function for
 dbl-64/wordsize-64
To: libc-alpha@sourceware.org
Cc: adhemerval.zanella@linaro.org

I proposed a new algorithm for fmod calculations for double type and
wordsize-64 architectures. Algorithm description is in the file.
I successfully ran internal tests on x86_64 processors.
Also I did some extensive tests and benchmark with my own test-suite on x86_64 and ARM64. See
https://github.com/orex/test_fmod
My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm up to 20 times faster for
"extreme" cases. And up to two times faster for regular cases of using the function.
Also, I did some unit testing which shows that old and a new algorithms gives binary
equivalent result for each of billions different pairs (x, y) with wide range of numbers
including normal, subnormal, and special one (NaN INF, 0).
---
 sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c | 282 ++++++++++++++------
 1 file changed, 194 insertions(+), 88 deletions(-)

diff --git a/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c b/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c
index 52a8687448..b3c7c4a28c 100644
--- a/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c
+++ b/sysdeps/ieee754/dbl-64/wordsize-64/e_fmod.c
@@ -1,19 +1,61 @@
-/* Rewritten for 64-bit machines by Ulrich Drepper <drepper@gmail.com>.  */
-/*
- * ====================================================
- * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
- *
- * Developed at SunPro, a Sun Microsystems, Inc. business.
- * Permission to use, copy, modify, and distribute this
- * software is freely granted, provided that this notice
- * is preserved.
- * ====================================================
+/* Author: Kirill Okhotnikov <kirill.okhotnikov@gmail.com>
+ * 
+ * Background:
+ *   Like the previous algorithm written by Sun Microsystems, the code also
+ *   splits double value to integers with exponential (ix, iy) 
+ *   and mantissa part (hx, hy). The variables names are kept common with previous code 
+ *   in order to simplify understanding. The part of checking special cases, numbers NaN, INF etc
+ *   is copied from Ulrich Drepper <drepper@gmail.com> previous code.
+ *   The rest of the code uses different algorithms for fmod calculation.
+ *   An input x, y in the algorithm is represented (mathematically) like h(x|y) * 2^i(x|y).
+ *   There is an unambiguous number representation. For example h * 2^i = (2 * h) * 2^(i-1).
+ *   The previous algorithm removes such unambiguous, "aligning" bits in hx, and hy. But for this code
+ *   such unambiguity is essential.
+ * 
+ * Mathematics:
+ *   The algorithm uses the fact that
+ *   r = a % b = (a % (N * b)) % b, where N is positive integer number. a and b - positive.
+ *   Let's adopt the formula for representation above.
+ *   a = hx * 2^ix, b = hy * 2^iy, N = 2^k
+ *   r(k) = a % b = (hx * 2^ix) % (2 ^k * hy * 2^iy) = 2^(iy + k) * (hx * 2^(ix - iy - k) % hy)
+ *   r(k) = hr * 2^ir = (hx % hy) * 2^(iy + k) = (2^p * (hx % hy) * 2^(iy + k - p))
+ *      hr = 2^p * (hx % hy)
+ *      ir = iy + k - p  
+ *    
+ *     
+ * Algorithm description:
+ *   The examples below use byte (not uint64_t) for simplicity.
+ *   1) Shift hy maximum to right without losing bits and increase iy value
+ *      hy = 0b00101100 iy = 20 after shift hy = 0b00001011 iy = 22.
+ *   2) hx = hx % hy.
+ *   3) Move hx maximum to left. Note that after (hx = hx% hy) CLZ in hx is not lower than CLZ in hy.
+ *      hx=0b00001001 ix = 100, hx=0b10010000, ix = 100-4 = 96.
+ *   4) Repeat (2) until ix == iy. 
+ * 
+ * Complexity analysis:
+ *   Converting x, y to (hx, ix), (hy, iy): CTZ/shift/AND/OR/if.
+ *   Loop count: (ix - iy) / (64 - "length of hy"). max("length of hy") = 53, max(ix - iy) = 2048
+ *                Maximum operation is 186. For rare "unrealistic" cases. 
+ *                Previous algorithm had ix - iy complexity. 11 times more iterations for worst case. 
+ *   
+ *   Converting hx, ix back to double: CLZ/shift/AND/OR/if
+ *   Previous algorithm uses loops for CLZ when converting from/to double.
+ * 
+ * Special cases:
+ *   Supposing that case where |y| > 1e-292 and |x/y|<2000 is very common special processing is implemented.
+ *   No hy alignment, no loop: result = (hx * 2^(ix - iy)) % hy.
+ *   When x and y are both subnormal (rare case but...) the result = hx % hy. Simplified conversion back to double.
+ * 
+ * Performance: 
+ *   Various tests on AArch64 and x86_64 shows that the current algorithm is faster than the previous one by factor 1.5-10 depend
+ *   on input.
+ * 
  */
 
 /*
  * __ieee754_fmod(x,y)
  * Return x mod y in exact arithmetic
- * Method: shift and subtract
+ * Method: divide and shift
  */
 
 #include <math.h>
@@ -21,86 +63,150 @@
 #include <stdint.h>
 #include <libm-alias-finite.h>
 
-static const double one = 1.0, Zero[] = {0.0, -0.0,};
+static const double one = 1.0;
+
+#define SIGN_POS_MASK UINT64_C(0x8000000000000000)
+#define SIGN_NEG_MASK UINT64_C(0x7fffffffffffffff)
+#define DOUBLE_INF_NAN UINT64_C(0x7ff0000000000000)
+#define LAST_EXP_BIT UINT64_C(0x0010000000000000)
+#define SET_NORMAL_VALUE(a) a = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&(a))
+#define CLZ(a) __builtin_clzl(a)
+#define CTZ(a) __builtin_ctzl(a)
+
+/* Convert back to double */
+inline
+double make_double (ieee_double_shape_type sx, uint64_t num, int32_t ep)
+{
+  int32_t lz;
+  lz = CLZ(num) - 11;
+  num <<= lz;
+  ep -= lz;
+
+  if (__glibc_likely(ep >= 0))
+    {
+      num = ((num - LAST_EXP_BIT) | ((uint64_t) (ep + 1) << 52));
+    }
+  else
+    {
+      num >>= -ep;
+    }
+  sx.word += num;
+
+  sx.value *= one;  /* Optimized out? May be math_check_force_underflow? */
+  return sx.value;
+}
 
 double
 __ieee754_fmod (double x, double y)
 {
-	int32_t n,ix,iy;
-	int64_t hx,hy,hz,sx,i;
-
-	EXTRACT_WORDS64(hx,x);
-	EXTRACT_WORDS64(hy,y);
-	sx = hx&UINT64_C(0x8000000000000000);	/* sign of x */
-	hx ^=sx;				/* |x| */
-	hy &= UINT64_C(0x7fffffffffffffff);	/* |y| */
-
-    /* purge off exception values */
-	if(__builtin_expect(hy==0
-			    || hx >= UINT64_C(0x7ff0000000000000)
-			    || hy > UINT64_C(0x7ff0000000000000), 0))
-	  /* y=0,or x not finite or y is NaN */
-	    return (x*y)/(x*y);
-	if(__builtin_expect(hx<=hy, 0)) {
-	    if(hx<hy) return x;	/* |x|<|y| return x */
-	    return Zero[(uint64_t)sx>>63];	/* |x|=|y| return x*0*/
-	}
-
-    /* determine ix = ilogb(x) */
-	if(__builtin_expect(hx<UINT64_C(0x0010000000000000), 0)) {
-	  /* subnormal x */
-	  for (ix = -1022,i=(hx<<11); i>0; i<<=1) ix -=1;
-	} else ix = (hx>>52)-1023;
-
-    /* determine iy = ilogb(y) */
-	if(__builtin_expect(hy<UINT64_C(0x0010000000000000), 0)) {	/* subnormal y */
-	  for (iy = -1022,i=(hy<<11); i>0; i<<=1) iy -=1;
-	} else iy = (hy>>52)-1023;
-
-    /* set up hx, hy and align y to x */
-	if(__builtin_expect(ix >= -1022, 1))
-	    hx = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hx);
-	else {		/* subnormal x, shift x to normal */
-	    n = -1022-ix;
-	    hx<<=n;
-	}
-	if(__builtin_expect(iy >= -1022, 1))
-	    hy = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hy);
-	else {		/* subnormal y, shift y to normal */
-	    n = -1022-iy;
-	    hy<<=n;
-	}
-
-    /* fix point fmod */
-	n = ix - iy;
-	while(n--) {
-	    hz=hx-hy;
-	    if(hz<0){hx = hx+hx;}
-	    else {
-		if(hz==0)		/* return sign(x)*0 */
-		    return Zero[(uint64_t)sx>>63];
-		hx = hz+hz;
-	    }
-	}
-	hz=hx-hy;
-	if(hz>=0) {hx=hz;}
-
-    /* convert back to floating value and restore the sign */
-	if(hx==0)			/* return sign(x)*0 */
-	    return Zero[(uint64_t)sx>>63];
-	while(hx<UINT64_C(0x0010000000000000)) {	/* normalize x */
-	    hx = hx+hx;
-	    iy -= 1;
-	}
-	if(__builtin_expect(iy>= -1022, 1)) {	/* normalize output */
-	  hx = ((hx-UINT64_C(0x0010000000000000))|((uint64_t)(iy+1023)<<52));
-	    INSERT_WORDS64(x,hx|sx);
-	} else {		/* subnormal output */
-	    n = -1022 - iy;
-	    hx>>=n;
-	    INSERT_WORDS64(x,hx|sx);
-	    x *= one;		/* create necessary signal */
-	}
-	return x;		/* exact output */
+  int32_t n, ix, iy, lzhy, tzhy, hyltzeroes;
+  uint64_t hx, hy, d;
+  ieee_double_shape_type sx;
+
+  EXTRACT_WORDS64 (hx, x);
+  EXTRACT_WORDS64 (hy, y);
+  sx.word = hx & SIGN_POS_MASK;    /* sign of x */
+  hx ^= sx.word;            /* |x| */
+  hy &= SIGN_NEG_MASK;    /* |y| */
+
+  /* purge off exception values */
+  if (__glibc_unlikely (hy == 0
+                        || hx >= DOUBLE_INF_NAN
+                        || hy > DOUBLE_INF_NAN))
+    /* y=0,or x not finite or y is NaN */
+    return (x * y) / (x * y);
+  if (__glibc_likely(hx <= hy))
+    {
+      if (hx < hy) return x;    /* |x|<|y| return x */
+      return sx.value;    /* |x|=|y| return x*0*/
+    }
+
+  ix = hx >> 52;
+  iy = hy >> 52;
+
+  /* Most common case where |y| > 1e-292 and |x/y|<2000 */
+  if (__glibc_likely(ix >= 53 && ix - iy <= 11))
+    {
+      SET_NORMAL_VALUE(hx);
+      SET_NORMAL_VALUE(hy);
+      d = (ix == iy) ? (hx - hy) : (hx << (ix - iy)) % hy;
+      if (d == 0)
+        return sx.value;
+      return make_double (sx, d, iy - 1); /* iy - 1 because of "zero power" for number with power 1 */
+    }
+
+  /* Both subnormal special case. */
+  if (__glibc_unlikely (ix == 0 && iy == 0))
+    {
+      uint64_t d = hx % hy;
+      sx.word += d;
+      sx.value *= one;  /* Optimized out? need for math signals */
+      return sx.value;
+    }
+
+  /* Assume that hx is not subnormal by conditions above. */
+  SET_NORMAL_VALUE(hx);
+  ix--;
+  if (__builtin_expect (iy > 0, 1))
+    {
+      SET_NORMAL_VALUE(hy);
+      lzhy = 11;
+      iy--;
+    }
+  else
+    {
+      lzhy = CLZ(hy);
+    }
+
+  /* Assume hy != 0 */
+  tzhy = CTZ(hy);
+  hyltzeroes = lzhy + tzhy;
+  n = ix - iy;
+
+  /* Shift hy right until the end or n = 0 */
+  if (n > tzhy)
+    {
+      hy >>= tzhy;
+      n -= tzhy;
+      iy += tzhy;
+    }
+  else
+    {
+      hy >>= n;
+      iy += n;
+      n = 0;
+    }
+
+  /* Shift hx left until the end or n = 0 */
+  if (n > 11)
+    {
+      hx <<= 11;
+      n -= 11;
+    }
+  else
+    {
+      hx <<= n;
+      n = 0;
+    }
+  hx %= hy;
+  if (hx == 0)
+    return sx.value;
+
+  if (n == 0)
+    return make_double (sx, hx, iy);
+
+  /* hx in next code can become 0, because hx < hy, hy % 2 == 1 hx * 2^i % hy != 0 */
+
+  while (n > hyltzeroes)
+    {
+      n -= hyltzeroes;
+      hx <<= hyltzeroes;
+      hx %= hy;
+    }
+
+  hx <<= n;
+  hx %= hy;
+
+  return make_double (sx, hx, iy);
 }
 libm_alias_finite (__ieee754_fmod, __fmod)
-- 
2.25.1