[v5,00/12] Improve hypot

Message ID 20211207190353.3282666-1-adhemerval.zanella@linaro.org
Headers
Series Improve hypot |

Message

Adhemerval Zanella Dec. 7, 2021, 7:03 p.m. UTC
  This patchset add a different algorithm along with Wilco [1] performance
improvements.  The default implementation is based on the 'An Improved 
Algorithm for hypot(a,b)' by Carlos F. Borges [2] with some fixes and
improvements (although the dbl-64 one uses a slight different approach
when fast FMA is avaliable).  This method is also used by Julia language
runtime [2].

The motivation for this change are:

  1. Use a newer algotihm that favor FP operations over interger ones
     (which tends to show better performance specially on newer
     processor with multiple FP units).  It also allows consolidate
     the multiple implementation (for instance, the powerpc one).

  2. The new algorithm is more precise without minimum performance
     difference.

The current hypot() implementation seems already to be bounded to a
maximum of 1 ulp of error, however the new proposed algorithm shows an
slight precision improvement by showing more correctly rounded results.

With a random 1e9 inputs for different float format I see:

  - An improvement from 3427362 to 18457 results with 1 ulp of
    error for Binary64.
  - An improvement from 233442 to 1274 results with 1 ulp of
    error for Binary96 (x86_64).
  - An improvement from 453045 to 1294 results with 1 ulp of
    error for Binary96 (x86_64).

Also for the maximal known error master shows (in ulps, with
corresponding inputs), determined with [3]:

  binary32  0.500 0x1.3ac98p+67,-0x1.ba5ec2p+77
  binary64  0.987 -0x0.5a934b7eac967p-1022,-0x0.b5265a7e06b82p-1022
  binary96  0.981
0x1.73f339f61eda21dp-16384l,0x2.e45f9f9500877e2p-16384l
  binary128 0.985
-0x2.d8311789103b76133ea1d5bc38c4p-16384,-0x1.6d85492006d7dcc6cc52938684p-16384

With the new implementation:

  binary32  0.500 0x1.3ac98p+67,-0x1.ba5ec2p+77 [same]
  binary64  0.792 0x0.603e52daf0bfdp-1022,-0x0.a622d0a9a433bp-1022
  binary96  0.584
-0x2.97b86706043d619p+7240l,0x1.8256bdd12d2e163ep+7240l
  binary128 0.749
0x2.2d5faf4036d6e68566f01054612p-8192,0x3.5738e8e2505f5d1fc2973716f05p-8192

If FMA is uses the binary64 shows a slight worse precision:

I have adapted the dbl-64, ldbl-96, and ldbl-128, the flt-32 is not
required since it calls the dbl-64 one.  I have not adapated ldbl-128ibm
since the format has a lot of caveats and IBM aims to move to ldbl-128.

[1] https://sourceware.org/pipermail/libc-alpha/2021-November/133523.html
[1] https://arxiv.org/pdf/1904.09481.pdf
[2] https://github.com/JuliaLang/julia/commit/4a046009a3362ab5e17d369641dbbc9657eb680c
[3] https://gitlab.inria.fr/zimmerma/math_accuracy/-/blob/master/binary64/check_sample2.c

---
Changes from v5:
 * Add missing cast on generic hypotf.
 * Add missing math_narrow_eval on generic hypot.
 * Fixed return value on generic ldbl-96 hypotl.
 * Fixed return value on generic ldbl-128 hypotl.
 * Add POWER10 performance values.
 * Add missing cast on i686 hypotf.
 * Rewrite math-use-builtinds-fmax.h and math-use-builtinds-fmin.h
   based on Joseph's suggestions.
---

Adhemerval Zanella (9):
  math: Simplify hypotf implementation
  math: Use an improved algorithm for hypotl (ldbl-96)
  math: Use an improved algorithm for hypotl (ldbl-128)
  i386: Move hypot implementation to C
  math: Remove powerpc e_hypot
  math: Add math-use-builtinds-fmax.h
  math: Add math-use-builtinds-fmin.h
  aarch64: Add math-use-builtins-f{max,min}.h
  math: Remove the error handling wrapper from hypot and hypotf

Wilco Dijkstra (3):
  math: Use an improved algorithm for hypot (dbl-64)
  math: Improve hypot performance with FMA
  math: Use fmin/fmax on hypot

 math/Versions                                 |   2 +
 math/s_fmax_template.c                        |   5 +
 math/s_fmin_template.c                        |   6 +-
 math/w_hypot.c                                |   8 +
 math/w_hypot_compat.c                         |  13 +-
 math/w_hypotf.c                               |   8 +
 math/w_hypotf_compat.c                        |   6 +-
 sysdeps/aarch64/fpu/math-use-builtins-fmax.h  |   4 +
 sysdeps/aarch64/fpu/math-use-builtins-fmin.h  |   4 +
 sysdeps/aarch64/fpu/s_fmax.c                  |  28 --
 sysdeps/aarch64/fpu/s_fmaxf.c                 |  28 --
 sysdeps/aarch64/fpu/s_fmin.c                  |  28 --
 sysdeps/aarch64/fpu/s_fminf.c                 |  28 --
 sysdeps/generic/math-type-macros-double.h     |   1 +
 sysdeps/generic/math-type-macros-float.h      |   1 +
 sysdeps/generic/math-type-macros-float128.h   |   1 +
 sysdeps/generic/math-type-macros-ldouble.h    |   1 +
 sysdeps/generic/math-use-builtins-fmax.h      |   4 +
 sysdeps/generic/math-use-builtins-fmin.h      |   4 +
 sysdeps/generic/math-use-builtins.h           |   2 +
 sysdeps/i386/fpu/e_hypot.S                    |  75 -----
 sysdeps/i386/fpu/e_hypot.c                    |  58 ++++
 sysdeps/i386/fpu/e_hypotf.S                   |  64 -----
 sysdeps/ieee754/dbl-64/e_hypot.c              | 270 ++++++++----------
 sysdeps/ieee754/dbl-64/w_hypot.c              |   1 +
 sysdeps/ieee754/flt-32/e_hypotf.c             |  79 ++---
 sysdeps/ieee754/flt-32/math_config.h          |   9 +
 sysdeps/ieee754/flt-32/w_hypotf.c             |   1 +
 sysdeps/ieee754/ldbl-128/e_hypotl.c           | 226 +++++++--------
 sysdeps/ieee754/ldbl-96/e_hypotl.c            | 231 +++++++--------
 sysdeps/mach/hurd/i386/libm.abilist           |   2 +
 sysdeps/powerpc/fpu/e_hypot.c                 |  87 ------
 sysdeps/powerpc/fpu/e_hypotf.c                |  78 -----
 .../powerpc32/power4/fpu/multiarch/Makefile   |   5 +-
 .../power4/fpu/multiarch/e_hypot-power7.c     |  23 --
 .../power4/fpu/multiarch/e_hypot-ppc32.c      |  23 --
 .../powerpc32/power4/fpu/multiarch/e_hypot.c  |  33 ---
 .../power4/fpu/multiarch/e_hypotf-power7.c    |  23 --
 .../power4/fpu/multiarch/e_hypotf-ppc32.c     |  23 --
 .../powerpc32/power4/fpu/multiarch/e_hypotf.c |  33 ---
 sysdeps/unix/sysv/linux/aarch64/libm.abilist  |   2 +
 sysdeps/unix/sysv/linux/alpha/libm.abilist    |   2 +
 sysdeps/unix/sysv/linux/arm/be/libm.abilist   |   2 +
 sysdeps/unix/sysv/linux/arm/le/libm.abilist   |   2 +
 sysdeps/unix/sysv/linux/hppa/libm.abilist     |   2 +
 sysdeps/unix/sysv/linux/i386/libm.abilist     |   2 +
 .../sysv/linux/m68k/coldfire/libm.abilist     |   2 +
 .../unix/sysv/linux/m68k/m680x0/libm.abilist  |   2 +
 .../sysv/linux/microblaze/be/libm.abilist     |   2 +
 .../sysv/linux/microblaze/le/libm.abilist     |   2 +
 .../unix/sysv/linux/mips/mips32/libm.abilist  |   2 +
 .../unix/sysv/linux/mips/mips64/libm.abilist  |   2 +
 sysdeps/unix/sysv/linux/nios2/libm.abilist    |   2 +
 .../linux/powerpc/powerpc32/fpu/libm.abilist  |   2 +
 .../powerpc/powerpc32/nofpu/libm.abilist      |   2 +
 .../linux/powerpc/powerpc64/be/libm.abilist   |   2 +
 .../linux/powerpc/powerpc64/le/libm.abilist   |   2 +
 .../unix/sysv/linux/s390/s390-32/libm.abilist |   2 +
 .../unix/sysv/linux/s390/s390-64/libm.abilist |   2 +
 sysdeps/unix/sysv/linux/sh/be/libm.abilist    |   2 +
 sysdeps/unix/sysv/linux/sh/le/libm.abilist    |   2 +
 .../sysv/linux/sparc/sparc32/libm.abilist     |   2 +
 .../sysv/linux/sparc/sparc64/libm.abilist     |   2 +
 .../unix/sysv/linux/x86_64/64/libm.abilist    |   2 +
 .../unix/sysv/linux/x86_64/x32/libm.abilist   |   2 +
 65 files changed, 547 insertions(+), 1029 deletions(-)
 create mode 100644 math/w_hypot.c
 create mode 100644 math/w_hypotf.c
 create mode 100644 sysdeps/aarch64/fpu/math-use-builtins-fmax.h
 create mode 100644 sysdeps/aarch64/fpu/math-use-builtins-fmin.h
 delete mode 100644 sysdeps/aarch64/fpu/s_fmax.c
 delete mode 100644 sysdeps/aarch64/fpu/s_fmaxf.c
 delete mode 100644 sysdeps/aarch64/fpu/s_fmin.c
 delete mode 100644 sysdeps/aarch64/fpu/s_fminf.c
 create mode 100644 sysdeps/generic/math-use-builtins-fmax.h
 create mode 100644 sysdeps/generic/math-use-builtins-fmin.h
 delete mode 100644 sysdeps/i386/fpu/e_hypot.S
 create mode 100644 sysdeps/i386/fpu/e_hypot.c
 delete mode 100644 sysdeps/i386/fpu/e_hypotf.S
 create mode 100644 sysdeps/ieee754/dbl-64/w_hypot.c
 create mode 100644 sysdeps/ieee754/flt-32/w_hypotf.c
 delete mode 100644 sysdeps/powerpc/fpu/e_hypot.c
 delete mode 100644 sysdeps/powerpc/fpu/e_hypotf.c
 delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypot-power7.c
 delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypot-ppc32.c
 delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypot.c
 delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypotf-power7.c
 delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypotf-ppc32.c
 delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypotf.c
  

Comments

Paul Zimmermann Dec. 8, 2021, 5:51 a.m. UTC | #1
Dear Adhemerval,

some of the typos I reported in the v4 are not fixed in v5:

https://sourceware.org/pipermail/libc-alpha/2021-December/133588.html

Best regards,
Paul
  
Adhemerval Zanella Dec. 10, 2021, 7:31 p.m. UTC | #2
If no one oposes it I will commit this, this is moslty a fixup from the
v4 already reviewed by Wilco.

On 07/12/2021 16:03, Adhemerval Zanella wrote:
> This patchset add a different algorithm along with Wilco [1] performance
> improvements.  The default implementation is based on the 'An Improved 
> Algorithm for hypot(a,b)' by Carlos F. Borges [2] with some fixes and
> improvements (although the dbl-64 one uses a slight different approach
> when fast FMA is avaliable).  This method is also used by Julia language
> runtime [2].
> 
> The motivation for this change are:
> 
>   1. Use a newer algotihm that favor FP operations over interger ones
>      (which tends to show better performance specially on newer
>      processor with multiple FP units).  It also allows consolidate
>      the multiple implementation (for instance, the powerpc one).
> 
>   2. The new algorithm is more precise without minimum performance
>      difference.
> 
> The current hypot() implementation seems already to be bounded to a
> maximum of 1 ulp of error, however the new proposed algorithm shows an
> slight precision improvement by showing more correctly rounded results.
> 
> With a random 1e9 inputs for different float format I see:
> 
>   - An improvement from 3427362 to 18457 results with 1 ulp of
>     error for Binary64.
>   - An improvement from 233442 to 1274 results with 1 ulp of
>     error for Binary96 (x86_64).
>   - An improvement from 453045 to 1294 results with 1 ulp of
>     error for Binary96 (x86_64).
> 
> Also for the maximal known error master shows (in ulps, with
> corresponding inputs), determined with [3]:
> 
>   binary32  0.500 0x1.3ac98p+67,-0x1.ba5ec2p+77
>   binary64  0.987 -0x0.5a934b7eac967p-1022,-0x0.b5265a7e06b82p-1022
>   binary96  0.981
> 0x1.73f339f61eda21dp-16384l,0x2.e45f9f9500877e2p-16384l
>   binary128 0.985
> -0x2.d8311789103b76133ea1d5bc38c4p-16384,-0x1.6d85492006d7dcc6cc52938684p-16384
> 
> With the new implementation:
> 
>   binary32  0.500 0x1.3ac98p+67,-0x1.ba5ec2p+77 [same]
>   binary64  0.792 0x0.603e52daf0bfdp-1022,-0x0.a622d0a9a433bp-1022
>   binary96  0.584
> -0x2.97b86706043d619p+7240l,0x1.8256bdd12d2e163ep+7240l
>   binary128 0.749
> 0x2.2d5faf4036d6e68566f01054612p-8192,0x3.5738e8e2505f5d1fc2973716f05p-8192
> 
> If FMA is uses the binary64 shows a slight worse precision:
> 
> I have adapted the dbl-64, ldbl-96, and ldbl-128, the flt-32 is not
> required since it calls the dbl-64 one.  I have not adapated ldbl-128ibm
> since the format has a lot of caveats and IBM aims to move to ldbl-128.
> 
> [1] https://sourceware.org/pipermail/libc-alpha/2021-November/133523.html
> [1] https://arxiv.org/pdf/1904.09481.pdf
> [2] https://github.com/JuliaLang/julia/commit/4a046009a3362ab5e17d369641dbbc9657eb680c
> [3] https://gitlab.inria.fr/zimmerma/math_accuracy/-/blob/master/binary64/check_sample2.c
> 
> ---
> Changes from v5:
>  * Add missing cast on generic hypotf.
>  * Add missing math_narrow_eval on generic hypot.
>  * Fixed return value on generic ldbl-96 hypotl.
>  * Fixed return value on generic ldbl-128 hypotl.
>  * Add POWER10 performance values.
>  * Add missing cast on i686 hypotf.
>  * Rewrite math-use-builtinds-fmax.h and math-use-builtinds-fmin.h
>    based on Joseph's suggestions.
> ---
> 
> Adhemerval Zanella (9):
>   math: Simplify hypotf implementation
>   math: Use an improved algorithm for hypotl (ldbl-96)
>   math: Use an improved algorithm for hypotl (ldbl-128)
>   i386: Move hypot implementation to C
>   math: Remove powerpc e_hypot
>   math: Add math-use-builtinds-fmax.h
>   math: Add math-use-builtinds-fmin.h
>   aarch64: Add math-use-builtins-f{max,min}.h
>   math: Remove the error handling wrapper from hypot and hypotf
> 
> Wilco Dijkstra (3):
>   math: Use an improved algorithm for hypot (dbl-64)
>   math: Improve hypot performance with FMA
>   math: Use fmin/fmax on hypot
> 
>  math/Versions                                 |   2 +
>  math/s_fmax_template.c                        |   5 +
>  math/s_fmin_template.c                        |   6 +-
>  math/w_hypot.c                                |   8 +
>  math/w_hypot_compat.c                         |  13 +-
>  math/w_hypotf.c                               |   8 +
>  math/w_hypotf_compat.c                        |   6 +-
>  sysdeps/aarch64/fpu/math-use-builtins-fmax.h  |   4 +
>  sysdeps/aarch64/fpu/math-use-builtins-fmin.h  |   4 +
>  sysdeps/aarch64/fpu/s_fmax.c                  |  28 --
>  sysdeps/aarch64/fpu/s_fmaxf.c                 |  28 --
>  sysdeps/aarch64/fpu/s_fmin.c                  |  28 --
>  sysdeps/aarch64/fpu/s_fminf.c                 |  28 --
>  sysdeps/generic/math-type-macros-double.h     |   1 +
>  sysdeps/generic/math-type-macros-float.h      |   1 +
>  sysdeps/generic/math-type-macros-float128.h   |   1 +
>  sysdeps/generic/math-type-macros-ldouble.h    |   1 +
>  sysdeps/generic/math-use-builtins-fmax.h      |   4 +
>  sysdeps/generic/math-use-builtins-fmin.h      |   4 +
>  sysdeps/generic/math-use-builtins.h           |   2 +
>  sysdeps/i386/fpu/e_hypot.S                    |  75 -----
>  sysdeps/i386/fpu/e_hypot.c                    |  58 ++++
>  sysdeps/i386/fpu/e_hypotf.S                   |  64 -----
>  sysdeps/ieee754/dbl-64/e_hypot.c              | 270 ++++++++----------
>  sysdeps/ieee754/dbl-64/w_hypot.c              |   1 +
>  sysdeps/ieee754/flt-32/e_hypotf.c             |  79 ++---
>  sysdeps/ieee754/flt-32/math_config.h          |   9 +
>  sysdeps/ieee754/flt-32/w_hypotf.c             |   1 +
>  sysdeps/ieee754/ldbl-128/e_hypotl.c           | 226 +++++++--------
>  sysdeps/ieee754/ldbl-96/e_hypotl.c            | 231 +++++++--------
>  sysdeps/mach/hurd/i386/libm.abilist           |   2 +
>  sysdeps/powerpc/fpu/e_hypot.c                 |  87 ------
>  sysdeps/powerpc/fpu/e_hypotf.c                |  78 -----
>  .../powerpc32/power4/fpu/multiarch/Makefile   |   5 +-
>  .../power4/fpu/multiarch/e_hypot-power7.c     |  23 --
>  .../power4/fpu/multiarch/e_hypot-ppc32.c      |  23 --
>  .../powerpc32/power4/fpu/multiarch/e_hypot.c  |  33 ---
>  .../power4/fpu/multiarch/e_hypotf-power7.c    |  23 --
>  .../power4/fpu/multiarch/e_hypotf-ppc32.c     |  23 --
>  .../powerpc32/power4/fpu/multiarch/e_hypotf.c |  33 ---
>  sysdeps/unix/sysv/linux/aarch64/libm.abilist  |   2 +
>  sysdeps/unix/sysv/linux/alpha/libm.abilist    |   2 +
>  sysdeps/unix/sysv/linux/arm/be/libm.abilist   |   2 +
>  sysdeps/unix/sysv/linux/arm/le/libm.abilist   |   2 +
>  sysdeps/unix/sysv/linux/hppa/libm.abilist     |   2 +
>  sysdeps/unix/sysv/linux/i386/libm.abilist     |   2 +
>  .../sysv/linux/m68k/coldfire/libm.abilist     |   2 +
>  .../unix/sysv/linux/m68k/m680x0/libm.abilist  |   2 +
>  .../sysv/linux/microblaze/be/libm.abilist     |   2 +
>  .../sysv/linux/microblaze/le/libm.abilist     |   2 +
>  .../unix/sysv/linux/mips/mips32/libm.abilist  |   2 +
>  .../unix/sysv/linux/mips/mips64/libm.abilist  |   2 +
>  sysdeps/unix/sysv/linux/nios2/libm.abilist    |   2 +
>  .../linux/powerpc/powerpc32/fpu/libm.abilist  |   2 +
>  .../powerpc/powerpc32/nofpu/libm.abilist      |   2 +
>  .../linux/powerpc/powerpc64/be/libm.abilist   |   2 +
>  .../linux/powerpc/powerpc64/le/libm.abilist   |   2 +
>  .../unix/sysv/linux/s390/s390-32/libm.abilist |   2 +
>  .../unix/sysv/linux/s390/s390-64/libm.abilist |   2 +
>  sysdeps/unix/sysv/linux/sh/be/libm.abilist    |   2 +
>  sysdeps/unix/sysv/linux/sh/le/libm.abilist    |   2 +
>  .../sysv/linux/sparc/sparc32/libm.abilist     |   2 +
>  .../sysv/linux/sparc/sparc64/libm.abilist     |   2 +
>  .../unix/sysv/linux/x86_64/64/libm.abilist    |   2 +
>  .../unix/sysv/linux/x86_64/x32/libm.abilist   |   2 +
>  65 files changed, 547 insertions(+), 1029 deletions(-)
>  create mode 100644 math/w_hypot.c
>  create mode 100644 math/w_hypotf.c
>  create mode 100644 sysdeps/aarch64/fpu/math-use-builtins-fmax.h
>  create mode 100644 sysdeps/aarch64/fpu/math-use-builtins-fmin.h
>  delete mode 100644 sysdeps/aarch64/fpu/s_fmax.c
>  delete mode 100644 sysdeps/aarch64/fpu/s_fmaxf.c
>  delete mode 100644 sysdeps/aarch64/fpu/s_fmin.c
>  delete mode 100644 sysdeps/aarch64/fpu/s_fminf.c
>  create mode 100644 sysdeps/generic/math-use-builtins-fmax.h
>  create mode 100644 sysdeps/generic/math-use-builtins-fmin.h
>  delete mode 100644 sysdeps/i386/fpu/e_hypot.S
>  create mode 100644 sysdeps/i386/fpu/e_hypot.c
>  delete mode 100644 sysdeps/i386/fpu/e_hypotf.S
>  create mode 100644 sysdeps/ieee754/dbl-64/w_hypot.c
>  create mode 100644 sysdeps/ieee754/flt-32/w_hypotf.c
>  delete mode 100644 sysdeps/powerpc/fpu/e_hypot.c
>  delete mode 100644 sysdeps/powerpc/fpu/e_hypotf.c
>  delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypot-power7.c
>  delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypot-ppc32.c
>  delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypot.c
>  delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypotf-power7.c
>  delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypotf-ppc32.c
>  delete mode 100644 sysdeps/powerpc/powerpc32/power4/fpu/multiarch/e_hypotf.c
>