v11 Improves __ieee754_exp() performance by greater than 5x on sparc/x86.

  New with this version:
Adds updates sparc and x86_64 libm-test-ulps files (1 ulp for
various exp tests). Rewrite of full comment to reflect current
state of patch.

Summary of patch rationale

These changes will be active for all platforms that don't provide
their own exp() routines. They will also be active for ieee754
versions of ccos, ccosh, cosh, csin, csinh, sinh, exp10, gamma, and
erf.

Typical performance gains are 2x on Sparc s7 and 5x on x86_64.
The former code included a slow path to assure no 1 ulp errors
that could be 50-200 times slower than the normal path.
Informal testing suggests perhaps 1 in 200 values might invoke
the slow path.

Using the glibc_perf tests:
      sparc (nsec)    x86 (nsec)
      old     new     old     new
max   18180   936    4863     275
min     399    96      15      15
mean   5499   419    1336      24

Glibc correctness tests for exp() and expf() were run. Within the test
suite 1 input value was found to cause a 1 ulp difference when
"FE_TONEAREST" rounding mode is set. No differences in exp()
were seen for the tested values for the other rounding modes.

When tested over a range of 10 million input values, the new code
gets a 1 ulp error approximately 1.6 times per 1000 values.
That rate was similar for all four rounding modes.
The patch uses a 64 entry scaling table. 32, 128, and 256 entry
tables were also examined with the following error rates:

Table    1 ulp/
entries  1000
  32      2.9
  64      1.6
 128      1.0
 256      0.6

Each table entry takes 16 bytes meaning a 256 entry table requires
4K bytes. That large of a table was thought to possibly have an
impact in overall performance by displacing other data in an exp()
heavy application.

Further optimization is possible in the handling of rounding
modes. Using get_rounding_mode and libc_fesetround() instead of
SET_RESTORE_ROUND provides a measurable gain for Sparc.
Unfortunately, on x86, one works with sse fp unit rounding mode while
the other works on x87 fp unit rounding mode.  Adding libc_fegetround,
libc_fegetroundf and libc_fegetroundl to to match libc_fesetround()
should not be too large a task but outside the scope of this patch.
---
 manual/probes.texi                          |   14 -
 math/Makefile                               |    2 +-
 sysdeps/generic/math_private.h              |    1 -
 sysdeps/i386/fpu/slowexp.c                  |    1 -
 sysdeps/ia64/fpu/slowexp.c                  |    1 -
 sysdeps/ieee754/dbl-64/e_exp.c              |  341 +++++++++++++--------------
 sysdeps/ieee754/dbl-64/e_pow.c              |    2 +-
 sysdeps/ieee754/dbl-64/eexp.tbl             |  255 ++++++++++++++++++++
 sysdeps/ieee754/dbl-64/slowexp.c            |   86 -------
 sysdeps/m68k/m680x0/fpu/slowexp.c           |    1 -
 sysdeps/powerpc/power4/fpu/Makefile         |    1 -
 sysdeps/sparc/fpu/libm-test-ulps            |    2 +
 sysdeps/x86_64/fpu/libm-test-ulps           |    2 +
 sysdeps/x86_64/fpu/multiarch/Makefile       |    9 +-
 sysdeps/x86_64/fpu/multiarch/e_exp-avx.c    |    1 -
 sysdeps/x86_64/fpu/multiarch/e_exp-fma.c    |    1 -
 sysdeps/x86_64/fpu/multiarch/e_exp-fma4.c   |    1 -
 sysdeps/x86_64/fpu/multiarch/slowexp-avx.c  |    9 -
 sysdeps/x86_64/fpu/multiarch/slowexp-fma.c  |    9 -
 sysdeps/x86_64/fpu/multiarch/slowexp-fma4.c |    9 -
 20 files changed, 430 insertions(+), 318 deletions(-)
 delete mode 100644 sysdeps/i386/fpu/slowexp.c
 delete mode 100644 sysdeps/ia64/fpu/slowexp.c
 create mode 100644 sysdeps/ieee754/dbl-64/eexp.tbl
 delete mode 100644 sysdeps/ieee754/dbl-64/slowexp.c
 delete mode 100644 sysdeps/m68k/m680x0/fpu/slowexp.c
 delete mode 100644 sysdeps/x86_64/fpu/multiarch/slowexp-avx.c
 delete mode 100644 sysdeps/x86_64/fpu/multiarch/slowexp-fma.c
 delete mode 100644 sysdeps/x86_64/fpu/multiarch/slowexp-fma4.c

v11 Improves __ieee754_exp() performance by greater than 5x on sparc/x86.

Commit Message

Comments

Patch