[00/23] Add remaining CORE-MATH binary32 implementations to libm

Message ID 20241129132032.476978-1-adhemerval.zanella@linaro.org (mailing list archive)
Headers
Series Add remaining CORE-MATH binary32 implementations to libm |

Message

Adhemerval Zanella Netto Nov. 29, 2024, 1:17 p.m. UTC
  This patchset adds the optimized and correctly rounded acosf, acohf,
asinf, asinhf, atanf, atan2f, atanhf, coshf, sinhf, and tanf from
CORE-MATH [1].  Each implementation has a benchmark to evaluate the
performance improvements.

In general, the results are pretty good with just some remarks:

* acosf/asinf hits hard the branch prediction on the range x < 0x1.c2a1dcp-1
  for acosf and x < 0x1.852p+126 for asinf.

  For acosf, this is required to make it correctly rounded on non-default
  rounding modes, and for asinf, it is fast-path optimization.

  The performance profiles are wildly different depending on the chip: I see
  regressions compared to glibc implementation on AMD Zen3 targeting
  x86-64/x86-64-v2, but I also see large improvements for x86-64-v3 and also
   on aarch64 Neoverse-N1 and powerpc POWER10.

  I think it should not be a blocker for integration.

* coshf performance is the only regression compared to the current glibc.
  This is mostly due to the benchmark used (which I modeled using CORE-MATH
  input range) showing that the glibc code hotspot is on expf, an optimized
  version from ARM-optimized routines.  

  Neither current expf nor coshf is correctly rounded, and the maximum error
  is 2ulps for FE_TONEAREST and 3ulp for another rounding.

  I am not sure if this would be a blocker, and I plan to remove the old SVID
  compat wrapper in subsequent patches (which should improve the function
  performance by about ~10%).

[1] https://gitlab.inria.fr/core-math/core-math

Adhemerval Zanella (23):
  benchtests: Add acosf benchmark
  benchtests: Add acoshf benchmark
  benchtests: Add asinf benchmark
  benchtests: Add asinhf benchmark
  benchtests: Add atanf benchmark
  benchtests: Add atan2f benchmark
  benchtests: Add atanhf benchmark
  benchtests: Add coshf benchmark
  benchtests: Add sinhf benchmark
  benchtests: Add tanhf benchmark
  math: Add inf support on gen-auto-libm-tests.c
  math: Fix the expected atanf (inf) results
  math: Fix the expected atan2f (inf) results
  math: Use acosf from CORE-MATH
  math: Use acoshf from CORE-MATH
  math: Use asinf from CORE-MATH
  math: Use asinhf from CORE-MATH
  math: Use atanf from CORE-MATH
  math: Use atan2f from CORE-MATH
  math: Use atanhf from CORE-MATH
  math: Use coshf from CORE-MATH
  math: Use sinhf from CORE-MATH
  math: Use tanhf from CORE-MATH

 SHARED-FILES                                  |   40 +
 benchtests/Makefile                           |   10 +
 benchtests/acosf-inputs                       | 2710 +++++++++++++++++
 benchtests/acoshf-inputs                      | 1005 ++++++
 benchtests/asinf-inputs                       | 2710 +++++++++++++++++
 benchtests/asinhf-inputs                      | 2005 ++++++++++++
 benchtests/atan2f-inputs                      | 2005 ++++++++++++
 benchtests/atanf-inputs                       | 2005 ++++++++++++
 benchtests/atanhf-inputs                      | 2005 ++++++++++++
 benchtests/coshf-inputs                       | 2005 ++++++++++++
 benchtests/sinhf-inputs                       | 2005 ++++++++++++
 benchtests/tanhf-inputs                       | 2005 ++++++++++++
 math/auto-libm-test-in                        |   52 +
 math/auto-libm-test-out-atan                  |   50 +
 math/auto-libm-test-out-atan2                 | 2316 ++++++++++++++
 math/gen-auto-libm-tests.c                    |   23 +-
 math/libm-test-atan.inc                       |    2 -
 math/libm-test-atan2.inc                      |   56 -
 sysdeps/aarch64/libm-test-ulps                |   44 +-
 sysdeps/alpha/fpu/libm-test-ulps              |   40 -
 sysdeps/arc/fpu/libm-test-ulps                |   40 -
 sysdeps/arc/nofpu/libm-test-ulps              |   10 -
 sysdeps/arm/libm-test-ulps                    |   48 +-
 sysdeps/csky/fpu/libm-test-ulps               |   40 -
 sysdeps/csky/nofpu/libm-test-ulps             |   40 -
 sysdeps/hppa/fpu/libm-test-ulps               |   40 -
 sysdeps/i386/fpu/e_acosf.S                    |   23 -
 sysdeps/i386/fpu/e_acoshf.S                   |  101 -
 sysdeps/i386/fpu/e_asinf.S                    |   38 -
 sysdeps/i386/fpu/e_atan2f.S                   |   30 -
 sysdeps/i386/fpu/e_atanhf.S                   |  110 -
 sysdeps/i386/fpu/libm-test-ulps               |   25 -
 sysdeps/i386/fpu/s_asinhf.S                   |  139 -
 sysdeps/i386/fpu/s_atanf.S                    |   30 -
 .../i386/i686/fpu/multiarch/libm-test-ulps    |   25 -
 sysdeps/ieee754/flt-32/e_acosf.c              |  191 +-
 sysdeps/ieee754/flt-32/e_acoshf.c             |  230 +-
 sysdeps/ieee754/flt-32/e_asinf.c              |  210 +-
 sysdeps/ieee754/flt-32/e_atan2f.c             |  337 +-
 sysdeps/ieee754/flt-32/e_atanhf.c             |  210 +-
 sysdeps/ieee754/flt-32/e_coshf.c              |  156 +-
 sysdeps/ieee754/flt-32/e_sinhf.c              |  169 +-
 sysdeps/ieee754/flt-32/s_asinhf.c             |  219 +-
 sysdeps/ieee754/flt-32/s_atanf.c              |  186 +-
 sysdeps/ieee754/flt-32/s_tanhf.c              |  131 +-
 sysdeps/loongarch/lp64/libm-test-ulps         |   44 +-
 sysdeps/m68k/coldfire/fpu/libm-test-ulps      |    2 -
 sysdeps/m68k/m680x0/fpu/libm-test-ulps        |   11 -
 sysdeps/microblaze/libm-test-ulps             |   10 -
 sysdeps/mips/mips32/libm-test-ulps            |   40 -
 sysdeps/mips/mips64/libm-test-ulps            |   40 -
 sysdeps/or1k/fpu/libm-test-ulps               |   40 -
 sysdeps/or1k/nofpu/libm-test-ulps             |   40 -
 sysdeps/powerpc/fpu/libm-test-ulps            |   44 +-
 sysdeps/powerpc/nofpu/libm-test-ulps          |   40 -
 sysdeps/riscv/nofpu/libm-test-ulps            |   40 -
 sysdeps/riscv/rvd/libm-test-ulps              |   40 -
 sysdeps/s390/fpu/libm-test-ulps               |   40 -
 sysdeps/sh/libm-test-ulps                     |   20 -
 sysdeps/sparc/fpu/libm-test-ulps              |   40 -
 sysdeps/x86_64/fpu/libm-test-ulps             |   46 +-
 61 files changed, 24381 insertions(+), 2027 deletions(-)
 create mode 100644 benchtests/acosf-inputs
 create mode 100644 benchtests/acoshf-inputs
 create mode 100644 benchtests/asinf-inputs
 create mode 100644 benchtests/asinhf-inputs
 create mode 100644 benchtests/atan2f-inputs
 create mode 100644 benchtests/atanf-inputs
 create mode 100644 benchtests/atanhf-inputs
 create mode 100644 benchtests/coshf-inputs
 create mode 100644 benchtests/sinhf-inputs
 create mode 100644 benchtests/tanhf-inputs
 delete mode 100644 sysdeps/i386/fpu/e_acosf.S
 delete mode 100644 sysdeps/i386/fpu/e_acoshf.S
 delete mode 100644 sysdeps/i386/fpu/e_asinf.S
 delete mode 100644 sysdeps/i386/fpu/e_atan2f.S
 delete mode 100644 sysdeps/i386/fpu/e_atanhf.S
 delete mode 100644 sysdeps/i386/fpu/s_asinhf.S
 delete mode 100644 sysdeps/i386/fpu/s_atanf.S
  

Comments

Florian Weimer Dec. 2, 2024, 2:39 p.m. UTC | #1
* Adhemerval Zanella:

> * acosf/asinf hits hard the branch prediction on the range x < 0x1.c2a1dcp-1
>   for acosf and x < 0x1.852p+126 for asinf.
>
>   For acosf, this is required to make it correctly rounded on non-default
>   rounding modes, and for asinf, it is fast-path optimization.
>
>   The performance profiles are wildly different depending on the chip: I see
>   regressions compared to glibc implementation on AMD Zen3 targeting
>   x86-64/x86-64-v2, but I also see large improvements for x86-64-v3 and also
>    on aarch64 Neoverse-N1 and powerpc POWER10.

We can add an FMA variant for acosf/asinf as an IFUNC.  This should
avoid the performance regression for x86-64 on most machines.

According to the commit message, that would also help with coshf for
x86-64?

Thanks,
Florian
  
DJ Delorie Dec. 16, 2024, 10:32 p.m. UTC | #2
I'm currently[*] seeing ULP failures for atan, atan2, asin, and acos
when passed PI-related constants, for example:

FAIL: math/test-double-acospi
FAIL: math/test-double-asinpi
FAIL: math/test-double-atan2pi
FAIL: math/test-double-atanpi
FAIL: math/test-float-acospi
FAIL: math/test-float-asinpi
FAIL: math/test-float-atan2pi
FAIL: math/test-float-atanpi
. . .

... on s390, ppc64le, and i686.

The results are all "saw 1 expected 0 ULPs".

Could these be related to this patch set?

[*] while syncing glibc to Fedora Rawhide, at least.  The i686 ci/cd
trybot is not seeing ULP problems.
  
Florian Weimer Dec. 17, 2024, 6:34 a.m. UTC | #3
* DJ Delorie:

> I'm currently[*] seeing ULP failures for atan, atan2, asin, and acos
> when passed PI-related constants, for example:
>
> FAIL: math/test-double-acospi
> FAIL: math/test-double-asinpi
> FAIL: math/test-double-atan2pi
> FAIL: math/test-double-atanpi
> FAIL: math/test-float-acospi
> FAIL: math/test-float-asinpi
> FAIL: math/test-float-atan2pi
> FAIL: math/test-float-atanpi
> . . .
>
> ... on s390, ppc64le, and i686.
>
> The results are all "saw 1 expected 0 ULPs".
>
> Could these be related to this patch set?
>
> [*] while syncing glibc to Fedora Rawhide, at least.  The i686 ci/cd
> trybot is not seeing ULP problems.

Yes, these functions are not expected to be correctly rounded, and as
they are newly added, many architectures do not have ulps bounds for
them yet.

Thanks,
Florian
  
Paul Zimmermann Dec. 17, 2024, 6:47 a.m. UTC | #4
> > I'm currently[*] seeing ULP failures for atan, atan2, asin, and acos
> > when passed PI-related constants, for example:
> >
> > FAIL: math/test-double-acospi
> > FAIL: math/test-double-asinpi
> > FAIL: math/test-double-atan2pi
> > FAIL: math/test-double-atanpi
> > FAIL: math/test-float-acospi
> > FAIL: math/test-float-asinpi
> > FAIL: math/test-float-atan2pi
> > FAIL: math/test-float-atanpi
> > . . .
> >
> > ... on s390, ppc64le, and i686.
> >
> > The results are all "saw 1 expected 0 ULPs".
> >
> > Could these be related to this patch set?
> >
> > [*] while syncing glibc to Fedora Rawhide, at least.  The i686 ci/cd
> > trybot is not seeing ULP problems.
> 
> Yes, these functions are not expected to be correctly rounded, and as
> they are newly added, many architectures do not have ulps bounds for
> them yet.

however, this is unrelated to the "Add remaining CORE-MATH binary32 implementations"
patchset, since these functions do not yet use the CORE-MATH implementations.
This will have to wait after the 2.41 release.

Paul
  
Joseph Myers Dec. 17, 2024, 6:12 p.m. UTC | #5
On Tue, 17 Dec 2024, Paul Zimmermann wrote:

> however, this is unrelated to the "Add remaining CORE-MATH binary32 implementations"
> patchset, since these functions do not yet use the CORE-MATH implementations.
> This will have to wait after the 2.41 release.

The release freeze hasn't started yet, though it's soon (soft freeze 21 
Dec, hard freeze 4 Jan - changing which implementation is used for a new 
function like this would probably be OK until we're in hard freeze, though 
it may be hard to get review in the next couple of weeks).
  
DJ Delorie Dec. 17, 2024, 6:48 p.m. UTC | #6
Florian Weimer <fweimer@redhat.com> writes:
> and as they are newly added,

Ah, that's the key.  Thanks!
  
DJ Delorie Dec. 17, 2024, 6:52 p.m. UTC | #7
Joseph Myers <josmyers@redhat.com> writes:

> On Tue, 17 Dec 2024, Paul Zimmermann wrote:
>
>> however, this is unrelated to the "Add remaining CORE-MATH binary32 implementations"
>> patchset, since these functions do not yet use the CORE-MATH implementations.
>> This will have to wait after the 2.41 release.
>
> The release freeze hasn't started yet, though it's soon (soft freeze 21 
> Dec, hard freeze 4 Jan - changing which implementation is used for a new 
> function like this would probably be OK until we're in hard freeze, though 
> it may be hard to get review in the next couple of weeks).

Given how long it takes to review these, and the upcoming holiday break
for many of us, I think 2.41 is much more attainable.