From patchwork Thu Aug 19 04:20:01 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paul Hua X-Patchwork-Id: 44712 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id BF1F1383D005 for ; Thu, 19 Aug 2021 04:20:44 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BF1F1383D005 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1629346844; bh=BazVToX87jt3cYdmbO/vQsPAI1q+RkNhIq/dUtMgWVQ=; h=Date:Subject:To:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=G2GnD/vQDPLf4fHjlezRVU22ypLi1h/dYbX6mmcEUpCffCmOFXXFVxlLSn44OwBxp f4i28ueHpzT/m3bsjm78+177bHJ6i7WoH2EkX196yeLH2cPF1k9U3pjR/TwZes05ql CTU1dUZmWcS7gsvayBSX3aX7bcuCGYlBtYC4nxYM= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-yb1-xb31.google.com (mail-yb1-xb31.google.com [IPv6:2607:f8b0:4864:20::b31]) by sourceware.org (Postfix) with ESMTPS id 7705A3843893 for ; Thu, 19 Aug 2021 04:20:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7705A3843893 Received: by mail-yb1-xb31.google.com with SMTP id m193so9747726ybf.9 for ; Wed, 18 Aug 2021 21:20:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=BazVToX87jt3cYdmbO/vQsPAI1q+RkNhIq/dUtMgWVQ=; b=p5gOYwwVFRU8KzCluv1ROBZAhPWrC0VCzJcrZOQ4sSLAxnlhcfzznchWqgbUYVMmmz r4xLjdBk0y0f6UtkC+Zp9h7syyFnD7/CJ7hnET//Rs/kXTGLh5N9ekLbgJN3dpzpR076 OQUHIOZlXVNGeXpCziEy1GSgA1NhbDrtdWs36EWCWog9fT9FQacHWhSDEL5IZDyaeSSy IlQ4E4QbQ8J+S+a6NDTglwuT+vaMWV2lBluBygaD2Mour3KsXp0TnAU8FqyDGnGZHd0l ifRuHeI4QQBxYye3s89vJnONLDDVaks+av/hRGsTvbYP8vTSEQdvllvBmRZhaH/+Gcy5 6RNw== X-Gm-Message-State: AOAM532QeKoZksENy5LrKItHy+bqyylZ9W0Em9bLdRBwLBQdFdBE6PsB VvZI+4ViR0u7iXd3uSJsMHFDG7XknBFPgFc02zou3HhKAL0ETwqG X-Google-Smtp-Source: ABdhPJwHBY9gh1OIv8MRkbItxWdr3xc7wSTS4qqrJn3HzYKISWus0x9GqXxicULePpUaCeF/apxDgSHfWiT2BFNt1Iw= X-Received: by 2002:a25:824b:: with SMTP id d11mr16398012ybn.361.1629346812872; Wed, 18 Aug 2021 21:20:12 -0700 (PDT) MIME-Version: 1.0 Date: Thu, 19 Aug 2021 12:20:01 +0800 Message-ID: Subject: [PATCH 13/14] [LoongArch] Add assembly optimized sinf cosf functions To: libc-alpha@sourceware.org X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Paul Hua via Libc-alpha From: Paul Hua Reply-To: Paul Hua Cc: Xu Chenghua , huangpei@loongson.cn, caiyinyu@loongson.cn Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" From 13e66b2d0be38c1be530c6c4df1a57f321f6e3d2 Mon Sep 17 00:00:00 2001 From: caiyinyu Date: Tue, 27 Jul 2021 16:21:01 +0800 Subject: [PATCH 13/14] LoongArch: Add assembly optimized sinf cosf functions. * sysdeps/loongarch/lp64/s_cosf.S: New file. * sysdeps/loongarch/lp64/s_sinf.S: Likewise. --- sysdeps/loongarch/lp64/s_cosf.S | 472 ++++++++++++++++++++++++++++++++ sysdeps/loongarch/lp64/s_sinf.S | 451 ++++++++++++++++++++++++++++++ 2 files changed, 923 insertions(+) create mode 100644 sysdeps/loongarch/lp64/s_cosf.S create mode 100644 sysdeps/loongarch/lp64/s_sinf.S +libm_alias_float (__sin, sin) diff --git a/sysdeps/loongarch/lp64/s_cosf.S b/sysdeps/loongarch/lp64/s_cosf.S new file mode 100644 index 0000000000..4207fd2563 --- /dev/null +++ b/sysdeps/loongarch/lp64/s_cosf.S @@ -0,0 +1,472 @@ +/* Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + Contributed by Loongson Technology Corporation Limited. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library. If not, see + . */ + +#include +#include +#include + +/* Short algorithm description: + * + * 1) if |x|==0: sin(x)=x, + * cos(x)=1. + * 2) if |x|<2^-27: sin(x)=x-x*DP_SMALL, raising underflow only when needed, + * cos(x)=1-|x|. + * 3) if |x|<2^-5 : sin(x)=x+x*x^2*DP_SIN2_0+x^5*DP_SIN2_1, + * cos(x)=1+1*x^2*DP_COS2_0+x^5*DP_COS2_1 + * 4) if |x|< Pi/4: sin(x)=x+x*x^2*(S0+x^2*(S1+x^2*(S2+x^2*(S3+x^2*S4)))), + * cos(x)=1+1*x^2*(C0+x^2*(C1+x^2*(C2+x^2*(C3+x^2*C4)))). + * 5) if |x| < 9*Pi/4: + * 5.1) Range reduction: + * k=trunc(|x|/(Pi/4)), j=(k+1)&0x0e, n=k+1, t=|x|-j*Pi/4. + * 5.2) Reconstruction: + * sign_sin = sign(x) * (-1.0)^((n >>2)&1) + * sign_cos = (-1.0)^(((n+2)>>2)&1) + * poly_sin = ((((S4*t^2 + S3)*t^2 + S2)*t^2 + S1)*t^2 + S0)*t^2*t+t + * poly_cos = ((((C4*t^2 + C3)*t^2 + C2)*t^2 + C1)*t^2 + C0)*t^2*s+s + * if(n&2 != 0) { + * using cos(t) and sin(t) polynomials for |t|= 2^23, very large args: + * 7.1) Range reduction: + * k=trunc(|x|/(Pi/4)), j=(k+1)&0xfffffffe, n=k+1, t=|x|-j*Pi/4. + * 7.2) Reconstruction same as (5.2). + * 8) if x is Inf, return x-x, and set errno=EDOM. + * 9) if x is NaN, return x-x. + * + * Special cases: + * sin/cos(+-0) = +-0/1 not raising inexact/underflow, + * sin/cos(subnormal) raises inexact/underflow, + * sin/cos(min_normalized) raises inexact/underflow, + * sin/cos(normalized) raises inexact, + * sin/cos(Inf) = NaN, raises invalid, sets errno to EDOM, + * sin/cos(NaN) = NaN. + */ + +#define COSF __cosf + +#define LOADFD(rd, rs, label) \ + la.local rs, label;\ + fld.d rd, rs, 0 + +#define LOADFS(rd, rs, label) \ + la.local rs, label;\ + fld.s rd, rs, 0 + +#define FTOL(rd, rs, tmp) \ + ftintrz.l.d tmp, rs;\ + movfr2gr.d rd, tmp + +#define FTOW(rd, rs, tmp) \ + ftintrz.w.d tmp, rs;\ + movfr2gr.s rd, tmp + +#define WTOF(rd, rs, tmp) \ + movgr2fr.w tmp, rs;\ + ffint.d.w rd, tmp + +#define LTOF(rd, rs, tmp) \ + movgr2fr.d tmp, rs;\ + ffint.d.l rd, tmp + +LEAF(COSF) + .align 2 + .align 3 + + /* fa0 is SP x; fa1 is DP x */ + movfr2gr.s t0, fa0 /* Bits of x */ + fcvt.d.s fa1, fa0 /* DP x */ + li.w t1, 0x7fffffff + and t0, t0, t1 /* |x| */ + li.w t1, 0x3f490fdb /* const Pi/4 */ + bltu t0, t1, L(arg_less_pio4)/* |x| < Pi/4 branch */ + li.w t1, 0x40e231d6 /* 9*Pi/4 */ + la.local t4, L(DP_) /*DP_ base addr*/ + + /* |x| >= 9*Pi/4 branch */ + bgeu t0, t1, L(greater_or_equal_9pio4) + +/* L(median_args): */ + +/* Here if Pi/4<=|x|<9*Pi/4 */ + fabs.d fa0, fa1 /* DP |x| */ + fld.d fa1, t4, 56 /* 4/Pi */ + fmul.d fa1, fa1, fa0 /* DP |x|/(Pi/4) */ + FTOW(t0, fa1, fa1) /* k=trunc(|x|/(Pi/4)) */ + la.local t1, L(PIO2J) /* base addr of PIO2J table */ + addi.w t0, t0, 1 /* k+1 */ + bstrpick.d t2, t0, 3, 1 /* j=n/2 */ + alsl.d t1, t2, t1, 3 + fld.d fa1, t1, 0 /* j*Pi/2 */ + addi.w t0, t0, 2 /* n = k+3 */ + fsub.d fa0, fa0, fa1 /* t = |x| - j * Pi/2 */ + + /* Input: t0=n fa0=t */ +L(reduced): + +/* Here if cos(x) calculated using cos(t) polynomial for |t|>2)&1) + * result = s * (1.0+t^2*(C0+t^2*(C1+t^2*(C2+t^2*(C3+t^2*C4))))) + + * Here if cos(x) calculated using sin(t) polynomial for |t|>2)&1) + * result = s * t * (1.0+t^2*(S0+t^2*(S1+t^2*(S2+t^2*(S3+t^2*S4))))) + */ + +/* TODO: what is the best order ??? */ + +/* load-to-use latency, hardware module usage, integer pipeline & float + * pipeline */ + + /* cancel branch */ + slli.w t0, t0, 1 /* (n << 1) */ + andi t1, t0, 4 /* (n << 1) & 4 */ + alsl.d t2, t1, t4, 4 /* adjust to DP_C or DP_S */ + fld.d fa3, t2, 32 /* C4 */ + andi t0, t0, 8 /* =====> (n << 1) & 8 */ + fmul.d fa1, fa0, fa0 /* y=x^2 */ + fld.d fa4, t2, 16 /* C2 */ + fmul.d fa2, fa1, fa1 /* z=x^4 */ + fld.d fa5, t2, 24 /* C3 */ + la.local t3, L(DP_ONES) /* =====> DP_ONES */ + fld.d fa6, t2, 8 /* C1 */ + fmadd.d fa4, fa2, fa3, fa4 /* cx = C2+z*C4 */ + fld.d fa3, t2, 0 /* C0 */ + fmadd.d fa5, fa2, fa5, fa6 /* cy = C1+z*C3 */ + fld.d fa6, t3, 0 /* one */ + fmadd.d fa4, fa2, fa4, fa3 /* cx = C0+z*cx */ + add.d t0, t0, t3 /* =====> addr */ + fmadd.d fa4, fa1, fa5, fa4 /* cx = cx+y*cy */ + fld.d fa2, t0, 0 /* sign */ + fmadd.d fa4, fa4, fa1, fa6 /* 1.0+y*cx */ + fmul.d fa1, fa2, fa4 /* sign * cx */ + bnez t1, L_return + + /* t*s, where s = sign(x) * (-1.0)^((n>>2)&1) */ + fmul.d fa1, fa1, fa0 +L_return: + fcvt.s.d fa0, fa1 /* SP result */ + jr ra + +L(greater_or_equal_9pio4): + +/* Here if |x|>=9*Pi/4 */ + li.w t1, 0x7f800000 /* x is Inf or NaN? */ + bgeu t0, t1, L(inf_or_nan) /* |x| >= Inf branch */ + +/* Here if finite |x|>=9*Pi/4 */ + li.w t1, 0x4b000000 /* 2^23 */ + + /* |x| >= 2^23 branch */ + bgeu t0, t1, L(greater_or_equal_2p23) + +/* Here if 9*Pi/4<=|x|<2^23 */ + fabs.d fa0, fa1 /* DP |x| */ + fld.d fa1, t4, 56 + fmul.d fa1, fa1, fa0 /* |x|/(Pi/4) */ + FTOW(t0, fa1, fa1) /* k=trunc(|x|/(Pi/4)) */ + addi.w t0, t0, 1 /* k+1 */ + srli.w t1, t0, 1 /* x=n/2 */ + WTOF(fa1, t1, fa1) /* DP x */ + fld.d fa2, t4, 104 /* -PIO2HI = high part of -Pi/2 */ + fld.d fa3, t4, 112 /* -PIO2LO = low part of -Pi/2 */ + fmadd.d fa0, fa2, fa1, fa0 /* |x| - x*PIO2HI */ + addi.w t0, t0, 2 /* n = k+3 */ + fmadd.d fa0, fa3, fa1, fa0 /* |x| - x*PIO2HI - x*PIO2LO */ + b L(reduced) + +L(greater_or_equal_2p23): + +/* Here if finite |x|>=2^23 */ + fabs.s fa5, fa0 /* SP |x| */ + + /* bitpos = (ix>>23) - BIAS_32; */ + /* TODO???srai.w eb = biased exponent of x */ + srli.w t0, t0, 23 + + /* bitpos = eb - 0x7f + 59, where 0x7f is exponent bias */ + addi.w t0, t0, -124 /* t0 = bitpos */ + + /* t3= j = bitpos/28 */ + /* x/28 = (x * ((0x100000000 / 28) + 1)) >> 32 */ + li.w t1, 0x924924a + mulh.wu t0, t1, t0 + fcvt.d.s fa5, fa5 /* Convert to double */ + + /* TODO: what is the best order ??? */ + la.local t1, L(invpio4_table) /* t2 */ + alsl.d t1, t0, t1, 3 + fld.d fa0, t1, 0 /* invpio4_table[j] */ + fld.d fa1, t1, 8 /* invpio4_table[j+1] */ + fmul.d fa0, fa0, fa5 /* a = invpio4_table[j]*|x| */ + fld.d fa2, t1, 16 /* invpio4_table[j+2] */ + fmul.d fa1, fa1, fa5 /* b = invpio4_table[j+1]*|x| */ + fld.d fa3, t1, 24 /* invpio4_table[j+3] */ + fmul.d fa2, fa2, fa5 /* c = invpio4_table[j+2]*|x| */ + fmul.d fa3, fa3, fa5 /* d = invpio4_table[j+3]*|x| */ + + /* TODO: overflow check */ + /* uint64_t l = a; TODO: change the order */ + + FTOL(t0, fa0, fa4) + li.w t1, -8 /* 0xfffffffffffffff8 */ + and t0, t0, t1 /* l &= ~0x7; */ + LTOF(fa4, t0, fa4) /* DP l */ + fsub.d fa0, fa0, fa4 /* a -= l; */ + fadd.d fa4, fa0, fa1 /* fa4 double e = a + b; */ + + /* TODO: overflow check */ + FTOL(t0, fa4, fa4) /* uint64_t l = e; */ + andi t2, t0, 1 /* l & 1 TODO: change the order */ + LOADFD(fa5, t1, L(DP_ONES)) /* fa5 = 1.0 */ + LTOF(fa4, t0, fa4) /* fa4 DP l */ + + /* critical!!!! the order */ + fsub.d fa0, fa0, fa4 + fld.d fa4, t4, 120 /* PI_4 */ + beqz t2, L_even_integer + +/* L_odd_integer: */ + fsub.d fa0, fa0, fa5 + fadd.d fa0, fa0, fa1 + fadd.d fa2, fa2, fa3 + fadd.d fa0, fa0, fa2 + addi.d t0, t0, 3 + fmul.d fa0, fa0, fa4 + b L(reduced) +L_even_integer: + fadd.d fa0, fa0, fa1 + fadd.d fa2, fa2, fa3 + fadd.d fa0, fa0, fa2 + fcmp.sle.d $fcc0, fa0, fa5 + addi.d t0, t0, 3 + bcnez $fcc0, L_leq_one + +/* L_gt_one: */ + fld.d fa2, t1, 16 /* 2.0 */ + addi.d t0, t0, 1 + fsub.d fa0, fa0, fa2 +L_leq_one: + fmul.d fa0, fa0, fa4 + b L(reduced) + +L(arg_less_pio4): + +/* Here if |x|. */ + +#include +#include +#include + +/* Short algorithm description: + * + * 1) if |x|==0: sin(x)=x, + * cos(x)=1. + * 2) if |x|<2^-27: sin(x)=x-x*DP_SMALL, raising underflow only when needed, + * cos(x)=1-|x|. + * 3) if |x|<2^-5 : sin(x)=x+x*x^2*DP_SIN2_0+x^5*DP_SIN2_1, + * cos(x)=1+1*x^2*DP_COS2_0+x^5*DP_COS2_1 + * 4) if |x|< Pi/4: sin(x)=x+x*x^2*(S0+x^2*(S1+x^2*(S2+x^2*(S3+x^2*S4)))), + * cos(x)=1+1*x^2*(C0+x^2*(C1+x^2*(C2+x^2*(C3+x^2*C4)))). + * 5) if |x| < 9*Pi/4: + * 5.1) Range reduction: + * k=trunc(|x|/(Pi/4)), j=(k+1)&0x0e, n=k+1, t=|x|-j*Pi/4. + * 5.2) Reconstruction: + * sign_sin = sign(x) * (-1.0)^((n>>2)&1) + * sign_cos = (-1.0)^(((n+2)>>2)&1) + * poly_sin = ((((S4*t^2 + S3)*t^2 + S2)*t^2 + S1)*t^2 + S0)*t^2*t+t + * poly_cos = ((((C4*t^2 + C3)*t^2 + C2)*t^2 + C1)*t^2 + C0)*t^2*s+s + * if(n&2 != 0) { + * using cos(t) and sin(t) polynomials for |t|= 2^23, very large args: + * 7.1) Range reduction: + * k=trunc(|x|/(Pi/4)), j=(k+1)&0xfffffffe, n=k+1, t=|x|-j*Pi/4. + * 7.2) Reconstruction same as (5.2). + * 8) if x is Inf, return x-x, and set errno=EDOM. + * 9) if x is NaN, return x-x. + * + * Special cases: + * sin/cos(+-0) = +-0/1 not raising inexact/underflow, + * sin/cos(subnormal) raises inexact/underflow, + * sin/cos(min_normalized) raises inexact/underflow, + * sin/cos(normalized) raises inexact, + * sin/cos(Inf) = NaN, raises invalid, sets errno to EDOM, + * sin/cos(NaN) = NaN. + */ + +#define SINF __sinf + +#define LOADFD(rd, rs, label) \ + la.local rs, label;\ + fld.d rd, rs, 0 + +#define LOADFS(rd, rs, label) \ + la.local rs, label;\ + fld.s rd, rs, 0 + +#define FTOL(rd, rs, tmp) \ + ftintrz.l.d tmp, rs;\ + movfr2gr.d rd, tmp + +#define FTOW(rd, rs, tmp) \ + ftintrz.w.d tmp, rs;\ + movfr2gr.s rd, tmp + +#define WTOF(rd, rs, tmp) \ + movgr2fr.w tmp, rs;\ + ffint.d.w rd, tmp + +#define LTOF(rd, rs, tmp) \ + movgr2fr.d tmp, rs;\ + ffint.d.l rd, tmp + +LEAF(SINF) + .align 2 + .align 3 + + /* fa0 is SP x; fa1 is DP x */ + movfr2gr.s t2, fa0 /* Bits of x */ + fcvt.d.s fa1, fa0 /* DP x */ + li.w t1, 0x7fffffff + and t0, t2, t1 /* |x| */ + li.w t1, 0x3f490fdb /* const Pi/4 */ + bltu t0, t1, L(arg_less_pio4)/* |x| < Pi/4 branch */ + li.w t1, 0x40e231d6 /* 9*Pi/4 */ + la.local t4, L(DP_) /* DP_ base addr */ + bstrpick.d t5, t2, 31, 31 /* sign of x */ + slli.w t5, t5, 3 + + /* |x| >= 9*Pi/4 branch */ + bgeu t0, t1, L(greater_or_equal_9pio4) + +/* L(median_args): */ +/* Here if Pi/4<=|x|<9*Pi/4 */ + fabs.d fa0, fa1 /* DP |x| */ + fld.d fa1, t4, 56 /* 4/Pi */ + fmul.d fa1, fa1, fa0 /* DP |x|/(Pi/4) */ + FTOW(t0, fa1, fa1) /* k=trunc(|x|/(Pi/4)) */ + la.local t1, L(PIO2J) /* base addr of PIO2J table */ + addi.w t0, t0, 1 /* k+1 */ + bstrpick.d t2, t0, 3, 1 /* j=n/2 */ + alsl.d t1, t2, t1, 3 + fld.d fa1, t1, 0 /* j*Pi/2 */ + fsub.d fa0, fa0, fa1 /* t = |x| - j * Pi/2 */ + + /* Input: t0=n fa0=t*/ + /* Input: t0=n fa0=t, t5=sign(x) */ +L(reduced): + +/* Here if cos(x) calculated using cos(t) polynomial for |t|>2)&1) + * result = s * (1.0+t^2*(C0+t^2*(C1+t^2*(C2+t^2*(C3+t^2*C4))))) + * Here if cos(x) calculated using sin(t) polynomial for |t|>2)&1) + * result = s * t * (1.0+t^2*(S0+t^2*(S1+t^2*(S2+t^2*(S3+t^2*S4))))) + */ + +/* TODO: what is the best order ??? */ + +/* load-to-use latency, hardware module usage, integer pipeline & float +pipeline */ + + /* cancel branch */ + slli.w t0, t0, 1 /* (n << 1) */ + andi t1, t0, 4 /* (n << 1) & 4 */ + alsl.d t2, t1, t4, 4 /* adjust to DP_C or DP_S */ + fld.d fa3, t2, 32 /* C4 */ + andi t0, t0, 8 /* =====> (n << 1) & 8 */ + fmul.d fa1, fa0, fa0 /* y=x^2 */ + xor t0, t0, t5 /* (-1.0)^((n>>2)&1) XOR sign(x) */ + fld.d fa4, t2, 16 /* C2 */ + fmul.d fa2, fa1, fa1 /* z=x^4 */ + fld.d fa5, t2, 24 /* C3 */ + la.local t3, L(DP_ONES) /* =====> DP_ONES */ + fld.d fa6, t2, 8 /* C1 */ + fmadd.d fa4, fa2, fa3, fa4 /* cx = C2+z*C4 */ + fld.d fa3, t2, 0 /* C0 */ + fmadd.d fa5, fa2, fa5, fa6 /* cy = C1+z*C3 */ + fld.d fa6, t3, 0 /* 1.0 */ + fmadd.d fa4, fa2, fa4, fa3 /* cx = C0+z*cx */ + add.d t0, t0, t3 /* =====> addr */ + fmadd.d fa4, fa1, fa5, fa4 /* cx = cx+y*cy */ + fld.d fa2, t0, 0 /* sign */ + fmadd.d fa4, fa4, fa1, fa6 /* 1.0+y*cx */ + fmul.d fa1, fa2, fa4 /* sign * cx */ + bnez t1, L_return + + /* t*s, where s = sign(x) * (-1.0)^((n>>2)&1) */ + fmul.d fa1, fa1, fa0 + +L_return: + fcvt.s.d fa0, fa1 /* SP result */ + jr ra + +L(greater_or_equal_9pio4): + +/* Here if |x|>=9*Pi/4 */ + li.w t1, 0x7f800000 /* x is Inf or NaN? */ + bgeu t0, t1, L(inf_or_nan) /* |x| >= Inf branch */ + +/* Here if finite |x|>=9*Pi/4 */ + li.w t1, 0x4b000000 /* 2^23 */ + + /* |x| >= 2^23 branch */ + bgeu t0, t1, L(greater_or_equal_2p23) + +/* Here if 9*Pi/4<=|x|<2^23 */ + fabs.d fa0, fa1 /* DP |x| */ + fld.d fa1, t4, 56 + fmul.d fa1, fa1, fa0 /* |x|/(Pi/4) */ + FTOW(t0, fa1, fa1) /* k=trunc(|x|/(Pi/4)) */ + addi.w t0, t0, 1 /* k+1 */ + srli.w t1, t0, 1 /* x=n/2 */ + WTOF(fa1, t1, fa1) /* DP x */ + fld.d fa2, t4, 104 /* -PIO2HI = high part of -Pi/2 */ + fld.d fa3, t4, 112 /* -PIO2LO = low part of -Pi/2 */ + fmadd.d fa0, fa2, fa1, fa0 /* |x| - x*PIO2HI */ + fmadd.d fa0, fa3, fa1, fa0 /* |x| - x*PIO2HI - x*PIO2LO */ + b L(reduced) + +L(greater_or_equal_2p23): + +/* Here if finite |x|>=2^23 */ + fabs.s fa5, fa0 /* SP |x| */ + + /* bitpos = (ix>>23) - BIAS_32; */ + /* TODO???srai.w eb = biased exponent of x */ + srli.w t0, t0, 23 + + /* bitpos = eb - 0x7f + 59, where 0x7f is exponent bias */ + addi.w t0, t0, -124 /* t0 = bitpos */ + + /* t3= j = bitpos/28 */ + /* x/28 = (x * ((0x100000000 / 28) + 1)) >> 32 */ + li.w t1, 0x924924a + mulh.wu t0, t1, t0 + fcvt.d.s fa5, fa5 /* Convert to double */ + + /* TODO: what is the best order ??? */ + la.local t1, L(invpio4_table)/* t2 */ + alsl.d t1, t0, t1, 3 + fld.d fa0, t1, 0 /* invpio4_table[j] */ + fld.d fa1, t1, 8 /* invpio4_table[j+1] */ + fmul.d fa0, fa0, fa5 /* a = invpio4_table[j]*|x| */ + fld.d fa2, t1, 16 /* invpio4_table[j+2] */ + fmul.d fa1, fa1, fa5 /* b = invpio4_table[j+1]*|x| */ + fld.d fa3, t1, 24 /* invpio4_table[j+3] */ + fmul.d fa2, fa2, fa5 /* c = invpio4_table[j+2]*|x| */ + fmul.d fa3, fa3, fa5 /* d = invpio4_table[j+3]*|x| */ + + /* TODO: overflow check */ + /* uint64_t l = a; TODO: change the order */ + FTOL(t0, fa0, fa4) + li.w t1, -8 /* 0xfffffffffffffff8 */ + and t0, t0, t1 /* l &= ~0x7; */ + LTOF(fa4, t0, fa4) /* DP l*/ + fsub.d fa0, fa0, fa4 /* a -= l; */ + fadd.d fa4, fa0, fa1 /* fa4 double e = a + b; */ + + /* TODO: overflow check */ + FTOL(t0, fa4, fa4) /* uint64_t l = e */ + andi t2, t0, 1 /* l & 1 TODO: change the order */ + LOADFD(fa5, t1, L(DP_ONES)) /* fa5 = 1.0 */ + LTOF(fa4, t0, fa4) /* fa4 DP l */ + + /* critical!!!! the order */ + fsub.d fa0, fa0, fa4 + fld.d fa4, t4, 120 /* PI_4 */ + beqz t2, L_even_integer + +/* L_odd_integer: */ + fsub.d fa0, fa0, fa5 + fadd.d fa0, fa0, fa1 + fadd.d fa2, fa2, fa3 + fadd.d fa0, fa0, fa2 + addi.d t0, t0, 1 + fmul.d fa0, fa0, fa4 + b L(reduced) + +L_even_integer: + fadd.d fa0, fa0, fa1 + fadd.d fa2, fa2, fa3 + fadd.d fa0, fa0, fa2 + fcmp.sle.d $fcc0, fa0, fa5 + addi.d t0, t0, 1 + bcnez $fcc0, L_leq_one + +/* L_gt_one: */ + fld.d fa2, t1, 16 /* 2.0 */ + addi.d t0, t0, 1 + fsub.d fa0, fa0, fa2 +L_leq_one: + fmul.d fa0, fa0, fa4 + b L(reduced) + +L(arg_less_pio4): + +/* Here if |x|