From patchwork Tue Jun 13 07:17:06 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ashwin Sekhar T K X-Patchwork-Id: 20978 Received: (qmail 42911 invoked by alias); 13 Jun 2017 07:17:31 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 42794 invoked by uid 89); 13 Jun 2017 07:17:30 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.3 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=Sin, j2, poly, xx X-HELO: NAM01-SN1-obe.outbound.protection.outlook.com Authentication-Results: sourceware.org; dkim=none (message not signed) header.d=none;sourceware.org; dmarc=none action=none header.from=caviumnetworks.com; From: Ashwin Sekhar T K To: libc-alpha@sourceware.org Cc: Ashwin Sekhar T K Subject: [RFC][PATCH 1/2] aarch64: Add optimized ASIMD version of sinf Date: Tue, 13 Jun 2017 00:17:06 -0700 Message-Id: <20170613071707.43396-2-ashwin.sekhar@caviumnetworks.com> MIME-Version: 1.0 X-ClientProxiedBy: MWHPR15CA0059.namprd15.prod.outlook.com (10.174.254.21) To BY2PR07MB2423.namprd07.prod.outlook.com (10.166.115.15) X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BY2PR07MB2423: X-MS-Office365-Filtering-Correlation-Id: 3957c165-899f-47d5-5211-08d4b22c395a X-Microsoft-Antispam: UriScan:; BCL:0; PCL:0; RULEID:(22001)(201703131423075)(201703031133081); SRVR:BY2PR07MB2423; X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 3:WcSSVhXEiUi2QNyRe8RSFdQ4JcWc7JhplzAO0TEqkIIQKZsNtyPd5D7L7g52qZUslUAoURcGAKIvhphnZmsUWMSQUOYUzsbnZ+p2vQYPkaEiwNvGdYHyv9m0m0wh/dPqUgfnL2Ms1LZBPK0nwc8PKcEo2ZakEZnC5N0UDDLyIHf7zRUj9588WXdbTNgCgHbaDG0cZWLTMrBDAPH5y8SGlmgcH02wvxuvsJLxeuNtFVUe7IBzbCCJNHh8+ukAetFgWJIQY8ZeJzt2HDhUflDVQCcEyFyBf7e3tIRXRY0M3NCUG03DwJmdhCVKMSslQtwsadVTOmpMO5Rps88A+l1ODw==; 25:OGdKDKIIUyZvrCA3omJg4QsFkdiHdDmDFWd8/iHERV5aKh3np5v7AigUn5W42ONAK2AF5F+CI1ObsrbGCEro7lH5+ym+iOpYmX1tNYRAaTtNSR8YVE0Jx7RzH+XoUdga10DxPM8t9SEXPgj0VfnYkmVsm1ptWKkKzmN3ac8du7dfMa1FU1w+LZlYfxoRRN8l4VfMFjgt8L653x3oYneJB8fVHEyC1RYz/lvuET2+Da0HmHnm2nYSiM1a5SRe9g928/U3DH0GtmgO6DNm36jORiTY8Q8B5euV+4O2oAOkiVvQ6nFC0LSsPKDS4FNaInhq7Vq8v81YtQaG+WqRXUh+fIMeda0Qq70zLePhfpKmvPS0H/cnHLk1LRul8+AJDXDdcyB2c4PFukzLmJgvcmTh7pY6OA0pwN6AtZ77Rxb9DQAlznllAgLa4CCe4FPFIthl9IVMrvoKW048ckNOux1uocaESoEDg6VXzsZ9rgaVLhA= X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 31:LTqHqZNnwOI4ldNTvbDh806elxEaSSGhwBmtDCubg1akOOFnm4z4XYihkaonWX7k1UmcmmKydh1OkW5T8HOpwZduUsMiGXRjdXV1S6JxHTvOE53Z+aYtQSMip1cudhtehockrEtRXh5XFRVDVFR67DMIpAl/VWG2jsZ0de4YXfqEF8VaOE4xX3gceyet3N6VfnFvKKewAxhJE4R8S3rt/peXqU7AS8tW0EFzSDHk2vE=; 20:fSvkvT+G0xwdhOmnfbfmHWxEbMGjtZdxxtH277a39ghhCRsKftVhtQf70dgUIcWcI5ryqp6DXz4hzV23LaIFe0P7+lmQXGciTov/QfaFE11MmHpoqqyIQOi9QLvmn2zYAQy6B3tRVKqr5onYX0aIsoDcJmCj43iVge0iz7MHkVrMQLpI2Pr4/L9QBKylORRPj/wH4FgJvrqpYKO2A7NAtbzJN2gloPOsjxz9mWgpSDQk6rpdH7taknZ3DL7n2dd6D9HlxYQ+v5Gg+p4pvBFi97O3+pPNS01zdQGpAUfIl03C/YEKq6ypNki+c2+coeuT4+WaOuzbUI5e/Mu3VUTp8+OLXanz1PgHiRmf7VZq0wkAzPaZ2IATEw9sC+JeQxkzj8yP7DXkLzqATypN4VT5k35XRd4otet/UL+TSqwlieJjqNLs0V9X4awUKDz2uKmKVGh1jTzFpR0WCI60dmCSsF6aIvjVKR2usTjWw4TCgMOadgtj6ZhzxWkRuef+PZSoVmqSjSySEPKf0IBL+DjBkKyiXKPNlBZnTX0RO4Kw3pGBm+kcA3Ont5j4O8/cScvJPyxg73aNBqExIQqrcyF6GnY1U7af4YOfCj9G1ROTZr0= X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(250305191791016)(22074186197030); X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(8121501046)(5005006)(93006095)(10201501046)(3002001)(100000703101)(100105400095)(6041248)(20161123555025)(20161123558100)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123560025)(20161123562025)(20161123564025)(6072148)(100000704101)(100105200095)(100000705101)(100105500095); SRVR:BY2PR07MB2423; BCL:0; PCL:0; RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095); SRVR:BY2PR07MB2423; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; BY2PR07MB2423; 4:KyWU4Nf0gnLvnQh27n+dzbS/voFhCUDdoG6kFHcHRB?= =?us-ascii?Q?n/q6flozfMRo1AxwpGMBd6YZg3LG5XkNF/WFtHGT+Ajl5iZfHVl1YuZXl+L/?= =?us-ascii?Q?8cCSiUUqvKXUhSZWM6pnsM+RwygiAypyK5zk1bcPswERaBqu9Wl8dBryncgt?= =?us-ascii?Q?7AHoDF96YPY08tW5FcqOsj2MitC4mz9URk76HjGJl1Rt16zqjUX76VHryjS9?= =?us-ascii?Q?GHtthzXK6bgW6mEeXK6a9LAvo9VN9fo0wa5VrrfJSZDvBM9pkudS1xzexX5s?= =?us-ascii?Q?m9/AWvJvtiphcekHdIcluhyKsY1BDdKR9zUYvzF2PjwooJQN6RkGKs3Vyaw9?= =?us-ascii?Q?rEyovkJOW1A0w7DYBHX8QHc0yRjMt+Zk8/xrDZT534DqJuM2LCYeti9BqgXF?= =?us-ascii?Q?KboGfQPhy7yOxWbPd6OsntTq5EzlbKmYntgddLqEZwS0x/vxNv2QkGlQIchw?= =?us-ascii?Q?ioJlwh9ar3O2X47Ei86JjyKeIOAKNzUOoArJMtJwgLHEDouRom6W4A6XV7DT?= =?us-ascii?Q?Q3vDLySGKtTmdPTT60jSzoSdv7uyY+DMZvHJG3vpHffh99rk/bJQGq6YQtNo?= =?us-ascii?Q?e9wex44Pof/z5JHmy/NHH89sep4bYiEVgAF9mKMVyo2r9vv/hGehUl/fN8sw?= =?us-ascii?Q?c2IfSNsIIIyHeOi7Wu3TrUfRzdtsLn33eV26LAdTMxYo2YQ60SsREVrEDwcc?= =?us-ascii?Q?ZX8wvE9JBW9UAy03B0V9WrcceeCGKczrODcvZ4sQk+Sq5OSAIt9FbWYZf05e?= =?us-ascii?Q?iq+zIdNQL/g0JT9X2RYLUaaVv5HWxO2GegOh+1lTdGL3zRz1Ay4yUA6lkqAh?= =?us-ascii?Q?617tCgc19ATAwWXrSnU9MMaKtUIuoqQS5/0bUDMIKDPwlbJbDEwJSkndO0s1?= =?us-ascii?Q?WXKQiZxUkpjUn86412aZndBqIRvGtupmWvavwGLJ1oDR1GxnEBpSp1T3o83m?= =?us-ascii?Q?e9bzw/v9abwbxYdZk6TV/kcXeox6LCyhQFBvvZZ9Rwf/uqdE78k01gSGiOjN?= =?us-ascii?Q?eeLKKa1cl4CKxpvr8IcHRnd+msZL6rwlBW0pUXtRc498fyZzNDdLT9VzKIFg?= =?us-ascii?Q?iFwU796ywHD9PWbxKDNnYWOCCfXRsiFhwxE738R2+f2yYc+IeunCrflB0mte?= =?us-ascii?Q?wY79ELzESZSfelPtBzxfRdYDiAk8cLpGK5j8bOVEnRMMnlpvcD/aOPFu0d5n?= =?us-ascii?Q?QXRNxASv163cXhRjzBxON/P+kzWR+A28KY?= X-Forefront-PRVS: 0337AFFE9A X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009020)(4630300001)(6009001)(39410400002)(39400400002)(39850400002)(39450400003)(39840400002)(33646002)(575784001)(6666003)(6306002)(8676002)(4326008)(53416004)(81166006)(6512007)(53936002)(305945005)(42882006)(7736002)(5660300001)(189998001)(50226002)(48376002)(2906002)(6916009)(25786009)(66066001)(6486002)(38730400002)(107886003)(3846002)(478600001)(1076002)(110136004)(2361001)(50986999)(5003940100001)(2351001)(36756003)(72206003)(42186005)(6506006)(47776003)(2004002); DIR:OUT; SFP:1101; SCL:1; SRVR:BY2PR07MB2423; H:1scrb-1.caveonetworks.com; FPR:; SPF:None; MLV:sfv; LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; BY2PR07MB2423; 23:svLjsZatOdsrW0wprHwtpLOdmzXnK9NxYetp3ciZQ?= =?us-ascii?Q?/n4loOEWHZzQg0KMp+DV7whICWL+aWO5fG920hMS9+7znbYGRx5cUIbnIHQl?= =?us-ascii?Q?ET3tFzL7b5iLxxaYXGaOyCn/1d5HlS2/4VwoQbU4HLWsGCys1TVrQ8F7v9OV?= =?us-ascii?Q?UEVI6pCgU9ekXhyVCR2VTZZcVU7Wu6wdxhlJl/xnIhxL8G9wRFXHU/Q4kCTb?= =?us-ascii?Q?sUAoL4wMSrXXDIS407x1n9UvaKm4KtKVnKjguCZ0yXXEBnDZN0Mhd3GidCFw?= =?us-ascii?Q?E3JfbkCrrBJyflL1EO5ud495QfPQbKR2ryS+6jySDgknrI5vcjHMCK9kNHG3?= =?us-ascii?Q?3DJUq15+VMRmCsxe3Uum5TyuyN0iKRxyN2W0X8AepygmQ1A5P5bs/LQVxC+4?= =?us-ascii?Q?IL8tZ9/Im1lUIjs+raqPtHCkKOQgPDu/pHyfgqXzzMnT9tZMk3t0jY3Oqmq/?= =?us-ascii?Q?eaf9niTnFFSPPO2Ag8Se5M7jdEGrH0qZ+qRvWkJy5wQa4piRSuZoO8ZZQSE5?= =?us-ascii?Q?etA7FT/cGlAnaJCU6s6TwpAiRIP0W66KdC5vLgZ0du37i97/45MHsNL6g4l4?= =?us-ascii?Q?xmDNpCXhpjGeHibaAdDVCNByIi8Gnt1gEbotuwlYhFQ2/Rvv3elErmhYWrX5?= =?us-ascii?Q?Htv8WLpJG0WcvjFIgU2vhN58ftqHYCmGRclnFvRZEH/PQYOBKgQVVEtQ2AXy?= =?us-ascii?Q?fKHA+1KixKh/sslyHio3L1D/CcbcFAlmmPN0Kh/heI/SC4RZKl2VCUX+2w03?= =?us-ascii?Q?n2N5Mho3C9/rgPD7ciFK+ZddqTqDhe9hEt2OORNIvofEZX7qMefnHvotKhQs?= =?us-ascii?Q?QNm4bH9bmvwE5LDUTkA3T++5taeZZTlxp1t6oNp6db9/Repx5Bzcaw64IXqb?= =?us-ascii?Q?e9680GRHdeln8VD2a81LbGyaShIe5WSYFAtXSIpOPUsolcXaocYll2cpsq/0?= =?us-ascii?Q?g5bJjOZZWJSETX8bZ30nk5BCyjXpw+3PxnobbKMTBy0cxZ8oesr7cJ3MC8oY?= =?us-ascii?Q?/bJy7x5Aem04NoF+0YhX3knl6EUvAJrn/9Dxn0wF4xq+P2dY90Ixm89CIaPn?= =?us-ascii?Q?/7iqRE/iiy5uqoAu5b+qROGGTx2kwr9UPP/hdvBU3fkdzlUqQ2C9HHp+Y4P4?= =?us-ascii?Q?PZqgyPpWTQ=3D?= X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 6:8rYa643TqoiPlaFrMD0KZNwrPNYDw4Fp36NNXBvj/c+S8LIZyTW392jgTtc0VSkuDJ1kpEGJX2oIJx2w35j6ZSjFNvjjTWccmTcZb8HUEuktOf5p2ny6V0veJycxoesHtkaehxhzaTKVA97Ggsln+1r1LKS8BrNhCsCscyv8xM0h4hhGCqC9baMoTLlBLuF8WOye3P4a4II0twLniIxfa8A1BdoI44GQfwdlyRp6YWPm+MlRGiFjaa7egvJy65fF1AhHMQa0PO1HREQ2wpriyZAd62nD088RH/nu5i6Bohti98Kr3htZ0biN1DbXpLw7kMXx+0F591+p5eG1clDlLEBC0YuqPCYuv7I17K1g4lkaRz0LClnNcwh+A/F+84fVhN2ZAXxbQ0KmgiH+mGRjBDEe/7xjZM3AZNb8p0zq6+vyibjmxQnS3Kv0Wxg59IRHerKo2IeJypMHHn/iRhZL1ibfvkSoKTfiHDZ1WeuDJXK/bSq/zLJfejVQ7JQiOa3P47l5W4d4WAONhczXXx34Lw== X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 5:hDg0oK65d/QHAz/bVVZoaySbaRdnkmTQBl5ZzP1KVisOWCoxy65inRjO62/SM7djKC1sqjHa1qBcPry3eRyEFW1XYgr6en7t7W47dvCnx05H8LXCnFKxczBbXOQ25127rvC0asfHjUqNBOXluW3czEooLM/viFXGBHWuHA4mkZkOO/nlexW4+v1tyzheBfP+zkuPz1oObGoEPC+CZ5RZ/wQopn2meq9jD1cd+uqw6hcLqi95ldq7LGm67qEyGj3VcS8kJwum8KmSOINCVOzap7Ke6lPvnpcT3s1hP30QdAZk304zsViPEz07GbeMZJLXw8FgI6ZghQDXBt9XS/1ZHg9D2sffH6a30E+iXfi3uisN0Y92yxBphVrEgh/BBAkqIyGxOY8zjMcqMaZsE2T21sDgsSqOb9nNqjTIFNc9EURL5dduPMlSb3V8x9CDmniMM+q29EDhzgGNbk9ZyN4Xa35pxmmFZ4d34UVQLpzd2dMV+P9m5Ptt7KfkNMS1N98Y; 24:MHFNPkYkJPfzLLhcum48osYO7s9E4Aw3TrKKhCQs6Q7LzYr9FccAY9a5itGjCLzV1n8Winl3gxlbpMjzX0IqR+w3lxutA27mT9fKiFmjIik= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 7:LvblKFBgBNAZZlMX8McHg/mGHD9xBKWjQK5hDfiFO3lXoa+kjrNVJUgOSZYi6CpCH61olsnkvUsVynQ/wCMsiXlaoqWlwZP1LPOOL11CzO54FPbY5de28tc6YrFkcP/siIIEUVkK41eODBMX9MqGbMK2BNazSUVCQ0TZfgUeNWqbJvr4Dx6z28FY0EIecNTMF3l714aavoUh4xs7qTxEh6W+9STgiM7ZmK1Dq8DOFIZ71LM5MKf2s5xCCioT9wEMEIL2t7MQwW37RTNLnp9JDY42gKX4KqdQQzuFKeoIo+vt9a8joTlh01OyM3U8n+mGTS+JQmsr1KGdxOhWGPRTxg== X-OriginatorOrg: caviumnetworks.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 13 Jun 2017 07:17:17.4733 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY2PR07MB2423 This patch adds the optimized ASIMD version of sinf for Aarch64. The algorithm and code flow is based on the SSE version of sinf implementation for x86_64 in sysdeps/x86_64/fpu/s_sinf.S. * sysdeps/aarch64/fpu/multiarch/Makefile: New file. * sysdeps/aarch64/fpu/multiarch/s_sinf.c: Likewise. * sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S: Likewise. --- sysdeps/aarch64/fpu/multiarch/Makefile | 3 + sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S | 382 +++++++++++++++++++++++++++ sysdeps/aarch64/fpu/multiarch/s_sinf.c | 31 +++ 3 files changed, 416 insertions(+) create mode 100644 sysdeps/aarch64/fpu/multiarch/Makefile create mode 100644 sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S create mode 100644 sysdeps/aarch64/fpu/multiarch/s_sinf.c diff --git a/sysdeps/aarch64/fpu/multiarch/Makefile b/sysdeps/aarch64/fpu/multiarch/Makefile new file mode 100644 index 0000000000..2092e9a885 --- /dev/null +++ b/sysdeps/aarch64/fpu/multiarch/Makefile @@ -0,0 +1,3 @@ +ifeq ($(subdir),math) +libm-sysdep_routines += s_sinf-asimd +endif diff --git a/sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S b/sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S new file mode 100644 index 0000000000..83f26c0c33 --- /dev/null +++ b/sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S @@ -0,0 +1,382 @@ +/* Optimized ASIMD version of sinf + Copyright (C) 2017 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#define __need_Emath +#include + +/* Short algorithm description: + * + * 1) if |x| == 0: return x. + * 2) if |x| < 2^-27: return x-x*DP_SMALL, raise underflow only when needed. + * 3) if |x| < 2^-5 : return x+x^3*DP_SIN2_0+x^5*DP_SIN2_1. + * 4) if |x| < Pi/4: return x+x^3*(S0+x^2*(S1+x^2*(S2+x^2*(S3+x^2*S4)))). + * 5) if |x| < 9*Pi/4: + * 5.1) Range reduction: k=trunc(|x|/(Pi/4)), j=(k+1)&0x0e, n=k+1, + * t=|x|-j*Pi/4. + * 5.2) Reconstruction: + * s = sign(x) * (-1.0)^((n>>2)&1) + * if(n&2 != 0) { + * using cos(t) polynomial for |t|= 2^23, very large args: + * 7.1) Range reduction: k=trunc(|x|/(Pi/4)), j=(k+1)&0xfffffffe, n=k+1, + * t=|x|-j*Pi/4. + * 7.2) Reconstruction same as (5.2). + * 8) if x is Inf, return x-x, and set errno=EDOM. + * 9) if x is NaN, return x-x. + * + * Special cases: + * sin(+-0) = +-0 not raising inexact/underflow, + * sin(subnormal) raises inexact/underflow, + * sin(min_normalized) raises inexact/underflow, + * sin(normalized) raises inexact, + * sin(Inf) = NaN, raises invalid, sets errno to EDOM, + * sin(NaN) = NaN. + */ + +sp_x .req s0 /* SP x */ +dp_x .req d1 /* DP x */ +sp_abs_x .req s2 /* SP |x| */ +dp_abs_x .req d3 /* DP |x| */ +t_val .req d4 /* DP t */ +dp_pio4 .req d5 /* DP Pi/4 */ +dp_zero .req d6 /* DP 0.0 */ +dp_one .req d7 /* DP 1.0 */ + +bits_abs_x .req w0 /* Bits of SP |x| */ +sign_x .req w1 /* Sign bit of x */ +n_val .req w2 /* n */ +bits_x .req w3 /* Bits of SP x */ + +ENTRY_ALIGN(__sinf_asimd, 6) + /* Input: single precision x in s0 */ + ldr w9, L(SP_PIO4) /* Pi/4 */ + fmov bits_x, sp_x /* Bits of x */ + ubfx bits_abs_x, bits_x, #0, #31 /* Bits of |x| */ + cmp bits_abs_x, w9 /* |x|=Pi/4 */ + ldr w10, L(SP_9PIO4) /* 9*Pi/4 */ + fmov sp_abs_x, bits_abs_x /* SP |x| */ + cmp bits_abs_x, w10 /* |x|>9*Pi/4? */ + bge L(large_args) + + /* Here if Pi/4<=|x|<9*Pi/4 */ + ldr s16, L(SP_INVPIO4) /* SP 1/(Pi/4) */ + fcvt dp_abs_x, sp_abs_x /* DP |x| */ + ldr dp_pio4, L(DP_PIO4) + fmul s16, sp_abs_x, s16 /* SP |x|/(Pi/4) */ + fcvtzu w12, s16 /* k=trunc(|x|/(Pi/4)) */ + add n_val, w12, #1 /* n=k+1*/ + and w12, n_val, #0x0e /* j=n&0x0e */ + ucvtf d16, w12 /* DP j */ + fmsub t_val, d16, dp_pio4, dp_abs_x /* t=|x|-j*Pi/4 */ + + .p2align 3 +L(reconstruction): + /* Input: w2=n, d4=t, w1=sign_x */ + tst n_val, #2 /* n&2? */ + fmov dp_one, #1.0 /* DP 1.0 */ + adr x14, L(DP_C) /* Cos Poly Coefficients */ + adr x15, L(DP_S) /* Sin Poly Coefficients */ + fcsel d17, t_val, dp_one, EQ /* q=t or 1.0 */ + lsr sign_x, bits_x, #31 /* Sign bit of x */ + eor w9, sign_x, n_val, LSR #2 /* (n>>2) XOR sign(x) */ + lsl x9, x9, #63 /* sign_sin */ + fmov d20, x9 /* sign_sin */ + csel x14, x15, x14, EQ /* K=Sin or Cos Coefficients */ + eor v17.8b, v17.8b, v20.8b /* r=sign_sin XOR (1.0 or t) */ + + .p2align 3 +L(sin_cos_poly): + /* + * Here if sin(x) is evalutaed by sin(t)/cos(t) polynomial for |t|>2)&1) + * result = r * (1.0+t^2*(K0+t^2*(K1+t^2*(K2+t^2*(K3+t^2*K4))))) + * where r=s*t, Kx=Sx (Sin Polynomial Coefficients) if n&2==0 + * r=s, Kx=Cx (Cos Polynomial Coefficients) otherwise + */ + fmul d18, t_val, t_val /* y=t^2 */ + fmul d19, d18, d18 /* z=t^4 */ + ldr d21, [x14, #0*8] /* K0 */ + ldp d22, d23, [x14, #1*8] /* K1,K2 */ + ldp d24, d25, [x14, #3*8] /* K3,K4 */ + fmadd d23, d25, d19, d23 /* K2+z*K4 */ + fmadd d22, d24, d19, d22 /* K1+z*K3 */ + fmadd d21, d23, d19, d21 /* K0+z*(K2+z*K4) */ + fmul d22, d22, d19 /* z*(K1+z*K3) */ + /* y*(K0+y*(K1+y*(K2+y*(K3+y*K4)))) */ + fmadd d22, d21, d18, d22 + fmadd d22, d22, d17, d17 /* DP result */ + fcvt s0, d22 /* SP result */ + ret + + .p2align 3 +L(large_args): + /* Here if |x|>=9*Pi/4 */ + mov w8, #0x7f8 /* InfNan>>20 */ + cmp bits_abs_x, w8, LSL #20 /* x is Inf or NaN? */ + bge L(arg_inf_or_nan) + + /* Here if finite |x|>=9*Pi/4 */ + fcvt dp_abs_x, sp_abs_x /* DP |x| */ + fmov dp_one, #1.0 /* DP 1.0 */ + mov w11, #0x4b0 /* 2^23>>20 */ + cmp bits_abs_x, w11, LSL #20 /* |x|>=2^23? */ + bge L(very_large_args) + + /* Here if 9*Pi/4<=|x|<2^23 */ + adr x14, L(DP_PIO4HILO) + ldr d16, L(DP_INVPIO4) /* 1/(Pi/4) */ + ldp d17, d18, [x14] /* -PIO4HI,-PIO4LO */ + fmadd d16, dp_abs_x, d16, dp_one /* |x|/(Pi/4)+1.0 */ + fcvtzu n_val, d16 /* n=trunc(|x|/(Pi/4)+1.0) */ + and w10, n_val, #0xfffffffe /* j=n&0xfffffffe */ + ucvtf d16, w10 /* DP j */ + fmadd d17, d16, d17, dp_abs_x /* |x|-j*PIO4HI */ + fmadd t_val, d16, d18, d17 /* t=|x|-j*PIO4HI-j*PIO4LO */ + b L(reconstruction) + +L(very_large_args): + /* Here if finite |x|>=2^23 */ + movz x11, #0x4330, LSL #48 /* 2^52 */ + fmov d21, x11 /* DP 2^52 */ + ldr dp_pio4, L(DP_PIO4) /* Pi/4 */ + fmov dp_zero, xzr /* 0.0 */ + adr x14, L(_FPI) + lsr w8, bits_abs_x, #23 /* eb = biased exponent of x */ + add w8, w8, #-0x7f+59 /* bitpos=eb-BIAS_32+59 */ + mov w9, #28 /* =28 */ + udiv w10, w8, w9 /* j=bitpos/28 */ + mov x11, #0xffffffff00000000 /* DP_HI_MASK */ + add x14, x14, x10, LSL #3 + ldr d16, [x14, #-2*8] /* FPI[j-2] */ + ldr d17, [x14, #-1*8] /* FPI[j-1] */ + ldr q18, [x14] /* FPI[j+1]|FPI[j] */ + mul w10, w10, w9 /* j*28 */ + add w10, w10, #19 /* j*28+19 */ + fmov d20, x11 /* DP_HI_MASK */ + fmul d16, dp_abs_x, d16 /* tmp3 */ + fmul d17, dp_abs_x, d17 /* tmp2 */ + fmul v18.2d, v18.2d, v3.d[0] /* tmp1|tmp0 */ + cmp w8, w10 /* bitpos>=j*28+19? */ + and v20.8b, v16.8b, v20.8b /* HI(tmp3) */ + fcsel d20, dp_zero, d20, LT /* d=HI(tmp3) OR 0.0 */ + fsub d16, d16, d20 /* tmp3=tmp3-d */ + fadd d22, d16, d17 /* tmp5=tmp3+tmp2 */ + fadd d20, d22, d21 /* tmp6=tmp5+2^52 */ + fsub d21, d20, d21 /* tmp4=tmp6-2^52 */ + faddp d18, v18.2d /* tmp7=tmp0+tmp1 */ + fmov w10, s20 /* k=I64_LO(tmp6) */ + fcmp d21, d22 /* tmp4>tmp5? */ + cset w9, GT /* c=1 or 0 */ + fcsel d22, dp_one, dp_zero, GT /* d=1.0 or 0.0 */ + sub w10, w10, w9 /* k-=c */ + fsub d21, d21, d22 /* tmp4-=d */ + and w11, w10, #1 /* k&1 */ + ucvtf d20, w11 /* DP k&1 */ + fsub d16, d16, d21 /* tmp3-=tmp4 */ + fmsub d20, d20, dp_one, d16 /* t=-1.0*[k&1]+tmp3 */ + fadd d20, d20, d17 /* t+=tmp2 */ + add n_val, w10, #1 /* n=k+1 */ + fadd d20, d20, d18 /* t+=tmp7 */ + fmul t_val, d20, dp_pio4 /* t*=PI/4 */ + b L(reconstruction) + + .p2align 3 +L(arg_less_pio4): + /* Here if |x|>20 */ + cmp bits_abs_x, w10, LSL #20 /* |x|<2^-5? */ + blt L(arg_less_2pn5) + + /* Here if 2^-5<=|x|>20 */ + cmp bits_abs_x, w11, LSL #20 /* |x|<2^-27? */ + blt L(arg_less_2pn27) + + /* Here if 2^-27<=|x|<2^-5 */ + adr x14, L(DP_SIN2) + ldp d16, d17, [x14] /* DP SIN2_0,SIN2_1 */ + fmul d18, dp_x, dp_x /* y=x^2 */ + fmadd d16, d17, d18, d16 /* DP SIN2_0+x^2*SIN2_1 */ + fmul d16, d16, d18 /* DP x^2*SIN2_0+x^4*SIN2_1 */ + fmadd d16, d16, dp_x, dp_x /* DP result */ + fcvt s0, d16 /* SP result */ + ret + +L(arg_less_2pn27): + /* Here if |x|<2^-27 */ + cbnz bits_abs_x, L(arg_not_zero) + /* Here if |x|==0 */ + ret + +L(arg_not_zero): + /* Here if 0<|x|<2^-27 */ + /* + * Special cases here: + * sin(subnormal) raises inexact/underflow + * sin(min_normalized) raises inexact/underflow + * sin(normalized) raises inexact + */ + movz x8, #0x3cd0, LSL #48 /* DP_SMALL=2^-50 */ + fmov d16, x8 /* DP_SMALL */ + fmsub dp_x, dp_x, d16, dp_x /* Result is x-x*DP_SMALL */ + fcvt s0, dp_x + ret + + .p2align 3 +L(arg_inf_or_nan): + /* Here if |x| is Inf or NAN */ + bne L(skip_errno_setting) /* in case of x is NaN */ + + /* Here if x is Inf. Set errno to EDOM. */ + adrp x14, :gottprel: errno + ldr PTR_REG(14), [x14, #:gottprel_lo12:errno] + mrs x15, tpidr_el0 + mov w8, #EDOM /* EDOM */ + str w8, [x15, x14] /* Store EDOM in errno */ + +L(skip_errno_setting): + /* Here if |x| is Inf or NAN. Continued. */ + fsub s0, sp_x, sp_x /* Result is NaN */ + ret + +END(__sinf_asimd) + + .section .rodata, "a" + .p2align 3 +L(_FPI): /* 4/Pi broken into sum of positive DP values */ + .long 0x00000000,0x00000000 + .long 0x6c000000,0x3ff45f30 + .long 0x2a000000,0x3e3c9c88 + .long 0xa8000000,0x3c54fe13 + .long 0xd0000000,0x3aaf47d4 + .long 0x6c000000,0x38fbb81b + .long 0xe0000000,0x3714acc9 + .long 0x7c000000,0x3560e410 + .long 0x56000000,0x33bca2c7 + .long 0xac000000,0x31fbd778 + .long 0xe0000000,0x300b7246 + .long 0xe8000000,0x2e5d2126 + .long 0x48000000,0x2c970032 + .long 0xe8000000,0x2ad77504 + .long 0xe0000000,0x290921cf + .long 0xb0000000,0x274deb1c + .long 0xe0000000,0x25829a73 + .long 0xbe000000,0x23fd1046 + .long 0x10000000,0x2224baed + .long 0x8e000000,0x20709d33 + .long 0x80000000,0x1e535a2f + .long 0x64000000,0x1cef904e + .long 0x30000000,0x1b0d6398 + .long 0x24000000,0x1964ce7d + .long 0x16000000,0x17b908bf + .type L(_FPI), @object + ASM_SIZE_DIRECTIVE(L(_FPI)) + +/* Coefficients of polynomial + for sin(x)~=x+x^3*DP_SIN2_0+x^5*DP_SIN2_1, |x|<2^-5. */ + .p2align 3 +L(DP_SIN2): + .long 0x5543d49d,0xbfc55555 /* DP_SIN2_0 */ + .long 0x75cec8c5,0x3f8110f4 /* DP_SIN2_1 */ + .type L(DP_SIN2), @object + ASM_SIZE_DIRECTIVE(L(DP_SIN2)) + +/* Coefficients of polynomial + for sin(t)~=t+t^3*(S0+t^2*(S1+t^2*(S2+t^2*(S3+t^2*S4)))), |t|. */ + +#include +#include + +extern float __sinf_asimd (float); +extern float __sinf_aarch64 (float); +float __sinf (float); + +libm_ifunc (__sinf, + (GLRO (dl_hwcap) & HWCAP_ASIMD) ? __sinf_asimd : __sinf_aarch64); +weak_alias (__sinf, sinf); + +#define SINF __sinf_aarch64 +#include