From patchwork Tue Jun 13 07:17:07 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ashwin Sekhar T K X-Patchwork-Id: 20979 Received: (qmail 43368 invoked by alias); 13 Jun 2017 07:17:37 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 43261 invoked by uid 89); 13 Jun 2017 07:17:36 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.1 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS autolearn=ham version=3.3.2 spammy= X-HELO: NAM01-SN1-obe.outbound.protection.outlook.com Authentication-Results: sourceware.org; dkim=none (message not signed) header.d=none;sourceware.org; dmarc=none action=none header.from=caviumnetworks.com; From: Ashwin Sekhar T K To: libc-alpha@sourceware.org Cc: Ashwin Sekhar T K Subject: [RFC][PATCH 2/2] aarch64: Add optimized ASIMD version of cosf Date: Tue, 13 Jun 2017 00:17:07 -0700 Message-Id: <20170613071707.43396-3-ashwin.sekhar@caviumnetworks.com> MIME-Version: 1.0 X-ClientProxiedBy: MWHPR15CA0059.namprd15.prod.outlook.com (10.174.254.21) To BY2PR07MB2423.namprd07.prod.outlook.com (10.166.115.15) X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BY2PR07MB2423: X-MS-Office365-Filtering-Correlation-Id: 8dddeb96-3de4-472f-d7de-08d4b22c39c0 X-Microsoft-Antispam: UriScan:; BCL:0; PCL:0; RULEID:(22001)(201703131423075)(201703031133081); SRVR:BY2PR07MB2423; X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 3:arAJO5fcLim5ZeU7qFtOiZItCLJc9NzW5Ys3UpxYFBBR4KWA8LpWgaW8+mimNHmycIMCv3biTXZj20tW0dJj7cP5qwv051ny8tBRzs8gyhM+kyaGjvgfRJ5w/zvEoA1uym7LBDSgtSGeQMvSQ7qQHDbj9G1v2O5vKfplnkUT2qu6iDDV92tkbcmL4pKxkhhtInQhWGnO3Z2HT7SwzoFwWQoNj+YVZEBxzRq3z1UC27uGRY4DTE3ARfeitDkY932xqBnQyfcACaQe4kMnz8mZPf9MUc7Q2kKs9nTVbBdClou80nWGxoHU8WG4iR9vDIDdE6tu4n/yJnX5nwXcpVr8Zg==; 25:SrpRZMqIX2hGgW6JFaaUWfK04iBC4ZzjkEk8gdxVga3P/rZus6vazJX+1PF+/o0TWTYNBO1LX0yKth9RjliOxZ3+Dj4mdpvEfNRO6vdmTIIN3U+tdlZie/AC3M8vFe8VlSbecrhDoIGwNzjjWRaiZFPCG8iNKt6SfnXwoEWsnOyb2r7FAWA0xcTN5tUkv4DW1MelMMvKuk67FY7JpvHPjRlgNYxSDvoPb/DwhaFn8a7Hy8BGR59qBCjeQKZIp679Rzs2VfdciPadPGLrxLpTbd6PQMbgZ8+T9BZAiFfOiTDK4FICZ1IaHSNBV2h0cPIzbok51O4NPrzYx1FsRSlKQGjRY8/n7O3C4XB899Qk+aqyHggzH1qkHgCzGQZxNp7DOxelRKF8bOw57421xfcfK/rEf0SG8slwYPBS35YNxzS/XUs5MnZaZOQ+95bysaqc1A3RucWit7rLHT7cjfl/ltzhdnHewX/sabUDRRwf0Zg= X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 31:Lhsw+7g3UyeCTNPnO7Imcn9hAZX/XyUMUc1POmQzxG4MFOl4j+e3P4UjVMGUHGvUVZZtofe23Wp5fUENbBD5emgRN8TQdENlYlv3aIlla+dHdlZiLXoJLMchx5bU8EdC26HjJD5nPLK7oc6u2Y1ZukPSvSndmP7VMR5LQWmAZHcBmDD3zACrhlV+X8uqXdO0Y8Rjb7a4OK9UFbsfjJNOz8J45/Sn0C4ZOyaIVkcYv9c=; 20:6ea7wREbV6LwbCNeQ+Y/3sI8x5B+WVygg9MnBhfAXP40DYXragzNMSVjzJi7ZKI8D0UqtinCf1j7SRh7W0tdvPPr2fx7QX5R7XmXLzwiSslSnuKfKLQ9sYjMfC4S52GF7IyUxGe12w7VnE6SCWKIdqC3qLWLjjjHHmnHO58u0B/rn92AwTHjkcE0fkE4cyS9VByHGRPmXzprPaR5v6w43+Npi0lsLIttYc2sETxb+mW2E3SZkfFr0WXfIWj+VPpdiMeolGTqKJ6ufkOtVJ7wfrGQOGlzEDjR4vg0e8ZIx/t8bE198Ufz2xvacL4Accw3ew/JgUoovb0zN2cgxCEotX+ypMWxAFtHXu9eF0omHOsiiqOgI0PdGoe/ei860U/KWf7DVMNu1w0wesgeTK67jW5rReN5xO0Q0bXQOX8o1Md2Rkyek7F6TM0o7/UK2Rws5ZPLeiNxjHzvaEYplRx1G1I6agzJ8BLp+lo9Z5L0VfzjFqAHZS3/qXytFht8c2QMl3sj8Xr5iU+lki6k9gc7TvD9yVK7FhcR0eJaDcgd3xGCHg9Mry6vbVkzGlhcwJbwPJAq7IuhTOVQhbH7T8yvtDvYL0tFLys8+/j5bAw+uOc= X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(250305191791016)(22074186197030); X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(8121501046)(5005006)(93006095)(10201501046)(3002001)(100000703101)(100105400095)(6041248)(20161123555025)(20161123558100)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123560025)(20161123562025)(20161123564025)(6072148)(100000704101)(100105200095)(100000705101)(100105500095); SRVR:BY2PR07MB2423; BCL:0; PCL:0; RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095); SRVR:BY2PR07MB2423; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; BY2PR07MB2423; 4:5Y+WI+rDZZIYiFL/FimkMYqPWdfRWbZZnp0XSVJAob?= =?us-ascii?Q?Cpe3KlHG9QpYmaF9aE8LIAsL0tt0G83KUvadK4mmfzdJsjXz446ji06I+y0B?= =?us-ascii?Q?97XJe8JMEmZheSgnIxJtutMX31nTmJEOYhWxXi6bL6eNnQJUaD4KzmhvABsu?= =?us-ascii?Q?U/465jWzFQ6cmCZdJZJ1mjhSLOZQWzZkodZsfgHMu9xvtCVfaNEY7dfkthZu?= =?us-ascii?Q?44bJeF50AygwVrnQHZEETaNze9UZmKe2aAyqS3U7n2jb2sSro7xAQwd/UjTj?= =?us-ascii?Q?nvIGBAp0LW5h3XpxczpaXz7VUyg2iMx0wyqWRE0hRjL7V+DC6DHUne2JrXrN?= =?us-ascii?Q?tLChf+eDcF205LPWLTYiBkIW3Q0azRw6L/0NuktxD+5B/+MKW9blvdmd+Rvs?= =?us-ascii?Q?fkz1q/VZbmSeWhHFT8NWAMvkMZnCs7RQKEQz+112UzwgHxbo0HLsLAKAYlf5?= =?us-ascii?Q?7GmfEn6XIalL7JwZ6229kBkHXboL2I1z1iKbK0lTXYRCxL/+4vmy/iHdHtKq?= =?us-ascii?Q?ttosvZqey4T8Bl6mlou7bwSLKPaziK26cMGGKFmzTzQdOFWpulnpUfW5dN8H?= =?us-ascii?Q?uE3IirIwE9oztmOgzfR1L4SbNPOe08yJH7qG+EM2zOhVUNVSkQBDmc6gdAyT?= =?us-ascii?Q?TVh4rZlAkckU21qg9m5b5Omv4chuxupcUs2unviFWq1ZWSfRCHoN/G4t0VkN?= =?us-ascii?Q?H5BTVlFjPp//gX4o3ArxKdgXHyGrY0RmIk6X1M8TdYJD8J1s4TwWZtzDCJOZ?= =?us-ascii?Q?odeqS42n7WnhjDZwEpo9neSrQCqrpCQK0GEH9uoYr89X8Lsnzv0ZgR5vE5ia?= =?us-ascii?Q?eJ/NKXK3kee0Xs2whAXhZM0RkXOuhuNmXws6aJLFwwDLVzP634SuOFdOvOFn?= =?us-ascii?Q?RqwqGpsV1Uw6h0Dp2AYSOk45u9L+0d9zRleFEqnJa05nSodjCW6NpMaO44lL?= =?us-ascii?Q?ZZg7Ldap7hHatC4C2wr34VQ75xIbStITRcjM4iiwgBMZohunSC2mxQ40DUuZ?= =?us-ascii?Q?kpwWB2AWk2w6/nLNURq3YD5GqdHpd5sA+38R39k8D2Kcr9K3eFbCGCkaS74s?= =?us-ascii?Q?m6viUcxWZ6EPuSdqJozokcMQY18CDp0aZVCr12F/LVrXivHWcRPcgz6/SsWR?= =?us-ascii?Q?4W5k7j+A9Pyhl02cQ17R5XSOBrRcgtT2Icc2gn73XxOSmjgPbPPr4dvJkVkA?= =?us-ascii?Q?8fqIF4zDixfcb0h8sLFQ26tiQU+PUWE01c?= X-Forefront-PRVS: 0337AFFE9A X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009020)(4630300001)(6009001)(39410400002)(39400400002)(39850400002)(39450400003)(39840400002)(33646002)(575784001)(6666003)(6306002)(8676002)(4326008)(53416004)(81166006)(6512007)(53936002)(305945005)(42882006)(7736002)(5660300001)(189998001)(50226002)(48376002)(2906002)(6916009)(25786009)(66066001)(6486002)(38730400002)(107886003)(3846002)(478600001)(1076002)(110136004)(2361001)(50986999)(5003940100001)(2351001)(36756003)(72206003)(42186005)(6506006)(47776003)(2004002); DIR:OUT; SFP:1101; SCL:1; SRVR:BY2PR07MB2423; H:1scrb-1.caveonetworks.com; FPR:; SPF:None; MLV:sfv; LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; BY2PR07MB2423; 23:iwB0d2Ys9YZzcN8PIO5e93obk0umo98YJoyy3FhcG?= =?us-ascii?Q?6L/2lVPPoqNzLOwvSvz0ad9m7Ypd10vrHXgmp0U0knoI4uEkhODKZRGyb3XD?= =?us-ascii?Q?g9QtZtVpVfi98cIizHjwGhBgMJ1C0FaFBieG4ZfAWGudLcI3vFb/tEZgkB73?= =?us-ascii?Q?cwseJ2uL4ycnOH438k61Tqeq7kNdI0eMB3i515Bk4kUddu0BEf6nvxw923Zi?= =?us-ascii?Q?VzwuEHY0wGETgvrcnTg9ZpzYjrsZ5B1pmbF4Ltz2KDP5gg69ktwSKGu/Fecy?= =?us-ascii?Q?1+aznwiDLftWNp//2e0s8mRkdV90XPTViGNTbjQYbl7eQRJVTGsFw4PIgqNx?= =?us-ascii?Q?m7/AZf10eMl5ckrT8WUkM47FEdc1jXLeZM5k/D2hsXkiPRYVJOQWoc4ERj/B?= =?us-ascii?Q?2Qj5V97TLNvKffSx8iUlPjmFBefQwN/4RgOT2+Mgy2qb67xpc7ihNJAef9OS?= =?us-ascii?Q?YmoXr9/lQZGgOMilKIdDzi4LDbhVZLc66xO+C5NPHmjA0bceOCYKTG8UQQqm?= =?us-ascii?Q?UMmewLtCHt4TXCzdpPuCsO/4Tn4oPtdYw/WPY6s/vWpqXJAetLIVmxC3J+hb?= =?us-ascii?Q?56ZVjS/SEH955bQs85hwMBzND/jr7opWhgvwb3KkITVNCZ3A/IwPIJGZAii0?= =?us-ascii?Q?OWyQDnRWJnGBFhxStoR/LUdBC1U6rdEfQFI5pVUW4o1TlXhfkqqUaGvOOa78?= =?us-ascii?Q?4b50JbsW1GC8Pqe5zKhi0JkrdrFdhpfVoLpPt8/V2wRrKiwlBSUQpgSqX07u?= =?us-ascii?Q?2x6FGRNJLEONxWY6fZgSOM3a/Ntvh687ZIpA1Cn9o4oH0u1REpSDJLU2IKCn?= =?us-ascii?Q?eqn5wD/ePvU+3RJpKjqp9ES+KIK3O5QCFN6XdHBvhlbFsVCd8f5T7w7S70G7?= =?us-ascii?Q?9LrN+J9ujQLh9WREfOiPSMjgxb9LgyNdPOdNYgUHqEPc1CCWasudO9/jdToK?= =?us-ascii?Q?zfW3EwoHTOYcQ/6ufKdDLhF5g+fhlO/+bmYCo3r/67kz+AMSh8j6KonOzMSA?= =?us-ascii?Q?OYzUNOWL8TiJWcEA+8DQwS4bhi5cZ0ymjtaVh+gGF7CliitDtSWx3LqgcpYj?= =?us-ascii?Q?qKvdbZpnL705RsMh3FXLRZvEiB9TdMfi9qv5gQrL/qg66kfdt38YMz759mXH?= =?us-ascii?Q?z4fyVo73u4=3D?= X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 6:IoARlCIlPNoZCrMiJVztKrh/iVF8Sqj5p8DHw2MbSRW0wQEW9/zpDGFXfH3ao2BRromzaAvLKAZD4j8VHFo8IzVLM40XDGl7zD8I5d6j526wM+zppm8Q0LS8blOdle9jQr7gYB3nYX+6hFYheThi1Z/VVez4lsxJU6e85rM0UO0fMFHzTBeFB+O9XGvniQRpcbuOss0yN+wCr5wvZ/FUKSqhybh5v7SQpsm2n/z4qtJNcIerBBkhRiRR9LihiQmiQR4BhGSkZPltVJC15m6co4WXRUOMOl4x2SGOSmhghvdUTaxMZ0vB6li+aSSRkpT+CxfwMwn6Zyeb7RNPdOVtbfeNAHtt7HZPOYtDTDph/8HTzxMdunA0yd+CVs1ER+5MfWQnqPVTW1NfGrwNDdcK7k+Tni+8Xduj0jNSHhUIzrr8srGHWl4KZNDxY9IPKorS18H1MmL7e0TRGi9/l2H4ywfimJAB8gnFjrCYh/1bRwNElRB7Q6v/34dtWJ4UNp/wfv54k6Qr55uOEYN93XOl2A== X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 5:pzaIwOvwVzKlrv/bDqisM1OO3QMOZB+ckESqT1xG0o0b9DSDjlz0JypwgpGH0ZAFew4wM347tEwnJzEGL371tUG6K62G9gCnptgXDqd/7ubMUt+puoW3RMa99GEW8UMJOHT0pGwxoVXe6ApGuLcHz6swqAIgyB6kR8qltQAIe7jyOECrR/YlBZgvlM5CsqkJEuUlO8SYxv3gFeFA/cmeHEsP8igPqivAJtYYefnjobhzXfTknnUa6QerE/1unxCBxVK51hts5/uFNzPW5dqjoNTRk5MOhJsk/5pwZ2FGFRLXe8lqQk9Ccs5rUGJ7/Rwo8KhUn61YckY17jdc00248UAA4Yjg5dHuN4+1CGhQ/blUAZbA14pyygqPkrzIYsgStxGf6ud9VT3dy8DVr1oDL7BJkaJ+YsFdfVFaM/fzlYiGJqJgEYAXypYw0Thfxb4Dbhe07kLO7aAcEnxHAfu08mXMu1GxV9yyoBHvXE98wyae4KWFY9GMCg3RmPB2YIc6; 24:XYuJZe1naUs5lo5AFQ30jp8g/KcwFvmrR4PPY3HrhJg5jAkFkvM9hFVx0xXYU6i33zLvu80mR58E7hmESqrMMMYSQKq3Sb7avDEs3GUrZbU= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; BY2PR07MB2423; 7:5dM2UkBvBVWTr3xeb01ROU1ymdfasIFYlpBj2LqzsrMGed2BNlYXt0dAVmU9xocjPxpgzgIprK3S4eNHXqNoiG/QSOLtzQbOmmctG3icyXz+NUeVokHO28Yz6i8ijK8pJp2hcsH567e7yDGWyXzcbN2Z7KHLg3AFyx+9MCkQYjCsx37vTYUU/DWCs5xNJZWphYhHcPoE73epRmK796mfjeghfUta+kttg028R0TQ4JyIoh5GZ8Bj6o47SefZcsQSCu7MwwaufilhpsMT1Noy0216eCunWnwPTFQDLndgaXwCmV4TbD/TfBkftJFVhkk75wNmoQevFYP1Ar8FjH5TtQ== X-OriginatorOrg: caviumnetworks.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 13 Jun 2017 07:17:18.1452 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY2PR07MB2423 This patch adds the optimized ASIMD version of cosf for Aarch64. The algorithm and code flow is based on the SSE version of cosf implementation for x86_64 in sysdeps/x86_64/fpu/s_cosf.S. * sysdeps/aarch64/fpu/multiarch/Makefile: Add s_cosf-asimd. * sysdeps/aarch64/fpu/multiarch/s_cosf.c: New file. * sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S: Likewise. --- sysdeps/aarch64/fpu/multiarch/Makefile | 2 +- sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S | 367 +++++++++++++++++++++++++++ sysdeps/aarch64/fpu/multiarch/s_cosf.c | 31 +++ 3 files changed, 399 insertions(+), 1 deletion(-) create mode 100644 sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S create mode 100644 sysdeps/aarch64/fpu/multiarch/s_cosf.c diff --git a/sysdeps/aarch64/fpu/multiarch/Makefile b/sysdeps/aarch64/fpu/multiarch/Makefile index 2092e9a885..3711b71805 100644 --- a/sysdeps/aarch64/fpu/multiarch/Makefile +++ b/sysdeps/aarch64/fpu/multiarch/Makefile @@ -1,3 +1,3 @@ ifeq ($(subdir),math) -libm-sysdep_routines += s_sinf-asimd +libm-sysdep_routines += s_sinf-asimd s_cosf-asimd endif diff --git a/sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S b/sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S new file mode 100644 index 0000000000..d052cfad7c --- /dev/null +++ b/sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S @@ -0,0 +1,367 @@ +/* Optimized ASIMD version of cosf + Copyright (C) 2017 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#define __need_Emath +#include + +/* Short algorithm description: + * + * 1) if |x| == 0: return 1.0-|x|. + * 2) if |x| < 2^-27: return 1.0-|x|. + * 3) if |x| < 2^-5 : return 1.0+x^2*DP_COS2_0+x^5*DP_COS2_1. + * 4) if |x| < Pi/4: return 1.0+x^2*(C0+x^2*(C1+x^2*(C2+x^2*(C3+x^2*C4)))). + * 5) if |x| < 9*Pi/4: + * 5.1) Range reduction: k=trunc(|x|/(Pi/4)), j=(k+1)&0x0e, n=k+3, + * t=|x|-j*Pi/4. + * 5.2) Reconstruction: + * s = (-1.0)^((n>>2)&1) + * if(n&2 != 0) { + * using cos(t) polynomial for |t|= 2^23, very large args: + * 7.1) Range reduction: k=trunc(|x|/(Pi/4)), j=(k+1)&0xfffffffe, n=k+3, + * t=|x|-j*Pi/4. + * 7.2) Reconstruction same as (5.2). + * 8) if x is Inf, return x-x, and set errno=EDOM. + * 9) if x is NaN, return x-x. + * + * Special cases: + * cos(+-0) = 1 not raising inexact, + * cos(subnormal) raises inexact, + * cos(min_normalized) raises inexact, + * cos(normalized) raises inexact, + * cos(Inf) = NaN, raises invalid, sets errno to EDOM, + * cos(NaN) = NaN. + */ + +sp_x .req s0 /* SP x */ +dp_x .req d1 /* DP x */ +sp_abs_x .req s2 /* SP |x| */ +dp_abs_x .req d3 /* DP |x| */ +t_val .req d4 /* DP t */ +dp_pio4 .req d5 /* DP Pi/4 */ +dp_zero .req d6 /* DP 0.0 */ +dp_one .req d7 /* DP 1.0 */ +sp_one .req s7 /* SP 1.0 */ + +bits_abs_x .req w0 /* Bits of SP |x| */ +n_val .req w1 /* n */ +bits_x .req w2 /* Bits of SP x */ + +ENTRY_ALIGN(__cosf_asimd, 6) + /* Input: single precision x in s0 */ + ldr w9, L(SP_PIO4) /* Pi/4 */ + fmov dp_one, #1.0 /* DP 1.0 */ + fmov bits_x, sp_x /* Bits of x */ + ubfx bits_abs_x, bits_x, #0, #31 /* Bits of |x| */ + cmp bits_abs_x, w9 /* |x|=Pi/4 */ + ldr w10, L(SP_9PIO4) /* 9*Pi/4 */ + cmp bits_abs_x, w10 /* |x|>9*Pi/4? */ + bge L(large_args) + + /* Here if Pi/4<=|x|<9*Pi/4 */ + ldr s16, L(SP_INVPIO4) /* SP 1/(Pi/4) */ + fcvt dp_abs_x, sp_abs_x /* DP |x| */ + ldr dp_pio4, L(DP_PIO4) + fmul s16, sp_abs_x, s16 /* SP |x|/(Pi/4) */ + fcvtzu w12, s16 /* k=trunc(|x|/(Pi/4)) */ + add w12, w12, #1 /* k+1*/ + and w12, w12, #0x0e /* j=(k+1)&0x0e */ + add n_val, w12, #2 /* n=k+3 */ + ucvtf d16, w12 /* DP j */ + fmsub t_val, d16, dp_pio4, dp_abs_x /* t=|x|-j*Pi/4 */ + + .p2align 3 +L(reconstruction): + /* Input: w1=n, d4=t */ + tst n_val, #2 /* n&2? */ + adr x14, L(DP_C) /* Cos Poly Coefficients */ + adr x15, L(DP_S) /* Sin Poly Coefficients */ + fcsel d17, t_val, dp_one, EQ /* q=t or 1.0 */ + lsr w9, n_val, #2 /* (n>>2) */ + lsl x9, x9, #63 /* sign_cos=(n>>2)<<63 */ + fmov d20, x9 /* sign_cos */ + csel x14, x15, x14, EQ /* K=Sin or Cos Coefficients */ + eor v17.8b, v17.8b, v20.8b /* r=sign_cos XOR (1.0 or t) */ + + .p2align 3 +L(sin_cos_poly): + /* + * Here if cos(x) is evalutaed by sin(t)/cos(t) polynomial for |t|>2)&1) + * result = r * (1.0+t^2*(K0+t^2*(K1+t^2*(K2+t^2*(K3+t^2*K4))))) + * where r=s*t, Kx=Sx (Sin Polynomial Coefficients) if n&2==0 + * r=s, Kx=Cx (Cos Polynomial Coefficients) otherwise + */ + fmul d18, t_val, t_val /* y=t^2 */ + fmul d19, d18, d18 /* z=t^4 */ + ldr d21, [x14, #0*8] /* K0 */ + ldp d22, d23, [x14, #1*8] /* K1,K2 */ + ldp d24, d25, [x14, #3*8] /* K3,K4 */ + fmadd d23, d25, d19, d23 /* K2+z*K4 */ + fmadd d22, d24, d19, d22 /* K1+z*K3 */ + fmadd d21, d23, d19, d21 /* K0+z*(K2+z*K4) */ + fmul d22, d22, d19 /* z*(K1+z*K3) */ + /* y*(K0+y*(K1+y*(K2+y*(K3+y*K4)))) */ + fmadd d22, d21, d18, d22 + fmadd d22, d22, d17, d17 /* DP result */ + fcvt s0, d22 /* SP result */ + ret + + .p2align 3 +L(large_args): + /* Here if |x|>=9*Pi/4 */ + mov w8, #0x7f8 /* InfNan>>20 */ + cmp bits_abs_x, w8, LSL #20 /* x is Inf or NaN? */ + bge L(arg_inf_or_nan) + + /* Here if finite |x|>=9*Pi/4 */ + fcvt dp_abs_x, sp_abs_x /* DP |x| */ + mov w11, #0x4b0 /* 2^23>>20 */ + cmp bits_abs_x, w11, LSL #20 /* |x|>=2^23? */ + bge L(very_large_args) + + /* Here if 9*Pi/4<=|x|<2^23 */ + adr x14, L(DP_PIO4HILO) + ldr d16, L(DP_INVPIO4) /* 1/(Pi/4) */ + ldp d17, d18, [x14] /* -PIO4HI,-PIO4LO */ + fmadd d16, dp_abs_x, d16, dp_one /* |x|/(Pi/4)+1.0 */ + fcvtzu w10, d16 /* k+1=trunc(|x|/(Pi/4)+1.0) */ + and w10, w10, #0xfffffffe /* j=(k+1)&0xfffffffe */ + add n_val, w10, #2 /* n=k+3 */ + ucvtf d16, w10 /* DP j */ + fmadd d17, d16, d17, dp_abs_x /* |x|-j*PIO4HI */ + fmadd t_val, d16, d18, d17 /* t=|x|-j*PIO4HI-j*PIO4LO */ + b L(reconstruction) + +L(very_large_args): + /* Here if finite |x|>=2^23 */ + movz x11, #0x4330, LSL #48 /* 2^52 */ + fmov d21, x11 /* DP 2^52 */ + ldr dp_pio4, L(DP_PIO4) /* Pi/4 */ + fmov dp_zero, xzr /* 0.0 */ + adr x14, L(_FPI) + lsr w8, bits_abs_x, #23 /* eb = biased exponent of x */ + add w8, w8, #-0x7f+59 /* bitpos=eb-BIAS_32+59 */ + mov w9, #28 /* =28 */ + udiv w10, w8, w9 /* j=bitpos/28 */ + mov x11, #0xffffffff00000000 /* DP_HI_MASK */ + add x14, x14, x10, LSL #3 + ldr d16, [x14, #-2*8] /* FPI[j-2] */ + ldr d17, [x14, #-1*8] /* FPI[j-1] */ + ldr q18, [x14] /* FPI[j+1]|FPI[j] */ + mul w10, w10, w9 /* j*28 */ + add w10, w10, #19 /* j*28+19 */ + fmov d20, x11 /* DP_HI_MASK */ + fmul d16, dp_abs_x, d16 /* tmp3 */ + fmul d17, dp_abs_x, d17 /* tmp2 */ + fmul v18.2d, v18.2d, v3.d[0] /* tmp1|tmp0 */ + cmp w8, w10 /* bitpos>=j*28+19? */ + and v20.8b, v16.8b, v20.8b /* HI(tmp3) */ + fcsel d20, dp_zero, d20, LT /* d=0.0 OR HI(tmp3) */ + fsub d16, d16, d20 /* tmp3=tmp3-d */ + fadd d22, d16, d17 /* tmp5=tmp3+tmp2 */ + fadd d20, d22, d21 /* tmp6=tmp5+2^52 */ + fsub d21, d20, d21 /* tmp4=tmp6-2^52 */ + faddp d18, v18.2d /* tmp7=tmp0+tmp1 */ + fmov w10, s20 /* k=I64_LO(tmp6) */ + fcmp d21, d22 /* tmp4>tmp5? */ + cset w9, GT /* c=1 or 0 */ + fcsel d22, dp_one, dp_zero, GT /* d=1.0 or 0.0 */ + sub w10, w10, w9 /* k-=c */ + fsub d21, d21, d22 /* tmp4-=d */ + and w11, w10, #1 /* k&1 */ + ucvtf d20, w11 /* DP k&1 */ + fsub d16, d16, d21 /* tmp3-=tmp4 */ + fmsub d20, d20, dp_one, d16 /* t=-1.0*[k&1]+tmp3 */ + fadd d20, d20, d17 /* t+=tmp2 */ + add n_val, w10, #3 /* n=k+3 */ + fadd d20, d20, d18 /* t+=tmp7 */ + fmul t_val, d20, dp_pio4 /* t*=PI/4 */ + b L(reconstruction) + + .p2align 3 +L(arg_less_pio4): + /* Here if |x|>20 */ + cmp bits_abs_x, w10, LSL #20 /* |x|<2^-5? */ + blt L(arg_less_2pn5) + + /* Here if 2^-5<=|x|>20 */ + cmp bits_abs_x, w11, LSL #20 /* |x|<2^-27? */ + blt L(arg_less_2pn27) + + /* Here if 2^-27<=|x|<2^-5 */ + adr x14, L(DP_COS2) + ldp d16, d17, [x14] /* DP COS2_0,COS2_1 */ + fmul d18, dp_x, dp_x /* y=x^2 */ + fmadd d16, d17, d18, d16 /* DP COS2_0+x^2*COS2_1 */ + fmadd d16, d16, d18, dp_one /* DP 1+x^2*COS2_0+x^4*COS2_1 */ + fcvt s0, d16 /* SP result */ + ret + +L(arg_less_2pn27): + /* Here if |x|<2^-27 */ + fmov sp_one, #1.0 /* 1.0 */ + fsub s0, sp_one, sp_abs_x /* result is 1.0-|x| */ + ret + + .p2align 3 +L(arg_inf_or_nan): + /* Here if |x| is Inf or NAN */ + bne L(skip_errno_setting) /* in case of x is NaN */ + + /* Here if x is Inf. Set errno to EDOM. */ + adrp x14, :gottprel: errno + ldr PTR_REG(14), [x14, #:gottprel_lo12:errno] + mrs x15, tpidr_el0 + mov w8, #EDOM /* EDOM */ + str w8, [x15, x14] /* Store EDOM in errno */ + +L(skip_errno_setting): + /* Here if |x| is Inf or NAN. Continued. */ + fsub s0, sp_x, sp_x /* Result is NaN */ + ret + +END(__cosf_asimd) + + .section .rodata, "a" + .p2align 3 +L(_FPI): /* 4/Pi broken into sum of positive DP values */ + .long 0x00000000,0x00000000 + .long 0x6c000000,0x3ff45f30 + .long 0x2a000000,0x3e3c9c88 + .long 0xa8000000,0x3c54fe13 + .long 0xd0000000,0x3aaf47d4 + .long 0x6c000000,0x38fbb81b + .long 0xe0000000,0x3714acc9 + .long 0x7c000000,0x3560e410 + .long 0x56000000,0x33bca2c7 + .long 0xac000000,0x31fbd778 + .long 0xe0000000,0x300b7246 + .long 0xe8000000,0x2e5d2126 + .long 0x48000000,0x2c970032 + .long 0xe8000000,0x2ad77504 + .long 0xe0000000,0x290921cf + .long 0xb0000000,0x274deb1c + .long 0xe0000000,0x25829a73 + .long 0xbe000000,0x23fd1046 + .long 0x10000000,0x2224baed + .long 0x8e000000,0x20709d33 + .long 0x80000000,0x1e535a2f + .long 0x64000000,0x1cef904e + .long 0x30000000,0x1b0d6398 + .long 0x24000000,0x1964ce7d + .long 0x16000000,0x17b908bf + .type L(_FPI), @object + ASM_SIZE_DIRECTIVE(L(_FPI)) + +/* Coefficients of polynomial + for cos(x)~=1.0+x^2*DP_COS2_0+x^4*DP_COS2_1, |x|<2^-5. */ + .p2align 3 +L(DP_COS2): + .long 0xff5cc6fd,0xbfdfffff /* DP_COS2_0 */ + .long 0xb178dac5,0x3fa55514 /* DP_COS2_1 */ + .type L(DP_COS2), @object + ASM_SIZE_DIRECTIVE(L(DP_COS2)) + +/* Coefficients of polynomial + for sin(t)~=t+t^3*(S0+t^2*(S1+t^2*(S2+t^2*(S3+t^2*S4)))), |t|. */ + +#include +#include + +extern float __cosf_asimd (float); +extern float __cosf_aarch64 (float); +float __cosf (float); + +libm_ifunc (__cosf, + (GLRO(dl_hwcap) & HWCAP_ASIMD) ? __cosf_asimd : __cosf_aarch64); +weak_alias (__cosf, cosf); + +#define COSF __cosf_aarch64 +#include