From patchwork Wed Dec 7 08:52:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 61634 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0C92D396E869 for ; Wed, 7 Dec 2022 08:56:24 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0C92D396E869 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1670403384; bh=D0o774OAzgozGNu6EO2scetF6G9UtNgTGZ6a6jyp+I4=; h=To:Cc:Subject:Date:In-Reply-To:References:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=G6yjyrbhSZ8/D1jQxkggk+1kDNHM31Z8IY6a6E+QSXMOMDcT9AFsgkB8oaGsBbXUj gEUk8magmR4n9CGGrfr/mP7+66rQ0E6VArEFObvS252Ir2SP2iG5qHWLtnvg1U5cTz U9eBEiqPWlAVE1gBuHhM5LCT7l4hFQM2JQH73FZM= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-ej1-x633.google.com (mail-ej1-x633.google.com [IPv6:2a00:1450:4864:20::633]) by sourceware.org (Postfix) with ESMTPS id 925E838AA26F for ; Wed, 7 Dec 2022 08:53:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 925E838AA26F Received: by mail-ej1-x633.google.com with SMTP id x22so12277666ejs.11 for ; Wed, 07 Dec 2022 00:53:02 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D0o774OAzgozGNu6EO2scetF6G9UtNgTGZ6a6jyp+I4=; b=11tHJ6eNS1ireGbmbk+MdQtOPXVpe5X2vsLZWMtXFQKQi/t+fUGlAQqX17m2ql6xrT IpKsSlO58bNdQw6MSgIjiuQksnfoE0r/uOm/GQZnHwIEbuDJnBEg+mrU75NAf/K9A198 QpbUbj4bhA2KLa8EdnwuMJVHQ+yX1K3fBdAyGBSBd9N/A9WjmuzA8YKQiW5qjRQqncAe y2h8ZeWawpfKT0ezK6CmZ0Oa3g+Je8FppOvcQ4uQpsR/WQC1F3aAYbeUUkv6YyXXeStE mGYY98QM5tHmUJq/XiUQcjORCT6Z+odHip8R9opOJ2toyf0npL6BsmFlTjlyg5TjDbS/ OAXw== X-Gm-Message-State: ANoB5pk6VWoMXav6p9BoHQsNw8yU/ldVarw2Qwe810hoPgH71vI06cdH HC/nWk6dfQvPe4mfSdtIhAZV/Q3eSc4= X-Google-Smtp-Source: AA0mqf4gQU8a4MHKOkiuKJ7Ss1czXpKkz1lbFbf8iK7UaziiqA1GTPfMZIB+93zHrvpZUVsD3o3Cjw== X-Received: by 2002:a17:906:3e53:b0:7c1:1f2b:945f with SMTP id t19-20020a1709063e5300b007c11f2b945fmr1358420eji.302.1670403181889; Wed, 07 Dec 2022 00:53:01 -0800 (PST) Received: from noahgold-desk.lan (2603-8080-1301-76c6-feb7-1b9b-f2dd-08f7.res6.spectrum.com. [2603:8080:1301:76c6:feb7:1b9b:f2dd:8f7]) by smtp.gmail.com with ESMTPSA id k17-20020aa7c051000000b0046bd3b366f9sm1931767edo.32.2022.12.07.00.53.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Dec 2022 00:53:01 -0800 (PST) To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, andrey.kolesov@intel.com, carlos@systemhalted.org Subject: [PATCH v1 12/27] x86/fpu: Optimize svml_s_atanf4_core_sse4.S Date: Wed, 7 Dec 2022 00:52:21 -0800 Message-Id: <20221207085236.1424424-12-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221207085236.1424424-1-goldstein.w.n@gmail.com> References: <20221207085236.1424424-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, UNWANTED_LANGUAGE_BODY autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" 1. Cleanup some missed optimizations in instruction selection / unnecissary repeated rodata references. 2. Remove unused rodata. 3. Use common data definitions where possible. Code Size Change: -31 Bytes (173 - 204) Input New Time / Old Time 0F (0x00000000) -> 0.9446 0F (0x0000ffff, Denorm) -> 0.9977 .1F (0x3dcccccd) -> 0.9380 5F (0x40a00000) -> 0.9542 2315255808F (0x4f0a0000) -> 1.0115 -NaN (0xffffffff) -> 0.9232 --- .../fpu/multiarch/svml_s_atanf4_core_sse4.S | 198 +++++++----------- 1 file changed, 75 insertions(+), 123 deletions(-) diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core_sse4.S index 83cecb8ee5..2ab599f7a8 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core_sse4.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core_sse4.S @@ -28,136 +28,88 @@ * */ -/* Offsets for data table __svml_satan_data_internal - */ -#define _sSIGN_MASK 0 -#define _sABS_MASK 16 -#define _sONE 32 -#define _sPIO2 48 -#define _sPC8 64 -#define _sPC7 80 -#define _sPC6 96 -#define _sPC5 112 -#define _sPC4 128 -#define _sPC3 144 -#define _sPC2 160 -#define _sPC1 176 -#define _sPC0 192 +#define LOCAL_DATA_NAME __svml_satan_data_internal +#include "svml_s_common_sse4_rodata_offsets.h" +/* Offsets for data table __svml_satan_data_internal. */ +#define _SignMask 0 +#define _sPIO2 16 +#define _sPC7 32 +#define _sPC5 48 +#define _sPC3 64 +#define _sPC1 80 +#define _sPC8 96 +#define _sPC6 112 +#define _sPC4 128 +#define _sPC2 144 +#define _sPC0 160 #include .section .text.sse4, "ax", @progbits ENTRY(_ZGVbN4v_atanf_sse4) - /* - * To use minps\maxps operations for argument reduction - * uncomment _AT_USEMINMAX_ definition - * Declarations - * Variables - * Constants - */ - movups _sABS_MASK+__svml_satan_data_internal(%rip), %xmm2 - - /* - * 1) If x>1, then r=-1/x, PIO2=Pi/2 - * 2) If -1<=x<=1, then r=x, PIO2=0 - * 3) If x<-1, then r=-1/x, PIO2=-Pi/2 - */ - movups _sONE+__svml_satan_data_internal(%rip), %xmm1 - andps %xmm0, %xmm2 - movaps %xmm2, %xmm9 - movaps %xmm1, %xmm3 - cmpleps %xmm1, %xmm9 - maxps %xmm2, %xmm3 - minps %xmm2, %xmm1 - divps %xmm3, %xmm1 - movups __svml_satan_data_internal(%rip), %xmm4 - movaps %xmm9, %xmm10 - andps %xmm4, %xmm0 - andnps %xmm4, %xmm9 - pxor %xmm0, %xmm9 - pxor %xmm1, %xmm9 - - /* Polynomial. */ - movaps %xmm9, %xmm8 - mulps %xmm9, %xmm8 - movaps %xmm8, %xmm7 - mulps %xmm8, %xmm7 - movups _sPC8+__svml_satan_data_internal(%rip), %xmm6 - mulps %xmm7, %xmm6 - movups _sPC7+__svml_satan_data_internal(%rip), %xmm5 - mulps %xmm7, %xmm5 - addps _sPC6+__svml_satan_data_internal(%rip), %xmm6 - mulps %xmm7, %xmm6 - addps _sPC5+__svml_satan_data_internal(%rip), %xmm5 - mulps %xmm7, %xmm5 - addps _sPC4+__svml_satan_data_internal(%rip), %xmm6 - mulps %xmm7, %xmm6 - addps _sPC3+__svml_satan_data_internal(%rip), %xmm5 - mulps %xmm5, %xmm7 - addps _sPC2+__svml_satan_data_internal(%rip), %xmm6 - mulps %xmm8, %xmm6 - addps _sPC1+__svml_satan_data_internal(%rip), %xmm7 - andnps _sPIO2+__svml_satan_data_internal(%rip), %xmm10 - addps %xmm6, %xmm7 - mulps %xmm7, %xmm8 - pxor %xmm0, %xmm10 - addps _sPC0+__svml_satan_data_internal(%rip), %xmm8 - - /* Reconstruction. */ - mulps %xmm8, %xmm9 - addps %xmm9, %xmm10 - movaps %xmm10, %xmm0 + /* 1) If x>1, then r=-1/x, PIO2=Pi/2 + 2) If -1<=x<=1, then r=x, PIO2=0 + 3) If x<-1, then r=-1/x, PIO2=-Pi/2. */ + movups COMMON_DATA(_OneF)(%rip), %xmm1 + /* use minud\maxud operations for argument reduction. */ + movups LOCAL_DATA(_SignMask)(%rip), %xmm5 + movaps %xmm5, %xmm6 + andnps %xmm0, %xmm5 + andps %xmm6, %xmm0 + movaps %xmm5, %xmm7 + + movaps %xmmA, %xmm4 + pminud %xmm5, %xmmA + pmaxud %xmm4, %xmm7 + pcmpgtd %xmmA, %xmm5 + divps %xmm7, %xmmA + + andps %xmm5, %xmm6 + pxor %xmm0, %xmm6 + andps LOCAL_DATA(_sPIO2)(%rip), %xmm5 + pxor %xmm0, %xmm5 + pxor %xmmA, %xmm6 + /* Polynomial. */ + mulps %xmmA, %xmmA + movaps %xmmA, %xmm0 + mulps %xmmA, %xmmA + movups LOCAL_DATA(_sPC7)(%rip), %xmm2 + mulps %xmmA, %xmm2 + addps LOCAL_DATA(_sPC5)(%rip), %xmm2 + mulps %xmmA, %xmm2 + addps LOCAL_DATA(_sPC3)(%rip), %xmm2 + mulps %xmmA, %xmm2 + addps LOCAL_DATA(_sPC1)(%rip), %xmm2 + movups LOCAL_DATA(_sPC8)(%rip), %xmm3 + mulps %xmmA, %xmm3 + addps LOCAL_DATA(_sPC6)(%rip), %xmm3 + mulps %xmmA, %xmm3 + addps LOCAL_DATA(_sPC4)(%rip), %xmm3 + mulps %xmmA, %xmm3 + addps LOCAL_DATA(_sPC2)(%rip), %xmm3 + mulps %xmm0, %xmm3 + addps %xmm3, %xmm2 + mulps %xmm2, %xmm0 + addps %xmm4, %xmm0 + /* Reconstruction. */ + mulps %xmm6, %xmm0 + addps %xmm5, %xmm0 ret - END(_ZGVbN4v_atanf_sse4) - .section .rodata, "a" + .section .rodata.sse4, "a" .align 16 +LOCAL_DATA_NAME: + DATA_VEC (LOCAL_DATA_NAME, _SignMask, 0x80000000) + DATA_VEC (LOCAL_DATA_NAME, _sPIO2, 0x3fc90fdb) + DATA_VEC (LOCAL_DATA_NAME, _sPC7, 0xBC7F2631) + DATA_VEC (LOCAL_DATA_NAME, _sPC5, 0xBD987629) + DATA_VEC (LOCAL_DATA_NAME, _sPC3, 0xBE1161F8) + DATA_VEC (LOCAL_DATA_NAME, _sPC1, 0xBEAAAA49) + DATA_VEC (LOCAL_DATA_NAME, _sPC8, 0x3B322CC0) + DATA_VEC (LOCAL_DATA_NAME, _sPC6, 0x3D2BC384) + DATA_VEC (LOCAL_DATA_NAME, _sPC4, 0x3DD96474) + DATA_VEC (LOCAL_DATA_NAME, _sPC2, 0x3E4CB79F) -#ifdef __svml_satan_data_internal_typedef -typedef unsigned int VUINT32; -typedef struct { - __declspec(align(16)) VUINT32 _sSIGN_MASK[4][1]; - __declspec(align(16)) VUINT32 _sABS_MASK[4][1]; - __declspec(align(16)) VUINT32 _sONE[4][1]; - __declspec(align(16)) VUINT32 _sPIO2[4][1]; - __declspec(align(16)) VUINT32 _sPC8[4][1]; - __declspec(align(16)) VUINT32 _sPC7[4][1]; - __declspec(align(16)) VUINT32 _sPC6[4][1]; - __declspec(align(16)) VUINT32 _sPC5[4][1]; - __declspec(align(16)) VUINT32 _sPC4[4][1]; - __declspec(align(16)) VUINT32 _sPC3[4][1]; - __declspec(align(16)) VUINT32 _sPC2[4][1]; - __declspec(align(16)) VUINT32 _sPC1[4][1]; - __declspec(align(16)) VUINT32 _sPC0[4][1]; -} __svml_satan_data_internal; -#endif -__svml_satan_data_internal: - .long 0x80000000, 0x80000000, 0x80000000, 0x80000000 // _sSIGN_MASK - .align 16 - .long 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF // _sABS_MASK - .align 16 - .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 // _sONE - .align 16 - .long 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB // _sPIO2 - .align 16 - .long 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0 // _sPC8 - .align 16 - .long 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631 // _sPC7 - .align 16 - .long 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384 // _sPC6 - .align 16 - .long 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629 // _sPC5 - .align 16 - .long 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474 // _sPC4 - .align 16 - .long 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8 // _sPC3 - .align 16 - .long 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F // _sPC2 - .align 16 - .long 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49 // _sPC1 - .align 16 - .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 // _sPC0 - .align 16 - .type __svml_satan_data_internal, @object - .size __svml_satan_data_internal, .-__svml_satan_data_internal + .type LOCAL_DATA_NAME, @object + .size LOCAL_DATA_NAME, .-LOCAL_DATA_NAME