From patchwork Wed Dec 7 08:52:20 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 61635 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8DD943982416 for ; Wed, 7 Dec 2022 08:56:32 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8DD943982416 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1670403392; bh=WM5y86GVFMEL25rnXkDnfiPIU9yVgsinJdUDg1UBPI0=; h=To:Cc:Subject:Date:In-Reply-To:References:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=HG2lv6i5PelFjqNGv4oogN543lbt+I1/tumI4gORliIK0nwlajMV4MAMcwnWllXP1 i2UD7vpbiTfBxJC/MQQk+tVBnFDvxq8anUYix8Reh7vez3KgRt4pw7ZM3wMv/2aEvW NEagZh2dGcyRk+O9uUOvCvYKbiINFn7HDpE4jKGg= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-ej1-x635.google.com (mail-ej1-x635.google.com [IPv6:2a00:1450:4864:20::635]) by sourceware.org (Postfix) with ESMTPS id A2A6C393BA5D for ; Wed, 7 Dec 2022 08:53:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A2A6C393BA5D Received: by mail-ej1-x635.google.com with SMTP id kw15so8886358ejc.10 for ; Wed, 07 Dec 2022 00:53:01 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WM5y86GVFMEL25rnXkDnfiPIU9yVgsinJdUDg1UBPI0=; b=WDtxPe6XkTf/Ofzc/3PO08lTy97HacEhktLf27uxCDSMiWY/fEt2BztlLukEECH5jK gyWSe39ux+bhQvcYaWg5DrYnWWi6qu2spsxY0BrF1Yg5BkrEfK8uSGzrg4sUWH1dKDF4 5rdn08d4k4v+8nxoLMfa85ZO/BZrwtuvO0iKMsh0oPtYBgbj/EyAX3Ny9kbLle/5jImI Cs8MUz5GKyAD91o5g1pniiaIfAMLkLJ8jWC/NxjPIyHI2hN8q7MUpGSD7s1AfCt9Najc OXKr85PYwSWcuueB7p9KjB69sm4qmGabH4IdU8c883tFex3IuhJSgE2ao3fuqEgVQw1A n2yQ== X-Gm-Message-State: ANoB5pnmtnxpVZDcA+HzKP8hTEwtTsbLEbGKwqp6+axkKMXLhp3aAaBo yiRaN4sfIBSrB/9DjJ0+EJJkpvmV+Kk= X-Google-Smtp-Source: AA0mqf6KfcHMCViOprRTIwSwaJfpwSZDgyVKU+UcvVXkvLoe/plPvYprHL6HrYzP92VJV8MPT2AMrw== X-Received: by 2002:a17:906:2e83:b0:78d:b3f0:b5c0 with SMTP id o3-20020a1709062e8300b0078db3f0b5c0mr78806327eji.141.1670403180136; Wed, 07 Dec 2022 00:53:00 -0800 (PST) Received: from noahgold-desk.lan (2603-8080-1301-76c6-feb7-1b9b-f2dd-08f7.res6.spectrum.com. [2603:8080:1301:76c6:feb7:1b9b:f2dd:8f7]) by smtp.gmail.com with ESMTPSA id k17-20020aa7c051000000b0046bd3b366f9sm1931767edo.32.2022.12.07.00.52.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Dec 2022 00:52:59 -0800 (PST) To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, andrey.kolesov@intel.com, carlos@systemhalted.org Subject: [PATCH v1 11/27] x86/fpu: Optimize svml_s_atanf16_core_avx512.S Date: Wed, 7 Dec 2022 00:52:20 -0800 Message-Id: <20221207085236.1424424-11-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221207085236.1424424-1-goldstein.w.n@gmail.com> References: <20221207085236.1424424-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" 1. Change the algorithm used to match the avx2 implementation which seems to be faster. 2. Cleanup some missed optimizations in instruction selection / unnecissary repeated rodata references. 3. Remove unused rodata. 4. Use common data definitions where possible. Changing the algorithm (1) causes a slight ULP error increase (exact same as the avx2 version). Before: ulp: 0: 4127324924 (0.9610) 1: 167635550 (0.0390) 2: 6822 (0.0000) 3: 0 (0.0000) 4: 0 (0.0000) After: ulp: 0: 4088299128 (0.9519) 1: 206531674 (0.0481) 2: 136494 (0.0000) 3: 0 (0.0000) 4: 0 (0.0000) Since the max ULP is the same and the distribution matches the avx2 implementation this seems like an acceptable "regression" as it doesn't seem feasible any application could have been relying on the precision distribution. Code Size Change: -79 Bytes (193 - 272) Perf Changes: Input New Time / Old Time 0F (0x00000000) -> 0.7612 0F (0x0000ffff, Denorm) -> 1.3234 .1F (0x3dcccccd) -> 0.7690 5F (0x40a00000) -> 0.7752 2315255808F (0x4f0a0000) -> 0.7712 -NaN (0xffffffff) -> 0.7824 Note the ~32% regression in the denorm case is because of additional micro-code assists (from the algorithm shift). This generally seems worth it for the ~23-24% perf improvement in other cases as denormal inputs are almost certainly cold cases. --- .../multiarch/svml_s_atanf16_core_avx512.S | 199 ++++++------------ 1 file changed, 67 insertions(+), 132 deletions(-) diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core_avx512.S index 88b44a989c..abb3c76209 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core_avx512.S @@ -28,146 +28,81 @@ * */ -/* Offsets for data table __svml_satan_data_internal_avx512 - */ -#define AbsMask 0 -#define Shifter 64 -#define MaxThreshold 128 -#define MOne 192 -#define One 256 -#define LargeX 320 -#define Zero 384 -#define Tbl_H 448 -#define Pi2 576 -#define coeff_1 640 -#define coeff_2 704 -#define coeff_3 768 +#define LOCAL_DATA_NAME __svml_satan_data_internal +#include "svml_s_common_evex512_rodata_offsets.h" +/* Offsets for data table __svml_satan_data_internal. */ +#define _sPC8 0 +#define _sPC7 64 +#define _sPC6 128 +#define _sPC5 192 +#define _sPC4 256 +#define _sPC3 320 +#define _sPC2 384 +#define _sPC1 448 +#define _sPIO2 512 #include .section .text.evex512, "ax", @progbits ENTRY(_ZGVeN16v_atanf_skx) - vandps __svml_satan_data_internal_avx512(%rip), %zmm0, %zmm7 - vmovups MaxThreshold+__svml_satan_data_internal_avx512(%rip), %zmm3 - vmovups One+__svml_satan_data_internal_avx512(%rip), %zmm8 - - /* round to 2 bits after binary point */ - vreduceps $40, {sae}, %zmm7, %zmm5 - - /* saturate X range */ - vmovups LargeX+__svml_satan_data_internal_avx512(%rip), %zmm6 - vmovups Shifter+__svml_satan_data_internal_avx512(%rip), %zmm2 - vcmpps $29, {sae}, %zmm3, %zmm7, %k1 - - /* table lookup sequence */ - vmovups Tbl_H+__svml_satan_data_internal_avx512(%rip), %zmm3 - vsubps {rn-sae}, %zmm5, %zmm7, %zmm4 - vaddps {rn-sae}, %zmm2, %zmm7, %zmm1 - vxorps %zmm0, %zmm7, %zmm0 - vfmadd231ps {rn-sae}, %zmm7, %zmm4, %zmm8 - vmovups coeff_2+__svml_satan_data_internal_avx512(%rip), %zmm4 - - /* if|X|>=MaxThreshold, set DiffX=-1 */ - vblendmps MOne+__svml_satan_data_internal_avx512(%rip), %zmm5, %zmm9{%k1} - vmovups coeff_3+__svml_satan_data_internal_avx512(%rip), %zmm5 - - /* if|X|>=MaxThreshold, set Y=X */ - vminps {sae}, %zmm7, %zmm6, %zmm8{%k1} - - /* R+Rl = DiffX/Y */ - vgetmantps $0, {sae}, %zmm9, %zmm12 - vgetexpps {sae}, %zmm9, %zmm10 - vpermt2ps Tbl_H+64+__svml_satan_data_internal_avx512(%rip), %zmm1, %zmm3 - vgetmantps $0, {sae}, %zmm8, %zmm15 - vgetexpps {sae}, %zmm8, %zmm11 - vmovups coeff_1+__svml_satan_data_internal_avx512(%rip), %zmm1 - - /* set table value to Pi/2 for large X */ - vblendmps Pi2+__svml_satan_data_internal_avx512(%rip), %zmm3, %zmm9{%k1} - vrcp14ps %zmm15, %zmm13 - vsubps {rn-sae}, %zmm11, %zmm10, %zmm2 - vmulps {rn-sae}, %zmm13, %zmm12, %zmm14 - vfnmadd213ps {rn-sae}, %zmm12, %zmm14, %zmm15 - vfmadd213ps {rn-sae}, %zmm14, %zmm13, %zmm15 - vscalefps {rn-sae}, %zmm2, %zmm15, %zmm7 - - /* polynomial evaluation */ - vmulps {rn-sae}, %zmm7, %zmm7, %zmm8 - vmulps {rn-sae}, %zmm7, %zmm8, %zmm6 - vfmadd231ps {rn-sae}, %zmm8, %zmm1, %zmm4 - vfmadd213ps {rn-sae}, %zmm5, %zmm4, %zmm8 - vfmadd213ps {rn-sae}, %zmm7, %zmm6, %zmm8 - vaddps {rn-sae}, %zmm9, %zmm8, %zmm10 - vxorps %zmm0, %zmm10, %zmm0 + /* 1) If x>1, then r=-1/x, PIO2=Pi/2 + 2) If -1<=x<=1, then r=x, PIO2=0 + 3) If x<-1, then r=-1/x, PIO2=-Pi/2. */ + vmovups COMMON_DATA(_OneF)(%rip), %zmm2 + vmovups COMMON_DATA(_SignMask)(%rip), %zmm7 + + + /* Use minud\maxud operations for argument reduction. */ + vandnps %zmm0, %zmm7, %zmm3 + vpcmpgtd %zmm2, %zmm3, %k1 + + vpmaxud %zmm3, %zmm2, %zmm4 + vpminud %zmm3, %zmm2, %zmm5 + + vdivps %zmm4, %zmm5, %zmm4 + + vandps %zmm7, %zmm0, %zmm3 + vmovdqa32 %zmm7, %zmm7{%k1}{z} + + vmulps %zmm4, %zmm4, %zmm1 + vpternlogq $0x96, %zmm3, %zmm4, %zmm7 + + /* Polynomial. */ + + vmovups LOCAL_DATA(_sPC8)(%rip), %zmm0 + vmovups LOCAL_DATA(_sPC7)(%rip), %zmm4 + + vmulps %zmm1, %zmm1, %zmm5 + + vfmadd213ps LOCAL_DATA(_sPC6)(%rip), %zmm5, %zmm0 + vfmadd213ps LOCAL_DATA(_sPC5)(%rip), %zmm5, %zmm4 + vfmadd213ps LOCAL_DATA(_sPC4)(%rip), %zmm5, %zmm0 + vfmadd213ps LOCAL_DATA(_sPC3)(%rip), %zmm5, %zmm4 + vfmadd213ps LOCAL_DATA(_sPC2)(%rip), %zmm5, %zmm0 + vfmadd213ps LOCAL_DATA(_sPC1)(%rip), %zmm5, %zmm4 + vfmadd213ps %zmm4, %zmm1, %zmm0 + vfmadd213ps %zmm2, %zmm1, %zmm0 + vorps LOCAL_DATA(_sPIO2)(%rip), %zmm3, %zmm3{%k1} + + /* Reconstruction. */ + vfmadd213ps %zmm3, %zmm7, %zmm0 ret END(_ZGVeN16v_atanf_skx) - .section .rodata, "a" + .section .rodata.evex512, "a" .align 64 -#ifdef __svml_satan_data_internal_avx512_typedef -typedef unsigned int VUINT32; -typedef struct { - __declspec(align(64)) VUINT32 AbsMask[16][1]; - __declspec(align(64)) VUINT32 Shifter[16][1]; - __declspec(align(64)) VUINT32 MaxThreshold[16][1]; - __declspec(align(64)) VUINT32 MOne[16][1]; - __declspec(align(64)) VUINT32 One[16][1]; - __declspec(align(64)) VUINT32 LargeX[16][1]; - __declspec(align(64)) VUINT32 Zero[16][1]; - __declspec(align(64)) VUINT32 Tbl_H[32][1]; - __declspec(align(64)) VUINT32 Pi2[16][1]; - __declspec(align(64)) VUINT32 coeff[3][16][1]; -} __svml_satan_data_internal_avx512; -#endif -__svml_satan_data_internal_avx512: - /* AbsMask */ - .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff - /* Shifter */ - .align 64 - .long 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000 - /* MaxThreshold */ - .align 64 - .long 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000 - /* MOne */ - .align 64 - .long 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000 - /* One */ - .align 64 - .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 - /* LargeX */ - .align 64 - .long 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000 - /* Zero */ - .align 64 - .long 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000 - /* Tbl_H */ - .align 64 - .long 0x00000000, 0x3e7adbb0 - .long 0x3eed6338, 0x3f24bc7d - .long 0x3f490fdb, 0x3f6563e3 - .long 0x3f7b985f, 0x3f869c79 - .long 0x3f8db70d, 0x3f93877b - .long 0x3f985b6c, 0x3f9c6b53 - .long 0x3f9fe0bb, 0x3fa2daa4 - .long 0x3fa57088, 0x3fa7b46f - .long 0x3fa9b465, 0x3fab7b7a - .long 0x3fad1283, 0x3fae809e - .long 0x3fafcb99, 0x3fb0f836 - .long 0x3fb20a6a, 0x3fb30581 - .long 0x3fb3ec43, 0x3fb4c10a - .long 0x3fb585d7, 0x3fb63c64 - .long 0x3fb6e62c, 0x3fb78478 - .long 0x3fb81868, 0x3fb8a2f5 - /* Pi2 */ - .align 64 - .long 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB - /* coeff3 */ - .align 64 - .long 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de - .long 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2 - .long 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa - .align 64 - .type __svml_satan_data_internal_avx512, @object - .size __svml_satan_data_internal_avx512, .-__svml_satan_data_internal_avx512 +LOCAL_DATA_NAME: + DATA_VEC (LOCAL_DATA_NAME, _sPC8, 0x3B322CC0) + DATA_VEC (LOCAL_DATA_NAME, _sPC7, 0xBC7F2631) + DATA_VEC (LOCAL_DATA_NAME, _sPC6, 0x3D2BC384) + DATA_VEC (LOCAL_DATA_NAME, _sPC5, 0xBD987629) + DATA_VEC (LOCAL_DATA_NAME, _sPC4, 0x3DD96474) + DATA_VEC (LOCAL_DATA_NAME, _sPC3, 0xBE1161F8) + DATA_VEC (LOCAL_DATA_NAME, _sPC2, 0x3E4CB79F) + DATA_VEC (LOCAL_DATA_NAME, _sPC1, 0xBEAAAA49) + DATA_VEC (LOCAL_DATA_NAME, _sPIO2, 0x3FC90FDB) + + .type LOCAL_DATA_NAME, @object + .size LOCAL_DATA_NAME, .-LOCAL_DATA_NAME