From patchwork Sat Aug 21 16:36:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 44731 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C37273861843 for ; Sat, 21 Aug 2021 16:37:06 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C37273861843 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1629563826; bh=N+rHpnSxGgXbdwixdx5rgDPOqJ4629wpCosI8MWEc4k=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=O7vldOQcXilcnBElH87kJ8L09cPrQzoTD5hbwEz+o+5XaJ0/l0uy8+YRnie7ZiklV YVifmJD8YatEbWV4REbgJZTN/bbTmxOY6VOot1ICaovpZSvw6SsKSQKTk3UYSf7tI6 IqCIDDQlml0Gkm86HFFbAHGhLB/n7KGCk/hoaKwM= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by sourceware.org (Postfix) with ESMTPS id E5A8E385C421 for ; Sat, 21 Aug 2021 16:36:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E5A8E385C421 Received: by mail-pl1-x634.google.com with SMTP id u15so7676294plg.13 for ; Sat, 21 Aug 2021 09:36:43 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=N+rHpnSxGgXbdwixdx5rgDPOqJ4629wpCosI8MWEc4k=; b=Bdz5OkVw5Lty6h6TvpC+4Y5PrJpxSpHzqJJTBvXOUqAwZKPZoxCXGnZqczSc6XGRw7 d2laQFMJvTNbWXNquab9tHw62WhfIjY8zFoiNWrBKbTd416t9NFz/scazX5et4QX4YDF D+diKVqvRScfi0sPICnMC8RYWHq3gbTzQ0Q5sUEVM/rGmZXnwH+yAUUvTja6deyu2wR3 YihL5idQPtsMkEbHbITODMEpWOASogg5zssfANSP8HKvXxdijVnP5osWB7Dm0RRGJjvs GVAS2rb0TLEO31aM1UGSJAjQ5AxwDUsl9UzIxkQNOObCcGekrmIj+HMaFIaBRbQSX+en J7UA== X-Gm-Message-State: AOAM533JMHoRXS9+i01nx8Bbf2U6POslKFKyq1CXool4C7n70od39IaS A5QY2JSqGzgW0/844KD4fN8eQs8PaWs= X-Google-Smtp-Source: ABdhPJzUQTXTNnFjmZFxnqtfGRuotm1WdmszlKO7jwjWMQe7OSKkBNFBXL4dIqt6e1dYffgb2E2fdg== X-Received: by 2002:a17:90b:11c2:: with SMTP id gv2mr10893626pjb.227.1629563802752; Sat, 21 Aug 2021 09:36:42 -0700 (PDT) Received: from gnu-cfl-2.localdomain ([172.58.38.240]) by smtp.gmail.com with ESMTPSA id d15sm9529385pfh.34.2021.08.21.09.36.42 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 21 Aug 2021 09:36:42 -0700 (PDT) Received: from gnu-tgl-2.localdomain (gnu-tgl-2 [192.168.1.34]) by gnu-cfl-2.localdomain (Postfix) with ESMTPS id 44E1CC006C for ; Sat, 21 Aug 2021 09:36:41 -0700 (PDT) Received: from gnu-tgl-2.lan (localhost [IPv6:::1]) by gnu-tgl-2.localdomain (Postfix) with ESMTP id 2D20B3003C7 for ; Sat, 21 Aug 2021 09:36:31 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH] x86-64: Optimize load of all bits set into ZMM register [BZ #28252] Date: Sat, 21 Aug 2021 09:36:31 -0700 Message-Id: <20210821163631.138482-1-hjl.tools@gmail.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Spam-Status: No, score=-3023.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, UNWANTED_LANGUAGE_BODY autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "H.J. Lu via Libc-alpha" From: "H.J. Lu" Reply-To: "H.J. Lu" Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Optimize loads of all bits set into ZMM register in AVX512 SVML codes by replacing vpbroadcastq .L_2il0floatpacket.16(%rip), %zmmX and vmovups .L_2il0floatpacket.13(%rip), %zmmX with vpternlogd $0xff, %zmmX, %zmmX, %zmmX This fixes BZ #28252. --- .../x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S | 7 +------ .../x86_64/fpu/multiarch/svml_d_log8_core_avx512.S | 7 +------ .../x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S | 7 +------ .../fpu/multiarch/svml_d_sincos8_core_avx512.S | 7 +------ .../x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S | 7 +------ .../x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S | 7 +------ .../x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S | 7 +------ .../x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S | 12 ++---------- .../fpu/multiarch/svml_s_sincosf16_core_avx512.S | 7 +------ .../x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S | 7 +------ 10 files changed, 11 insertions(+), 64 deletions(-) diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S index c2cf007904..0fcb912557 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S @@ -258,7 +258,7 @@ ENTRY (_ZGVeN8v_cos_skx) vmovaps %zmm0, %zmm8 /* Check for large arguments path */ - vpbroadcastq .L_2il0floatpacket.16(%rip), %zmm2 + vpternlogd $0xff, %zmm2, %zmm2, %zmm2 /* ARGUMENT RANGE REDUCTION: @@ -448,8 +448,3 @@ ENTRY (_ZGVeN8v_cos_skx) vmovsd %xmm0, 1216(%rsp,%r15) jmp .LBL_2_7 END (_ZGVeN8v_cos_skx) - - .section .rodata, "a" -.L_2il0floatpacket.16: - .long 0xffffffff,0xffffffff - .type .L_2il0floatpacket.16,@object diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S index e9a5d00992..5596c950ce 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S @@ -267,7 +267,7 @@ ENTRY (_ZGVeN8v_log_skx) /* preserve mantissa, set input exponent to 2^(-10) */ vpternlogq $248, _ExpMask(%rax), %zmm3, %zmm2 - vpbroadcastq .L_2il0floatpacket.12(%rip), %zmm1 + vpternlogd $0xff, %zmm1, %zmm1, %zmm1 vpsrlq $32, %zmm4, %zmm6 /* reciprocal approximation good to at least 11 bits */ @@ -453,8 +453,3 @@ ENTRY (_ZGVeN8v_log_skx) vmovsd %xmm0, 1216(%rsp,%r15) jmp .LBL_2_7 END (_ZGVeN8v_log_skx) - - .section .rodata, "a" -.L_2il0floatpacket.12: - .long 0xffffffff,0xffffffff - .type .L_2il0floatpacket.12,@object diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S index 508da563fe..2981f1582e 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S @@ -254,7 +254,7 @@ ENTRY (_ZGVeN8v_sin_skx) andq $-64, %rsp subq $1280, %rsp movq __svml_d_trig_data@GOTPCREL(%rip), %rax - vpbroadcastq .L_2il0floatpacket.14(%rip), %zmm14 + vpternlogd $0xff, %zmm1, %zmm1, %zmm14 vmovups __dAbsMask(%rax), %zmm7 vmovups __dInvPI(%rax), %zmm2 vmovups __dRShifter(%rax), %zmm1 @@ -450,8 +450,3 @@ ENTRY (_ZGVeN8v_sin_skx) vmovsd %xmm0, 1216(%rsp,%r15) jmp .LBL_2_7 END (_ZGVeN8v_sin_skx) - - .section .rodata, "a" -.L_2il0floatpacket.14: - .long 0xffffffff,0xffffffff - .type .L_2il0floatpacket.14,@object diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S index 965415f2bd..4ad366373b 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S @@ -423,7 +423,7 @@ ENTRY (_ZGVeN8vl8l8_sincos_skx) /* SinPoly = SinR*SinPoly */ vfmadd213pd %zmm5, %zmm5, %zmm4 - vpbroadcastq .L_2il0floatpacket.15(%rip), %zmm3 + vpternlogd $0xff, %zmm3, %zmm3, %zmm3 /* Update Cos result's sign */ vxorpd %zmm2, %zmm1, %zmm1 @@ -733,8 +733,3 @@ END (_ZGVeN8vvv_sincos_knl) ENTRY (_ZGVeN8vvv_sincos_skx) WRAPPER_AVX512_vvv_vl8l8 _ZGVeN8vl8l8_sincos_skx END (_ZGVeN8vvv_sincos_skx) - - .section .rodata, "a" -.L_2il0floatpacket.15: - .long 0xffffffff,0xffffffff - .type .L_2il0floatpacket.15,@object diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S index cdcb16087d..b7d79efb54 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S @@ -271,7 +271,7 @@ ENTRY (_ZGVeN16v_cosf_skx) X = X - Y*PI1 - Y*PI2 - Y*PI3 */ vmovaps %zmm0, %zmm6 - vmovups .L_2il0floatpacket.13(%rip), %zmm12 + vpternlogd $0xff, %zmm12, %zmm12, %zmm12 vmovups __sRShifter(%rax), %zmm3 vmovups __sPI1_FMA(%rax), %zmm5 vmovups __sA9_FMA(%rax), %zmm9 @@ -445,8 +445,3 @@ ENTRY (_ZGVeN16v_cosf_skx) vmovss %xmm0, 1216(%rsp,%r15,8) jmp .LBL_2_7 END (_ZGVeN16v_cosf_skx) - - .section .rodata, "a" -.L_2il0floatpacket.13: - .long 0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff - .type .L_2il0floatpacket.13,@object diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S index 1b09909344..9f03b9b780 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S @@ -257,7 +257,7 @@ ENTRY (_ZGVeN16v_expf_skx) vmovaps %zmm0, %zmm7 /* compare against threshold */ - vmovups .L_2il0floatpacket.13(%rip), %zmm3 + vpternlogd $0xff, %zmm3, %zmm3, %zmm3 vmovups __sInvLn2(%rax), %zmm4 vmovups __sShifter(%rax), %zmm1 vmovups __sLn2hi(%rax), %zmm6 @@ -432,8 +432,3 @@ ENTRY (_ZGVeN16v_expf_skx) jmp .LBL_2_7 END (_ZGVeN16v_expf_skx) - - .section .rodata, "a" -.L_2il0floatpacket.13: - .long 0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff - .type .L_2il0floatpacket.13,@object diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S index 4a7b2adbbf..2ba38b0f33 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S @@ -228,7 +228,7 @@ ENTRY (_ZGVeN16v_logf_skx) andq $-64, %rsp subq $1280, %rsp movq __svml_slog_data@GOTPCREL(%rip), %rax - vmovups .L_2il0floatpacket.7(%rip), %zmm6 + vpternlogd $0xff, %zmm6, %zmm6, %zmm6 vmovups _iBrkValue(%rax), %zmm4 vmovups _sPoly_7(%rax), %zmm8 @@ -401,8 +401,3 @@ ENTRY (_ZGVeN16v_logf_skx) jmp .LBL_2_7 END (_ZGVeN16v_logf_skx) - - .section .rodata, "a" -.L_2il0floatpacket.7: - .long 0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff - .type .L_2il0floatpacket.7,@object diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S index 7f906622a5..7f0272c809 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S @@ -378,7 +378,7 @@ ENTRY (_ZGVeN16vv_powf_skx) vpsrlq $32, %zmm3, %zmm2 vpmovqd %zmm2, %ymm11 vcvtps2pd %ymm14, %zmm13 - vmovups .L_2il0floatpacket.23(%rip), %zmm14 + vpternlogd $0xff, %zmm14, %zmm14, %zmm14 vmovaps %zmm14, %zmm26 vpandd _ABSMASK(%rax), %zmm1, %zmm8 vpcmpd $1, _INF(%rax), %zmm8, %k2 @@ -420,7 +420,7 @@ ENTRY (_ZGVeN16vv_powf_skx) vpmovqd %zmm11, %ymm5 vpxord %zmm10, %zmm10, %zmm10 vgatherdpd _Log2Rcp_lookup(%rax,%ymm4), %zmm10{%k3} - vpbroadcastq .L_2il0floatpacket.24(%rip), %zmm4 + vpternlogd $0xff, %zmm4, %zmm4, %zmm4 vpxord %zmm11, %zmm11, %zmm11 vcvtdq2pd %ymm7, %zmm7 vgatherdpd _Log2Rcp_lookup(%rax,%ymm5), %zmm11{%k1} @@ -635,11 +635,3 @@ ENTRY (_ZGVeN16vv_powf_skx) vmovss %xmm0, 1216(%rsp,%r15,8) jmp .LBL_2_7 END (_ZGVeN16vv_powf_skx) - - .section .rodata, "a" -.L_2il0floatpacket.23: - .long 0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff - .type .L_2il0floatpacket.23,@object -.L_2il0floatpacket.24: - .long 0xffffffff,0xffffffff - .type .L_2il0floatpacket.24,@object diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S index 54cee3a537..e1d0154441 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S @@ -310,7 +310,7 @@ ENTRY (_ZGVeN16vl4l4_sincosf_skx) /* Result sign calculations */ vpternlogd $150, %zmm0, %zmm14, %zmm1 - vmovups .L_2il0floatpacket.13(%rip), %zmm14 + vpternlogd $0xff, %zmm14, %zmm14, %zmm14 /* Add correction term 0.5 for cos() part */ vaddps %zmm8, %zmm5, %zmm15 @@ -740,8 +740,3 @@ END (_ZGVeN16vvv_sincosf_knl) ENTRY (_ZGVeN16vvv_sincosf_skx) WRAPPER_AVX512_vvv_vl4l4 _ZGVeN16vl4l4_sincosf_skx END (_ZGVeN16vvv_sincosf_skx) - - .section .rodata, "a" -.L_2il0floatpacket.13: - .long 0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff - .type .L_2il0floatpacket.13,@object diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S index ec65ffdce5..bcb76ff756 100644 --- a/sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S @@ -273,7 +273,7 @@ ENTRY (_ZGVeN16v_sinf_skx) movq __svml_s_trig_data@GOTPCREL(%rip), %rax /* Check for large and special values */ - vmovups .L_2il0floatpacket.11(%rip), %zmm14 + vpternlogd $0xff, %zmm14, %zmm14, %zmm14 vmovups __sAbsMask(%rax), %zmm5 vmovups __sInvPI(%rax), %zmm1 vmovups __sRShifter(%rax), %zmm2 @@ -464,8 +464,3 @@ ENTRY (_ZGVeN16v_sinf_skx) vmovss %xmm0, 1216(%rsp,%r15,8) jmp .LBL_2_7 END (_ZGVeN16v_sinf_skx) - - .section .rodata, "a" -.L_2il0floatpacket.11: - .long 0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff - .type .L_2il0floatpacket.11,@object