From patchwork Fri Dec 31 18:20:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 49443 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D694C385801C for ; Fri, 31 Dec 2021 18:20:38 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D694C385801C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640974838; bh=cf6JJ43IY6QPej/CuEduawHynLRASHcoMtdzKftR4IQ=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=xsp4PdrtC6XvFGrVxtIv4Qr/KWJjVqAlN8myuCeO3Jwvr++yytIsy5O9GBvr4+Jw/ AZTgJHb2QJ2i4cqVwIz1JtpPw+pZ9HK0hk0DZSU0fFORjKgcrIdFsuDIfuEMAMxBxi 3Z/rm4oCtvu462r49Ba3ZogzTeVYw5bmLDL+a9Z0= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pg1-x52e.google.com (mail-pg1-x52e.google.com [IPv6:2607:f8b0:4864:20::52e]) by sourceware.org (Postfix) with ESMTPS id 7F0F33858D39 for ; Fri, 31 Dec 2021 18:20:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7F0F33858D39 Received: by mail-pg1-x52e.google.com with SMTP id v25so24466448pge.2 for ; Fri, 31 Dec 2021 10:20:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=cf6JJ43IY6QPej/CuEduawHynLRASHcoMtdzKftR4IQ=; b=r6x2w9b/sZ6ioBS0T5U/a0pBfW3aXs5WGO1kQPs2oaw+UWOb6r/+iAOf8md/DwCoUr EdU5jMs2IdZP3Z1qLKjUl8xv7cRx8GvwrIFvax5yZYBKH+AMqatPycUuYCA2U85dZHd8 ASiOEQo3z387xN+HDTU588HdfxGb0raNyn/yUDg5uBwOsvt/d/L/aV2X8UFMypW7hMKs 4PbNK10GUPeFV4cweQ8sY2pgt0e9GiO4l/u/LsmtS6slf/0NaKR4OlM9/AI6gCDXhYdq N7axiRLWF568DxB6GFww/qDZ1ppoPBsQWXR/EWBYIjwawVkU/MrRUzwDsMP8GZBxpSlk Rd6w== X-Gm-Message-State: AOAM533yyh6S3kaifqufrfTGdbEC1mNxozFXW+XTI5l8J+LBSavZmQNG 2DQggGDhkyw5LLt05jBB4JfDCx8FhrA= X-Google-Smtp-Source: ABdhPJxMJOPiGovclqihclvzeZqsLamdMZu1FiVJMwlUPnOs+fu5jbSuMLy4/Qe8HdddcMF6rStCyA== X-Received: by 2002:a05:6a00:1a43:b0:4bb:8507:9568 with SMTP id h3-20020a056a001a4300b004bb85079568mr35895515pfv.42.1640974815576; Fri, 31 Dec 2021 10:20:15 -0800 (PST) Received: from gnu-tgl-3.localdomain ([172.58.35.133]) by smtp.gmail.com with ESMTPSA id n14sm25471013pgd.80.2021.12.31.10.20.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 31 Dec 2021 10:20:15 -0800 (PST) Received: from gnu-tgl-2.localdomain (gnu-tgl-2 [192.168.1.42]) by gnu-tgl-3.localdomain (Postfix) with ESMTPS id 45172C02D4; Fri, 31 Dec 2021 10:20:14 -0800 (PST) Received: from gnu-tgl-2.. (localhost [IPv6:::1]) by gnu-tgl-2.localdomain (Postfix) with ESMTP id 4AF3D300324; Fri, 31 Dec 2021 10:20:10 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH] x86-64: Optimize memset for zeroing Date: Fri, 31 Dec 2021 10:20:10 -0800 Message-Id: <20211231182010.107040-1-hjl.tools@gmail.com> X-Mailer: git-send-email 2.33.1 MIME-Version: 1.0 X-Spam-Status: No, score=-3029.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "H.J. Lu via Libc-alpha" From: "H.J. Lu" Reply-To: "H.J. Lu" Cc: arjan@linux.intel.com Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Update MEMSET_VDUP_TO_VEC0_AND_SET_RETURN to use PXOR, which has lower lantency and higher throughput than VPBROADCAST, for zero constant. Since the most common usage of memset is to zero a block of memory, the branch predictor will make the compare/jmp basically free and PXOR is almost like being executed unconditionally. --- sysdeps/x86_64/memset.S | 14 ++++++++++++-- .../x86_64/multiarch/memset-avx2-unaligned-erms.S | 14 ++++++++++++-- .../multiarch/memset-avx512-unaligned-erms.S | 10 ++++++++++ .../x86_64/multiarch/memset-evex-unaligned-erms.S | 10 ++++++++++ .../x86_64/multiarch/memset-vec-unaligned-erms.S | 13 +++++++++++++ 5 files changed, 57 insertions(+), 4 deletions(-) diff --git a/sysdeps/x86_64/memset.S b/sysdeps/x86_64/memset.S index 0137eba4cd..513f9c703d 100644 --- a/sysdeps/x86_64/memset.S +++ b/sysdeps/x86_64/memset.S @@ -29,15 +29,25 @@ #define VMOVA movaps #define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \ - movd d, %xmm0; \ movq r, %rax; \ + testl d, d; \ + jnz 1f; \ + pxor %xmm0, %xmm0 + +# define MEMSET_VDUP_TO_VEC0(d) \ + movd d, %xmm0; \ punpcklbw %xmm0, %xmm0; \ punpcklwd %xmm0, %xmm0; \ pshufd $0, %xmm0, %xmm0 #define WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \ - movd d, %xmm0; \ movq r, %rax; \ + testl d, d; \ + jnz 1f; \ + pxor %xmm0, %xmm0 + +# define WMEMSET_VDUP_TO_VEC0(d) \ + movd d, %xmm0; \ pshufd $0, %xmm0, %xmm0 #define SECTION(p) p diff --git a/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S index 1af668af0a..8004a27750 100644 --- a/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S @@ -11,13 +11,23 @@ # define VMOVA vmovdqa # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \ - vmovd d, %xmm0; \ movq r, %rax; \ + testl d, d; \ + jnz 1f; \ + vpxor %xmm0, %xmm0, %xmm0 + +# define MEMSET_VDUP_TO_VEC0(d) \ + vmovd d, %xmm0; \ vpbroadcastb %xmm0, %ymm0 # define WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \ - vmovd d, %xmm0; \ movq r, %rax; \ + testl d, d; \ + jnz 1f; \ + vpxor %xmm0, %xmm0, %xmm0 + +# define WMEMSET_VDUP_TO_VEC0(d) \ + vmovd d, %xmm0; \ vpbroadcastd %xmm0, %ymm0 # ifndef SECTION diff --git a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S index f14d6f8493..61ff9ccf6f 100644 --- a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S @@ -17,10 +17,20 @@ # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \ movq r, %rax; \ + testl d, d; \ + jnz 1f; \ + vpxorq %XMM0, %XMM0, %XMM0 + +# define MEMSET_VDUP_TO_VEC0(d) \ vpbroadcastb d, %VEC0 # define WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \ movq r, %rax; \ + testl d, d; \ + jnz 1f; \ + vpxorq %XMM0, %XMM0, %XMM0 + +# define WMEMSET_VDUP_TO_VEC0(d) \ vpbroadcastd d, %VEC0 # define SECTION(p) p##.evex512 diff --git a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S index 64b09e77cc..85544fb0fc 100644 --- a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S @@ -17,10 +17,20 @@ # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \ movq r, %rax; \ + testl d, d; \ + jnz 1f; \ + vpxorq %XMM0, %XMM0, %XMM0 + +# define MEMSET_VDUP_TO_VEC0(d) \ vpbroadcastb d, %VEC0 # define WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \ movq r, %rax; \ + testl d, d; \ + jnz 1f; \ + vpxorq %XMM0, %XMM0, %XMM0 + +# define WMEMSET_VDUP_TO_VEC0(d) \ vpbroadcastd d, %VEC0 # define SECTION(p) p##.evex diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index e723413a66..4ca34a19ba 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -112,6 +112,9 @@ ENTRY (WMEMSET_SYMBOL (__wmemset, unaligned)) shl $2, %RDX_LP WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN (%esi, %rdi) jmp L(entry_from_bzero) +1: + WMEMSET_VDUP_TO_VEC0 (%esi) + jmp L(entry_from_bzero) END (WMEMSET_SYMBOL (__wmemset, unaligned)) #endif @@ -124,6 +127,7 @@ END_CHK (MEMSET_CHK_SYMBOL (__memset_chk, unaligned)) ENTRY (MEMSET_SYMBOL (__memset, unaligned)) MEMSET_VDUP_TO_VEC0_AND_SET_RETURN (%esi, %rdi) +2: # ifdef __ILP32__ /* Clear the upper 32 bits. */ mov %edx, %edx @@ -137,6 +141,10 @@ L(entry_from_bzero): VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) VMOVU %VEC(0), (%rdi) VZEROUPPER_RETURN + +1: + MEMSET_VDUP_TO_VEC0 (%esi) + jmp 2b #if defined USE_MULTIARCH && IS_IN (libc) END (MEMSET_SYMBOL (__memset, unaligned)) @@ -180,6 +188,7 @@ END_CHK (MEMSET_CHK_SYMBOL (__memset_chk, unaligned_erms)) ENTRY_P2ALIGN (MEMSET_SYMBOL (__memset, unaligned_erms), 6) MEMSET_VDUP_TO_VEC0_AND_SET_RETURN (%esi, %rdi) +2: # ifdef __ILP32__ /* Clear the upper 32 bits. */ mov %edx, %edx @@ -193,6 +202,10 @@ ENTRY_P2ALIGN (MEMSET_SYMBOL (__memset, unaligned_erms), 6) VMOVU %VEC(0), (%rax) VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) VZEROUPPER_RETURN + +1: + MEMSET_VDUP_TO_VEC0 (%esi) + jmp 2b #endif .p2align 4,, 10