From patchwork Thu May 20 18:44:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 43519 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E2C09396E83D; Thu, 20 May 2021 18:45:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E2C09396E83D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1621536339; bh=4rpPZe+CoHmHk4zChEm1IPbrfX3YtZwBMvt75Jr7yJ0=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=LvDF0oNcxYfist1njcQ3qv64Jpbx8v91o8XQ/Uaq1aRKuSiE0lZwTHCMZgut6l8ks NcUZNI9LRWpA3E8oAYmzFK8GxUQM3dFoO/TJofFbZX046YivDcasH16GYw64C9QiYN f1FFUJMv3bUFTYNjjW5z/yHt8sAsvFNhhlfxL0vQ= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qv1-xf36.google.com (mail-qv1-xf36.google.com [IPv6:2607:f8b0:4864:20::f36]) by sourceware.org (Postfix) with ESMTPS id 868A5385782F for ; Thu, 20 May 2021 18:45:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 868A5385782F Received: by mail-qv1-xf36.google.com with SMTP id o59so9163943qva.1 for ; Thu, 20 May 2021 11:45:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=4rpPZe+CoHmHk4zChEm1IPbrfX3YtZwBMvt75Jr7yJ0=; b=UPqei5SYWYkTbCRqLV5MIZNMg7hj6c/BKv8qRa4NunJZT+fU3rQ8LEmuu5V04vaI7I lCE/zy5pHjXAZTP3x5Dfx7QZ2L+rlx3YbmD3dqxhqyJT7TDmmhSUWDN+3mAswM0aPjSx puKkKgWczeMV0JSwPBF3ICpxX+EuqnAIo8MniqNajmFgYmEv998OXVNnPnFQvgCnG7Qu 9jXsJPlSRCzvL/qhjgb7/cVaBjylJC9LlGfhq9kxdl3RYdLa5zw3JMAADKC22eDGoMkx Cs/lknTT5AOCZbzYK8T8VONjzlEfgtJJdRnOoYTddlrbTa6Z11W6s9DJCzebfbbE+ZXm JN+Q== X-Gm-Message-State: AOAM533qlYCAsKkmlDyBWg3MD16JqwUUXwIupM6cTFzcyE3StNsnNN0d XkoJqNYkfCk34gjIj6ymlYnBBorZuX1rdQ== X-Google-Smtp-Source: ABdhPJxWXLrWkv2sWhQ9w2H6ms7feosagtDY2kYKxpNQ+dYSmNyxZO0YLcv18y5I8U0xCFCPiuawug== X-Received: by 2002:ad4:57b0:: with SMTP id g16mr7267496qvx.16.1621536336694; Thu, 20 May 2021 11:45:36 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id t25sm2486539qtp.84.2021.05.20.11.45.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 May 2021 11:45:36 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1] x86: Improve memset-vec-unaligned-erms.S Date: Thu, 20 May 2021 14:44:07 -0400 Message-Id: <20210520184404.2901975-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" No bug. This commit makes a few small improvements to memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 instead of 128. Either alignment will perform equally well in a loop and 128 just increases the odds of having to do an extra iteration which can be significant overhead for small values. 2) Align some targets and the loop. 3) Remove an ALU from the alignment process. 4) Reorder the last 4x VEC so that they are stored after the loop. 5) Move the condition for leq 8x VEC to before the alignment process. test-memset and test-wmemset are both passing. Signed-off-by: Noah Goldstein Reviewed-by: H.J. Lu --- Tests where run on the following CPUs: Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html All times are the geometric mean of N=50. The unit of time is seconds. "Cur" refers to the current implementation "New" refers to this patches implementation Performance data attached in memset-data.pdf Some notes on the numbers: I only included numbers for the proper VEC_SIZE for the corresponding cpu. skl -> avx2 icl -> evex tgl -> evex The changes only affect sizes > 2 * VEC_SIZE. The performance differences in the size <= 2 * VEC_SIZE come from changes in alignment after linking (i.e ENTRY aligns to 16, but performance can be affected by alignment % 64 or alignment % 4096) and generally affects throughput only, not latency (i.e with an lfence to the benchmark loop the deviations go away). Generally I think they can be ignored (both positive and negative affects). The interesting part of the data is in the medium size range [128, 1024] where the new implementation has a reasonable speedup. This is especially pronounced when the more conservative alignment saves a full loop iteration. The only significant exception is skylake-avx2-erms case for size = 416, alignment = 416 where the current implementation is meaningfully faster. I am unsure of the root cause for this. The skylake-avx2 case only performs a bit worse in this case which makes me think part of it is code alignment related, though comparative to the speedup in other size/alignment configurations it is still a trough. Despite this, I still think the numbers are overall an improvement. As well due to aligning the loop (and possibly slightly more efficient DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight improvement to the main loop for large sizes. .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- 1 file changed, 28 insertions(+), 22 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index 08cfa49bd1..ff196844a0 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) VMOVU %VEC(0), (%rdi) VZEROUPPER_RETURN + .p2align 4 L(stosb_more_2x_vec): cmp __x86_rep_stosb_threshold(%rip), %RDX_LP ja L(stosb) +#else + .p2align 4 #endif L(more_2x_vec): - cmpq $(VEC_SIZE * 4), %rdx - ja L(loop_start) + /* Stores to first 2x VEC before cmp as any path forward will + require it. */ VMOVU %VEC(0), (%rdi) VMOVU %VEC(0), VEC_SIZE(%rdi) - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) + cmpq $(VEC_SIZE * 4), %rdx + ja L(loop_start) VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) L(return): #if VEC_SIZE > 16 ZERO_UPPER_VEC_REGISTERS_RETURN @@ -192,28 +197,29 @@ L(return): #endif L(loop_start): - leaq (VEC_SIZE * 4)(%rdi), %rcx - VMOVU %VEC(0), (%rdi) - andq $-(VEC_SIZE * 4), %rcx - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) - VMOVU %VEC(0), VEC_SIZE(%rdi) - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) - addq %rdi, %rdx - andq $-(VEC_SIZE * 4), %rdx - cmpq %rdx, %rcx - je L(return) + cmpq $(VEC_SIZE * 8), %rdx + jbe L(loop_end) + andq $-(VEC_SIZE * 2), %rdi + subq $-(VEC_SIZE * 4), %rdi + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx + .p2align 4 L(loop): - VMOVA %VEC(0), (%rcx) - VMOVA %VEC(0), VEC_SIZE(%rcx) - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) - addq $(VEC_SIZE * 4), %rcx - cmpq %rcx, %rdx - jne L(loop) + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(0), VEC_SIZE(%rdi) + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpq %rcx, %rdi + jb L(loop) +L(loop_end): + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. + rdx as length is also unchanged. */ + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) VZEROUPPER_SHORT_RETURN .p2align 4