[5/5] X86-64: Optimize memmove-vec-unaligned-erms.S

  No bug. This commit optimizes memmove-vec-unaligned.S.

The optimizations are in descending order of importance to the
L(less_vec), L(movsb), the 8x forward/backward loops and various
target alignments that have minimal code size impact.

The L(less_vec) optimizations are to:

    1. Readjust the branch order to either given hotter paths a fall
    through case or have less branches in there way.
    2. Moderately change the size classes to make hot branches hotter
    and thus increase predictability.
    3. Try and minimize branch aliasing to avoid BPU thrashing based
    misses.
    4. 64 byte the prior function entry. This is to avoid cases where
    seemingly unrelated changes end up have severe negative
    performance impacts.

The L(movsb) optimizations are to:

    1. Reduce the number of taken branches needed to determine if
    movsb should be used.
    2. 64 byte align either dst if the CPU has fsrm or if dst and src
    do not 4k alias.
    3. 64 byte align src if the CPU does not have fsrm and dst and src
    do 4k alias.

The 8x forward/backward loop optimizations are to:

    1. Reduce instructions needed for aligning to VEC_SIZE.
    2. Reduce uops and code size of the loops.

All tests in string/ passing.
---
See performance data attached.
Included benchmarks: memcpy-random, memcpy, memmove, memcpy-walk, memmove-walk, memcpy-large  

The first page is a summary with the ifunc selection version for
erms/non-erms for each computers. Then in the following 4 sheets are
all the numbers for sse2, avx for Skylake and sse2, avx2, evex, and
avx512 for Tigerlake.

Benchmark CPUS: Skylake:
https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

Tigerlake:
https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

All times are geometric mean of N=30.

"Cur" refers to the current implementation "New" refers to this
patches implementation

Score refers to new/cur (low means improvement, high means
degragation). Scores are color coded. The more green the better, the
more red the worse.

Some notes on the numbers:

In my opinion most of the benchmarks where src/dst align are in [0,
64] have some unpredictable and unfortunate noise from non-obvious
false dependencies between stores to dst and next iterations loads
from src. For example in the 8x forward case, the store of VEC(4) will
end up stalling next iterations load queue, so if size was large
enough that the begining of dst was flushed from L1 this can have a
seemingly random but significant impact on the benchmark result.

There are significant performance improvements/degregations in the [0,
VEC_SIZE]. I didn't treat these as imporant as I think in this size
range the branch pattern indicated by the random tests is more
important. On the random tests the new implementation performance
significantly better.

I also added logic to align before L(movsb). If you see the new random
benchmarks with fixed size this leads to roughly a 10-20% performance
improvement for some hot sizes. I am not 100% convinced this is needed
as generally for larger copies that would go to movsb they are already
aligned but even in the fixed loop cases, especially on Skylake w.o
FSRM it seems aligning before movsb pays off. Let me know if you think
this is unnecessary.

There are occasional performance degregations at odd splots throughout
the medium range sizes in the fixed memcpy benchmarks. I think
generally there is more good than harm here and at the moment I don't
have an explination for why these certain configurations seem to
perform worse. On the plus side, however, it also seems that there are
unexplained improvements of the same magnitude patterened with the
degregations (and both are sparse) so I ultimately believe it should
be acceptable. if this is not the case let me know.

The memmove benchmarks look a bit worse, especially for the erms
case. Part of this is from the nop cases which I didn't treat as
important. But part of it is also because to optimize for what I
expect to be the common case of no overlap the overlap case has extra
branches and overhead. I think this is inevitable when implementing
memmove and memcpy in the same file, but if this is unacceptable let
me know.

Note: I benchmarks before two changes that made it into the final version:

-#if !defined USE_MULTIARCH || !IS_IN (libc)
-L(nop):
-       ret
-#else
+       VMOVU   %VEC(1), -VEC_SIZE(%rdi, %rdx)
        VZEROUPPER_RETURN
-#endif

And

+       testl   $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, __x86_string_control(%rip)
-       andl    $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, __x86_string_control(%rip)

I don't think either of these should have any impact.

I made the former change because I think it was a bug that could cause
use of avx2 w.o vzeroupper and the latter because I think it could
cause issues on multicore platforms.

 sysdeps/x86/sysdep.h                          |  13 +-
 .../multiarch/memmove-vec-unaligned-erms.S    | 484 +++++++++++-------
 2 files changed, 317 insertions(+), 180 deletions(-)

Message ID	20210824082753.3356637-5-goldstein.w.n@gmail.com
State	Superseded
Headers	Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DB0E33857C5F for <patchwork@sourceware.org>; Tue, 24 Aug 2021 08:32:01 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DB0E33857C5F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1629793921; bh=gXSlu0EO/DZNCGamh2Y0FbVafiaJMyMm1Uf9GHlNTFo=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=GkNw5am5Djvp6/0oG1uDXMXPm7J6BICh/28PUpNFiqeNJx10gchl8lF8vKCcKlOcl BpfS7BrvVGF825o+2FPuLQ3rTOX061jkcaozQF88Q0AJLwj0n8Sg2mmottUhIH5I1j mM0mmTG2vOX5oIaqbBhHt7xBbx92DNeuQZlbAVvs= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-io1-xd2d.google.com (mail-io1-xd2d.google.com [IPv6:2607:f8b0:4864:20::d2d]) by sourceware.org (Postfix) with ESMTPS id 8E0153857C6D for <libc-alpha@sourceware.org>; Tue, 24 Aug 2021 08:29:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8E0153857C6D Received: by mail-io1-xd2d.google.com with SMTP id a21so25310444ioq.6 for <libc-alpha@sourceware.org>; Tue, 24 Aug 2021 01:29:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=gXSlu0EO/DZNCGamh2Y0FbVafiaJMyMm1Uf9GHlNTFo=; b=H0bLIz6MgM85KpVZyo1PyEyAanf6gAbOHAIKNNx+ZHI+oVOiwBFB3lKfMlcNM1ARS8 tlg5bU/u3q3rxQw8bzXIQSE4zvgZKPe4+hLFFsAz4dWhGTHUhCJV+IoTVBvT2BfHUmpl O+Zc0T/TxFCBC9H8p40ayZ1cCzdY8vpr3Mo1W5SLHJXJIOV0fuN11mf47IgSW2igHRSz O5eL+GjAdwtsMgK6KZ7/bHgy9YtNbqBFDOBDiT9X3Uhly6M0zOv4ndDV8KJKPvUXDUvh qNFNUEcSwi/Lr5WFQN2HyTx+N4e/oI/EsSm1YbLuu4utZbO1XJqpb7zh7ohq58hjTZ1N mX2Q== X-Gm-Message-State: AOAM532xQktkgwLV+LfSr5Rc85DAWo6XlNjcjMHFm1QHqOW35elf4/lO VRXt2lUKEeX+tANKqlgi9O8JYrddgxtplQ== X-Google-Smtp-Source: ABdhPJzVuGXlqYEmYlrPl3gUBzSu0HXVFty/myTWn0/eRlle5uLToUAW7vUQX/i+BIOyHNK9dzFDpQ== X-Received: by 2002:a02:3b1b:: with SMTP id c27mr3679500jaa.103.1629793788626; Tue, 24 Aug 2021 01:29:48 -0700 (PDT) Received: from localhost.localdomain (node-17-161.flex.volo.net. [76.191.17.161]) by smtp.googlemail.com with ESMTPSA id d12sm6074744iow.16.2021.08.24.01.29.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Aug 2021 01:29:48 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH 5/5] X86-64: Optimize memmove-vec-unaligned-erms.S Date: Tue, 24 Aug 2021 04:27:56 -0400 Message-Id: <20210824082753.3356637-5-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210824082753.3356637-1-goldstein.w.n@gmail.com> References: <20210824082753.3356637-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Noah Goldstein <goldstein.w.n@gmail.com> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
Series	[1/5] string: Make tests birdirectional test-memcpy.c \| [1/5] string: Make tests birdirectional test-memcpy.c [2/5] benchtests: Add new random cases to bench-memcpy-random.c [3/5] benchtests: Add partial overlap case in bench-memmove-walk.c [4/5] benchtests: Add additional cases to bench-memcpy.c and bench-memmove.c [5/5] X86-64: Optimize memmove-vec-unaligned-erms.S

Context	Check	Description
dj/TryBot-apply_patch	success	Patch applied to master at the time it was sent
dj/TryBot-32bit	success	Build for i686

[5/5] X86-64: Optimize memmove-vec-unaligned-erms.S

Checks

Commit Message

Comments

Patch