Message ID | 20210314190532.1242745-1-goldstein.w.n@gmail.com |
---|---|
State | Superseded |
Headers |
Return-Path: <libc-alpha-bounces@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 6E2E63835405; Sun, 14 Mar 2021 19:06:02 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6E2E63835405 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1615748762; bh=6G2EwE3sov426NuoLWTvAeAama7f6EV+RhYkQ89msU4=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=FNX6kqrxgXUfwx8ES+qCssOauZQBaNxhON88EMrFojFhwmJo2U6KE59cX/DT7bzkq 0y1n54fFvZT+zdEuX0lZhDeF5l0ied/D8PT5nFKBt+3tNt/E5MeN+LGA7jwIR/ng5D oiezL8YCrlxIYdEurKvoBLp1yGyhmvbGMvqN9sJo= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qv1-xf30.google.com (mail-qv1-xf30.google.com [IPv6:2607:f8b0:4864:20::f30]) by sourceware.org (Postfix) with ESMTPS id D27F23857C52 for <libc-alpha@sourceware.org>; Sun, 14 Mar 2021 19:05:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D27F23857C52 Received: by mail-qv1-xf30.google.com with SMTP id cx5so6769971qvb.10 for <libc-alpha@sourceware.org>; Sun, 14 Mar 2021 12:05:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=6G2EwE3sov426NuoLWTvAeAama7f6EV+RhYkQ89msU4=; b=tm551ZxZTaUshe7O4/NSTeMdbHp8BDLtkU1wbNX2R4SdhmBqIkQM5g5WnANnhEyg/h Il/GD+fHNd2ntjUMOZAe+AfMU7YKL9NJqFxUwx4LOGgmkx4P/kqZCyRgUk8dQkYGn4Tk A/p/bvmm3/0spVJ510u/ump7FVpceOzRgT66pb0EfRhuUA3EQs+JBngxSsjlHrxdVy1I 1BTiTI+OAmgFYJA/K9bt+qMzbh9zi10LDqlnZfVwFyzmIldDlHI6bBt1zKmiZng01SmK j0pTX7W7KPjUBaFVAmX6e8MBaSNgn5eRYYJdgnrkpdli0OqUet1F1EBVsKN9i613URuC Uyrw== X-Gm-Message-State: AOAM5327aBM6SH12tr5D9NM8H+aD9zjc8iPtWadjfe0Wq5+yXGpBKORQ IBvvSrAPMuKCZ0LtBi4vGVJxqIndL22mGg== X-Google-Smtp-Source: ABdhPJwstJrFfH6M55Yscw8gc2HDqhLWqTXZ/sSyD8aP40oUCafmH9w/twdAyP7cptadm0L5wXow3w== X-Received: by 2002:ad4:55ef:: with SMTP id bu15mr7517644qvb.46.1615748753231; Sun, 14 Mar 2021 12:05:53 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id 131sm10690044qkl.74.2021.03.14.12.05.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 14 Mar 2021 12:05:52 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH 1/2] x86: Update large memcpy case in memmove-vec-unaligned-erms.S Date: Sun, 14 Mar 2021 15:05:32 -0400 Message-Id: <20210314190532.1242745-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: noah via Libc-alpha <libc-alpha@sourceware.org> Reply-To: noah <goldstein.w.n@gmail.com> Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org> |
Series |
[1/2] x86: Update large memcpy case in memmove-vec-unaligned-erms.S
|
|
Commit Message
Noah Goldstein
March 14, 2021, 7:05 p.m. UTC
No Bug. This commit updates the large memcpy case (no overlap). The
update is to perform memcpy on either 2 or 4 contiguous pages at
once. This 1) helps to alleviate the affects of false memory aliasing
when destination and source have a close 4k alignment and 2) In most
cases and for most DRAM units is a modestly more efficient access
pattern. These changes are a clear performance improvement for
VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy,
test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all
pass.
Signed-off-by: noah <goldstein.w.n@gmail.com>
---
In this patch is an update to memmove-vec-unaligned-erms.S, additions
to test-memmove.c and test-memcp.c, and additions to
bench-memcpy-large.c.
Test Changes:
These changes where largely in the vein of increasing the maximum test
size, increasing the range of misalignments, and expanding the to
cover both forward/backward copying.
Bench Changes:
These changes where to increase the range of tested
alignments. Relative alignment and source and destination can make a
huge impact on performance (more below) even when the there is no
overlap.
Memmove Changes:
The change was benchmarked on an Icelake and Skylake CPU. See below
for CSV of data. Time is median of 25 runs of bench-memcpy-large.c in
nanoseconds. "New" is this patch, "Old" is the current implementation.
The majority of changes in performance where beneficial. The most
clear example is on icelake where alleviating the pressure on false
memory aliasing lead to more than a 2x performance improvement for
certain alignments of VEC_SIZE=16 and 1.5x performance improvement for
certain alignments of VEC_SIZE=32.
i.e:
func ,size ,align1,align2,Old ,New ,% New / Old
sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3
avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4
As well across the board for larger sizes (starting around size =
2^23) there was a roughly 0-10% performance improvement.
i.e:
Skylake:
sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0
avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7
Icelake:
sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3
avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2
avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7
There where performance degregations, however: Medium large sizes
[2^20, 2^22] had roughly a 0-6% performance loss on Icelake for
VEC_SIZE=64. This degregation is worst for destination alignment=127.
i.e:
avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9
avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3
avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9
avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7
avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9
avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2
Around 2^23 the change becomes neutral - advantageous:
avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1
avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4
Across the board, aside from the address aliasing case, the
performance difference is roughly in the range of [-6%, 12%] with some
extreme [150%, 200%] cases that are heavily dependent on alignment.
Its possible these changes should only be made for VEC_SIZE=16/32 or
to keep the original forward memcpy for sizes [2^20, 2^22] in the case
that there is no address aliasing. Please let me know what you think.
Performance Numbers (Skylake Numbers Below):
Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
func ,size ,align1,align2,Old ,New ,% New / Old
sse2 ,1048583 ,0 ,0 ,147297.0 ,146234.0 ,99.3
sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3
sse2 ,1048607 ,3 ,0 ,151488.0 ,150773.0 ,99.5
sse2 ,1048639 ,3 ,5 ,399842.0 ,174222.0 ,43.6
sse2 ,1048576 ,0 ,127 ,356326.0 ,171422.0 ,48.1
sse2 ,1048576 ,0 ,255 ,144145.0 ,152123.0 ,105.5
sse2 ,1048576 ,0 ,256 ,147605.0 ,148005.0 ,100.3
sse2 ,1048576 ,0 ,4064 ,146929.0 ,147812.0 ,100.6
sse2 ,2097159 ,0 ,0 ,293910.0 ,291403.0 ,99.1
sse2 ,2097167 ,0 ,3 ,798920.0 ,346694.0 ,43.4
sse2 ,2097183 ,3 ,0 ,301171.0 ,299606.0 ,99.5
sse2 ,2097215 ,3 ,5 ,799129.0 ,346597.0 ,43.4
sse2 ,2097152 ,0 ,127 ,710256.0 ,341110.0 ,48.0
sse2 ,2097152 ,0 ,255 ,286370.0 ,302553.0 ,105.7
sse2 ,2097152 ,0 ,256 ,293691.0 ,294825.0 ,100.4
sse2 ,2097152 ,0 ,4064 ,292920.0 ,294180.0 ,100.4
sse2 ,4194311 ,0 ,0 ,587894.0 ,586827.0 ,99.8
sse2 ,4194319 ,0 ,3 ,1596340.0 ,694200.0 ,43.5
sse2 ,4194335 ,3 ,0 ,601996.0 ,601342.0 ,99.9
sse2 ,4194367 ,3 ,5 ,1596870.0 ,694562.0 ,43.5
sse2 ,4194304 ,0 ,127 ,1414140.0 ,682856.0 ,48.3
sse2 ,4194304 ,0 ,255 ,573752.0 ,607024.0 ,105.8
sse2 ,4194304 ,0 ,256 ,586961.0 ,591899.0 ,100.8
sse2 ,4194304 ,0 ,4064 ,586618.0 ,591267.0 ,100.8
sse2 ,8388615 ,0 ,0 ,1267450.0 ,1213660.0 ,95.8
sse2 ,8388623 ,0 ,3 ,3204280.0 ,1404460.0 ,43.8
sse2 ,8388639 ,3 ,0 ,1298940.0 ,1245790.0 ,95.9
sse2 ,8388671 ,3 ,5 ,3200790.0 ,1404540.0 ,43.9
sse2 ,8388608 ,0 ,127 ,2843880.0 ,1380490.0 ,48.5
sse2 ,8388608 ,0 ,255 ,1261040.0 ,1259110.0 ,99.8
sse2 ,8388608 ,0 ,256 ,1301120.0 ,1228890.0 ,94.4
sse2 ,8388608 ,0 ,4064 ,1263930.0 ,1233400.0 ,97.6
sse2 ,16777223 ,0 ,0 ,2845260.0 ,2690490.0 ,94.6
sse2 ,16777231 ,0 ,3 ,6424220.0 ,2999980.0 ,46.7
sse2 ,16777247 ,3 ,0 ,2902290.0 ,2764350.0 ,95.2
sse2 ,16777279 ,3 ,5 ,6413600.0 ,2999310.0 ,46.8
sse2 ,16777216 ,0 ,127 ,5704050.0 ,2986650.0 ,52.4
sse2 ,16777216 ,0 ,255 ,2823440.0 ,2790510.0 ,98.8
sse2 ,16777216 ,0 ,256 ,2926150.0 ,2711540.0 ,92.7
sse2 ,16777216 ,0 ,4064 ,2836530.0 ,2738850.0 ,96.6
sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3
sse2 ,33554447 ,0 ,3 ,12850900.0 ,6171500.0 ,48.0
sse2 ,33554463 ,3 ,0 ,6041090.0 ,5731480.0 ,94.9
sse2 ,33554495 ,3 ,5 ,12851100.0 ,6179870.0 ,48.1
sse2 ,33554432 ,0 ,127 ,11381900.0 ,6134130.0 ,53.9
sse2 ,33554432 ,0 ,255 ,5899320.0 ,5792680.0 ,98.2
sse2 ,33554432 ,0 ,256 ,6066220.0 ,5636270.0 ,92.9
sse2 ,33554432 ,0 ,4064 ,5915210.0 ,5688830.0 ,96.2
avx ,1048583 ,0 ,0 ,134392.0 ,136494.0 ,101.6
avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4
avx ,1048607 ,3 ,0 ,138559.0 ,138887.0 ,100.2
avx ,1048639 ,3 ,5 ,210655.0 ,146690.0 ,69.6
avx ,1048576 ,0 ,127 ,219819.0 ,155758.0 ,70.9
avx ,1048576 ,0 ,255 ,180740.0 ,146392.0 ,81.0
avx ,1048576 ,0 ,256 ,138448.0 ,142813.0 ,103.2
avx ,1048576 ,0 ,4064 ,133067.0 ,136384.0 ,102.5
avx ,2097159 ,0 ,0 ,268811.0 ,272810.0 ,101.5
avx ,2097167 ,0 ,3 ,419724.0 ,292730.0 ,69.7
avx ,2097183 ,3 ,0 ,277358.0 ,277789.0 ,100.2
avx ,2097215 ,3 ,5 ,421091.0 ,292907.0 ,69.6
avx ,2097152 ,0 ,127 ,439166.0 ,311969.0 ,71.0
avx ,2097152 ,0 ,255 ,359858.0 ,293484.0 ,81.6
avx ,2097152 ,0 ,256 ,276467.0 ,285067.0 ,103.1
avx ,2097152 ,0 ,4064 ,266145.0 ,273049.0 ,102.6
avx ,4194311 ,0 ,0 ,538566.0 ,547454.0 ,101.7
avx ,4194319 ,0 ,3 ,841884.0 ,586111.0 ,69.6
avx ,4194335 ,3 ,0 ,555930.0 ,557857.0 ,100.3
avx ,4194367 ,3 ,5 ,841146.0 ,586329.0 ,69.7
avx ,4194304 ,0 ,127 ,879711.0 ,625865.0 ,71.1
avx ,4194304 ,0 ,255 ,718131.0 ,588442.0 ,81.9
avx ,4194304 ,0 ,256 ,553593.0 ,571956.0 ,103.3
avx ,4194304 ,0 ,4064 ,534461.0 ,547903.0 ,102.5
avx ,8388615 ,0 ,0 ,1145460.0 ,1127430.0 ,98.4
avx ,8388623 ,0 ,3 ,1704200.0 ,1185410.0 ,69.6
avx ,8388639 ,3 ,0 ,1179600.0 ,1145670.0 ,97.1
avx ,8388671 ,3 ,5 ,1702480.0 ,1183410.0 ,69.5
avx ,8388608 ,0 ,127 ,1773750.0 ,1264360.0 ,71.3
avx ,8388608 ,0 ,255 ,1450840.0 ,1189310.0 ,82.0
avx ,8388608 ,0 ,256 ,1179160.0 ,1157490.0 ,98.2
avx ,8388608 ,0 ,4064 ,1135990.0 ,1128150.0 ,99.3
avx ,16777223 ,0 ,0 ,2630160.0 ,2553770.0 ,97.1
avx ,16777231 ,0 ,3 ,3539370.0 ,2667050.0 ,75.4
avx ,16777247 ,3 ,0 ,2671830.0 ,2585550.0 ,96.8
avx ,16777279 ,3 ,5 ,3537460.0 ,2664080.0 ,75.3
avx ,16777216 ,0 ,127 ,3598350.0 ,2784810.0 ,77.4
avx ,16777216 ,0 ,255 ,3012890.0 ,2650420.0 ,88.0
avx ,16777216 ,0 ,256 ,2690480.0 ,2605640.0 ,96.8
avx ,16777216 ,0 ,4064 ,2607870.0 ,2537450.0 ,97.3
avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2
avx ,33554447 ,0 ,3 ,7208430.0 ,5541330.0 ,76.9
avx ,33554463 ,3 ,0 ,5613760.0 ,5399880.0 ,96.2
avx ,33554495 ,3 ,5 ,7202140.0 ,5547470.0 ,77.0
avx ,33554432 ,0 ,127 ,7287570.0 ,5784590.0 ,79.4
avx ,33554432 ,0 ,255 ,6156640.0 ,5508630.0 ,89.5
avx ,33554432 ,0 ,256 ,5700530.0 ,5441950.0 ,95.5
avx ,33554432 ,0 ,4064 ,5531820.0 ,5302580.0 ,95.9
avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9
avx512 ,1048591 ,0 ,3 ,142372.0 ,146319.0 ,102.8
avx512 ,1048607 ,3 ,0 ,134629.0 ,139098.0 ,103.3
avx512 ,1048639 ,3 ,5 ,142362.0 ,146405.0 ,102.8
avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3
avx512 ,1048576 ,0 ,255 ,143736.0 ,147800.0 ,102.8
avx512 ,1048576 ,0 ,256 ,139937.0 ,142958.0 ,102.2
avx512 ,1048576 ,0 ,4064 ,134730.0 ,139222.0 ,103.3
avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9
avx512 ,2097167 ,0 ,3 ,284152.0 ,293076.0 ,103.1
avx512 ,2097183 ,3 ,0 ,269656.0 ,278215.0 ,103.2
avx512 ,2097215 ,3 ,5 ,284422.0 ,293030.0 ,103.0
avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7
avx512 ,2097152 ,0 ,255 ,287381.0 ,295503.0 ,102.8
avx512 ,2097152 ,0 ,256 ,280224.0 ,286054.0 ,102.1
avx512 ,2097152 ,0 ,4064 ,270038.0 ,277907.0 ,102.9
avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9
avx512 ,4194319 ,0 ,3 ,570476.0 ,584715.0 ,102.5
avx512 ,4194335 ,3 ,0 ,539745.0 ,556838.0 ,103.2
avx512 ,4194367 ,3 ,5 ,570148.0 ,586154.0 ,102.8
avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2
avx512 ,4194304 ,0 ,255 ,576014.0 ,590627.0 ,102.5
avx512 ,4194304 ,0 ,256 ,560921.0 ,572248.0 ,102.0
avx512 ,4194304 ,0 ,4064 ,540550.0 ,557613.0 ,103.2
avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1
avx512 ,8388623 ,0 ,3 ,1218350.0 ,1192400.0 ,97.9
avx512 ,8388639 ,3 ,0 ,1139420.0 ,1144530.0 ,100.4
avx512 ,8388671 ,3 ,5 ,1219760.0 ,1191420.0 ,97.7
avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4
avx512 ,8388608 ,0 ,255 ,1222290.0 ,1190400.0 ,97.4
avx512 ,8388608 ,0 ,256 ,1194810.0 ,1154410.0 ,96.6
avx512 ,8388608 ,0 ,4064 ,1138850.0 ,1147750.0 ,100.8
avx512 ,16777223 ,0 ,0 ,2601040.0 ,2535500.0 ,97.5
avx512 ,16777231 ,0 ,3 ,2759350.0 ,2674570.0 ,96.9
avx512 ,16777247 ,3 ,0 ,2603500.0 ,2588260.0 ,99.4
avx512 ,16777279 ,3 ,5 ,2743810.0 ,2674870.0 ,97.5
avx512 ,16777216 ,0 ,127 ,2754910.0 ,2726860.0 ,99.0
avx512 ,16777216 ,0 ,255 ,2750980.0 ,2651370.0 ,96.4
avx512 ,16777216 ,0 ,256 ,2707940.0 ,2589660.0 ,95.6
avx512 ,16777216 ,0 ,4064 ,2606760.0 ,2580980.0 ,99.0
avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7
avx512 ,33554447 ,0 ,3 ,5788490.0 ,5574380.0 ,96.3
avx512 ,33554463 ,3 ,0 ,5558950.0 ,5415190.0 ,97.4
avx512 ,33554495 ,3 ,5 ,5775400.0 ,5582390.0 ,96.7
avx512 ,33554432 ,0 ,127 ,5787680.0 ,5659730.0 ,97.8
avx512 ,33554432 ,0 ,255 ,5823500.0 ,5516530.0 ,94.7
avx512 ,33554432 ,0 ,256 ,5678760.0 ,5401000.0 ,95.1
avx512 ,33554432 ,0 ,4064 ,5573540.0 ,5400460.0 ,96.9
Skylake: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
func ,size ,align1,align2,Old ,New ,% New / Old
sse2 ,1048583 ,0 ,0 ,71890.2 ,70626.8 ,98.2
sse2 ,1048591 ,0 ,3 ,72200.5 ,74263.6 ,102.9
sse2 ,1048607 ,3 ,0 ,71360.5 ,70106.5 ,98.2
sse2 ,1048639 ,3 ,5 ,71972.1 ,73468.0 ,102.1
sse2 ,1048576 ,0 ,127 ,81634.2 ,77607.6 ,95.1
sse2 ,1048576 ,0 ,255 ,71575.2 ,71951.5 ,100.5
sse2 ,1048576 ,0 ,256 ,72383.2 ,69610.8 ,96.2
sse2 ,1048576 ,0 ,4064 ,71996.6 ,70941.0 ,98.5
sse2 ,2097159 ,0 ,0 ,143835.0 ,140186.0 ,97.5
sse2 ,2097167 ,0 ,3 ,146347.0 ,147984.0 ,101.1
sse2 ,2097183 ,3 ,0 ,145740.0 ,140317.0 ,96.3
sse2 ,2097215 ,3 ,5 ,147099.0 ,147066.0 ,100.0
sse2 ,2097152 ,0 ,127 ,163712.0 ,157386.0 ,96.1
sse2 ,2097152 ,0 ,255 ,145048.0 ,144970.0 ,99.9
sse2 ,2097152 ,0 ,256 ,144545.0 ,139948.0 ,96.8
sse2 ,2097152 ,0 ,4064 ,143519.0 ,140975.0 ,98.2
sse2 ,4194311 ,0 ,0 ,293848.0 ,283531.0 ,96.5
sse2 ,4194319 ,0 ,3 ,305127.0 ,295478.0 ,96.8
sse2 ,4194335 ,3 ,0 ,299170.0 ,283950.0 ,94.9
sse2 ,4194367 ,3 ,5 ,307419.0 ,293175.0 ,95.4
sse2 ,4194304 ,0 ,127 ,332567.0 ,318276.0 ,95.7
sse2 ,4194304 ,0 ,255 ,304897.0 ,300309.0 ,98.5
sse2 ,4194304 ,0 ,256 ,298929.0 ,284008.0 ,95.0
sse2 ,4194304 ,0 ,4064 ,296282.0 ,286087.0 ,96.6
sse2 ,8388615 ,0 ,0 ,751380.0 ,724191.0 ,96.4
sse2 ,8388623 ,0 ,3 ,775657.0 ,734942.0 ,94.8
sse2 ,8388639 ,3 ,0 ,756674.0 ,712934.0 ,94.2
sse2 ,8388671 ,3 ,5 ,774934.0 ,736895.0 ,95.1
sse2 ,8388608 ,0 ,127 ,781242.0 ,741475.0 ,94.9
sse2 ,8388608 ,0 ,255 ,762849.0 ,725086.0 ,95.0
sse2 ,8388608 ,0 ,256 ,758465.0 ,711665.0 ,93.8
sse2 ,8388608 ,0 ,4064 ,755243.0 ,738092.0 ,97.7
sse2 ,16777223 ,0 ,0 ,2104730.0 ,1954140.0 ,92.8
sse2 ,16777231 ,0 ,3 ,2129590.0 ,1951410.0 ,91.6
sse2 ,16777247 ,3 ,0 ,2102950.0 ,1952530.0 ,92.8
sse2 ,16777279 ,3 ,5 ,2126250.0 ,1952410.0 ,91.8
sse2 ,16777216 ,0 ,127 ,2074290.0 ,1932070.0 ,93.1
sse2 ,16777216 ,0 ,255 ,2060610.0 ,1941860.0 ,94.2
sse2 ,16777216 ,0 ,256 ,2106430.0 ,1952060.0 ,92.7
sse2 ,16777216 ,0 ,4064 ,2100660.0 ,1945610.0 ,92.6
sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0
sse2 ,33554447 ,0 ,3 ,4687860.0 ,4387680.0 ,93.6
sse2 ,33554463 ,3 ,0 ,4655420.0 ,4402580.0 ,94.6
sse2 ,33554495 ,3 ,5 ,4692800.0 ,4386350.0 ,93.5
sse2 ,33554432 ,0 ,127 ,4558620.0 ,4341510.0 ,95.2
sse2 ,33554432 ,0 ,255 ,4545130.0 ,4374230.0 ,96.2
sse2 ,33554432 ,0 ,256 ,4665000.0 ,4390850.0 ,94.1
sse2 ,33554432 ,0 ,4064 ,4666350.0 ,4374400.0 ,93.7
avx ,1048583 ,0 ,0 ,105460.0 ,104097.0 ,98.7
avx ,1048591 ,0 ,3 ,66369.2 ,67306.4 ,101.4
avx ,1048607 ,3 ,0 ,66625.8 ,64741.2 ,97.2
avx ,1048639 ,3 ,5 ,66757.7 ,65796.3 ,98.6
avx ,1048576 ,0 ,127 ,65272.4 ,65130.6 ,99.8
avx ,1048576 ,0 ,255 ,65632.1 ,65678.6 ,100.1
avx ,1048576 ,0 ,256 ,67530.1 ,64841.5 ,96.0
avx ,1048576 ,0 ,4064 ,65955.1 ,66194.8 ,100.4
avx ,2097159 ,0 ,0 ,132883.0 ,131644.0 ,99.1
avx ,2097167 ,0 ,3 ,133825.0 ,132308.0 ,98.9
avx ,2097183 ,3 ,0 ,133567.0 ,129040.0 ,96.6
avx ,2097215 ,3 ,5 ,133856.0 ,132735.0 ,99.2
avx ,2097152 ,0 ,127 ,131219.0 ,129983.0 ,99.1
avx ,2097152 ,0 ,255 ,131450.0 ,131755.0 ,100.2
avx ,2097152 ,0 ,256 ,135219.0 ,132616.0 ,98.1
avx ,2097152 ,0 ,4064 ,131692.0 ,132351.0 ,100.5
avx ,4194311 ,0 ,0 ,278494.0 ,265144.0 ,95.2
avx ,4194319 ,0 ,3 ,282868.0 ,267499.0 ,94.6
avx ,4194335 ,3 ,0 ,275956.0 ,262626.0 ,95.2
avx ,4194367 ,3 ,5 ,283080.0 ,266712.0 ,94.2
avx ,4194304 ,0 ,127 ,270912.0 ,266153.0 ,98.2
avx ,4194304 ,0 ,255 ,266650.0 ,267640.0 ,100.4
avx ,4194304 ,0 ,256 ,276224.0 ,264929.0 ,95.9
avx ,4194304 ,0 ,4064 ,274156.0 ,265264.0 ,96.8
avx ,8388615 ,0 ,0 ,820710.0 ,799313.0 ,97.4
avx ,8388623 ,0 ,3 ,881478.0 ,816087.0 ,92.6
avx ,8388639 ,3 ,0 ,881138.0 ,788571.0 ,89.5
avx ,8388671 ,3 ,5 ,883555.0 ,820020.0 ,92.8
avx ,8388608 ,0 ,127 ,799727.0 ,785502.0 ,98.2
avx ,8388608 ,0 ,255 ,785782.0 ,800006.0 ,101.8
avx ,8388608 ,0 ,256 ,876745.0 ,809691.0 ,92.4
avx ,8388608 ,0 ,4064 ,895120.0 ,809204.0 ,90.4
avx ,16777223 ,0 ,0 ,2138420.0 ,1955110.0 ,91.4
avx ,16777231 ,0 ,3 ,2208590.0 ,1966590.0 ,89.0
avx ,16777247 ,3 ,0 ,2209190.0 ,1968980.0 ,89.1
avx ,16777279 ,3 ,5 ,2207120.0 ,1964830.0 ,89.0
avx ,16777216 ,0 ,127 ,2123460.0 ,1942180.0 ,91.5
avx ,16777216 ,0 ,255 ,2120500.0 ,1951910.0 ,92.0
avx ,16777216 ,0 ,256 ,2193680.0 ,1963540.0 ,89.5
avx ,16777216 ,0 ,4064 ,2196110.0 ,1970050.0 ,89.7
avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7
avx ,33554447 ,0 ,3 ,4855270.0 ,4402670.0 ,90.7
avx ,33554463 ,3 ,0 ,4877600.0 ,4405480.0 ,90.3
avx ,33554495 ,3 ,5 ,4851190.0 ,4401330.0 ,90.7
avx ,33554432 ,0 ,127 ,4699810.0 ,4324860.0 ,92.0
avx ,33554432 ,0 ,255 ,4676570.0 ,4363830.0 ,93.3
avx ,33554432 ,0 ,256 ,4846720.0 ,4376970.0 ,90.3
avx ,33554432 ,0 ,4064 ,4839810.0 ,4400570.0 ,90.9
.../multiarch/memmove-vec-unaligned-erms.S | 326 ++++++++++++++----
1 file changed, 254 insertions(+), 72 deletions(-)
Comments
On Sun, Mar 14, 2021 at 12:05 PM noah <goldstein.w.n@gmail.com> wrote: > > No Bug. This commit updates the large memcpy case (no overlap). The > update is to perform memcpy on either 2 or 4 contiguous pages at > once. This 1) helps to alleviate the affects of false memory aliasing > when destination and source have a close 4k alignment and 2) In most > cases and for most DRAM units is a modestly more efficient access > pattern. These changes are a clear performance improvement for > VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, > test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all > pass. > > Signed-off-by: noah <goldstein.w.n@gmail.com> > --- > In this patch is an update to memmove-vec-unaligned-erms.S, additions > to test-memmove.c and test-memcp.c, and additions to > bench-memcpy-large.c. > > Test Changes: > These changes where largely in the vein of increasing the maximum test > size, increasing the range of misalignments, and expanding the to > cover both forward/backward copying. > > Bench Changes: > These changes where to increase the range of tested > alignments. Relative alignment and source and destination can make a > huge impact on performance (more below) even when the there is no > overlap. > > Memmove Changes: > The change was benchmarked on an Icelake and Skylake CPU. See below > for CSV of data. Time is median of 25 runs of bench-memcpy-large.c in > nanoseconds. "New" is this patch, "Old" is the current implementation. > > The majority of changes in performance where beneficial. The most > clear example is on icelake where alleviating the pressure on false > memory aliasing lead to more than a 2x performance improvement for > certain alignments of VEC_SIZE=16 and 1.5x performance improvement for > certain alignments of VEC_SIZE=32. > i.e: > func ,size ,align1,align2,Old ,New ,% New / Old > sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3 > avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4 > > As well across the board for larger sizes (starting around size = > 2^23) there was a roughly 0-10% performance improvement. > > i.e: > Skylake: > sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0 > avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7 > > Icelake: > sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3 > avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2 > avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7 > > There where performance degregations, however: Medium large sizes > [2^20, 2^22] had roughly a 0-6% performance loss on Icelake for > VEC_SIZE=64. This degregation is worst for destination alignment=127. > i.e: > avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9 > avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3 > avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9 > avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7 > avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9 > avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2 > > Around 2^23 the change becomes neutral - advantageous: > avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1 > avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4 > > Across the board, aside from the address aliasing case, the > performance difference is roughly in the range of [-6%, 12%] with some > extreme [150%, 200%] cases that are heavily dependent on alignment. > > Its possible these changes should only be made for VEC_SIZE=16/32 or > to keep the original forward memcpy for sizes [2^20, 2^22] in the case > that there is no address aliasing. Please let me know what you think. > > Performance Numbers (Skylake Numbers Below): > > Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz > func ,size ,align1,align2,Old ,New ,% New / Old > sse2 ,1048583 ,0 ,0 ,147297.0 ,146234.0 ,99.3 > sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3 > sse2 ,1048607 ,3 ,0 ,151488.0 ,150773.0 ,99.5 > sse2 ,1048639 ,3 ,5 ,399842.0 ,174222.0 ,43.6 > sse2 ,1048576 ,0 ,127 ,356326.0 ,171422.0 ,48.1 > sse2 ,1048576 ,0 ,255 ,144145.0 ,152123.0 ,105.5 > sse2 ,1048576 ,0 ,256 ,147605.0 ,148005.0 ,100.3 > sse2 ,1048576 ,0 ,4064 ,146929.0 ,147812.0 ,100.6 > sse2 ,2097159 ,0 ,0 ,293910.0 ,291403.0 ,99.1 > sse2 ,2097167 ,0 ,3 ,798920.0 ,346694.0 ,43.4 > sse2 ,2097183 ,3 ,0 ,301171.0 ,299606.0 ,99.5 > sse2 ,2097215 ,3 ,5 ,799129.0 ,346597.0 ,43.4 > sse2 ,2097152 ,0 ,127 ,710256.0 ,341110.0 ,48.0 > sse2 ,2097152 ,0 ,255 ,286370.0 ,302553.0 ,105.7 > sse2 ,2097152 ,0 ,256 ,293691.0 ,294825.0 ,100.4 > sse2 ,2097152 ,0 ,4064 ,292920.0 ,294180.0 ,100.4 > sse2 ,4194311 ,0 ,0 ,587894.0 ,586827.0 ,99.8 > sse2 ,4194319 ,0 ,3 ,1596340.0 ,694200.0 ,43.5 > sse2 ,4194335 ,3 ,0 ,601996.0 ,601342.0 ,99.9 > sse2 ,4194367 ,3 ,5 ,1596870.0 ,694562.0 ,43.5 > sse2 ,4194304 ,0 ,127 ,1414140.0 ,682856.0 ,48.3 > sse2 ,4194304 ,0 ,255 ,573752.0 ,607024.0 ,105.8 > sse2 ,4194304 ,0 ,256 ,586961.0 ,591899.0 ,100.8 > sse2 ,4194304 ,0 ,4064 ,586618.0 ,591267.0 ,100.8 > sse2 ,8388615 ,0 ,0 ,1267450.0 ,1213660.0 ,95.8 > sse2 ,8388623 ,0 ,3 ,3204280.0 ,1404460.0 ,43.8 > sse2 ,8388639 ,3 ,0 ,1298940.0 ,1245790.0 ,95.9 > sse2 ,8388671 ,3 ,5 ,3200790.0 ,1404540.0 ,43.9 > sse2 ,8388608 ,0 ,127 ,2843880.0 ,1380490.0 ,48.5 > sse2 ,8388608 ,0 ,255 ,1261040.0 ,1259110.0 ,99.8 > sse2 ,8388608 ,0 ,256 ,1301120.0 ,1228890.0 ,94.4 > sse2 ,8388608 ,0 ,4064 ,1263930.0 ,1233400.0 ,97.6 > sse2 ,16777223 ,0 ,0 ,2845260.0 ,2690490.0 ,94.6 > sse2 ,16777231 ,0 ,3 ,6424220.0 ,2999980.0 ,46.7 > sse2 ,16777247 ,3 ,0 ,2902290.0 ,2764350.0 ,95.2 > sse2 ,16777279 ,3 ,5 ,6413600.0 ,2999310.0 ,46.8 > sse2 ,16777216 ,0 ,127 ,5704050.0 ,2986650.0 ,52.4 > sse2 ,16777216 ,0 ,255 ,2823440.0 ,2790510.0 ,98.8 > sse2 ,16777216 ,0 ,256 ,2926150.0 ,2711540.0 ,92.7 > sse2 ,16777216 ,0 ,4064 ,2836530.0 ,2738850.0 ,96.6 > sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3 > sse2 ,33554447 ,0 ,3 ,12850900.0 ,6171500.0 ,48.0 > sse2 ,33554463 ,3 ,0 ,6041090.0 ,5731480.0 ,94.9 > sse2 ,33554495 ,3 ,5 ,12851100.0 ,6179870.0 ,48.1 > sse2 ,33554432 ,0 ,127 ,11381900.0 ,6134130.0 ,53.9 > sse2 ,33554432 ,0 ,255 ,5899320.0 ,5792680.0 ,98.2 > sse2 ,33554432 ,0 ,256 ,6066220.0 ,5636270.0 ,92.9 > sse2 ,33554432 ,0 ,4064 ,5915210.0 ,5688830.0 ,96.2 > avx ,1048583 ,0 ,0 ,134392.0 ,136494.0 ,101.6 > avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4 > avx ,1048607 ,3 ,0 ,138559.0 ,138887.0 ,100.2 > avx ,1048639 ,3 ,5 ,210655.0 ,146690.0 ,69.6 > avx ,1048576 ,0 ,127 ,219819.0 ,155758.0 ,70.9 > avx ,1048576 ,0 ,255 ,180740.0 ,146392.0 ,81.0 > avx ,1048576 ,0 ,256 ,138448.0 ,142813.0 ,103.2 > avx ,1048576 ,0 ,4064 ,133067.0 ,136384.0 ,102.5 > avx ,2097159 ,0 ,0 ,268811.0 ,272810.0 ,101.5 > avx ,2097167 ,0 ,3 ,419724.0 ,292730.0 ,69.7 > avx ,2097183 ,3 ,0 ,277358.0 ,277789.0 ,100.2 > avx ,2097215 ,3 ,5 ,421091.0 ,292907.0 ,69.6 > avx ,2097152 ,0 ,127 ,439166.0 ,311969.0 ,71.0 > avx ,2097152 ,0 ,255 ,359858.0 ,293484.0 ,81.6 > avx ,2097152 ,0 ,256 ,276467.0 ,285067.0 ,103.1 > avx ,2097152 ,0 ,4064 ,266145.0 ,273049.0 ,102.6 > avx ,4194311 ,0 ,0 ,538566.0 ,547454.0 ,101.7 > avx ,4194319 ,0 ,3 ,841884.0 ,586111.0 ,69.6 > avx ,4194335 ,3 ,0 ,555930.0 ,557857.0 ,100.3 > avx ,4194367 ,3 ,5 ,841146.0 ,586329.0 ,69.7 > avx ,4194304 ,0 ,127 ,879711.0 ,625865.0 ,71.1 > avx ,4194304 ,0 ,255 ,718131.0 ,588442.0 ,81.9 > avx ,4194304 ,0 ,256 ,553593.0 ,571956.0 ,103.3 > avx ,4194304 ,0 ,4064 ,534461.0 ,547903.0 ,102.5 > avx ,8388615 ,0 ,0 ,1145460.0 ,1127430.0 ,98.4 > avx ,8388623 ,0 ,3 ,1704200.0 ,1185410.0 ,69.6 > avx ,8388639 ,3 ,0 ,1179600.0 ,1145670.0 ,97.1 > avx ,8388671 ,3 ,5 ,1702480.0 ,1183410.0 ,69.5 > avx ,8388608 ,0 ,127 ,1773750.0 ,1264360.0 ,71.3 > avx ,8388608 ,0 ,255 ,1450840.0 ,1189310.0 ,82.0 > avx ,8388608 ,0 ,256 ,1179160.0 ,1157490.0 ,98.2 > avx ,8388608 ,0 ,4064 ,1135990.0 ,1128150.0 ,99.3 > avx ,16777223 ,0 ,0 ,2630160.0 ,2553770.0 ,97.1 > avx ,16777231 ,0 ,3 ,3539370.0 ,2667050.0 ,75.4 > avx ,16777247 ,3 ,0 ,2671830.0 ,2585550.0 ,96.8 > avx ,16777279 ,3 ,5 ,3537460.0 ,2664080.0 ,75.3 > avx ,16777216 ,0 ,127 ,3598350.0 ,2784810.0 ,77.4 > avx ,16777216 ,0 ,255 ,3012890.0 ,2650420.0 ,88.0 > avx ,16777216 ,0 ,256 ,2690480.0 ,2605640.0 ,96.8 > avx ,16777216 ,0 ,4064 ,2607870.0 ,2537450.0 ,97.3 > avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2 > avx ,33554447 ,0 ,3 ,7208430.0 ,5541330.0 ,76.9 > avx ,33554463 ,3 ,0 ,5613760.0 ,5399880.0 ,96.2 > avx ,33554495 ,3 ,5 ,7202140.0 ,5547470.0 ,77.0 > avx ,33554432 ,0 ,127 ,7287570.0 ,5784590.0 ,79.4 > avx ,33554432 ,0 ,255 ,6156640.0 ,5508630.0 ,89.5 > avx ,33554432 ,0 ,256 ,5700530.0 ,5441950.0 ,95.5 > avx ,33554432 ,0 ,4064 ,5531820.0 ,5302580.0 ,95.9 > avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9 > avx512 ,1048591 ,0 ,3 ,142372.0 ,146319.0 ,102.8 > avx512 ,1048607 ,3 ,0 ,134629.0 ,139098.0 ,103.3 > avx512 ,1048639 ,3 ,5 ,142362.0 ,146405.0 ,102.8 > avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3 > avx512 ,1048576 ,0 ,255 ,143736.0 ,147800.0 ,102.8 > avx512 ,1048576 ,0 ,256 ,139937.0 ,142958.0 ,102.2 > avx512 ,1048576 ,0 ,4064 ,134730.0 ,139222.0 ,103.3 > avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9 > avx512 ,2097167 ,0 ,3 ,284152.0 ,293076.0 ,103.1 > avx512 ,2097183 ,3 ,0 ,269656.0 ,278215.0 ,103.2 > avx512 ,2097215 ,3 ,5 ,284422.0 ,293030.0 ,103.0 > avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7 > avx512 ,2097152 ,0 ,255 ,287381.0 ,295503.0 ,102.8 > avx512 ,2097152 ,0 ,256 ,280224.0 ,286054.0 ,102.1 > avx512 ,2097152 ,0 ,4064 ,270038.0 ,277907.0 ,102.9 > avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9 > avx512 ,4194319 ,0 ,3 ,570476.0 ,584715.0 ,102.5 > avx512 ,4194335 ,3 ,0 ,539745.0 ,556838.0 ,103.2 > avx512 ,4194367 ,3 ,5 ,570148.0 ,586154.0 ,102.8 > avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2 > avx512 ,4194304 ,0 ,255 ,576014.0 ,590627.0 ,102.5 > avx512 ,4194304 ,0 ,256 ,560921.0 ,572248.0 ,102.0 > avx512 ,4194304 ,0 ,4064 ,540550.0 ,557613.0 ,103.2 > avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1 > avx512 ,8388623 ,0 ,3 ,1218350.0 ,1192400.0 ,97.9 > avx512 ,8388639 ,3 ,0 ,1139420.0 ,1144530.0 ,100.4 > avx512 ,8388671 ,3 ,5 ,1219760.0 ,1191420.0 ,97.7 > avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4 > avx512 ,8388608 ,0 ,255 ,1222290.0 ,1190400.0 ,97.4 > avx512 ,8388608 ,0 ,256 ,1194810.0 ,1154410.0 ,96.6 > avx512 ,8388608 ,0 ,4064 ,1138850.0 ,1147750.0 ,100.8 > avx512 ,16777223 ,0 ,0 ,2601040.0 ,2535500.0 ,97.5 > avx512 ,16777231 ,0 ,3 ,2759350.0 ,2674570.0 ,96.9 > avx512 ,16777247 ,3 ,0 ,2603500.0 ,2588260.0 ,99.4 > avx512 ,16777279 ,3 ,5 ,2743810.0 ,2674870.0 ,97.5 > avx512 ,16777216 ,0 ,127 ,2754910.0 ,2726860.0 ,99.0 > avx512 ,16777216 ,0 ,255 ,2750980.0 ,2651370.0 ,96.4 > avx512 ,16777216 ,0 ,256 ,2707940.0 ,2589660.0 ,95.6 > avx512 ,16777216 ,0 ,4064 ,2606760.0 ,2580980.0 ,99.0 > avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7 > avx512 ,33554447 ,0 ,3 ,5788490.0 ,5574380.0 ,96.3 > avx512 ,33554463 ,3 ,0 ,5558950.0 ,5415190.0 ,97.4 > avx512 ,33554495 ,3 ,5 ,5775400.0 ,5582390.0 ,96.7 > avx512 ,33554432 ,0 ,127 ,5787680.0 ,5659730.0 ,97.8 > avx512 ,33554432 ,0 ,255 ,5823500.0 ,5516530.0 ,94.7 > avx512 ,33554432 ,0 ,256 ,5678760.0 ,5401000.0 ,95.1 > avx512 ,33554432 ,0 ,4064 ,5573540.0 ,5400460.0 ,96.9 > > Skylake: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz > > func ,size ,align1,align2,Old ,New ,% New / Old > sse2 ,1048583 ,0 ,0 ,71890.2 ,70626.8 ,98.2 > sse2 ,1048591 ,0 ,3 ,72200.5 ,74263.6 ,102.9 > sse2 ,1048607 ,3 ,0 ,71360.5 ,70106.5 ,98.2 > sse2 ,1048639 ,3 ,5 ,71972.1 ,73468.0 ,102.1 > sse2 ,1048576 ,0 ,127 ,81634.2 ,77607.6 ,95.1 > sse2 ,1048576 ,0 ,255 ,71575.2 ,71951.5 ,100.5 > sse2 ,1048576 ,0 ,256 ,72383.2 ,69610.8 ,96.2 > sse2 ,1048576 ,0 ,4064 ,71996.6 ,70941.0 ,98.5 > sse2 ,2097159 ,0 ,0 ,143835.0 ,140186.0 ,97.5 > sse2 ,2097167 ,0 ,3 ,146347.0 ,147984.0 ,101.1 > sse2 ,2097183 ,3 ,0 ,145740.0 ,140317.0 ,96.3 > sse2 ,2097215 ,3 ,5 ,147099.0 ,147066.0 ,100.0 > sse2 ,2097152 ,0 ,127 ,163712.0 ,157386.0 ,96.1 > sse2 ,2097152 ,0 ,255 ,145048.0 ,144970.0 ,99.9 > sse2 ,2097152 ,0 ,256 ,144545.0 ,139948.0 ,96.8 > sse2 ,2097152 ,0 ,4064 ,143519.0 ,140975.0 ,98.2 > sse2 ,4194311 ,0 ,0 ,293848.0 ,283531.0 ,96.5 > sse2 ,4194319 ,0 ,3 ,305127.0 ,295478.0 ,96.8 > sse2 ,4194335 ,3 ,0 ,299170.0 ,283950.0 ,94.9 > sse2 ,4194367 ,3 ,5 ,307419.0 ,293175.0 ,95.4 > sse2 ,4194304 ,0 ,127 ,332567.0 ,318276.0 ,95.7 > sse2 ,4194304 ,0 ,255 ,304897.0 ,300309.0 ,98.5 > sse2 ,4194304 ,0 ,256 ,298929.0 ,284008.0 ,95.0 > sse2 ,4194304 ,0 ,4064 ,296282.0 ,286087.0 ,96.6 > sse2 ,8388615 ,0 ,0 ,751380.0 ,724191.0 ,96.4 > sse2 ,8388623 ,0 ,3 ,775657.0 ,734942.0 ,94.8 > sse2 ,8388639 ,3 ,0 ,756674.0 ,712934.0 ,94.2 > sse2 ,8388671 ,3 ,5 ,774934.0 ,736895.0 ,95.1 > sse2 ,8388608 ,0 ,127 ,781242.0 ,741475.0 ,94.9 > sse2 ,8388608 ,0 ,255 ,762849.0 ,725086.0 ,95.0 > sse2 ,8388608 ,0 ,256 ,758465.0 ,711665.0 ,93.8 > sse2 ,8388608 ,0 ,4064 ,755243.0 ,738092.0 ,97.7 > sse2 ,16777223 ,0 ,0 ,2104730.0 ,1954140.0 ,92.8 > sse2 ,16777231 ,0 ,3 ,2129590.0 ,1951410.0 ,91.6 > sse2 ,16777247 ,3 ,0 ,2102950.0 ,1952530.0 ,92.8 > sse2 ,16777279 ,3 ,5 ,2126250.0 ,1952410.0 ,91.8 > sse2 ,16777216 ,0 ,127 ,2074290.0 ,1932070.0 ,93.1 > sse2 ,16777216 ,0 ,255 ,2060610.0 ,1941860.0 ,94.2 > sse2 ,16777216 ,0 ,256 ,2106430.0 ,1952060.0 ,92.7 > sse2 ,16777216 ,0 ,4064 ,2100660.0 ,1945610.0 ,92.6 > sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0 > sse2 ,33554447 ,0 ,3 ,4687860.0 ,4387680.0 ,93.6 > sse2 ,33554463 ,3 ,0 ,4655420.0 ,4402580.0 ,94.6 > sse2 ,33554495 ,3 ,5 ,4692800.0 ,4386350.0 ,93.5 > sse2 ,33554432 ,0 ,127 ,4558620.0 ,4341510.0 ,95.2 > sse2 ,33554432 ,0 ,255 ,4545130.0 ,4374230.0 ,96.2 > sse2 ,33554432 ,0 ,256 ,4665000.0 ,4390850.0 ,94.1 > sse2 ,33554432 ,0 ,4064 ,4666350.0 ,4374400.0 ,93.7 > avx ,1048583 ,0 ,0 ,105460.0 ,104097.0 ,98.7 > avx ,1048591 ,0 ,3 ,66369.2 ,67306.4 ,101.4 > avx ,1048607 ,3 ,0 ,66625.8 ,64741.2 ,97.2 > avx ,1048639 ,3 ,5 ,66757.7 ,65796.3 ,98.6 > avx ,1048576 ,0 ,127 ,65272.4 ,65130.6 ,99.8 > avx ,1048576 ,0 ,255 ,65632.1 ,65678.6 ,100.1 > avx ,1048576 ,0 ,256 ,67530.1 ,64841.5 ,96.0 > avx ,1048576 ,0 ,4064 ,65955.1 ,66194.8 ,100.4 > avx ,2097159 ,0 ,0 ,132883.0 ,131644.0 ,99.1 > avx ,2097167 ,0 ,3 ,133825.0 ,132308.0 ,98.9 > avx ,2097183 ,3 ,0 ,133567.0 ,129040.0 ,96.6 > avx ,2097215 ,3 ,5 ,133856.0 ,132735.0 ,99.2 > avx ,2097152 ,0 ,127 ,131219.0 ,129983.0 ,99.1 > avx ,2097152 ,0 ,255 ,131450.0 ,131755.0 ,100.2 > avx ,2097152 ,0 ,256 ,135219.0 ,132616.0 ,98.1 > avx ,2097152 ,0 ,4064 ,131692.0 ,132351.0 ,100.5 > avx ,4194311 ,0 ,0 ,278494.0 ,265144.0 ,95.2 > avx ,4194319 ,0 ,3 ,282868.0 ,267499.0 ,94.6 > avx ,4194335 ,3 ,0 ,275956.0 ,262626.0 ,95.2 > avx ,4194367 ,3 ,5 ,283080.0 ,266712.0 ,94.2 > avx ,4194304 ,0 ,127 ,270912.0 ,266153.0 ,98.2 > avx ,4194304 ,0 ,255 ,266650.0 ,267640.0 ,100.4 > avx ,4194304 ,0 ,256 ,276224.0 ,264929.0 ,95.9 > avx ,4194304 ,0 ,4064 ,274156.0 ,265264.0 ,96.8 > avx ,8388615 ,0 ,0 ,820710.0 ,799313.0 ,97.4 > avx ,8388623 ,0 ,3 ,881478.0 ,816087.0 ,92.6 > avx ,8388639 ,3 ,0 ,881138.0 ,788571.0 ,89.5 > avx ,8388671 ,3 ,5 ,883555.0 ,820020.0 ,92.8 > avx ,8388608 ,0 ,127 ,799727.0 ,785502.0 ,98.2 > avx ,8388608 ,0 ,255 ,785782.0 ,800006.0 ,101.8 > avx ,8388608 ,0 ,256 ,876745.0 ,809691.0 ,92.4 > avx ,8388608 ,0 ,4064 ,895120.0 ,809204.0 ,90.4 > avx ,16777223 ,0 ,0 ,2138420.0 ,1955110.0 ,91.4 > avx ,16777231 ,0 ,3 ,2208590.0 ,1966590.0 ,89.0 > avx ,16777247 ,3 ,0 ,2209190.0 ,1968980.0 ,89.1 > avx ,16777279 ,3 ,5 ,2207120.0 ,1964830.0 ,89.0 > avx ,16777216 ,0 ,127 ,2123460.0 ,1942180.0 ,91.5 > avx ,16777216 ,0 ,255 ,2120500.0 ,1951910.0 ,92.0 > avx ,16777216 ,0 ,256 ,2193680.0 ,1963540.0 ,89.5 > avx ,16777216 ,0 ,4064 ,2196110.0 ,1970050.0 ,89.7 > avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7 > avx ,33554447 ,0 ,3 ,4855270.0 ,4402670.0 ,90.7 > avx ,33554463 ,3 ,0 ,4877600.0 ,4405480.0 ,90.3 > avx ,33554495 ,3 ,5 ,4851190.0 ,4401330.0 ,90.7 > avx ,33554432 ,0 ,127 ,4699810.0 ,4324860.0 ,92.0 > avx ,33554432 ,0 ,255 ,4676570.0 ,4363830.0 ,93.3 > avx ,33554432 ,0 ,256 ,4846720.0 ,4376970.0 ,90.3 > avx ,33554432 ,0 ,4064 ,4839810.0 ,4400570.0 ,90.9 > > .../multiarch/memmove-vec-unaligned-erms.S | 326 ++++++++++++++---- > 1 file changed, 254 insertions(+), 72 deletions(-) My patch set on users/hjl/pr27457/master branch: https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/master which is under review, changes the same file. I prefer my patch set going in first.
On Sun, Mar 14, 2021 at 3:20 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Sun, Mar 14, 2021 at 12:05 PM noah <goldstein.w.n@gmail.com> wrote: > > > > No Bug. This commit updates the large memcpy case (no overlap). The > > update is to perform memcpy on either 2 or 4 contiguous pages at > > once. This 1) helps to alleviate the affects of false memory aliasing > > when destination and source have a close 4k alignment and 2) In most > > cases and for most DRAM units is a modestly more efficient access > > pattern. These changes are a clear performance improvement for > > VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, > > test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all > > pass. > > > > Signed-off-by: noah <goldstein.w.n@gmail.com> > > --- > > In this patch is an update to memmove-vec-unaligned-erms.S, additions > > to test-memmove.c and test-memcp.c, and additions to > > bench-memcpy-large.c. > > > > Test Changes: > > These changes where largely in the vein of increasing the maximum test > > size, increasing the range of misalignments, and expanding the to > > cover both forward/backward copying. > > > > Bench Changes: > > These changes where to increase the range of tested > > alignments. Relative alignment and source and destination can make a > > huge impact on performance (more below) even when the there is no > > overlap. > > > > Memmove Changes: > > The change was benchmarked on an Icelake and Skylake CPU. See below > > for CSV of data. Time is median of 25 runs of bench-memcpy-large.c in > > nanoseconds. "New" is this patch, "Old" is the current implementation. > > > > The majority of changes in performance where beneficial. The most > > clear example is on icelake where alleviating the pressure on false > > memory aliasing lead to more than a 2x performance improvement for > > certain alignments of VEC_SIZE=16 and 1.5x performance improvement for > > certain alignments of VEC_SIZE=32. > > i.e: > > func ,size ,align1,align2,Old ,New ,% New / Old > > sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3 > > avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4 > > > > As well across the board for larger sizes (starting around size = > > 2^23) there was a roughly 0-10% performance improvement. > > > > i.e: > > Skylake: > > sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0 > > avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7 > > > > Icelake: > > sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3 > > avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2 > > avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7 > > > > There where performance degregations, however: Medium large sizes > > [2^20, 2^22] had roughly a 0-6% performance loss on Icelake for > > VEC_SIZE=64. This degregation is worst for destination alignment=127. > > i.e: > > avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9 > > avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3 > > avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9 > > avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7 > > avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9 > > avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2 > > > > Around 2^23 the change becomes neutral - advantageous: > > avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1 > > avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4 > > > > Across the board, aside from the address aliasing case, the > > performance difference is roughly in the range of [-6%, 12%] with some > > extreme [150%, 200%] cases that are heavily dependent on alignment. > > > > Its possible these changes should only be made for VEC_SIZE=16/32 or > > to keep the original forward memcpy for sizes [2^20, 2^22] in the case > > that there is no address aliasing. Please let me know what you think. > > > > Performance Numbers (Skylake Numbers Below): > > > > Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz > > func ,size ,align1,align2,Old ,New ,% New / Old > > sse2 ,1048583 ,0 ,0 ,147297.0 ,146234.0 ,99.3 > > sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3 > > sse2 ,1048607 ,3 ,0 ,151488.0 ,150773.0 ,99.5 > > sse2 ,1048639 ,3 ,5 ,399842.0 ,174222.0 ,43.6 > > sse2 ,1048576 ,0 ,127 ,356326.0 ,171422.0 ,48.1 > > sse2 ,1048576 ,0 ,255 ,144145.0 ,152123.0 ,105.5 > > sse2 ,1048576 ,0 ,256 ,147605.0 ,148005.0 ,100.3 > > sse2 ,1048576 ,0 ,4064 ,146929.0 ,147812.0 ,100.6 > > sse2 ,2097159 ,0 ,0 ,293910.0 ,291403.0 ,99.1 > > sse2 ,2097167 ,0 ,3 ,798920.0 ,346694.0 ,43.4 > > sse2 ,2097183 ,3 ,0 ,301171.0 ,299606.0 ,99.5 > > sse2 ,2097215 ,3 ,5 ,799129.0 ,346597.0 ,43.4 > > sse2 ,2097152 ,0 ,127 ,710256.0 ,341110.0 ,48.0 > > sse2 ,2097152 ,0 ,255 ,286370.0 ,302553.0 ,105.7 > > sse2 ,2097152 ,0 ,256 ,293691.0 ,294825.0 ,100.4 > > sse2 ,2097152 ,0 ,4064 ,292920.0 ,294180.0 ,100.4 > > sse2 ,4194311 ,0 ,0 ,587894.0 ,586827.0 ,99.8 > > sse2 ,4194319 ,0 ,3 ,1596340.0 ,694200.0 ,43.5 > > sse2 ,4194335 ,3 ,0 ,601996.0 ,601342.0 ,99.9 > > sse2 ,4194367 ,3 ,5 ,1596870.0 ,694562.0 ,43.5 > > sse2 ,4194304 ,0 ,127 ,1414140.0 ,682856.0 ,48.3 > > sse2 ,4194304 ,0 ,255 ,573752.0 ,607024.0 ,105.8 > > sse2 ,4194304 ,0 ,256 ,586961.0 ,591899.0 ,100.8 > > sse2 ,4194304 ,0 ,4064 ,586618.0 ,591267.0 ,100.8 > > sse2 ,8388615 ,0 ,0 ,1267450.0 ,1213660.0 ,95.8 > > sse2 ,8388623 ,0 ,3 ,3204280.0 ,1404460.0 ,43.8 > > sse2 ,8388639 ,3 ,0 ,1298940.0 ,1245790.0 ,95.9 > > sse2 ,8388671 ,3 ,5 ,3200790.0 ,1404540.0 ,43.9 > > sse2 ,8388608 ,0 ,127 ,2843880.0 ,1380490.0 ,48.5 > > sse2 ,8388608 ,0 ,255 ,1261040.0 ,1259110.0 ,99.8 > > sse2 ,8388608 ,0 ,256 ,1301120.0 ,1228890.0 ,94.4 > > sse2 ,8388608 ,0 ,4064 ,1263930.0 ,1233400.0 ,97.6 > > sse2 ,16777223 ,0 ,0 ,2845260.0 ,2690490.0 ,94.6 > > sse2 ,16777231 ,0 ,3 ,6424220.0 ,2999980.0 ,46.7 > > sse2 ,16777247 ,3 ,0 ,2902290.0 ,2764350.0 ,95.2 > > sse2 ,16777279 ,3 ,5 ,6413600.0 ,2999310.0 ,46.8 > > sse2 ,16777216 ,0 ,127 ,5704050.0 ,2986650.0 ,52.4 > > sse2 ,16777216 ,0 ,255 ,2823440.0 ,2790510.0 ,98.8 > > sse2 ,16777216 ,0 ,256 ,2926150.0 ,2711540.0 ,92.7 > > sse2 ,16777216 ,0 ,4064 ,2836530.0 ,2738850.0 ,96.6 > > sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3 > > sse2 ,33554447 ,0 ,3 ,12850900.0 ,6171500.0 ,48.0 > > sse2 ,33554463 ,3 ,0 ,6041090.0 ,5731480.0 ,94.9 > > sse2 ,33554495 ,3 ,5 ,12851100.0 ,6179870.0 ,48.1 > > sse2 ,33554432 ,0 ,127 ,11381900.0 ,6134130.0 ,53.9 > > sse2 ,33554432 ,0 ,255 ,5899320.0 ,5792680.0 ,98.2 > > sse2 ,33554432 ,0 ,256 ,6066220.0 ,5636270.0 ,92.9 > > sse2 ,33554432 ,0 ,4064 ,5915210.0 ,5688830.0 ,96.2 > > avx ,1048583 ,0 ,0 ,134392.0 ,136494.0 ,101.6 > > avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4 > > avx ,1048607 ,3 ,0 ,138559.0 ,138887.0 ,100.2 > > avx ,1048639 ,3 ,5 ,210655.0 ,146690.0 ,69.6 > > avx ,1048576 ,0 ,127 ,219819.0 ,155758.0 ,70.9 > > avx ,1048576 ,0 ,255 ,180740.0 ,146392.0 ,81.0 > > avx ,1048576 ,0 ,256 ,138448.0 ,142813.0 ,103.2 > > avx ,1048576 ,0 ,4064 ,133067.0 ,136384.0 ,102.5 > > avx ,2097159 ,0 ,0 ,268811.0 ,272810.0 ,101.5 > > avx ,2097167 ,0 ,3 ,419724.0 ,292730.0 ,69.7 > > avx ,2097183 ,3 ,0 ,277358.0 ,277789.0 ,100.2 > > avx ,2097215 ,3 ,5 ,421091.0 ,292907.0 ,69.6 > > avx ,2097152 ,0 ,127 ,439166.0 ,311969.0 ,71.0 > > avx ,2097152 ,0 ,255 ,359858.0 ,293484.0 ,81.6 > > avx ,2097152 ,0 ,256 ,276467.0 ,285067.0 ,103.1 > > avx ,2097152 ,0 ,4064 ,266145.0 ,273049.0 ,102.6 > > avx ,4194311 ,0 ,0 ,538566.0 ,547454.0 ,101.7 > > avx ,4194319 ,0 ,3 ,841884.0 ,586111.0 ,69.6 > > avx ,4194335 ,3 ,0 ,555930.0 ,557857.0 ,100.3 > > avx ,4194367 ,3 ,5 ,841146.0 ,586329.0 ,69.7 > > avx ,4194304 ,0 ,127 ,879711.0 ,625865.0 ,71.1 > > avx ,4194304 ,0 ,255 ,718131.0 ,588442.0 ,81.9 > > avx ,4194304 ,0 ,256 ,553593.0 ,571956.0 ,103.3 > > avx ,4194304 ,0 ,4064 ,534461.0 ,547903.0 ,102.5 > > avx ,8388615 ,0 ,0 ,1145460.0 ,1127430.0 ,98.4 > > avx ,8388623 ,0 ,3 ,1704200.0 ,1185410.0 ,69.6 > > avx ,8388639 ,3 ,0 ,1179600.0 ,1145670.0 ,97.1 > > avx ,8388671 ,3 ,5 ,1702480.0 ,1183410.0 ,69.5 > > avx ,8388608 ,0 ,127 ,1773750.0 ,1264360.0 ,71.3 > > avx ,8388608 ,0 ,255 ,1450840.0 ,1189310.0 ,82.0 > > avx ,8388608 ,0 ,256 ,1179160.0 ,1157490.0 ,98.2 > > avx ,8388608 ,0 ,4064 ,1135990.0 ,1128150.0 ,99.3 > > avx ,16777223 ,0 ,0 ,2630160.0 ,2553770.0 ,97.1 > > avx ,16777231 ,0 ,3 ,3539370.0 ,2667050.0 ,75.4 > > avx ,16777247 ,3 ,0 ,2671830.0 ,2585550.0 ,96.8 > > avx ,16777279 ,3 ,5 ,3537460.0 ,2664080.0 ,75.3 > > avx ,16777216 ,0 ,127 ,3598350.0 ,2784810.0 ,77.4 > > avx ,16777216 ,0 ,255 ,3012890.0 ,2650420.0 ,88.0 > > avx ,16777216 ,0 ,256 ,2690480.0 ,2605640.0 ,96.8 > > avx ,16777216 ,0 ,4064 ,2607870.0 ,2537450.0 ,97.3 > > avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2 > > avx ,33554447 ,0 ,3 ,7208430.0 ,5541330.0 ,76.9 > > avx ,33554463 ,3 ,0 ,5613760.0 ,5399880.0 ,96.2 > > avx ,33554495 ,3 ,5 ,7202140.0 ,5547470.0 ,77.0 > > avx ,33554432 ,0 ,127 ,7287570.0 ,5784590.0 ,79.4 > > avx ,33554432 ,0 ,255 ,6156640.0 ,5508630.0 ,89.5 > > avx ,33554432 ,0 ,256 ,5700530.0 ,5441950.0 ,95.5 > > avx ,33554432 ,0 ,4064 ,5531820.0 ,5302580.0 ,95.9 > > avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9 > > avx512 ,1048591 ,0 ,3 ,142372.0 ,146319.0 ,102.8 > > avx512 ,1048607 ,3 ,0 ,134629.0 ,139098.0 ,103.3 > > avx512 ,1048639 ,3 ,5 ,142362.0 ,146405.0 ,102.8 > > avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3 > > avx512 ,1048576 ,0 ,255 ,143736.0 ,147800.0 ,102.8 > > avx512 ,1048576 ,0 ,256 ,139937.0 ,142958.0 ,102.2 > > avx512 ,1048576 ,0 ,4064 ,134730.0 ,139222.0 ,103.3 > > avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9 > > avx512 ,2097167 ,0 ,3 ,284152.0 ,293076.0 ,103.1 > > avx512 ,2097183 ,3 ,0 ,269656.0 ,278215.0 ,103.2 > > avx512 ,2097215 ,3 ,5 ,284422.0 ,293030.0 ,103.0 > > avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7 > > avx512 ,2097152 ,0 ,255 ,287381.0 ,295503.0 ,102.8 > > avx512 ,2097152 ,0 ,256 ,280224.0 ,286054.0 ,102.1 > > avx512 ,2097152 ,0 ,4064 ,270038.0 ,277907.0 ,102.9 > > avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9 > > avx512 ,4194319 ,0 ,3 ,570476.0 ,584715.0 ,102.5 > > avx512 ,4194335 ,3 ,0 ,539745.0 ,556838.0 ,103.2 > > avx512 ,4194367 ,3 ,5 ,570148.0 ,586154.0 ,102.8 > > avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2 > > avx512 ,4194304 ,0 ,255 ,576014.0 ,590627.0 ,102.5 > > avx512 ,4194304 ,0 ,256 ,560921.0 ,572248.0 ,102.0 > > avx512 ,4194304 ,0 ,4064 ,540550.0 ,557613.0 ,103.2 > > avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1 > > avx512 ,8388623 ,0 ,3 ,1218350.0 ,1192400.0 ,97.9 > > avx512 ,8388639 ,3 ,0 ,1139420.0 ,1144530.0 ,100.4 > > avx512 ,8388671 ,3 ,5 ,1219760.0 ,1191420.0 ,97.7 > > avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4 > > avx512 ,8388608 ,0 ,255 ,1222290.0 ,1190400.0 ,97.4 > > avx512 ,8388608 ,0 ,256 ,1194810.0 ,1154410.0 ,96.6 > > avx512 ,8388608 ,0 ,4064 ,1138850.0 ,1147750.0 ,100.8 > > avx512 ,16777223 ,0 ,0 ,2601040.0 ,2535500.0 ,97.5 > > avx512 ,16777231 ,0 ,3 ,2759350.0 ,2674570.0 ,96.9 > > avx512 ,16777247 ,3 ,0 ,2603500.0 ,2588260.0 ,99.4 > > avx512 ,16777279 ,3 ,5 ,2743810.0 ,2674870.0 ,97.5 > > avx512 ,16777216 ,0 ,127 ,2754910.0 ,2726860.0 ,99.0 > > avx512 ,16777216 ,0 ,255 ,2750980.0 ,2651370.0 ,96.4 > > avx512 ,16777216 ,0 ,256 ,2707940.0 ,2589660.0 ,95.6 > > avx512 ,16777216 ,0 ,4064 ,2606760.0 ,2580980.0 ,99.0 > > avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7 > > avx512 ,33554447 ,0 ,3 ,5788490.0 ,5574380.0 ,96.3 > > avx512 ,33554463 ,3 ,0 ,5558950.0 ,5415190.0 ,97.4 > > avx512 ,33554495 ,3 ,5 ,5775400.0 ,5582390.0 ,96.7 > > avx512 ,33554432 ,0 ,127 ,5787680.0 ,5659730.0 ,97.8 > > avx512 ,33554432 ,0 ,255 ,5823500.0 ,5516530.0 ,94.7 > > avx512 ,33554432 ,0 ,256 ,5678760.0 ,5401000.0 ,95.1 > > avx512 ,33554432 ,0 ,4064 ,5573540.0 ,5400460.0 ,96.9 > > > > Skylake: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz > > > > func ,size ,align1,align2,Old ,New ,% New / Old > > sse2 ,1048583 ,0 ,0 ,71890.2 ,70626.8 ,98.2 > > sse2 ,1048591 ,0 ,3 ,72200.5 ,74263.6 ,102.9 > > sse2 ,1048607 ,3 ,0 ,71360.5 ,70106.5 ,98.2 > > sse2 ,1048639 ,3 ,5 ,71972.1 ,73468.0 ,102.1 > > sse2 ,1048576 ,0 ,127 ,81634.2 ,77607.6 ,95.1 > > sse2 ,1048576 ,0 ,255 ,71575.2 ,71951.5 ,100.5 > > sse2 ,1048576 ,0 ,256 ,72383.2 ,69610.8 ,96.2 > > sse2 ,1048576 ,0 ,4064 ,71996.6 ,70941.0 ,98.5 > > sse2 ,2097159 ,0 ,0 ,143835.0 ,140186.0 ,97.5 > > sse2 ,2097167 ,0 ,3 ,146347.0 ,147984.0 ,101.1 > > sse2 ,2097183 ,3 ,0 ,145740.0 ,140317.0 ,96.3 > > sse2 ,2097215 ,3 ,5 ,147099.0 ,147066.0 ,100.0 > > sse2 ,2097152 ,0 ,127 ,163712.0 ,157386.0 ,96.1 > > sse2 ,2097152 ,0 ,255 ,145048.0 ,144970.0 ,99.9 > > sse2 ,2097152 ,0 ,256 ,144545.0 ,139948.0 ,96.8 > > sse2 ,2097152 ,0 ,4064 ,143519.0 ,140975.0 ,98.2 > > sse2 ,4194311 ,0 ,0 ,293848.0 ,283531.0 ,96.5 > > sse2 ,4194319 ,0 ,3 ,305127.0 ,295478.0 ,96.8 > > sse2 ,4194335 ,3 ,0 ,299170.0 ,283950.0 ,94.9 > > sse2 ,4194367 ,3 ,5 ,307419.0 ,293175.0 ,95.4 > > sse2 ,4194304 ,0 ,127 ,332567.0 ,318276.0 ,95.7 > > sse2 ,4194304 ,0 ,255 ,304897.0 ,300309.0 ,98.5 > > sse2 ,4194304 ,0 ,256 ,298929.0 ,284008.0 ,95.0 > > sse2 ,4194304 ,0 ,4064 ,296282.0 ,286087.0 ,96.6 > > sse2 ,8388615 ,0 ,0 ,751380.0 ,724191.0 ,96.4 > > sse2 ,8388623 ,0 ,3 ,775657.0 ,734942.0 ,94.8 > > sse2 ,8388639 ,3 ,0 ,756674.0 ,712934.0 ,94.2 > > sse2 ,8388671 ,3 ,5 ,774934.0 ,736895.0 ,95.1 > > sse2 ,8388608 ,0 ,127 ,781242.0 ,741475.0 ,94.9 > > sse2 ,8388608 ,0 ,255 ,762849.0 ,725086.0 ,95.0 > > sse2 ,8388608 ,0 ,256 ,758465.0 ,711665.0 ,93.8 > > sse2 ,8388608 ,0 ,4064 ,755243.0 ,738092.0 ,97.7 > > sse2 ,16777223 ,0 ,0 ,2104730.0 ,1954140.0 ,92.8 > > sse2 ,16777231 ,0 ,3 ,2129590.0 ,1951410.0 ,91.6 > > sse2 ,16777247 ,3 ,0 ,2102950.0 ,1952530.0 ,92.8 > > sse2 ,16777279 ,3 ,5 ,2126250.0 ,1952410.0 ,91.8 > > sse2 ,16777216 ,0 ,127 ,2074290.0 ,1932070.0 ,93.1 > > sse2 ,16777216 ,0 ,255 ,2060610.0 ,1941860.0 ,94.2 > > sse2 ,16777216 ,0 ,256 ,2106430.0 ,1952060.0 ,92.7 > > sse2 ,16777216 ,0 ,4064 ,2100660.0 ,1945610.0 ,92.6 > > sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0 > > sse2 ,33554447 ,0 ,3 ,4687860.0 ,4387680.0 ,93.6 > > sse2 ,33554463 ,3 ,0 ,4655420.0 ,4402580.0 ,94.6 > > sse2 ,33554495 ,3 ,5 ,4692800.0 ,4386350.0 ,93.5 > > sse2 ,33554432 ,0 ,127 ,4558620.0 ,4341510.0 ,95.2 > > sse2 ,33554432 ,0 ,255 ,4545130.0 ,4374230.0 ,96.2 > > sse2 ,33554432 ,0 ,256 ,4665000.0 ,4390850.0 ,94.1 > > sse2 ,33554432 ,0 ,4064 ,4666350.0 ,4374400.0 ,93.7 > > avx ,1048583 ,0 ,0 ,105460.0 ,104097.0 ,98.7 > > avx ,1048591 ,0 ,3 ,66369.2 ,67306.4 ,101.4 > > avx ,1048607 ,3 ,0 ,66625.8 ,64741.2 ,97.2 > > avx ,1048639 ,3 ,5 ,66757.7 ,65796.3 ,98.6 > > avx ,1048576 ,0 ,127 ,65272.4 ,65130.6 ,99.8 > > avx ,1048576 ,0 ,255 ,65632.1 ,65678.6 ,100.1 > > avx ,1048576 ,0 ,256 ,67530.1 ,64841.5 ,96.0 > > avx ,1048576 ,0 ,4064 ,65955.1 ,66194.8 ,100.4 > > avx ,2097159 ,0 ,0 ,132883.0 ,131644.0 ,99.1 > > avx ,2097167 ,0 ,3 ,133825.0 ,132308.0 ,98.9 > > avx ,2097183 ,3 ,0 ,133567.0 ,129040.0 ,96.6 > > avx ,2097215 ,3 ,5 ,133856.0 ,132735.0 ,99.2 > > avx ,2097152 ,0 ,127 ,131219.0 ,129983.0 ,99.1 > > avx ,2097152 ,0 ,255 ,131450.0 ,131755.0 ,100.2 > > avx ,2097152 ,0 ,256 ,135219.0 ,132616.0 ,98.1 > > avx ,2097152 ,0 ,4064 ,131692.0 ,132351.0 ,100.5 > > avx ,4194311 ,0 ,0 ,278494.0 ,265144.0 ,95.2 > > avx ,4194319 ,0 ,3 ,282868.0 ,267499.0 ,94.6 > > avx ,4194335 ,3 ,0 ,275956.0 ,262626.0 ,95.2 > > avx ,4194367 ,3 ,5 ,283080.0 ,266712.0 ,94.2 > > avx ,4194304 ,0 ,127 ,270912.0 ,266153.0 ,98.2 > > avx ,4194304 ,0 ,255 ,266650.0 ,267640.0 ,100.4 > > avx ,4194304 ,0 ,256 ,276224.0 ,264929.0 ,95.9 > > avx ,4194304 ,0 ,4064 ,274156.0 ,265264.0 ,96.8 > > avx ,8388615 ,0 ,0 ,820710.0 ,799313.0 ,97.4 > > avx ,8388623 ,0 ,3 ,881478.0 ,816087.0 ,92.6 > > avx ,8388639 ,3 ,0 ,881138.0 ,788571.0 ,89.5 > > avx ,8388671 ,3 ,5 ,883555.0 ,820020.0 ,92.8 > > avx ,8388608 ,0 ,127 ,799727.0 ,785502.0 ,98.2 > > avx ,8388608 ,0 ,255 ,785782.0 ,800006.0 ,101.8 > > avx ,8388608 ,0 ,256 ,876745.0 ,809691.0 ,92.4 > > avx ,8388608 ,0 ,4064 ,895120.0 ,809204.0 ,90.4 > > avx ,16777223 ,0 ,0 ,2138420.0 ,1955110.0 ,91.4 > > avx ,16777231 ,0 ,3 ,2208590.0 ,1966590.0 ,89.0 > > avx ,16777247 ,3 ,0 ,2209190.0 ,1968980.0 ,89.1 > > avx ,16777279 ,3 ,5 ,2207120.0 ,1964830.0 ,89.0 > > avx ,16777216 ,0 ,127 ,2123460.0 ,1942180.0 ,91.5 > > avx ,16777216 ,0 ,255 ,2120500.0 ,1951910.0 ,92.0 > > avx ,16777216 ,0 ,256 ,2193680.0 ,1963540.0 ,89.5 > > avx ,16777216 ,0 ,4064 ,2196110.0 ,1970050.0 ,89.7 > > avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7 > > avx ,33554447 ,0 ,3 ,4855270.0 ,4402670.0 ,90.7 > > avx ,33554463 ,3 ,0 ,4877600.0 ,4405480.0 ,90.3 > > avx ,33554495 ,3 ,5 ,4851190.0 ,4401330.0 ,90.7 > > avx ,33554432 ,0 ,127 ,4699810.0 ,4324860.0 ,92.0 > > avx ,33554432 ,0 ,255 ,4676570.0 ,4363830.0 ,93.3 > > avx ,33554432 ,0 ,256 ,4846720.0 ,4376970.0 ,90.3 > > avx ,33554432 ,0 ,4064 ,4839810.0 ,4400570.0 ,90.9 > > > > .../multiarch/memmove-vec-unaligned-erms.S | 326 ++++++++++++++---- > > 1 file changed, 254 insertions(+), 72 deletions(-) > > My patch set on users/hjl/pr27457/master branch: > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/master > > which is under review, changes the same file. I prefer my patch set > going in first. > Makes sense. Any chance you could add definitions for vec[8, 16] -> zmm[24,31]? > -- > H.J.
On Sun, Mar 14, 2021 at 12:48 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Sun, Mar 14, 2021 at 3:20 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Sun, Mar 14, 2021 at 12:05 PM noah <goldstein.w.n@gmail.com> wrote: > > > > > > No Bug. This commit updates the large memcpy case (no overlap). The > > > update is to perform memcpy on either 2 or 4 contiguous pages at > > > once. This 1) helps to alleviate the affects of false memory aliasing > > > when destination and source have a close 4k alignment and 2) In most > > > cases and for most DRAM units is a modestly more efficient access > > > pattern. These changes are a clear performance improvement for > > > VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, > > > test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all > > > pass. > > > > > > Signed-off-by: noah <goldstein.w.n@gmail.com> > > > --- > > > In this patch is an update to memmove-vec-unaligned-erms.S, additions > > > to test-memmove.c and test-memcp.c, and additions to > > > bench-memcpy-large.c. > > > > > > Test Changes: > > > These changes where largely in the vein of increasing the maximum test > > > size, increasing the range of misalignments, and expanding the to > > > cover both forward/backward copying. > > > > > > Bench Changes: > > > These changes where to increase the range of tested > > > alignments. Relative alignment and source and destination can make a > > > huge impact on performance (more below) even when the there is no > > > overlap. > > > > > > Memmove Changes: > > > The change was benchmarked on an Icelake and Skylake CPU. See below > > > for CSV of data. Time is median of 25 runs of bench-memcpy-large.c in > > > nanoseconds. "New" is this patch, "Old" is the current implementation. > > > > > > The majority of changes in performance where beneficial. The most > > > clear example is on icelake where alleviating the pressure on false > > > memory aliasing lead to more than a 2x performance improvement for > > > certain alignments of VEC_SIZE=16 and 1.5x performance improvement for > > > certain alignments of VEC_SIZE=32. > > > i.e: > > > func ,size ,align1,align2,Old ,New ,% New / Old > > > sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3 > > > avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4 > > > > > > As well across the board for larger sizes (starting around size = > > > 2^23) there was a roughly 0-10% performance improvement. > > > > > > i.e: > > > Skylake: > > > sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0 > > > avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7 > > > > > > Icelake: > > > sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3 > > > avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2 > > > avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7 > > > > > > There where performance degregations, however: Medium large sizes > > > [2^20, 2^22] had roughly a 0-6% performance loss on Icelake for > > > VEC_SIZE=64. This degregation is worst for destination alignment=127. > > > i.e: > > > avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9 > > > avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3 > > > avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9 > > > avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7 > > > avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9 > > > avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2 > > > > > > Around 2^23 the change becomes neutral - advantageous: > > > avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1 > > > avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4 > > > > > > Across the board, aside from the address aliasing case, the > > > performance difference is roughly in the range of [-6%, 12%] with some > > > extreme [150%, 200%] cases that are heavily dependent on alignment. > > > > > > Its possible these changes should only be made for VEC_SIZE=16/32 or > > > to keep the original forward memcpy for sizes [2^20, 2^22] in the case > > > that there is no address aliasing. Please let me know what you think. > > > > > > Performance Numbers (Skylake Numbers Below): > > > > > > Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz > > > func ,size ,align1,align2,Old ,New ,% New / Old > > > sse2 ,1048583 ,0 ,0 ,147297.0 ,146234.0 ,99.3 > > > sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3 > > > sse2 ,1048607 ,3 ,0 ,151488.0 ,150773.0 ,99.5 > > > sse2 ,1048639 ,3 ,5 ,399842.0 ,174222.0 ,43.6 > > > sse2 ,1048576 ,0 ,127 ,356326.0 ,171422.0 ,48.1 > > > sse2 ,1048576 ,0 ,255 ,144145.0 ,152123.0 ,105.5 > > > sse2 ,1048576 ,0 ,256 ,147605.0 ,148005.0 ,100.3 > > > sse2 ,1048576 ,0 ,4064 ,146929.0 ,147812.0 ,100.6 > > > sse2 ,2097159 ,0 ,0 ,293910.0 ,291403.0 ,99.1 > > > sse2 ,2097167 ,0 ,3 ,798920.0 ,346694.0 ,43.4 > > > sse2 ,2097183 ,3 ,0 ,301171.0 ,299606.0 ,99.5 > > > sse2 ,2097215 ,3 ,5 ,799129.0 ,346597.0 ,43.4 > > > sse2 ,2097152 ,0 ,127 ,710256.0 ,341110.0 ,48.0 > > > sse2 ,2097152 ,0 ,255 ,286370.0 ,302553.0 ,105.7 > > > sse2 ,2097152 ,0 ,256 ,293691.0 ,294825.0 ,100.4 > > > sse2 ,2097152 ,0 ,4064 ,292920.0 ,294180.0 ,100.4 > > > sse2 ,4194311 ,0 ,0 ,587894.0 ,586827.0 ,99.8 > > > sse2 ,4194319 ,0 ,3 ,1596340.0 ,694200.0 ,43.5 > > > sse2 ,4194335 ,3 ,0 ,601996.0 ,601342.0 ,99.9 > > > sse2 ,4194367 ,3 ,5 ,1596870.0 ,694562.0 ,43.5 > > > sse2 ,4194304 ,0 ,127 ,1414140.0 ,682856.0 ,48.3 > > > sse2 ,4194304 ,0 ,255 ,573752.0 ,607024.0 ,105.8 > > > sse2 ,4194304 ,0 ,256 ,586961.0 ,591899.0 ,100.8 > > > sse2 ,4194304 ,0 ,4064 ,586618.0 ,591267.0 ,100.8 > > > sse2 ,8388615 ,0 ,0 ,1267450.0 ,1213660.0 ,95.8 > > > sse2 ,8388623 ,0 ,3 ,3204280.0 ,1404460.0 ,43.8 > > > sse2 ,8388639 ,3 ,0 ,1298940.0 ,1245790.0 ,95.9 > > > sse2 ,8388671 ,3 ,5 ,3200790.0 ,1404540.0 ,43.9 > > > sse2 ,8388608 ,0 ,127 ,2843880.0 ,1380490.0 ,48.5 > > > sse2 ,8388608 ,0 ,255 ,1261040.0 ,1259110.0 ,99.8 > > > sse2 ,8388608 ,0 ,256 ,1301120.0 ,1228890.0 ,94.4 > > > sse2 ,8388608 ,0 ,4064 ,1263930.0 ,1233400.0 ,97.6 > > > sse2 ,16777223 ,0 ,0 ,2845260.0 ,2690490.0 ,94.6 > > > sse2 ,16777231 ,0 ,3 ,6424220.0 ,2999980.0 ,46.7 > > > sse2 ,16777247 ,3 ,0 ,2902290.0 ,2764350.0 ,95.2 > > > sse2 ,16777279 ,3 ,5 ,6413600.0 ,2999310.0 ,46.8 > > > sse2 ,16777216 ,0 ,127 ,5704050.0 ,2986650.0 ,52.4 > > > sse2 ,16777216 ,0 ,255 ,2823440.0 ,2790510.0 ,98.8 > > > sse2 ,16777216 ,0 ,256 ,2926150.0 ,2711540.0 ,92.7 > > > sse2 ,16777216 ,0 ,4064 ,2836530.0 ,2738850.0 ,96.6 > > > sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3 > > > sse2 ,33554447 ,0 ,3 ,12850900.0 ,6171500.0 ,48.0 > > > sse2 ,33554463 ,3 ,0 ,6041090.0 ,5731480.0 ,94.9 > > > sse2 ,33554495 ,3 ,5 ,12851100.0 ,6179870.0 ,48.1 > > > sse2 ,33554432 ,0 ,127 ,11381900.0 ,6134130.0 ,53.9 > > > sse2 ,33554432 ,0 ,255 ,5899320.0 ,5792680.0 ,98.2 > > > sse2 ,33554432 ,0 ,256 ,6066220.0 ,5636270.0 ,92.9 > > > sse2 ,33554432 ,0 ,4064 ,5915210.0 ,5688830.0 ,96.2 > > > avx ,1048583 ,0 ,0 ,134392.0 ,136494.0 ,101.6 > > > avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4 > > > avx ,1048607 ,3 ,0 ,138559.0 ,138887.0 ,100.2 > > > avx ,1048639 ,3 ,5 ,210655.0 ,146690.0 ,69.6 > > > avx ,1048576 ,0 ,127 ,219819.0 ,155758.0 ,70.9 > > > avx ,1048576 ,0 ,255 ,180740.0 ,146392.0 ,81.0 > > > avx ,1048576 ,0 ,256 ,138448.0 ,142813.0 ,103.2 > > > avx ,1048576 ,0 ,4064 ,133067.0 ,136384.0 ,102.5 > > > avx ,2097159 ,0 ,0 ,268811.0 ,272810.0 ,101.5 > > > avx ,2097167 ,0 ,3 ,419724.0 ,292730.0 ,69.7 > > > avx ,2097183 ,3 ,0 ,277358.0 ,277789.0 ,100.2 > > > avx ,2097215 ,3 ,5 ,421091.0 ,292907.0 ,69.6 > > > avx ,2097152 ,0 ,127 ,439166.0 ,311969.0 ,71.0 > > > avx ,2097152 ,0 ,255 ,359858.0 ,293484.0 ,81.6 > > > avx ,2097152 ,0 ,256 ,276467.0 ,285067.0 ,103.1 > > > avx ,2097152 ,0 ,4064 ,266145.0 ,273049.0 ,102.6 > > > avx ,4194311 ,0 ,0 ,538566.0 ,547454.0 ,101.7 > > > avx ,4194319 ,0 ,3 ,841884.0 ,586111.0 ,69.6 > > > avx ,4194335 ,3 ,0 ,555930.0 ,557857.0 ,100.3 > > > avx ,4194367 ,3 ,5 ,841146.0 ,586329.0 ,69.7 > > > avx ,4194304 ,0 ,127 ,879711.0 ,625865.0 ,71.1 > > > avx ,4194304 ,0 ,255 ,718131.0 ,588442.0 ,81.9 > > > avx ,4194304 ,0 ,256 ,553593.0 ,571956.0 ,103.3 > > > avx ,4194304 ,0 ,4064 ,534461.0 ,547903.0 ,102.5 > > > avx ,8388615 ,0 ,0 ,1145460.0 ,1127430.0 ,98.4 > > > avx ,8388623 ,0 ,3 ,1704200.0 ,1185410.0 ,69.6 > > > avx ,8388639 ,3 ,0 ,1179600.0 ,1145670.0 ,97.1 > > > avx ,8388671 ,3 ,5 ,1702480.0 ,1183410.0 ,69.5 > > > avx ,8388608 ,0 ,127 ,1773750.0 ,1264360.0 ,71.3 > > > avx ,8388608 ,0 ,255 ,1450840.0 ,1189310.0 ,82.0 > > > avx ,8388608 ,0 ,256 ,1179160.0 ,1157490.0 ,98.2 > > > avx ,8388608 ,0 ,4064 ,1135990.0 ,1128150.0 ,99.3 > > > avx ,16777223 ,0 ,0 ,2630160.0 ,2553770.0 ,97.1 > > > avx ,16777231 ,0 ,3 ,3539370.0 ,2667050.0 ,75.4 > > > avx ,16777247 ,3 ,0 ,2671830.0 ,2585550.0 ,96.8 > > > avx ,16777279 ,3 ,5 ,3537460.0 ,2664080.0 ,75.3 > > > avx ,16777216 ,0 ,127 ,3598350.0 ,2784810.0 ,77.4 > > > avx ,16777216 ,0 ,255 ,3012890.0 ,2650420.0 ,88.0 > > > avx ,16777216 ,0 ,256 ,2690480.0 ,2605640.0 ,96.8 > > > avx ,16777216 ,0 ,4064 ,2607870.0 ,2537450.0 ,97.3 > > > avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2 > > > avx ,33554447 ,0 ,3 ,7208430.0 ,5541330.0 ,76.9 > > > avx ,33554463 ,3 ,0 ,5613760.0 ,5399880.0 ,96.2 > > > avx ,33554495 ,3 ,5 ,7202140.0 ,5547470.0 ,77.0 > > > avx ,33554432 ,0 ,127 ,7287570.0 ,5784590.0 ,79.4 > > > avx ,33554432 ,0 ,255 ,6156640.0 ,5508630.0 ,89.5 > > > avx ,33554432 ,0 ,256 ,5700530.0 ,5441950.0 ,95.5 > > > avx ,33554432 ,0 ,4064 ,5531820.0 ,5302580.0 ,95.9 > > > avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9 > > > avx512 ,1048591 ,0 ,3 ,142372.0 ,146319.0 ,102.8 > > > avx512 ,1048607 ,3 ,0 ,134629.0 ,139098.0 ,103.3 > > > avx512 ,1048639 ,3 ,5 ,142362.0 ,146405.0 ,102.8 > > > avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3 > > > avx512 ,1048576 ,0 ,255 ,143736.0 ,147800.0 ,102.8 > > > avx512 ,1048576 ,0 ,256 ,139937.0 ,142958.0 ,102.2 > > > avx512 ,1048576 ,0 ,4064 ,134730.0 ,139222.0 ,103.3 > > > avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9 > > > avx512 ,2097167 ,0 ,3 ,284152.0 ,293076.0 ,103.1 > > > avx512 ,2097183 ,3 ,0 ,269656.0 ,278215.0 ,103.2 > > > avx512 ,2097215 ,3 ,5 ,284422.0 ,293030.0 ,103.0 > > > avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7 > > > avx512 ,2097152 ,0 ,255 ,287381.0 ,295503.0 ,102.8 > > > avx512 ,2097152 ,0 ,256 ,280224.0 ,286054.0 ,102.1 > > > avx512 ,2097152 ,0 ,4064 ,270038.0 ,277907.0 ,102.9 > > > avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9 > > > avx512 ,4194319 ,0 ,3 ,570476.0 ,584715.0 ,102.5 > > > avx512 ,4194335 ,3 ,0 ,539745.0 ,556838.0 ,103.2 > > > avx512 ,4194367 ,3 ,5 ,570148.0 ,586154.0 ,102.8 > > > avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2 > > > avx512 ,4194304 ,0 ,255 ,576014.0 ,590627.0 ,102.5 > > > avx512 ,4194304 ,0 ,256 ,560921.0 ,572248.0 ,102.0 > > > avx512 ,4194304 ,0 ,4064 ,540550.0 ,557613.0 ,103.2 > > > avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1 > > > avx512 ,8388623 ,0 ,3 ,1218350.0 ,1192400.0 ,97.9 > > > avx512 ,8388639 ,3 ,0 ,1139420.0 ,1144530.0 ,100.4 > > > avx512 ,8388671 ,3 ,5 ,1219760.0 ,1191420.0 ,97.7 > > > avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4 > > > avx512 ,8388608 ,0 ,255 ,1222290.0 ,1190400.0 ,97.4 > > > avx512 ,8388608 ,0 ,256 ,1194810.0 ,1154410.0 ,96.6 > > > avx512 ,8388608 ,0 ,4064 ,1138850.0 ,1147750.0 ,100.8 > > > avx512 ,16777223 ,0 ,0 ,2601040.0 ,2535500.0 ,97.5 > > > avx512 ,16777231 ,0 ,3 ,2759350.0 ,2674570.0 ,96.9 > > > avx512 ,16777247 ,3 ,0 ,2603500.0 ,2588260.0 ,99.4 > > > avx512 ,16777279 ,3 ,5 ,2743810.0 ,2674870.0 ,97.5 > > > avx512 ,16777216 ,0 ,127 ,2754910.0 ,2726860.0 ,99.0 > > > avx512 ,16777216 ,0 ,255 ,2750980.0 ,2651370.0 ,96.4 > > > avx512 ,16777216 ,0 ,256 ,2707940.0 ,2589660.0 ,95.6 > > > avx512 ,16777216 ,0 ,4064 ,2606760.0 ,2580980.0 ,99.0 > > > avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7 > > > avx512 ,33554447 ,0 ,3 ,5788490.0 ,5574380.0 ,96.3 > > > avx512 ,33554463 ,3 ,0 ,5558950.0 ,5415190.0 ,97.4 > > > avx512 ,33554495 ,3 ,5 ,5775400.0 ,5582390.0 ,96.7 > > > avx512 ,33554432 ,0 ,127 ,5787680.0 ,5659730.0 ,97.8 > > > avx512 ,33554432 ,0 ,255 ,5823500.0 ,5516530.0 ,94.7 > > > avx512 ,33554432 ,0 ,256 ,5678760.0 ,5401000.0 ,95.1 > > > avx512 ,33554432 ,0 ,4064 ,5573540.0 ,5400460.0 ,96.9 > > > > > > Skylake: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz > > > > > > func ,size ,align1,align2,Old ,New ,% New / Old > > > sse2 ,1048583 ,0 ,0 ,71890.2 ,70626.8 ,98.2 > > > sse2 ,1048591 ,0 ,3 ,72200.5 ,74263.6 ,102.9 > > > sse2 ,1048607 ,3 ,0 ,71360.5 ,70106.5 ,98.2 > > > sse2 ,1048639 ,3 ,5 ,71972.1 ,73468.0 ,102.1 > > > sse2 ,1048576 ,0 ,127 ,81634.2 ,77607.6 ,95.1 > > > sse2 ,1048576 ,0 ,255 ,71575.2 ,71951.5 ,100.5 > > > sse2 ,1048576 ,0 ,256 ,72383.2 ,69610.8 ,96.2 > > > sse2 ,1048576 ,0 ,4064 ,71996.6 ,70941.0 ,98.5 > > > sse2 ,2097159 ,0 ,0 ,143835.0 ,140186.0 ,97.5 > > > sse2 ,2097167 ,0 ,3 ,146347.0 ,147984.0 ,101.1 > > > sse2 ,2097183 ,3 ,0 ,145740.0 ,140317.0 ,96.3 > > > sse2 ,2097215 ,3 ,5 ,147099.0 ,147066.0 ,100.0 > > > sse2 ,2097152 ,0 ,127 ,163712.0 ,157386.0 ,96.1 > > > sse2 ,2097152 ,0 ,255 ,145048.0 ,144970.0 ,99.9 > > > sse2 ,2097152 ,0 ,256 ,144545.0 ,139948.0 ,96.8 > > > sse2 ,2097152 ,0 ,4064 ,143519.0 ,140975.0 ,98.2 > > > sse2 ,4194311 ,0 ,0 ,293848.0 ,283531.0 ,96.5 > > > sse2 ,4194319 ,0 ,3 ,305127.0 ,295478.0 ,96.8 > > > sse2 ,4194335 ,3 ,0 ,299170.0 ,283950.0 ,94.9 > > > sse2 ,4194367 ,3 ,5 ,307419.0 ,293175.0 ,95.4 > > > sse2 ,4194304 ,0 ,127 ,332567.0 ,318276.0 ,95.7 > > > sse2 ,4194304 ,0 ,255 ,304897.0 ,300309.0 ,98.5 > > > sse2 ,4194304 ,0 ,256 ,298929.0 ,284008.0 ,95.0 > > > sse2 ,4194304 ,0 ,4064 ,296282.0 ,286087.0 ,96.6 > > > sse2 ,8388615 ,0 ,0 ,751380.0 ,724191.0 ,96.4 > > > sse2 ,8388623 ,0 ,3 ,775657.0 ,734942.0 ,94.8 > > > sse2 ,8388639 ,3 ,0 ,756674.0 ,712934.0 ,94.2 > > > sse2 ,8388671 ,3 ,5 ,774934.0 ,736895.0 ,95.1 > > > sse2 ,8388608 ,0 ,127 ,781242.0 ,741475.0 ,94.9 > > > sse2 ,8388608 ,0 ,255 ,762849.0 ,725086.0 ,95.0 > > > sse2 ,8388608 ,0 ,256 ,758465.0 ,711665.0 ,93.8 > > > sse2 ,8388608 ,0 ,4064 ,755243.0 ,738092.0 ,97.7 > > > sse2 ,16777223 ,0 ,0 ,2104730.0 ,1954140.0 ,92.8 > > > sse2 ,16777231 ,0 ,3 ,2129590.0 ,1951410.0 ,91.6 > > > sse2 ,16777247 ,3 ,0 ,2102950.0 ,1952530.0 ,92.8 > > > sse2 ,16777279 ,3 ,5 ,2126250.0 ,1952410.0 ,91.8 > > > sse2 ,16777216 ,0 ,127 ,2074290.0 ,1932070.0 ,93.1 > > > sse2 ,16777216 ,0 ,255 ,2060610.0 ,1941860.0 ,94.2 > > > sse2 ,16777216 ,0 ,256 ,2106430.0 ,1952060.0 ,92.7 > > > sse2 ,16777216 ,0 ,4064 ,2100660.0 ,1945610.0 ,92.6 > > > sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0 > > > sse2 ,33554447 ,0 ,3 ,4687860.0 ,4387680.0 ,93.6 > > > sse2 ,33554463 ,3 ,0 ,4655420.0 ,4402580.0 ,94.6 > > > sse2 ,33554495 ,3 ,5 ,4692800.0 ,4386350.0 ,93.5 > > > sse2 ,33554432 ,0 ,127 ,4558620.0 ,4341510.0 ,95.2 > > > sse2 ,33554432 ,0 ,255 ,4545130.0 ,4374230.0 ,96.2 > > > sse2 ,33554432 ,0 ,256 ,4665000.0 ,4390850.0 ,94.1 > > > sse2 ,33554432 ,0 ,4064 ,4666350.0 ,4374400.0 ,93.7 > > > avx ,1048583 ,0 ,0 ,105460.0 ,104097.0 ,98.7 > > > avx ,1048591 ,0 ,3 ,66369.2 ,67306.4 ,101.4 > > > avx ,1048607 ,3 ,0 ,66625.8 ,64741.2 ,97.2 > > > avx ,1048639 ,3 ,5 ,66757.7 ,65796.3 ,98.6 > > > avx ,1048576 ,0 ,127 ,65272.4 ,65130.6 ,99.8 > > > avx ,1048576 ,0 ,255 ,65632.1 ,65678.6 ,100.1 > > > avx ,1048576 ,0 ,256 ,67530.1 ,64841.5 ,96.0 > > > avx ,1048576 ,0 ,4064 ,65955.1 ,66194.8 ,100.4 > > > avx ,2097159 ,0 ,0 ,132883.0 ,131644.0 ,99.1 > > > avx ,2097167 ,0 ,3 ,133825.0 ,132308.0 ,98.9 > > > avx ,2097183 ,3 ,0 ,133567.0 ,129040.0 ,96.6 > > > avx ,2097215 ,3 ,5 ,133856.0 ,132735.0 ,99.2 > > > avx ,2097152 ,0 ,127 ,131219.0 ,129983.0 ,99.1 > > > avx ,2097152 ,0 ,255 ,131450.0 ,131755.0 ,100.2 > > > avx ,2097152 ,0 ,256 ,135219.0 ,132616.0 ,98.1 > > > avx ,2097152 ,0 ,4064 ,131692.0 ,132351.0 ,100.5 > > > avx ,4194311 ,0 ,0 ,278494.0 ,265144.0 ,95.2 > > > avx ,4194319 ,0 ,3 ,282868.0 ,267499.0 ,94.6 > > > avx ,4194335 ,3 ,0 ,275956.0 ,262626.0 ,95.2 > > > avx ,4194367 ,3 ,5 ,283080.0 ,266712.0 ,94.2 > > > avx ,4194304 ,0 ,127 ,270912.0 ,266153.0 ,98.2 > > > avx ,4194304 ,0 ,255 ,266650.0 ,267640.0 ,100.4 > > > avx ,4194304 ,0 ,256 ,276224.0 ,264929.0 ,95.9 > > > avx ,4194304 ,0 ,4064 ,274156.0 ,265264.0 ,96.8 > > > avx ,8388615 ,0 ,0 ,820710.0 ,799313.0 ,97.4 > > > avx ,8388623 ,0 ,3 ,881478.0 ,816087.0 ,92.6 > > > avx ,8388639 ,3 ,0 ,881138.0 ,788571.0 ,89.5 > > > avx ,8388671 ,3 ,5 ,883555.0 ,820020.0 ,92.8 > > > avx ,8388608 ,0 ,127 ,799727.0 ,785502.0 ,98.2 > > > avx ,8388608 ,0 ,255 ,785782.0 ,800006.0 ,101.8 > > > avx ,8388608 ,0 ,256 ,876745.0 ,809691.0 ,92.4 > > > avx ,8388608 ,0 ,4064 ,895120.0 ,809204.0 ,90.4 > > > avx ,16777223 ,0 ,0 ,2138420.0 ,1955110.0 ,91.4 > > > avx ,16777231 ,0 ,3 ,2208590.0 ,1966590.0 ,89.0 > > > avx ,16777247 ,3 ,0 ,2209190.0 ,1968980.0 ,89.1 > > > avx ,16777279 ,3 ,5 ,2207120.0 ,1964830.0 ,89.0 > > > avx ,16777216 ,0 ,127 ,2123460.0 ,1942180.0 ,91.5 > > > avx ,16777216 ,0 ,255 ,2120500.0 ,1951910.0 ,92.0 > > > avx ,16777216 ,0 ,256 ,2193680.0 ,1963540.0 ,89.5 > > > avx ,16777216 ,0 ,4064 ,2196110.0 ,1970050.0 ,89.7 > > > avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7 > > > avx ,33554447 ,0 ,3 ,4855270.0 ,4402670.0 ,90.7 > > > avx ,33554463 ,3 ,0 ,4877600.0 ,4405480.0 ,90.3 > > > avx ,33554495 ,3 ,5 ,4851190.0 ,4401330.0 ,90.7 > > > avx ,33554432 ,0 ,127 ,4699810.0 ,4324860.0 ,92.0 > > > avx ,33554432 ,0 ,255 ,4676570.0 ,4363830.0 ,93.3 > > > avx ,33554432 ,0 ,256 ,4846720.0 ,4376970.0 ,90.3 > > > avx ,33554432 ,0 ,4064 ,4839810.0 ,4400570.0 ,90.9 > > > > > > .../multiarch/memmove-vec-unaligned-erms.S | 326 ++++++++++++++---- > > > 1 file changed, 254 insertions(+), 72 deletions(-) > > > > My patch set on users/hjl/pr27457/master branch: > > > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/master > > > > which is under review, changes the same file. I prefer my patch set > > going in first. > > > > Makes sense. Any chance you could add definitions for vec[8, 16] -> zmm[24,31]? Done.
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 50bb1fccb2..d7a46a025e 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -51,6 +51,27 @@ # define MEMMOVE_CHK_SYMBOL(p,s) MEMMOVE_SYMBOL(p, s) #endif +#ifndef PAGE_SIZE +# define PAGE_SIZE 4096 +#endif + +#if PAGE_SIZE != 4096 +# error Unsupported PAGE_SIZE +#endif + +#ifndef LOG_PAGE_SIZE +# define LOG_PAGE_SIZE 12 +#endif + +#if PAGE_SIZE != (1 << LOG_PAGE_SIZE) +# error Invalid LOG_PAGE_SIZE +#endif + +/* Amount to shift rdx by to compare for memcpy_large_4x. */ +#ifndef LOG_4X_MEMCPY_THRESH +# define LOG_4X_MEMCPY_THRESH 4 +#endif + #ifndef VZEROUPPER # if VEC_SIZE > 16 # define VZEROUPPER vzeroupper @@ -59,6 +80,13 @@ # endif #endif +/* Byte per page for large_memcpy inner loop. */ +#if VEC_SIZE == 64 +# define LARGE_LOAD_SIZE (VEC_SIZE * 2) +#else +# define LARGE_LOAD_SIZE (VEC_SIZE * 4) +#endif + /* Avoid short distance rep movsb only with non-SSE vector. */ #ifndef AVOID_SHORT_DISTANCE_REP_MOVSB # define AVOID_SHORT_DISTANCE_REP_MOVSB (VEC_SIZE > 16) @@ -75,7 +103,7 @@ # define PREFETCH_SIZE 64 #endif -#define PREFETCHED_LOAD_SIZE (VEC_SIZE * 4) +#define PREFETCHED_LOAD_SIZE LARGE_LOAD_SIZE #if PREFETCH_SIZE == 64 # if PREFETCHED_LOAD_SIZE == PREFETCH_SIZE @@ -97,7 +125,29 @@ #else # error Unsupported PREFETCH_SIZE! #endif - + +#if LARGE_LOAD_SIZE == (VEC_SIZE * 2) +# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; +# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; +#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4) +# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; \ + VMOVU ((offset) + VEC_SIZE * 2)base, vec2; \ + VMOVU ((offset) + VEC_SIZE * 3)base, vec3; +# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; \ + VMOVNT vec2, ((offset) + VEC_SIZE * 2)base; \ + VMOVNT vec3, ((offset) + VEC_SIZE * 3)base; +#else +# error Invalid LARGE_LOAD_SIZE +#endif + #ifndef SECTION # error SECTION is not defined! #endif @@ -384,6 +434,15 @@ L(last_4x_vec): ret L(more_8x_vec): + /* Check if non-temporal move candidate. */ +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) + /* Check non-temporal store threshold. */ + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + ja L(large_memcpy_2x) +#endif + /* Entry if rdx is greater than non-temporal threshold but there + is overlap. */ +L(more_8x_vec_check): cmpq %rsi, %rdi ja L(more_8x_vec_backward) /* Source == destination is less common. */ @@ -410,11 +469,6 @@ L(more_8x_vec): subq %r8, %rdi /* Adjust length. */ addq %r8, %rdx -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) - /* Check non-temporal store threshold. */ - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_forward) -#endif L(loop_4x_vec_forward): /* Copy 4 * VEC a time forward. */ VMOVU (%rsi), %VEC(0) @@ -462,11 +516,6 @@ L(more_8x_vec_backward): subq %r8, %r9 /* Adjust length. */ subq %r8, %rdx -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) - /* Check non-temporal store threshold. */ - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_backward) -#endif L(loop_4x_vec_backward): /* Copy 4 * VEC a time backward. */ VMOVU (%rcx), %VEC(0) @@ -493,73 +542,206 @@ L(loop_4x_vec_backward): ret #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) -L(large_forward): - /* Don't use non-temporal store if there is overlap between - destination and source since destination may be in cache - when source is loaded. */ - leaq (%rdi, %rdx), %r10 - cmpq %r10, %rsi - jb L(loop_4x_vec_forward) -L(loop_large_forward): - /* Copy 4 * VEC a time forward with non-temporal stores. */ - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 3) +L(large_memcpy_2x): + /* Compute absolute value of difference between source and + destination. */ + movq %rdi, %r9 + subq %rsi, %r9 + movq %r9, %r8 + leaq -1(%r9), %rcx + sarq $63, %r8 + xorq %r8, %r9 + subq %r8, %r9 + /* Don't use non-temporal store if there is overlap between + destination and source since destination may be in cache when + source is loaded. */ + cmpq %r9, %rdx + ja L(more_8x_vec_check) + + /* Cache align destination. First store the first 64 bytes then + adjust alignments. */ + VMOVU (%rsi), %VEC(8) +#if VEC_SIZE < 64 + VMOVU VEC_SIZE(%rsi), %VEC(9) +#if VEC_SIZE < 32 + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(10) + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(11) +#endif +#endif + VMOVU %VEC(8), (%rdi) +#if VEC_SIZE < 64 + VMOVU %VEC(9), VEC_SIZE(%rdi) +#if VEC_SIZE < 32 + VMOVU %VEC(10), (VEC_SIZE * 2)(%rdi) + VMOVU %VEC(11), (VEC_SIZE * 3)(%rdi) +#endif +#endif + /* Adjust source, destination, and size. */ + MOVQ %rdi, %r8 + andq $63, %r8 + /* Get the negative of offset for alignment. */ + subq $64, %r8 + /* Adjust source. */ + subq %r8, %rsi + /* Adjust destination which should be aligned now. */ + subq %r8, %rdi + /* Adjust length. */ + addq %r8, %rdx + + /* Test if source and destination addresses will alias. If they do + the larger pipeline in large_memcpy_4x alleviated the + performance drop. */ + testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx + jz L(large_memcpy_4x) + + movq %rdx, %r10 + shrq $LOG_4X_MEMCPY_THRESH, %r10 + cmp __x86_shared_non_temporal_threshold(%rip), %r10 + jae L(large_memcpy_4x) + + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 2 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10 + /* Copy 4x VEC at a time from 2 pages. */ + .p2align 5 +L(loop_large_memcpy_2x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_2x_inner): + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE * 2) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2) + /* Load vectors from rsi. */ + LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + addq $LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + addq $LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_2x_inner) + addq $PAGE_SIZE, %rdi + addq $PAGE_SIZE, %rsi + decq %r10 + jne L(loop_large_memcpy_2x_outer) + sfence + + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_2x_end) + + /* Handle the last 2 * PAGE_SIZE bytes. Use temporal stores + here. The region will fit in cache and it should fit user + expectations for the tail of the memcpy region to be hot. */ + .p2align 4 +L(loop_large_memcpy_2x_tail): + /* Copy 4 * VEC a time forward with temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) VMOVU (%rsi), %VEC(0) VMOVU VEC_SIZE(%rsi), %VEC(1) VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) - addq $PREFETCHED_LOAD_SIZE, %rsi - subq $PREFETCHED_LOAD_SIZE, %rdx - VMOVNT %VEC(0), (%rdi) - VMOVNT %VEC(1), VEC_SIZE(%rdi) - VMOVNT %VEC(2), (VEC_SIZE * 2)(%rdi) - VMOVNT %VEC(3), (VEC_SIZE * 3)(%rdi) - addq $PREFETCHED_LOAD_SIZE, %rdi - cmpq $PREFETCHED_LOAD_SIZE, %rdx - ja L(loop_large_forward) - sfence + addq $(VEC_SIZE * 4), %rsi + subl $(VEC_SIZE * 4), %edx + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(1), VEC_SIZE(%rdi) + VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) + addq $(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_2x_tail) + +L(large_memcpy_2x_end): /* Store the last 4 * VEC. */ - VMOVU %VEC(5), (%rcx) - VMOVU %VEC(6), -VEC_SIZE(%rcx) - VMOVU %VEC(7), -(VEC_SIZE * 2)(%rcx) - VMOVU %VEC(8), -(VEC_SIZE * 3)(%rcx) - /* Store the first VEC. */ - VMOVU %VEC(4), (%r11) + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(3) + + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VEC(3), -VEC_SIZE(%rdi, %rdx) VZEROUPPER ret - -L(large_backward): - /* Don't use non-temporal store if there is overlap between - destination and source since destination may be in cache - when source is loaded. */ - leaq (%rcx, %rdx), %r10 - cmpq %r10, %r9 - jb L(loop_4x_vec_backward) -L(loop_large_backward): - /* Copy 4 * VEC a time backward with non-temporal stores. */ - PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 3) - VMOVU (%rcx), %VEC(0) - VMOVU -VEC_SIZE(%rcx), %VEC(1) - VMOVU -(VEC_SIZE * 2)(%rcx), %VEC(2) - VMOVU -(VEC_SIZE * 3)(%rcx), %VEC(3) - subq $PREFETCHED_LOAD_SIZE, %rcx - subq $PREFETCHED_LOAD_SIZE, %rdx - VMOVNT %VEC(0), (%r9) - VMOVNT %VEC(1), -VEC_SIZE(%r9) - VMOVNT %VEC(2), -(VEC_SIZE * 2)(%r9) - VMOVNT %VEC(3), -(VEC_SIZE * 3)(%r9) - subq $PREFETCHED_LOAD_SIZE, %r9 - cmpq $PREFETCHED_LOAD_SIZE, %rdx - ja L(loop_large_backward) - sfence - /* Store the first 4 * VEC. */ - VMOVU %VEC(4), (%rdi) - VMOVU %VEC(5), VEC_SIZE(%rdi) - VMOVU %VEC(6), (VEC_SIZE * 2)(%rdi) - VMOVU %VEC(7), (VEC_SIZE * 3)(%rdi) - /* Store the last VEC. */ - VMOVU %VEC(8), (%r11) + +L(large_memcpy_4x): + movq %rdx, %r10 + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 4 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $(LOG_PAGE_SIZE + 2), %r10 + /* Copy 4x VEC at a time from 4 pages. */ + .p2align 5 +L(loop_large_memcpy_4x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_4x_inner): + /* Only one prefetch set per page as doing 4 pages give more time + for prefetcher to keep up. */ + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE) + /* Load vectors from rsi. */ + LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + LOAD_ONE_SET((%rsi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11)) + LOAD_ONE_SET((%rsi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15)) + addq $LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + STORE_ONE_SET((%rdi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11)) + STORE_ONE_SET((%rdi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15)) + addq $LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_4x_inner) + addq $(PAGE_SIZE * 3), %rdi + addq $(PAGE_SIZE * 3), %rsi + decq %r10 + jne L(loop_large_memcpy_4x_outer) + sfence + + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_4x_end) + + /* Handle the last 4 * PAGE_SIZE bytes. */ + .p2align 4 +L(loop_large_memcpy_4x_tail): + /* Copy 4 * VEC a time forward with temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) + VMOVU (%rsi), %VEC(0) + VMOVU VEC_SIZE(%rsi), %VEC(1) + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) + addq $(VEC_SIZE * 4), %rsi + subl $(VEC_SIZE * 4), %edx + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(1), VEC_SIZE(%rdi) + VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) + addq $(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_4x_tail) + +L(large_memcpy_4x_end): + /* Store the last 4 * VEC. */ + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(3) + + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VEC(3), -VEC_SIZE(%rdi, %rdx) VZEROUPPER ret #endif