From patchwork Sat Apr 3 08:12:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 42842 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 32F5C388A40D; Sat, 3 Apr 2021 08:12:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 32F5C388A40D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1617437577; bh=qKlUJsjkQfR4nc4GpimK+PpRihnxbV44jADyrCNlF/s=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=SgkNRcFfvh3T+aZb0Lu3mnbTVsWj8F8B+qUm1nQ6s9BLEeSo+KRQSvu+adsIVJ/0a 24Zg9J9R9zAQmzMw8LrLSAlGpiDn4wk5BijSfmVuMLi7u0b941HrGCrG6i6HglvHUD HkF/0yc8zbS7+z/kfzLeTPWk1jC6pp+vAiHowP3g= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qk1-x736.google.com (mail-qk1-x736.google.com [IPv6:2607:f8b0:4864:20::736]) by sourceware.org (Postfix) with ESMTPS id A2640385781B for ; Sat, 3 Apr 2021 08:12:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A2640385781B Received: by mail-qk1-x736.google.com with SMTP id v70so7145307qkb.8 for ; Sat, 03 Apr 2021 01:12:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=qKlUJsjkQfR4nc4GpimK+PpRihnxbV44jADyrCNlF/s=; b=iiUDo7nJiXFDc/By9mShdrhJSDdilB/RCwgV2hRKmnUBzYljupZQ+2F8V2ZcYDknWd dsiSLYew0ryOCH5LuafLEOExLVPR7Akjevjk8QNcaPVQZ4nk9qftuoSjKYZ63wS0fxD5 EhOwERcYFDN6m0IqEPH6Qy0fKw3x99TjcKecJfdVJ56EzMpCy3lE5ksdMF0j55wCY4Ll CePDYu+gQStPMJdB6p+tMtHKEVQZBywdgDaMIBCHDjjGp9z5XoHPtP2Bch3VAhLbnqH7 T888ONUjrbgkIzOreKs/47HS9SqrpXdDDwGrKiGJhLM9C0wYf3v4tix2XcE12UVUvoAs f7Fg== X-Gm-Message-State: AOAM531+R6q1SvSh+Oq4Mt0Nqd3Qa4b+dd00WrHJAympNJRX1cSKCHPd +imUyXydo1Pe1XfcXx4W0vhiZAORG5g= X-Google-Smtp-Source: ABdhPJxq9K4WzHOSUY2Fi9DM7e9QP/92nI2k3shlP6afq0ZBhG+ep7Vxd4R7LWu8xuul/LU3zG5MDg== X-Received: by 2002:a37:a30f:: with SMTP id m15mr16326938qke.433.1617437570434; Sat, 03 Apr 2021 01:12:50 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id c7sm8488721qtv.48.2021.04.03.01.12.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 03 Apr 2021 01:12:50 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v8 1/2] x86: Update large memcpy case in memmove-vec-unaligned-erms.S Date: Sat, 3 Apr 2021 04:12:15 -0400 Message-Id: <20210403081215.2309505-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" From: noah No Bug. This commit updates the large memcpy case (no overlap). The update is to perform memcpy on either 2 or 4 contiguous pages at once. This 1) helps to alleviate the affects of false memory aliasing when destination and source have a close 4k alignment and 2) In most cases and for most DRAM units is a modestly more efficient access pattern. These changes are a clear performance improvement for VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all pass. Signed-off-by: Noah Goldstein --- Issue was alignment related AFAICT. Added `.p2align 4` infront of the loops and no longer see any meaningful regression. Also added back the temporal stores for the tail. Saw a regression when doing these tests. Two tables below for skylake and icelake numbers for the areas around where you saw the regression. Below is all data from the tests. N = 10. Skylake Len ,align1 ,align2 ,new mean ,old mean 4103 ,0 ,64 ,84.5 ,88.6 4111 ,0 ,3 ,99.0 ,99.9 4127 ,3 ,0 ,102.1 ,102.3 4159 ,3 ,7 ,88.7 ,90.9 4223 ,9 ,5 ,88.1 ,87.4 8199 ,0 ,64 ,146.7 ,150.2 8207 ,0 ,3 ,167.9 ,168.5 8223 ,3 ,0 ,168.5 ,168.1 8255 ,3 ,7 ,157.0 ,159.2 8319 ,9 ,5 ,155.5 ,155.7 16391 ,0 ,64 ,286.2 ,288.8 16399 ,0 ,3 ,307.0 ,308.7 16415 ,3 ,0 ,307.4 ,307.6 16447 ,3 ,7 ,294.6 ,295.5 16511 ,9 ,5 ,291.5 ,462.1 32775 ,0 ,64 ,603.4 ,601.5 32783 ,0 ,3 ,604.8 ,606.4 32799 ,3 ,0 ,603.0 ,604.1 32831 ,3 ,7 ,600.2 ,737.3 32895 ,9 ,5 ,604.4 ,599.5 65543 ,0 ,64 ,1873.5 ,1854.3 65551 ,0 ,3 ,1862.9 ,1846.6 65567 ,3 ,0 ,1885.5 ,1966.0 65599 ,3 ,7 ,1833.2 ,1833.1 65663 ,9 ,5 ,1884.9 ,1887.4 131079 ,0 ,64 ,3944.3 ,3949.4 131087 ,0 ,3 ,3927.3 ,3913.3 131103 ,3 ,0 ,4415.8 ,4169.4 131135 ,3 ,7 ,4224.5 ,4157.6 131199 ,9 ,5 ,5974.0 ,4983.8 262151 ,0 ,64 ,11050.2 ,10620.6 262159 ,0 ,3 ,9932.8 ,10037.3 262175 ,3 ,0 ,10188.8 ,9206.6 262207 ,3 ,7 ,9633.3 ,9216.7 262271 ,9 ,5 ,9732.7 ,9345.3 524295 ,0 ,64 ,24823.9 ,24880.7 524303 ,0 ,3 ,24514.0 ,24556.7 524319 ,3 ,0 ,23974.4 ,24219.9 524351 ,3 ,7 ,24159.7 ,24207.0 524415 ,9 ,5 ,23946.5 ,24142.8 Icelake: Len ,align1 ,align2 ,new mean ,old mean 4103 ,0 ,64 ,50.2 ,63.7 4111 ,0 ,3 ,63.7 ,65.1 4127 ,3 ,0 ,68.2 ,69.4 4159 ,3 ,7 ,59.6 ,68.0 4223 ,9 ,5 ,68.2 ,66.8 8199 ,0 ,64 ,92.1 ,89.9 8207 ,0 ,3 ,119.7 ,118.3 8223 ,3 ,0 ,119.1 ,120.9 8255 ,3 ,7 ,122.9 ,123.7 8319 ,9 ,5 ,122.1 ,121.8 16391 ,0 ,64 ,162.7 ,158.0 16399 ,0 ,3 ,227.6 ,234.1 16415 ,3 ,0 ,230.8 ,232.7 16447 ,3 ,7 ,226.8 ,232.6 16511 ,9 ,5 ,233.4 ,233.8 32775 ,0 ,64 ,312.2 ,301.8 32783 ,0 ,3 ,449.7 ,450.0 32799 ,3 ,0 ,452.7 ,455.9 32831 ,3 ,7 ,449.8 ,458.0 32895 ,9 ,5 ,456.3 ,459.4 65543 ,0 ,64 ,1460.6 ,1463.9 65551 ,0 ,3 ,1462.0 ,1465.4 65567 ,3 ,0 ,1466.6 ,1480.4 65599 ,3 ,7 ,1488.0 ,1488.9 65663 ,9 ,5 ,1680.8 ,1499.5 131079 ,0 ,64 ,2988.5 ,3010.1 131087 ,0 ,3 ,2995.5 ,2996.4 131103 ,3 ,0 ,3006.2 ,3000.5 131135 ,3 ,7 ,3032.4 ,3073.7 131199 ,9 ,5 ,3010.4 ,3027.4 262151 ,0 ,64 ,6143.2 ,6079.1 262159 ,0 ,3 ,6085.1 ,6075.8 262175 ,3 ,0 ,6088.0 ,6064.9 262207 ,3 ,7 ,6018.7 ,6023.5 262271 ,9 ,5 ,6019.8 ,5959.2 524295 ,0 ,64 ,14464.2 ,14095.1 524303 ,0 ,3 ,14761.6 ,14050.2 524319 ,3 ,0 ,14534.1 ,14087.5 524351 ,3 ,7 ,14147.7 ,13903.8 524415 ,9 ,5 ,14157.0 ,13982.9 cpu ,version ,Len ,align1 ,align2 ,new mean ,old mean skylake ,avx ,4103 ,0 ,64 ,84.5 ,88.6 skylake ,avx ,4111 ,0 ,3 ,99.0 ,99.9 skylake ,avx ,4127 ,3 ,0 ,102.1 ,102.3 skylake ,avx ,4159 ,3 ,7 ,88.7 ,90.9 skylake ,avx ,4223 ,9 ,5 ,88.1 ,87.4 skylake ,avx ,8199 ,0 ,64 ,146.7 ,150.2 skylake ,avx ,8207 ,0 ,3 ,167.9 ,168.5 skylake ,avx ,8223 ,3 ,0 ,168.5 ,168.1 skylake ,avx ,8255 ,3 ,7 ,157.0 ,159.2 skylake ,avx ,8319 ,9 ,5 ,155.5 ,155.7 skylake ,avx ,16391 ,0 ,64 ,286.2 ,288.8 skylake ,avx ,16399 ,0 ,3 ,307.0 ,308.7 skylake ,avx ,16415 ,3 ,0 ,307.4 ,307.6 skylake ,avx ,16447 ,3 ,7 ,294.6 ,295.5 skylake ,avx ,16511 ,9 ,5 ,291.5 ,462.1 skylake ,avx ,32775 ,0 ,64 ,603.4 ,601.5 skylake ,avx ,32783 ,0 ,3 ,604.8 ,606.4 skylake ,avx ,32799 ,3 ,0 ,603.0 ,604.1 skylake ,avx ,32831 ,3 ,7 ,600.2 ,737.3 skylake ,avx ,32895 ,9 ,5 ,604.4 ,599.5 skylake ,avx ,65543 ,0 ,64 ,1873.5 ,1854.3 skylake ,avx ,65551 ,0 ,3 ,1862.9 ,1846.6 skylake ,avx ,65567 ,3 ,0 ,1885.5 ,1966.0 skylake ,avx ,65599 ,3 ,7 ,1833.2 ,1833.1 skylake ,avx ,65663 ,9 ,5 ,1884.9 ,1887.4 skylake ,avx ,131079 ,0 ,64 ,3944.3 ,3949.4 skylake ,avx ,131087 ,0 ,3 ,3927.3 ,3913.3 skylake ,avx ,131103 ,3 ,0 ,4415.8 ,4169.4 skylake ,avx ,131135 ,3 ,7 ,4224.5 ,4157.6 skylake ,avx ,131199 ,9 ,5 ,5974.0 ,4983.8 skylake ,avx ,262151 ,0 ,64 ,11050.2 ,10620.6 skylake ,avx ,262159 ,0 ,3 ,9932.8 ,10037.3 skylake ,avx ,262175 ,3 ,0 ,10188.8 ,9206.6 skylake ,avx ,262207 ,3 ,7 ,9633.3 ,9216.7 skylake ,avx ,262271 ,9 ,5 ,9732.7 ,9345.3 skylake ,avx ,524295 ,0 ,64 ,24823.9 ,24880.7 skylake ,avx ,524303 ,0 ,3 ,24514.0 ,24556.7 skylake ,avx ,524319 ,3 ,0 ,23974.4 ,24219.9 skylake ,avx ,524351 ,3 ,7 ,24159.7 ,24207.0 skylake ,avx ,524415 ,9 ,5 ,23946.5 ,24142.8 skylake ,avx ,1048583 ,0 ,64 ,49163.9 ,49454.6 skylake ,avx ,1048591 ,0 ,3 ,49879.3 ,49400.8 skylake ,avx ,1048607 ,3 ,0 ,49738.0 ,48864.6 skylake ,avx ,1048639 ,3 ,7 ,48804.0 ,47588.5 skylake ,avx ,1048703 ,9 ,5 ,49629.4 ,49796.3 skylake ,avx ,2097159 ,0 ,64 ,98271.7 ,96330.6 skylake ,avx ,2097167 ,0 ,3 ,97801.8 ,98638.1 skylake ,avx ,2097183 ,3 ,0 ,98041.1 ,99287.6 skylake ,avx ,2097215 ,3 ,7 ,96629.5 ,96521.9 skylake ,avx ,2097279 ,9 ,5 ,98961.8 ,98909.8 skylake ,avx ,4194311 ,0 ,64 ,194667.7 ,195377.1 skylake ,avx ,4194319 ,0 ,3 ,194919.5 ,198576.2 skylake ,avx ,4194335 ,3 ,0 ,192949.8 ,194584.7 skylake ,avx ,4194367 ,3 ,7 ,189943.5 ,189177.9 skylake ,avx ,4194431 ,9 ,5 ,192479.1 ,196494.2 skylake ,avx ,8388615 ,0 ,64 ,588671.6 ,587215.4 skylake ,avx ,8388623 ,0 ,3 ,581640.7 ,582812.5 skylake ,avx ,8388639 ,3 ,0 ,549811.9 ,544697.6 skylake ,avx ,8388671 ,3 ,7 ,591155.0 ,577951.8 skylake ,avx ,8388735 ,9 ,5 ,547583.2 ,545133.3 skylake ,avx ,16777223 ,0 ,64 ,1787503.0 ,1811146.0 skylake ,avx ,16777231 ,0 ,3 ,1758671.0 ,1756343.0 skylake ,avx ,16777247 ,3 ,0 ,1691781.0 ,1694661.0 skylake ,avx ,16777279 ,3 ,7 ,1768150.0 ,1754785.0 skylake ,avx ,16777343 ,9 ,5 ,1695179.0 ,1710794.0 skylake ,sse2 ,4103 ,0 ,64 ,150.8 ,150.5 skylake ,sse2 ,4111 ,0 ,3 ,156.8 ,158.4 skylake ,sse2 ,4127 ,3 ,0 ,99.7 ,99.4 skylake ,sse2 ,4159 ,3 ,7 ,154.8 ,154.5 skylake ,sse2 ,4223 ,9 ,5 ,137.3 ,137.2 skylake ,sse2 ,8199 ,0 ,64 ,284.8 ,285.5 skylake ,sse2 ,8207 ,0 ,3 ,296.0 ,296.1 skylake ,sse2 ,8223 ,3 ,0 ,168.0 ,168.2 skylake ,sse2 ,8255 ,3 ,7 ,293.0 ,292.4 skylake ,sse2 ,8319 ,9 ,5 ,251.3 ,250.7 skylake ,sse2 ,16391 ,0 ,64 ,561.3 ,608.3 skylake ,sse2 ,16399 ,0 ,3 ,571.0 ,574.8 skylake ,sse2 ,16415 ,3 ,0 ,305.4 ,305.0 skylake ,sse2 ,16447 ,3 ,7 ,563.2 ,565.0 skylake ,sse2 ,16511 ,9 ,5 ,477.1 ,475.1 skylake ,sse2 ,32775 ,0 ,64 ,1128.2 ,1131.7 skylake ,sse2 ,32783 ,0 ,3 ,1126.6 ,1131.0 skylake ,sse2 ,32799 ,3 ,0 ,587.6 ,590.8 skylake ,sse2 ,32831 ,3 ,7 ,1130.6 ,1126.2 skylake ,sse2 ,32895 ,9 ,5 ,957.6 ,953.0 skylake ,sse2 ,65543 ,0 ,64 ,2718.9 ,2704.2 skylake ,sse2 ,65551 ,0 ,3 ,2724.1 ,2725.0 skylake ,sse2 ,65567 ,3 ,0 ,1888.4 ,1914.3 skylake ,sse2 ,65599 ,3 ,7 ,2787.6 ,2748.7 skylake ,sse2 ,65663 ,9 ,5 ,2400.5 ,2369.4 skylake ,sse2 ,131079 ,0 ,64 ,5603.3 ,5654.9 skylake ,sse2 ,131087 ,0 ,3 ,5939.3 ,5871.4 skylake ,sse2 ,131103 ,3 ,0 ,4272.4 ,4190.0 skylake ,sse2 ,131135 ,3 ,7 ,7601.4 ,7524.6 skylake ,sse2 ,131199 ,9 ,5 ,7022.1 ,6864.7 skylake ,sse2 ,262151 ,0 ,64 ,13736.2 ,14030.0 skylake ,sse2 ,262159 ,0 ,3 ,12407.3 ,12334.1 skylake ,sse2 ,262175 ,3 ,0 ,9661.1 ,9249.4 skylake ,sse2 ,262207 ,3 ,7 ,12850.2 ,12351.6 skylake ,sse2 ,262271 ,9 ,5 ,10792.6 ,10435.8 skylake ,sse2 ,524295 ,0 ,64 ,27754.5 ,28177.7 skylake ,sse2 ,524303 ,0 ,3 ,27766.2 ,28152.0 skylake ,sse2 ,524319 ,3 ,0 ,24030.9 ,24438.3 skylake ,sse2 ,524351 ,3 ,7 ,27787.5 ,27933.0 skylake ,sse2 ,524415 ,9 ,5 ,24263.2 ,25249.1 skylake ,sse2 ,1048583 ,0 ,64 ,56199.9 ,56039.8 skylake ,sse2 ,1048591 ,0 ,3 ,56750.2 ,58889.7 skylake ,sse2 ,1048607 ,3 ,0 ,56394.0 ,55115.3 skylake ,sse2 ,1048639 ,3 ,7 ,57233.1 ,57473.8 skylake ,sse2 ,1048703 ,9 ,5 ,56324.3 ,55917.9 skylake ,sse2 ,2097159 ,0 ,64 ,113234.8 ,114346.4 skylake ,sse2 ,2097167 ,0 ,3 ,114373.1 ,115522.5 skylake ,sse2 ,2097183 ,3 ,0 ,108113.3 ,108513.3 skylake ,sse2 ,2097215 ,3 ,7 ,116863.6 ,116549.9 skylake ,sse2 ,2097279 ,9 ,5 ,108945.1 ,108843.7 skylake ,sse2 ,4194311 ,0 ,64 ,230250.1 ,232350.0 skylake ,sse2 ,4194319 ,0 ,3 ,231895.3 ,235055.6 skylake ,sse2 ,4194335 ,3 ,0 ,218442.8 ,219199.8 skylake ,sse2 ,4194367 ,3 ,7 ,242564.2 ,235587.7 skylake ,sse2 ,4194431 ,9 ,5 ,224167.4 ,215261.8 skylake ,sse2 ,8388615 ,0 ,64 ,679801.8 ,674832.0 skylake ,sse2 ,8388623 ,0 ,3 ,684913.2 ,685238.7 skylake ,sse2 ,8388639 ,3 ,0 ,644865.4 ,631388.6 skylake ,sse2 ,8388671 ,3 ,7 ,698700.9 ,689316.1 skylake ,sse2 ,8388735 ,9 ,5 ,644820.2 ,631366.8 skylake ,sse2 ,16777223 ,0 ,64 ,1877984.0 ,1876437.0 skylake ,sse2 ,16777231 ,0 ,3 ,1898086.0 ,1913053.0 skylake ,sse2 ,16777247 ,3 ,0 ,1857018.0 ,1866949.0 skylake ,sse2 ,16777279 ,3 ,7 ,1914905.0 ,1897134.0 skylake ,sse2 ,16777343 ,9 ,5 ,1859937.0 ,1881939.0 icelake ,avx512 ,4103 ,0 ,64 ,75.2 ,75.8 icelake ,avx512 ,4111 ,0 ,3 ,56.9 ,56.4 icelake ,avx512 ,4127 ,3 ,0 ,59.1 ,59.6 icelake ,avx512 ,4159 ,3 ,7 ,50.7 ,51.3 icelake ,avx512 ,4223 ,9 ,5 ,59.2 ,58.9 icelake ,avx512 ,8199 ,0 ,64 ,67.8 ,63.9 icelake ,avx512 ,8207 ,0 ,3 ,89.0 ,89.9 icelake ,avx512 ,8223 ,3 ,0 ,90.2 ,90.1 icelake ,avx512 ,8255 ,3 ,7 ,82.6 ,84.9 icelake ,avx512 ,8319 ,9 ,5 ,91.5 ,92.8 icelake ,avx512 ,16391 ,0 ,64 ,118.0 ,117.6 icelake ,avx512 ,16399 ,0 ,3 ,156.5 ,157.0 icelake ,avx512 ,16415 ,3 ,0 ,157.4 ,157.3 icelake ,avx512 ,16447 ,3 ,7 ,151.0 ,151.6 icelake ,avx512 ,16511 ,9 ,5 ,159.1 ,159.6 icelake ,avx512 ,32775 ,0 ,64 ,231.8 ,230.8 icelake ,avx512 ,32783 ,0 ,3 ,297.8 ,299.3 icelake ,avx512 ,32799 ,3 ,0 ,299.1 ,299.0 icelake ,avx512 ,32831 ,3 ,7 ,293.5 ,295.4 icelake ,avx512 ,32895 ,9 ,5 ,300.3 ,302.5 icelake ,avx512 ,65543 ,0 ,64 ,1473.4 ,1479.2 icelake ,avx512 ,65551 ,0 ,3 ,1438.2 ,1445.3 icelake ,avx512 ,65567 ,3 ,0 ,1450.3 ,1463.8 icelake ,avx512 ,65599 ,3 ,7 ,1469.0 ,1473.8 icelake ,avx512 ,65663 ,9 ,5 ,1480.0 ,1483.5 icelake ,avx512 ,131079 ,0 ,64 ,3015.1 ,3037.5 icelake ,avx512 ,131087 ,0 ,3 ,2952.3 ,2960.4 icelake ,avx512 ,131103 ,3 ,0 ,2966.2 ,2964.4 icelake ,avx512 ,131135 ,3 ,7 ,2961.6 ,3047.9 icelake ,avx512 ,131199 ,9 ,5 ,2967.4 ,3183.8 icelake ,avx512 ,262151 ,0 ,64 ,6206.0 ,6141.5 icelake ,avx512 ,262159 ,0 ,3 ,5990.8 ,5959.2 icelake ,avx512 ,262175 ,3 ,0 ,5976.7 ,5963.8 icelake ,avx512 ,262207 ,3 ,7 ,5939.5 ,5924.3 icelake ,avx512 ,262271 ,9 ,5 ,5944.6 ,5990.3 icelake ,avx512 ,524295 ,0 ,64 ,14726.7 ,14307.0 icelake ,avx512 ,524303 ,0 ,3 ,14344.2 ,14040.5 icelake ,avx512 ,524319 ,3 ,0 ,14175.0 ,13862.2 icelake ,avx512 ,524351 ,3 ,7 ,14261.4 ,13821.5 icelake ,avx512 ,524415 ,9 ,5 ,14266.5 ,14064.7 icelake ,avx512 ,1048583 ,0 ,64 ,35211.4 ,35414.6 icelake ,avx512 ,1048591 ,0 ,3 ,35156.8 ,35591.2 icelake ,avx512 ,1048607 ,3 ,0 ,35273.1 ,35503.3 icelake ,avx512 ,1048639 ,3 ,7 ,35255.8 ,35725.0 icelake ,avx512 ,1048703 ,9 ,5 ,35703.6 ,36289.9 icelake ,avx512 ,2097159 ,0 ,64 ,72613.9 ,72063.2 icelake ,avx512 ,2097167 ,0 ,3 ,72301.6 ,73504.2 icelake ,avx512 ,2097183 ,3 ,0 ,73448.8 ,72133.6 icelake ,avx512 ,2097215 ,3 ,7 ,73762.9 ,72825.8 icelake ,avx512 ,2097279 ,9 ,5 ,72097.3 ,72914.6 icelake ,avx512 ,4194311 ,0 ,64 ,144793.4 ,144182.1 icelake ,avx512 ,4194319 ,0 ,3 ,143710.3 ,145063.3 icelake ,avx512 ,4194335 ,3 ,0 ,146722.1 ,144046.4 icelake ,avx512 ,4194367 ,3 ,7 ,144267.0 ,144874.6 icelake ,avx512 ,4194431 ,9 ,5 ,143808.2 ,144560.0 icelake ,avx512 ,8388615 ,0 ,64 ,427993.4 ,424521.5 icelake ,avx512 ,8388623 ,0 ,3 ,470267.1 ,473290.8 icelake ,avx512 ,8388639 ,3 ,0 ,457179.7 ,461797.7 icelake ,avx512 ,8388671 ,3 ,7 ,472507.9 ,481561.4 icelake ,avx512 ,8388735 ,9 ,5 ,463611.9 ,467388.7 icelake ,avx512 ,16777223 ,0 ,64 ,1490426.0 ,1526996.0 icelake ,avx512 ,16777231 ,0 ,3 ,1516687.0 ,1517095.0 icelake ,avx512 ,16777247 ,3 ,0 ,1497688.0 ,1512766.0 icelake ,avx512 ,16777279 ,3 ,7 ,1512331.0 ,1524317.0 icelake ,avx512 ,16777343 ,9 ,5 ,1498908.0 ,1500526.0 icelake ,avx ,4103 ,0 ,64 ,50.2 ,63.7 icelake ,avx ,4111 ,0 ,3 ,63.7 ,65.1 icelake ,avx ,4127 ,3 ,0 ,68.2 ,69.4 icelake ,avx ,4159 ,3 ,7 ,59.6 ,68.0 icelake ,avx ,4223 ,9 ,5 ,68.2 ,66.8 icelake ,avx ,8199 ,0 ,64 ,92.1 ,89.9 icelake ,avx ,8207 ,0 ,3 ,119.7 ,118.3 icelake ,avx ,8223 ,3 ,0 ,119.1 ,120.9 icelake ,avx ,8255 ,3 ,7 ,122.9 ,123.7 icelake ,avx ,8319 ,9 ,5 ,122.1 ,121.8 icelake ,avx ,16391 ,0 ,64 ,162.7 ,158.0 icelake ,avx ,16399 ,0 ,3 ,227.6 ,234.1 icelake ,avx ,16415 ,3 ,0 ,230.8 ,232.7 icelake ,avx ,16447 ,3 ,7 ,226.8 ,232.6 icelake ,avx ,16511 ,9 ,5 ,233.4 ,233.8 icelake ,avx ,32775 ,0 ,64 ,312.2 ,301.8 icelake ,avx ,32783 ,0 ,3 ,449.7 ,450.0 icelake ,avx ,32799 ,3 ,0 ,452.7 ,455.9 icelake ,avx ,32831 ,3 ,7 ,449.8 ,458.0 icelake ,avx ,32895 ,9 ,5 ,456.3 ,459.4 icelake ,avx ,65543 ,0 ,64 ,1460.6 ,1463.9 icelake ,avx ,65551 ,0 ,3 ,1462.0 ,1465.4 icelake ,avx ,65567 ,3 ,0 ,1466.6 ,1480.4 icelake ,avx ,65599 ,3 ,7 ,1488.0 ,1488.9 icelake ,avx ,65663 ,9 ,5 ,1680.8 ,1499.5 icelake ,avx ,131079 ,0 ,64 ,2988.5 ,3010.1 icelake ,avx ,131087 ,0 ,3 ,2995.5 ,2996.4 icelake ,avx ,131103 ,3 ,0 ,3006.2 ,3000.5 icelake ,avx ,131135 ,3 ,7 ,3032.4 ,3073.7 icelake ,avx ,131199 ,9 ,5 ,3010.4 ,3027.4 icelake ,avx ,262151 ,0 ,64 ,6143.2 ,6079.1 icelake ,avx ,262159 ,0 ,3 ,6085.1 ,6075.8 icelake ,avx ,262175 ,3 ,0 ,6088.0 ,6064.9 icelake ,avx ,262207 ,3 ,7 ,6018.7 ,6023.5 icelake ,avx ,262271 ,9 ,5 ,6019.8 ,5959.2 icelake ,avx ,524295 ,0 ,64 ,14464.2 ,14095.1 icelake ,avx ,524303 ,0 ,3 ,14761.6 ,14050.2 icelake ,avx ,524319 ,3 ,0 ,14534.1 ,14087.5 icelake ,avx ,524351 ,3 ,7 ,14147.7 ,13903.8 icelake ,avx ,524415 ,9 ,5 ,14157.0 ,13982.9 icelake ,avx ,1048583 ,0 ,64 ,36599.0 ,37461.4 icelake ,avx ,1048591 ,0 ,3 ,36717.8 ,37454.9 icelake ,avx ,1048607 ,3 ,0 ,36821.2 ,37343.3 icelake ,avx ,1048639 ,3 ,7 ,36958.0 ,37507.2 icelake ,avx ,1048703 ,9 ,5 ,36869.2 ,37413.1 icelake ,avx ,2097159 ,0 ,64 ,74765.8 ,75330.9 icelake ,avx ,2097167 ,0 ,3 ,75175.4 ,74891.9 icelake ,avx ,2097183 ,3 ,0 ,75451.4 ,74787.7 icelake ,avx ,2097215 ,3 ,7 ,75394.8 ,75839.1 icelake ,avx ,2097279 ,9 ,5 ,75099.2 ,75421.2 icelake ,avx ,4194311 ,0 ,64 ,146809.6 ,146619.4 icelake ,avx ,4194319 ,0 ,3 ,148866.4 ,149898.2 icelake ,avx ,4194335 ,3 ,0 ,148719.7 ,150165.4 icelake ,avx ,4194367 ,3 ,7 ,150600.1 ,150925.9 icelake ,avx ,4194431 ,9 ,5 ,149457.3 ,150519.2 icelake ,avx ,8388615 ,0 ,64 ,412709.8 ,423666.1 icelake ,avx ,8388623 ,0 ,3 ,423717.4 ,424418.2 icelake ,avx ,8388639 ,3 ,0 ,414387.5 ,413445.6 icelake ,avx ,8388671 ,3 ,7 ,449010.7 ,417553.5 icelake ,avx ,8388735 ,9 ,5 ,414128.6 ,411815.3 icelake ,avx ,16777223 ,0 ,64 ,1490032.0 ,1510004.0 icelake ,avx ,16777231 ,0 ,3 ,1379638.0 ,1422097.0 icelake ,avx ,16777247 ,3 ,0 ,1418930.0 ,1367557.0 icelake ,avx ,16777279 ,3 ,7 ,1515152.0 ,1500176.0 icelake ,avx ,16777343 ,9 ,5 ,1344117.0 ,1411795.0 icelake ,sse2 ,4103 ,0 ,64 ,113.2 ,114.6 icelake ,sse2 ,4111 ,0 ,3 ,121.5 ,120.4 icelake ,sse2 ,4127 ,3 ,0 ,1700.5 ,1771.5 icelake ,sse2 ,4159 ,3 ,7 ,119.3 ,118.8 icelake ,sse2 ,4223 ,9 ,5 ,1739.7 ,1735.2 icelake ,sse2 ,8199 ,0 ,64 ,207.0 ,203.9 icelake ,sse2 ,8207 ,0 ,3 ,225.5 ,220.8 icelake ,sse2 ,8223 ,3 ,0 ,3444.3 ,3743.5 icelake ,sse2 ,8255 ,3 ,7 ,219.9 ,216.8 icelake ,sse2 ,8319 ,9 ,5 ,4117.1 ,3487.3 icelake ,sse2 ,16391 ,0 ,64 ,397.1 ,394.3 icelake ,sse2 ,16399 ,0 ,3 ,439.6 ,428.6 icelake ,sse2 ,16415 ,3 ,0 ,6997.0 ,7031.2 icelake ,sse2 ,16447 ,3 ,7 ,426.8 ,421.8 icelake ,sse2 ,16511 ,9 ,5 ,7037.6 ,7038.3 icelake ,sse2 ,32775 ,0 ,64 ,790.9 ,779.0 icelake ,sse2 ,32783 ,0 ,3 ,863.1 ,849.6 icelake ,sse2 ,32799 ,3 ,0 ,14043.0 ,14390.9 icelake ,sse2 ,32831 ,3 ,7 ,841.6 ,833.1 icelake ,sse2 ,32895 ,9 ,5 ,14277.6 ,14344.2 icelake ,sse2 ,65543 ,0 ,64 ,1897.0 ,1897.3 icelake ,sse2 ,65551 ,0 ,3 ,1927.1 ,1955.4 icelake ,sse2 ,65567 ,3 ,0 ,28834.7 ,28727.8 icelake ,sse2 ,65599 ,3 ,7 ,1961.4 ,1969.7 icelake ,sse2 ,65663 ,9 ,5 ,28867.6 ,29019.8 icelake ,sse2 ,131079 ,0 ,64 ,3879.3 ,3872.6 icelake ,sse2 ,131087 ,0 ,3 ,3955.3 ,3990.7 icelake ,sse2 ,131103 ,3 ,0 ,58001.8 ,60567.9 icelake ,sse2 ,131135 ,3 ,7 ,3951.5 ,4002.6 icelake ,sse2 ,131199 ,9 ,5 ,57886.7 ,58391.4 icelake ,sse2 ,262151 ,0 ,64 ,7851.4 ,7894.7 icelake ,sse2 ,262159 ,0 ,3 ,7947.5 ,8016.2 icelake ,sse2 ,262175 ,3 ,0 ,115036.2 ,115968.6 icelake ,sse2 ,262207 ,3 ,7 ,7883.9 ,7814.1 icelake ,sse2 ,262271 ,9 ,5 ,113776.4 ,119733.6 icelake ,sse2 ,524295 ,0 ,64 ,17198.1 ,16974.9 icelake ,sse2 ,524303 ,0 ,3 ,17402.2 ,17096.3 icelake ,sse2 ,524319 ,3 ,0 ,223980.4 ,225889.9 icelake ,sse2 ,524351 ,3 ,7 ,17034.9 ,16910.3 icelake ,sse2 ,524415 ,9 ,5 ,224027.7 ,224962.5 icelake ,sse2 ,1048583 ,0 ,64 ,38822.3 ,39178.6 icelake ,sse2 ,1048591 ,0 ,3 ,41686.7 ,40247.4 icelake ,sse2 ,1048607 ,3 ,0 ,38814.8 ,39323.3 icelake ,sse2 ,1048639 ,3 ,7 ,39568.3 ,41325.7 icelake ,sse2 ,1048703 ,9 ,5 ,39354.2 ,39637.9 icelake ,sse2 ,2097159 ,0 ,64 ,84074.7 ,84543.1 icelake ,sse2 ,2097167 ,0 ,3 ,83665.7 ,82358.2 icelake ,sse2 ,2097183 ,3 ,0 ,81817.8 ,79638.9 icelake ,sse2 ,2097215 ,3 ,7 ,83649.1 ,83497.6 icelake ,sse2 ,2097279 ,9 ,5 ,80287.6 ,79980.9 icelake ,sse2 ,4194311 ,0 ,64 ,165409.8 ,168343.1 icelake ,sse2 ,4194319 ,0 ,3 ,165216.7 ,177632.0 icelake ,sse2 ,4194335 ,3 ,0 ,158718.7 ,160342.2 icelake ,sse2 ,4194367 ,3 ,7 ,167944.9 ,167204.4 icelake ,sse2 ,4194431 ,9 ,5 ,161530.1 ,164839.7 icelake ,sse2 ,8388615 ,0 ,64 ,626504.3 ,629858.5 icelake ,sse2 ,8388623 ,0 ,3 ,623969.5 ,631509.1 icelake ,sse2 ,8388639 ,3 ,0 ,599366.7 ,600016.0 icelake ,sse2 ,8388671 ,3 ,7 ,619964.2 ,619113.2 icelake ,sse2 ,8388735 ,9 ,5 ,595338.1 ,604172.4 icelake ,sse2 ,16777223 ,0 ,64 ,1709597.0 ,1725184.0 icelake ,sse2 ,16777231 ,0 ,3 ,1725452.0 ,1719746.0 icelake ,sse2 ,16777247 ,3 ,0 ,1614269.0 ,1607164.0 icelake ,sse2 ,16777279 ,3 ,7 ,1705295.0 ,1733018.0 icelake ,sse2 ,16777343 ,9 ,5 ,1604197.0 ,1595690.0 .../multiarch/memmove-vec-unaligned-erms.S | 338 ++++++++++++++---- 1 file changed, 265 insertions(+), 73 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 897a3d9762..5e4a071f16 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -35,7 +35,16 @@ __x86_rep_movsb_stop_threshold, then REP MOVSB will be used. 7. If size >= __x86_shared_non_temporal_threshold and there is no overlap between destination and source, use non-temporal store - instead of aligned store. */ + instead of aligned store copying from either 2 or 4 pages at + once. + 8. For point 7) if size < 16 * __x86_shared_non_temporal_threshold + and source and destination do not page alias, copy from 2 pages + at once using non-temporal stores. Page aliasing in this case is + considered true if destination's page alignment - sources' page + alignment is less than 8 * VEC_SIZE. + 9. If size >= 16 * __x86_shared_non_temporal_threshold or source + and destination do page alias copy from 4 pages at once using + non-temporal stores. */ #include @@ -67,6 +76,34 @@ # endif #endif +#ifndef PAGE_SIZE +# define PAGE_SIZE 4096 +#endif + +#if PAGE_SIZE != 4096 +# error Unsupported PAGE_SIZE +#endif + +#ifndef LOG_PAGE_SIZE +# define LOG_PAGE_SIZE 12 +#endif + +#if PAGE_SIZE != (1 << LOG_PAGE_SIZE) +# error Invalid LOG_PAGE_SIZE +#endif + +/* Byte per page for large_memcpy inner loop. */ +#if VEC_SIZE == 64 +# define LARGE_LOAD_SIZE (VEC_SIZE * 2) +#else +# define LARGE_LOAD_SIZE (VEC_SIZE * 4) +#endif + +/* Amount to shift rdx by to compare for memcpy_large_4x. */ +#ifndef LOG_4X_MEMCPY_THRESH +# define LOG_4X_MEMCPY_THRESH 4 +#endif + /* Avoid short distance rep movsb only with non-SSE vector. */ #ifndef AVOID_SHORT_DISTANCE_REP_MOVSB # define AVOID_SHORT_DISTANCE_REP_MOVSB (VEC_SIZE > 16) @@ -106,6 +143,28 @@ # error Unsupported PREFETCH_SIZE! #endif +#if LARGE_LOAD_SIZE == (VEC_SIZE * 2) +# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; +# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; +#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4) +# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; \ + VMOVU ((offset) + VEC_SIZE * 2)base, vec2; \ + VMOVU ((offset) + VEC_SIZE * 3)base, vec3; +# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; \ + VMOVNT vec2, ((offset) + VEC_SIZE * 2)base; \ + VMOVNT vec3, ((offset) + VEC_SIZE * 3)base; +#else +# error Invalid LARGE_LOAD_SIZE +#endif + #ifndef SECTION # error SECTION is not defined! #endif @@ -393,6 +452,15 @@ L(last_4x_vec): VZEROUPPER_RETURN L(more_8x_vec): + /* Check if non-temporal move candidate. */ +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) + /* Check non-temporal store threshold. */ + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + ja L(large_memcpy_2x) +#endif + /* Entry if rdx is greater than non-temporal threshold but there + is overlap. */ +L(more_8x_vec_check): cmpq %rsi, %rdi ja L(more_8x_vec_backward) /* Source == destination is less common. */ @@ -419,24 +487,21 @@ L(more_8x_vec): subq %r8, %rdi /* Adjust length. */ addq %r8, %rdx -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) - /* Check non-temporal store threshold. */ - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_forward) -#endif + + .p2align 4 L(loop_4x_vec_forward): /* Copy 4 * VEC a time forward. */ VMOVU (%rsi), %VEC(0) VMOVU VEC_SIZE(%rsi), %VEC(1) VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) - addq $(VEC_SIZE * 4), %rsi - subq $(VEC_SIZE * 4), %rdx + subq $-(VEC_SIZE * 4), %rsi + addq $-(VEC_SIZE * 4), %rdx VMOVA %VEC(0), (%rdi) VMOVA %VEC(1), VEC_SIZE(%rdi) VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) - addq $(VEC_SIZE * 4), %rdi + subq $-(VEC_SIZE * 4), %rdi cmpq $(VEC_SIZE * 4), %rdx ja L(loop_4x_vec_forward) /* Store the last 4 * VEC. */ @@ -470,24 +535,21 @@ L(more_8x_vec_backward): subq %r8, %r9 /* Adjust length. */ subq %r8, %rdx -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) - /* Check non-temporal store threshold. */ - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_backward) -#endif + + .p2align 4 L(loop_4x_vec_backward): /* Copy 4 * VEC a time backward. */ VMOVU (%rcx), %VEC(0) VMOVU -VEC_SIZE(%rcx), %VEC(1) VMOVU -(VEC_SIZE * 2)(%rcx), %VEC(2) VMOVU -(VEC_SIZE * 3)(%rcx), %VEC(3) - subq $(VEC_SIZE * 4), %rcx - subq $(VEC_SIZE * 4), %rdx + addq $-(VEC_SIZE * 4), %rcx + addq $-(VEC_SIZE * 4), %rdx VMOVA %VEC(0), (%r9) VMOVA %VEC(1), -VEC_SIZE(%r9) VMOVA %VEC(2), -(VEC_SIZE * 2)(%r9) VMOVA %VEC(3), -(VEC_SIZE * 3)(%r9) - subq $(VEC_SIZE * 4), %r9 + addq $-(VEC_SIZE * 4), %r9 cmpq $(VEC_SIZE * 4), %rdx ja L(loop_4x_vec_backward) /* Store the first 4 * VEC. */ @@ -500,72 +562,202 @@ L(loop_4x_vec_backward): VZEROUPPER_RETURN #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) -L(large_forward): + .p2align 4 +L(large_memcpy_2x): + /* Compute absolute value of difference between source and + destination. */ + movq %rdi, %r9 + subq %rsi, %r9 + movq %r9, %r8 + leaq -1(%r9), %rcx + sarq $63, %r8 + xorq %r8, %r9 + subq %r8, %r9 /* Don't use non-temporal store if there is overlap between - destination and source since destination may be in cache - when source is loaded. */ - leaq (%rdi, %rdx), %r10 - cmpq %r10, %rsi - jb L(loop_4x_vec_forward) -L(loop_large_forward): + destination and source since destination may be in cache when + source is loaded. */ + cmpq %r9, %rdx + ja L(more_8x_vec_check) + + /* Cache align destination. First store the first 64 bytes then + adjust alignments. */ + VMOVU (%rsi), %VEC(8) +#if VEC_SIZE < 64 + VMOVU VEC_SIZE(%rsi), %VEC(9) +#if VEC_SIZE < 32 + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(10) + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(11) +#endif +#endif + VMOVU %VEC(8), (%rdi) +#if VEC_SIZE < 64 + VMOVU %VEC(9), VEC_SIZE(%rdi) +#if VEC_SIZE < 32 + VMOVU %VEC(10), (VEC_SIZE * 2)(%rdi) + VMOVU %VEC(11), (VEC_SIZE * 3)(%rdi) +#endif +#endif + /* Adjust source, destination, and size. */ + movq %rdi, %r8 + andq $63, %r8 + /* Get the negative of offset for alignment. */ + subq $64, %r8 + /* Adjust source. */ + subq %r8, %rsi + /* Adjust destination which should be aligned now. */ + subq %r8, %rdi + /* Adjust length. */ + addq %r8, %rdx + + /* Test if source and destination addresses will alias. If they do + the larger pipeline in large_memcpy_4x alleviated the + performance drop. */ + testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx + jz L(large_memcpy_4x) + + movq %rdx, %r10 + shrq $LOG_4X_MEMCPY_THRESH, %r10 + cmp __x86_shared_non_temporal_threshold(%rip), %r10 + jae L(large_memcpy_4x) + + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 2 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10 + /* Copy 4x VEC at a time from 2 pages. */ + .p2align 4 +L(loop_large_memcpy_2x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_2x_inner): + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE * 2) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2) + /* Load vectors from rsi. */ + LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + subq $-LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + subq $-LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_2x_inner) + addq $PAGE_SIZE, %rdi + addq $PAGE_SIZE, %rsi + decq %r10 + jne L(loop_large_memcpy_2x_outer) + sfence + + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_2x_end) + + /* Handle the last 2 * PAGE_SIZE bytes. */ +L(loop_large_memcpy_2x_tail): /* Copy 4 * VEC a time forward with non-temporal stores. */ - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 3) + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) VMOVU (%rsi), %VEC(0) VMOVU VEC_SIZE(%rsi), %VEC(1) VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) - addq $PREFETCHED_LOAD_SIZE, %rsi - subq $PREFETCHED_LOAD_SIZE, %rdx - VMOVNT %VEC(0), (%rdi) - VMOVNT %VEC(1), VEC_SIZE(%rdi) - VMOVNT %VEC(2), (VEC_SIZE * 2)(%rdi) - VMOVNT %VEC(3), (VEC_SIZE * 3)(%rdi) - addq $PREFETCHED_LOAD_SIZE, %rdi - cmpq $PREFETCHED_LOAD_SIZE, %rdx - ja L(loop_large_forward) - sfence + subq $-(VEC_SIZE * 4), %rsi + addl $-(VEC_SIZE * 4), %edx + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(1), VEC_SIZE(%rdi) + VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_2x_tail) + +L(large_memcpy_2x_end): /* Store the last 4 * VEC. */ - VMOVU %VEC(5), (%rcx) - VMOVU %VEC(6), -VEC_SIZE(%rcx) - VMOVU %VEC(7), -(VEC_SIZE * 2)(%rcx) - VMOVU %VEC(8), -(VEC_SIZE * 3)(%rcx) - /* Store the first VEC. */ - VMOVU %VEC(4), (%r11) + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(3) + + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VEC(3), -VEC_SIZE(%rdi, %rdx) VZEROUPPER_RETURN -L(large_backward): - /* Don't use non-temporal store if there is overlap between - destination and source since destination may be in cache - when source is loaded. */ - leaq (%rcx, %rdx), %r10 - cmpq %r10, %r9 - jb L(loop_4x_vec_backward) -L(loop_large_backward): - /* Copy 4 * VEC a time backward with non-temporal stores. */ - PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 3) - VMOVU (%rcx), %VEC(0) - VMOVU -VEC_SIZE(%rcx), %VEC(1) - VMOVU -(VEC_SIZE * 2)(%rcx), %VEC(2) - VMOVU -(VEC_SIZE * 3)(%rcx), %VEC(3) - subq $PREFETCHED_LOAD_SIZE, %rcx - subq $PREFETCHED_LOAD_SIZE, %rdx - VMOVNT %VEC(0), (%r9) - VMOVNT %VEC(1), -VEC_SIZE(%r9) - VMOVNT %VEC(2), -(VEC_SIZE * 2)(%r9) - VMOVNT %VEC(3), -(VEC_SIZE * 3)(%r9) - subq $PREFETCHED_LOAD_SIZE, %r9 - cmpq $PREFETCHED_LOAD_SIZE, %rdx - ja L(loop_large_backward) + .p2align 4 +L(large_memcpy_4x): + movq %rdx, %r10 + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 4 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $(LOG_PAGE_SIZE + 2), %r10 + /* Copy 4x VEC at a time from 4 pages. */ + .p2align 4 +L(loop_large_memcpy_4x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_4x_inner): + /* Only one prefetch set per page as doing 4 pages give more time + for prefetcher to keep up. */ + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE) + /* Load vectors from rsi. */ + LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + LOAD_ONE_SET((%rsi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11)) + LOAD_ONE_SET((%rsi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15)) + subq $-LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + STORE_ONE_SET((%rdi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11)) + STORE_ONE_SET((%rdi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15)) + subq $-LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_4x_inner) + addq $(PAGE_SIZE * 3), %rdi + addq $(PAGE_SIZE * 3), %rsi + decq %r10 + jne L(loop_large_memcpy_4x_outer) sfence - /* Store the first 4 * VEC. */ - VMOVU %VEC(4), (%rdi) - VMOVU %VEC(5), VEC_SIZE(%rdi) - VMOVU %VEC(6), (VEC_SIZE * 2)(%rdi) - VMOVU %VEC(7), (VEC_SIZE * 3)(%rdi) - /* Store the last VEC. */ - VMOVU %VEC(8), (%r11) + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_4x_end) + + /* Handle the last 4 * PAGE_SIZE bytes. */ +L(loop_large_memcpy_4x_tail): + /* Copy 4 * VEC a time forward with non-temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) + VMOVU (%rsi), %VEC(0) + VMOVU VEC_SIZE(%rsi), %VEC(1) + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) + subq $-(VEC_SIZE * 4), %rsi + addl $-(VEC_SIZE * 4), %edx + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(1), VEC_SIZE(%rdi) + VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_4x_tail) + +L(large_memcpy_4x_end): + /* Store the last 4 * VEC. */ + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(3) + + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VEC(3), -VEC_SIZE(%rdi, %rdx) VZEROUPPER_RETURN #endif END (MEMMOVE_SYMBOL (__memmove, unaligned_erms))