From patchwork Tue Jan 6 14:29:39 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ondrej Bilka X-Patchwork-Id: 4522 Received: (qmail 8738 invoked by alias); 6 Jan 2015 14:29:51 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 8725 invoked by uid 89); 6 Jan 2015 14:29:51 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=4.3 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, SPAM_URI1, SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Tue, 6 Jan 2015 15:29:39 +0100 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: hjl.tools@gmail.com Cc: libc-alpha@sourceware.org Subject: [PATCH][BZ #17801] Fix memcpy regression (five times slower on bulldozer.) Message-ID: <20150106142939.GB5835@domone> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) H. J, in this commit there slipped performance regression by review. commit 05f3633da4f9df870d04dd77336e793746e57ed4 Author: Ling Ma Date: Mon Jul 14 00:02:52 2014 -0400 Improve 64bit memcpy performance for Haswell CPU with AVX instruction I seem to recall that I mentioned something about avx being typo and should be avx2 but did not look it further. As I assumed its avx2 only I was ok with that nad haswell specific optimizations like using rep movsq. However ifunc checks for avx which is bad as we already know that avx loads/stores are slow on sandy bridge. Also testing on affected architectures would reveal that. Especially amd bulldozer where its five times slower on 2kb-16kb range, see http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/memcpy_profile_avx/results_rand/result.html because movsb is slow. On sandy bridge its only 20% regression on same range. http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_avx/results_rand/result.html Also avx loop for 128-2024 bytes is slower there so there is no point using it. What about following change? * sysdeps/x86_64/multiarch/memcpy.S: Fix performance regression. diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S index 992e40d..27f89e4 100644 --- a/sysdeps/x86_64/multiarch/memcpy.S +++ b/sysdeps/x86_64/multiarch/memcpy.S @@ -32,10 +32,13 @@ ENTRY(__new_memcpy) cmpl $0, KIND_OFFSET+__cpu_features(%rip) jne 1f call __init_cpu_features +#ifdef HAVE_AVX2_SUPPORT 1: leaq __memcpy_avx_unaligned(%rip), %rax - testl $bit_AVX_Usable, __cpu_features+FEATURE_OFFSET+index_AVX_Usable(%rip) + testl $bit_AVX2_Usable, __cpu_features+FEATURE_OFFSET+index_AVX2_Usable(%rip) + jz 1f ret +#endif 1: leaq __memcpy_sse2(%rip), %rax testl $bit_Slow_BSF, __cpu_features+FEATURE_OFFSET+index_Slow_BSF(%rip) jnz 2f