From patchwork Tue Jan  6 14:29:39 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ondrej Bilka <neleai@seznam.cz>
X-Patchwork-Id: 4522
Received: (qmail 8738 invoked by alias); 6 Jan 2015 14:29:51 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 8725 invoked by uid 89); 6 Jan 2015 14:29:51 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=4.3 required=5.0 tests=AWL, BAYES_00,
	FREEMAIL_FROM, SPAM_URI1,
	SPF_NEUTRAL autolearn=no version=3.3.2
X-HELO: popelka.ms.mff.cuni.cz
Date: Tue, 6 Jan 2015 15:29:39 +0100
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: hjl.tools@gmail.com
Cc: libc-alpha@sourceware.org
Subject: [PATCH][BZ #17801] Fix memcpy regression (five times slower on
	bulldozer.)
Message-ID: <20150106142939.GB5835@domone>
MIME-Version: 1.0
Content-Disposition: inline
User-Agent: Mutt/1.5.20 (2009-06-14)

H. J, in this commit there slipped performance regression by review.

commit 05f3633da4f9df870d04dd77336e793746e57ed4
Author: Ling Ma <ling.ml@alibaba-inc.com>
Date:   Mon Jul 14 00:02:52 2014 -0400

    Improve 64bit memcpy performance for Haswell CPU with AVX
instruction


I seem to recall that I mentioned something about avx being typo and
should be avx2 but did not look it further.

As I assumed its avx2 only I was ok with that nad haswell specific
optimizations like using rep movsq. However ifunc checks for avx which
is bad as we already know that avx loads/stores are slow on sandy
bridge.

Also testing on affected architectures would reveal that. Especially amd
bulldozer where its five times slower on 2kb-16kb range, see
http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/memcpy_profile_avx/results_rand/result.html
because movsb is slow.

On sandy bridge its only 20% regression on same range.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_avx/results_rand/result.html


Also avx loop for 128-2024 bytes is slower there so there is no point
using it.

What about following change?


	* sysdeps/x86_64/multiarch/memcpy.S: Fix performance regression.

diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
index 992e40d..27f89e4 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -32,10 +32,13 @@ ENTRY(__new_memcpy)
 	cmpl	$0, KIND_OFFSET+__cpu_features(%rip)
 	jne	1f
 	call	__init_cpu_features
+#ifdef HAVE_AVX2_SUPPORT
 1:	leaq	__memcpy_avx_unaligned(%rip), %rax
-	testl	$bit_AVX_Usable, __cpu_features+FEATURE_OFFSET+index_AVX_Usable(%rip)
+	testl	$bit_AVX2_Usable, __cpu_features+FEATURE_OFFSET+index_AVX2_Usable(%rip)
+
 	jz 1f
 	ret
+#endif
 1:	leaq	__memcpy_sse2(%rip), %rax
 	testl	$bit_Slow_BSF, __cpu_features+FEATURE_OFFSET+index_Slow_BSF(%rip)
 	jnz	2f