[BZ,#17801] Fix memcpy regression (five times slower on bulldozer.)
Commit Message
H. J, in this commit there slipped performance regression by review.
commit 05f3633da4f9df870d04dd77336e793746e57ed4
Author: Ling Ma <ling.ml@alibaba-inc.com>
Date: Mon Jul 14 00:02:52 2014 -0400
Improve 64bit memcpy performance for Haswell CPU with AVX
instruction
I seem to recall that I mentioned something about avx being typo and
should be avx2 but did not look it further.
As I assumed its avx2 only I was ok with that nad haswell specific
optimizations like using rep movsq. However ifunc checks for avx which
is bad as we already know that avx loads/stores are slow on sandy
bridge.
Also testing on affected architectures would reveal that. Especially amd
bulldozer where its five times slower on 2kb-16kb range, see
http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/memcpy_profile_avx/results_rand/result.html
because movsb is slow.
On sandy bridge its only 20% regression on same range.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_avx/results_rand/result.html
Also avx loop for 128-2024 bytes is slower there so there is no point
using it.
What about following change?
* sysdeps/x86_64/multiarch/memcpy.S: Fix performance regression.
Comments
On Tue, Jan 6, 2015 at 6:29 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> H. J, in this commit there slipped performance regression by review.
>
> commit 05f3633da4f9df870d04dd77336e793746e57ed4
> Author: Ling Ma <ling.ml@alibaba-inc.com>
> Date: Mon Jul 14 00:02:52 2014 -0400
>
> Improve 64bit memcpy performance for Haswell CPU with AVX
> instruction
>
>
> I seem to recall that I mentioned something about avx being typo and
> should be avx2 but did not look it further.
>
> As I assumed its avx2 only I was ok with that nad haswell specific
> optimizations like using rep movsq. However ifunc checks for avx which
> is bad as we already know that avx loads/stores are slow on sandy
> bridge.
>
> Also testing on affected architectures would reveal that. Especially amd
> bulldozer where its five times slower on 2kb-16kb range, see
> http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/memcpy_profile_avx/results_rand/result.html
> because movsb is slow.
>
> On sandy bridge its only 20% regression on same range.
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_avx/results_rand/result.html
>
>
> Also avx loop for 128-2024 bytes is slower there so there is no point
> using it.
>
> What about following change?
>
>
> * sysdeps/x86_64/multiarch/memcpy.S: Fix performance regression.
>
> diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
> index 992e40d..27f89e4 100644
> --- a/sysdeps/x86_64/multiarch/memcpy.S
> +++ b/sysdeps/x86_64/multiarch/memcpy.S
> @@ -32,10 +32,13 @@ ENTRY(__new_memcpy)
> cmpl $0, KIND_OFFSET+__cpu_features(%rip)
> jne 1f
> call __init_cpu_features
> +#ifdef HAVE_AVX2_SUPPORT
> 1: leaq __memcpy_avx_unaligned(%rip), %rax
> - testl $bit_AVX_Usable, __cpu_features+FEATURE_OFFSET+index_AVX_Usable(%rip)
> + testl $bit_AVX2_Usable, __cpu_features+FEATURE_OFFSET+index_AVX2_Usable(%rip)
> +
> jz 1f
> ret
> +#endif
> 1: leaq __memcpy_sse2(%rip), %rax
> testl $bit_Slow_BSF, __cpu_features+FEATURE_OFFSET+index_Slow_BSF(%rip)
> jnz 2f
Please add a new feature bit, bit_Fast_AVX_Unaligned_Load, and turn it
on together
with bit_AVX2_Usable.
Thanks.
---
H.J.
@@ -32,10 +32,13 @@ ENTRY(__new_memcpy)
cmpl $0, KIND_OFFSET+__cpu_features(%rip)
jne 1f
call __init_cpu_features
+#ifdef HAVE_AVX2_SUPPORT
1: leaq __memcpy_avx_unaligned(%rip), %rax
- testl $bit_AVX_Usable, __cpu_features+FEATURE_OFFSET+index_AVX_Usable(%rip)
+ testl $bit_AVX2_Usable, __cpu_features+FEATURE_OFFSET+index_AVX2_Usable(%rip)
+
jz 1f
ret
+#endif
1: leaq __memcpy_sse2(%rip), %rax
testl $bit_Slow_BSF, __cpu_features+FEATURE_OFFSET+index_Slow_BSF(%rip)
jnz 2f