[2/1,neleai/string-x64] Microoptimize strcmp-sse2-unaligned more.
Commit Message
On Sat, Jun 20, 2015 at 10:35:25AM +0200, Ondřej Bílka wrote:
>
> Hi,
>
> When I read strcmp again to improve strncmp and add avx2 strcmp
> I found that I made several mistakes, mainly caused by first optimizing
> c template and then fixing assembly.
>
> First was mainly my idea to simplify handling cross-page check by oring
> src and dest. I recall that I first did complex crosspage handling where
> false positives were cheap. Then I found that due to size it has big
> overhead and simple loop was faster when testing with firefox.
> That turned original decision into bad one.
>
> Second is to reorganize loop instructions so that after loop ends I could
> simply find last byte without recalculating much, using trick that last
> 16 bit mask could be ored with previous three as its relevant only when
> previous three were zero.
>
> Final one is that gcc generates bad loops in regards where to increment
> pointers. You should place them after loads that use them, not at start
> of loop like gcc does. That change is responsible for 10% improvement
> for large sizes.
>
> Final are microoptimizations that save few bytes without measurable
> performance impact like using eax instead rax to save byte or moving
> unnecessary zeroing instruction when they are not needed.
>
> Profile data are here, shortly with avx2 for haswell that I will submit
> next.
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcmp_profile.html
>
> OK to commit this?
>
I missed few microoptimizations. These save few bytes, no measurable
impact.
* sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S
(__strcmp_sse2_unaligned): Add several microoptimizations.
@@ -76,19 +76,17 @@ L(return):
subl %edx, %eax
ret
-
L(main_loop_header):
leaq 64(%rdi), %rdx
- movl $4096, %ecx
andq $-64, %rdx
subq %rdi, %rdx
leaq (%rdi, %rdx), %rax
addq %rsi, %rdx
- movq %rdx, %rsi
- andl $4095, %esi
- subq %rsi, %rcx
- shrq $6, %rcx
- movq %rcx, %rsi
+ movl $4096, %esi
+ mov %edx, %ecx
+ andl $4095, %ecx
+ sub %ecx, %esi
+ shr $6, %esi
.p2align 4
L(loop):
@@ -140,10 +138,9 @@ L(back_to_loop):
.p2align 4
L(loop_cross_page):
- xor %ecx, %ecx
- movq %rdx, %r9
- and $63, %r9
- subq %r9, %rcx
+ mov %edx, %ecx
+ and $63, %ecx
+ neg %rcx
movdqa (%rdx, %rcx), %xmm0
movdqa 16(%rdx, %rcx), %xmm1
@@ -178,7 +175,7 @@ L(loop_cross_page):
salq $48, %rsi
orq %rsi, %rdi
movq %r9, %rcx
- movq $63, %rsi
+ mov $63, %esi
shrq %cl, %rdi
test %rdi, %rdi
je L(back_to_loop)