[neleai/string-x64] Microoptimize strcmp-sse2-unaligned.

  Hi,

When I read strcmp again to improve strncmp and add avx2 strcmp 
I found that I made several mistakes, mainly caused by first optimizing 
c template and then fixing assembly.

First was mainly my idea to simplify handling cross-page check by oring
src and dest. I recall that I first did complex crosspage handling where
false positives were cheap. Then I found that due to size it has big
overhead and simple loop was faster when testing with firefox. 
That turned original decision into bad one.

Second is to reorganize loop instructions so that after loop ends I could 
simply find last byte without recalculating much, using trick that last
16 bit mask could be ored with previous three as its relevant only when
previous three were zero.

Final one is that gcc generates bad loops in regards where to increment
pointers. You should place them after loads that use them, not at start
of loop like gcc does. That change is responsible for 10% improvement
for large sizes.

Final are microoptimizations that save few bytes without measurable
performance impact like using eax instead rax to save byte or moving
unnecessary zeroing instruction when they are not needed.

Profile data are here, shortly with avx2 for haswell that I will submit
next.

http://kam.mff.cuni.cz/~ondra/benchmark_string/strcmp_profile.html

OK to commit this?

	* sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S
	(__strcmp_sse2_unaligned): Add several microoptimizations.

[neleai/string-x64] Microoptimize strcmp-sse2-unaligned.

Commit Message

Comments

Patch