[2/1,v2,neleai/string-x64] Microoptimize strcmp-sse2-unaligned more.

  On Sat, Jun 20, 2015 at 12:22:56PM +0200, Ondřej Bílka wrote:
> On Sat, Jun 20, 2015 at 10:35:25AM +0200, Ondřej Bílka wrote:
> > 
> > Hi,
> > 
> > When I read strcmp again to improve strncmp and add avx2 strcmp 
> > I found that I made several mistakes, mainly caused by first optimizing 
> > c template and then fixing assembly.
> > 
> > First was mainly my idea to simplify handling cross-page check by oring
> > src and dest. I recall that I first did complex crosspage handling where
> > false positives were cheap. Then I found that due to size it has big
> > overhead and simple loop was faster when testing with firefox. 
> > That turned original decision into bad one.
> > 
> > Second is to reorganize loop instructions so that after loop ends I could 
> > simply find last byte without recalculating much, using trick that last
> > 16 bit mask could be ored with previous three as its relevant only when
> > previous three were zero.
> > 
> > Final one is that gcc generates bad loops in regards where to increment
> > pointers. You should place them after loads that use them, not at start
> > of loop like gcc does. That change is responsible for 10% improvement
> > for large sizes.
> > 
> > Final are microoptimizations that save few bytes without measurable
> > performance impact like using eax instead rax to save byte or moving
> > unnecessary zeroing instruction when they are not needed.
> > 
> > Profile data are here, shortly with avx2 for haswell that I will submit
> > next.
> > 
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcmp_profile.html
> > 
> > OK to commit this?
> >
> I missed few microoptimizations. These save few bytes, no measurable
> impact.
>  
>  	* sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S
>  	(__strcmp_sse2_unaligned): Add several microoptimizations.
>
This one.

[2/1,v2,neleai/string-x64] Microoptimize strcmp-sse2-unaligned more.

Commit Message

Patch