From patchwork Sat Jun 20 10:35:48 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Ondrej Bilka X-Patchwork-Id: 7268 Received: (qmail 45029 invoked by alias); 20 Jun 2015 10:36:02 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 45019 invoked by uid 89); 20 Jun 2015 10:36:01 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.7 required=5.0 tests=AWL, BAYES_50, FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Sat, 20 Jun 2015 12:35:48 +0200 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: libc-alpha@sourceware.org Subject: Re: [PATCH 2/1 v2 neleai/string-x64] Microoptimize strcmp-sse2-unaligned more. Message-ID: <20150620103548.GA21670@domone> References: <20150620083525.GA31992@domone> <20150620102256.GA16801@domone> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20150620102256.GA16801@domone> User-Agent: Mutt/1.5.20 (2009-06-14) On Sat, Jun 20, 2015 at 12:22:56PM +0200, Ondřej Bílka wrote: > On Sat, Jun 20, 2015 at 10:35:25AM +0200, Ondřej Bílka wrote: > > > > Hi, > > > > When I read strcmp again to improve strncmp and add avx2 strcmp > > I found that I made several mistakes, mainly caused by first optimizing > > c template and then fixing assembly. > > > > First was mainly my idea to simplify handling cross-page check by oring > > src and dest. I recall that I first did complex crosspage handling where > > false positives were cheap. Then I found that due to size it has big > > overhead and simple loop was faster when testing with firefox. > > That turned original decision into bad one. > > > > Second is to reorganize loop instructions so that after loop ends I could > > simply find last byte without recalculating much, using trick that last > > 16 bit mask could be ored with previous three as its relevant only when > > previous three were zero. > > > > Final one is that gcc generates bad loops in regards where to increment > > pointers. You should place them after loads that use them, not at start > > of loop like gcc does. That change is responsible for 10% improvement > > for large sizes. > > > > Final are microoptimizations that save few bytes without measurable > > performance impact like using eax instead rax to save byte or moving > > unnecessary zeroing instruction when they are not needed. > > > > Profile data are here, shortly with avx2 for haswell that I will submit > > next. > > > > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcmp_profile.html > > > > OK to commit this? > > > I missed few microoptimizations. These save few bytes, no measurable > impact. > > * sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S > (__strcmp_sse2_unaligned): Add several microoptimizations. > This one. diff --git a/sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S index 03d1b11..9a8f685 100644 --- a/sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S +++ b/sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S @@ -76,19 +76,17 @@ L(return): subl %edx, %eax ret - L(main_loop_header): leaq 64(%rdi), %rdx - movl $4096, %ecx andq $-64, %rdx subq %rdi, %rdx leaq (%rdi, %rdx), %rax addq %rsi, %rdx - movq %rdx, %rsi - andl $4095, %esi - subq %rsi, %rcx - shrq $6, %rcx - movq %rcx, %rsi + movl $4096, %esi + mov %edx, %ecx + andl $4095, %ecx + sub %ecx, %esi + shr $6, %esi .p2align 4 L(loop): @@ -140,10 +138,9 @@ L(back_to_loop): .p2align 4 L(loop_cross_page): - xor %ecx, %ecx - movq %rdx, %r9 - and $63, %r9 - subq %r9, %rcx + mov %edx, %ecx + and $63, %ecx + neg %rcx movdqa (%rdx, %rcx), %xmm0 movdqa 16(%rdx, %rcx), %xmm1 @@ -177,8 +174,8 @@ L(loop_cross_page): orq %rcx, %rdi salq $48, %rsi orq %rsi, %rdi - movq %r9, %rcx - movq $63, %rsi + mov %edx, %ecx + mov $63, %esi shrq %cl, %rdi test %rdi, %rdi je L(back_to_loop)