From patchwork Sat Jun 20 10:35:48 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Ondrej Bilka <neleai@seznam.cz>
X-Patchwork-Id: 7268
Received: (qmail 45029 invoked by alias); 20 Jun 2015 10:36:02 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 45019 invoked by uid 89); 20 Jun 2015 10:36:01 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=0.7 required=5.0 tests=AWL, BAYES_50,
	FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2
X-HELO: popelka.ms.mff.cuni.cz
Date: Sat, 20 Jun 2015 12:35:48 +0200
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: libc-alpha@sourceware.org
Subject: Re: [PATCH 2/1 v2 neleai/string-x64] Microoptimize
	strcmp-sse2-unaligned more.
Message-ID: <20150620103548.GA21670@domone>
References: <20150620083525.GA31992@domone> <20150620102256.GA16801@domone>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20150620102256.GA16801@domone>
User-Agent: Mutt/1.5.20 (2009-06-14)

On Sat, Jun 20, 2015 at 12:22:56PM +0200, Ondřej Bílka wrote:
> On Sat, Jun 20, 2015 at 10:35:25AM +0200, Ondřej Bílka wrote:
> > 
> > Hi,
> > 
> > When I read strcmp again to improve strncmp and add avx2 strcmp 
> > I found that I made several mistakes, mainly caused by first optimizing 
> > c template and then fixing assembly.
> > 
> > First was mainly my idea to simplify handling cross-page check by oring
> > src and dest. I recall that I first did complex crosspage handling where
> > false positives were cheap. Then I found that due to size it has big
> > overhead and simple loop was faster when testing with firefox. 
> > That turned original decision into bad one.
> > 
> > Second is to reorganize loop instructions so that after loop ends I could 
> > simply find last byte without recalculating much, using trick that last
> > 16 bit mask could be ored with previous three as its relevant only when
> > previous three were zero.
> > 
> > Final one is that gcc generates bad loops in regards where to increment
> > pointers. You should place them after loads that use them, not at start
> > of loop like gcc does. That change is responsible for 10% improvement
> > for large sizes.
> > 
> > Final are microoptimizations that save few bytes without measurable
> > performance impact like using eax instead rax to save byte or moving
> > unnecessary zeroing instruction when they are not needed.
> > 
> > Profile data are here, shortly with avx2 for haswell that I will submit
> > next.
> > 
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcmp_profile.html
> > 
> > OK to commit this?
> >
> I missed few microoptimizations. These save few bytes, no measurable
> impact.
>  
>  	* sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S
>  	(__strcmp_sse2_unaligned): Add several microoptimizations.
>
This one.

diff --git a/sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S
index 03d1b11..9a8f685 100644
--- a/sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S
@@ -76,19 +76,17 @@ L(return):
 	subl	%edx, %eax
 	ret
 
-
 L(main_loop_header):
 	leaq	64(%rdi), %rdx
-	movl	$4096, %ecx
 	andq	$-64, %rdx
 	subq	%rdi, %rdx
 	leaq	(%rdi, %rdx), %rax
 	addq	%rsi, %rdx
-	movq	%rdx, %rsi
-	andl	$4095, %esi
-	subq	%rsi, %rcx
-	shrq	$6, %rcx
-	movq	%rcx, %rsi
+	movl	$4096, %esi
+	mov	%edx, %ecx
+	andl	$4095, %ecx
+	sub	%ecx, %esi
+	shr	$6, %esi
 
 	.p2align 4
 L(loop):
@@ -140,10 +138,9 @@ L(back_to_loop):
 
 	.p2align 4
 L(loop_cross_page):
-	xor	%ecx, %ecx
-	movq	%rdx, %r9
-	and	$63, %r9
-	subq	%r9, %rcx
+	mov	%edx, %ecx
+	and	$63, %ecx
+	neg	%rcx
 
 	movdqa	(%rdx, %rcx), %xmm0
 	movdqa	16(%rdx, %rcx), %xmm1
@@ -177,8 +174,8 @@ L(loop_cross_page):
 	orq	%rcx, %rdi
 	salq	$48, %rsi
 	orq	%rsi, %rdi
-	movq	%r9, %rcx
-	movq	$63, %rsi
+	mov	%edx, %ecx
+	mov	$63, %esi
 	shrq	%cl, %rdi
 	test	%rdi, %rdi
 	je	L(back_to_loop)