x86-64: memcmp-avx2-movbe.S needs saturating subtraction [BZ #21662]

Message ID CAMe9rOqwAAwAtDEWEsdcafSfQBRQXqPjEw9gfeVAVvp9ciVryQ@mail.gmail.com
State New, archived
Headers

Commit Message

H.J. Lu June 23, 2017, 6:33 p.m. UTC
  On Fri, Jun 23, 2017 at 9:42 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 06/23/2017 06:38 PM, Carlos O'Donell wrote:
>
>> I assume that this catches the regression by ensuring the high values of
>> the subtraction result in an underflow which results in a positive value
>> of the subtraction and a wrong answer?
>
> Yes, I thought I said so in the commit message.
>
>> Was this comment ever accurate? mobzwl is not a BE load.
>
> We used bswap, so the register contents before the comparison is in
> big-endian format.
>
>>> +    orl     %edi, %eax
>>> +    orl     %esi, %ecx
>>> +    /* Subtraction is okay because the upper 8 bits a zero.  */
>>
>> s/a zero/are zero/g
>
> Okay, I'll fix this typo in a follow-up commit.

How about this patch to turn

movzbl -1(%rdi, %rdx), %edi
movzbl -1(%rsi, %rdx), %esi
orl %edi, %eax
orl %esi, %ecx

into

movb -1(%rdi, %rdx), %al
movb -1(%rsi, %rdx), %cl


H.J.
  

Comments

H.J. Lu June 23, 2017, 7:12 p.m. UTC | #1
On Fri, Jun 23, 2017 at 11:33 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Jun 23, 2017 at 9:42 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 06/23/2017 06:38 PM, Carlos O'Donell wrote:
>>
>>> I assume that this catches the regression by ensuring the high values of
>>> the subtraction result in an underflow which results in a positive value
>>> of the subtraction and a wrong answer?
>>
>> Yes, I thought I said so in the commit message.
>>
>>> Was this comment ever accurate? mobzwl is not a BE load.
>>
>> We used bswap, so the register contents before the comparison is in
>> big-endian format.
>>
>>>> +    orl     %edi, %eax
>>>> +    orl     %esi, %ecx
>>>> +    /* Subtraction is okay because the upper 8 bits a zero.  */
>>>
>>> s/a zero/are zero/g
>>
>> Okay, I'll fix this typo in a follow-up commit.
>
> How about this patch to turn
>
> movzbl -1(%rdi, %rdx), %edi
> movzbl -1(%rsi, %rdx), %esi
> orl %edi, %eax
> orl %esi, %ecx
>
> into
>
> movb -1(%rdi, %rdx), %al
> movb -1(%rsi, %rdx), %cl

Here is the benchmark result on Haswell.

[hjl@gnu-6 glibc-test]$ make
./test
movb  : 19937666
movzbl: 21518186
[hjl@gnu-6 glibc-test]$
  
Florian Weimer June 23, 2017, 7:27 p.m. UTC | #2
On 06/23/2017 09:12 PM, H.J. Lu wrote:

>> movzbl -1(%rdi, %rdx), %edi
>> movzbl -1(%rsi, %rdx), %esi
>> orl %edi, %eax
>> orl %esi, %ecx
>>
>> into
>>
>> movb -1(%rdi, %rdx), %al
>> movb -1(%rsi, %rdx), %cl
> 
> Here is the benchmark result on Haswell.
> 
> [hjl@gnu-6 glibc-test]$ make
> ./test
> movb  : 19937666
> movzbl: 21518186
> [hjl@gnu-6 glibc-test]$

Interesting.  So there isn't a steep penalty for partial register writes
anymore?  Your patch is a nice improvement then.

Thanks,
Florian
  
H.J. Lu June 23, 2017, 7:40 p.m. UTC | #3
On Fri, Jun 23, 2017 at 12:27 PM, Florian Weimer <fweimer@redhat.com> wrote:
> On 06/23/2017 09:12 PM, H.J. Lu wrote:
>
>>> movzbl -1(%rdi, %rdx), %edi
>>> movzbl -1(%rsi, %rdx), %esi
>>> orl %edi, %eax
>>> orl %esi, %ecx
>>>
>>> into
>>>
>>> movb -1(%rdi, %rdx), %al
>>> movb -1(%rsi, %rdx), %cl
>>
>> Here is the benchmark result on Haswell.
>>
>> [hjl@gnu-6 glibc-test]$ make
>> ./test
>> movb  : 19937666
>> movzbl: 21518186
>> [hjl@gnu-6 glibc-test]$
>
> Interesting.  So there isn't a steep penalty for partial register writes
> anymore?  Your patch is a nice improvement then.

Intel Optimization Guide has

In Intel microarchitecture code name Sandy Bridge, partial register
access is handled in hardware by
inserting a micro-op that merges the partial register with the full
register in the following cases:
• After a write to one of the registers AH, BH, CH or DH and before a
following read of the 2-, 4- or 8-
byte form of the same register. In these cases a merge micro-op is
inserted. The insertion consumes
a full allocation cycle in which other micro-ops cannot be allocated.
• After a micro-op with a destination register of 1 or 2 bytes, which
is not a source of the instruction (or
the register's bigger form), and before a following read of a 2-,4- or
8-byte form of the same register.
In these cases the merge micro-op is part of the flow.

None of them apply here to Haswell and Skylake.

I am checking in my patch now.
  

Patch

From acecb3f7de4892b68ec1b464a576ee84b3f97527 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 23 Jun 2017 11:29:38 -0700
Subject: [PATCH] x86-64: Optimize L(between_2_3) in memcmp-avx2-movbe.S

Turn

	movzbl	-1(%rdi, %rdx), %edi
	movzbl	-1(%rsi, %rdx), %esi
	orl	%edi, %eax
	orl	%esi, %ecx

into

	movb	-1(%rdi, %rdx), %al
	movb	-1(%rsi, %rdx), %cl

	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (between_2_3):
	Replace movzbl and orl with movb.
---
 sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
index 9d19210..abcc61c 100644
--- a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
+++ b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
@@ -144,11 +144,9 @@  L(between_2_3):
 	shll	$8, %ecx
 	bswap	%eax
 	bswap	%ecx
-	movzbl	-1(%rdi, %rdx), %edi
-	movzbl	-1(%rsi, %rdx), %esi
-	orl	%edi, %eax
-	orl	%esi, %ecx
-	/* Subtraction is okay because the upper 8 bits a zero.  */
+	movb	-1(%rdi, %rdx), %al
+	movb	-1(%rsi, %rdx), %cl
+	/* Subtraction is okay because the upper 8 bits are zero.  */
 	subl	%ecx, %eax
 	ret
 
-- 
2.9.4