x86-64: memcmp-avx2-movbe.S needs saturating subtraction [BZ #21662]

Message ID 6fec374c-177f-b8e8-d7a3-ab10d7dab136@redhat.com
State Committed
Headers

Commit Message

Florian Weimer June 23, 2017, 1:51 p.m. UTC
  I think this revised patch is an improvement because it avoids adding a
branch.

Thanks,
Florian
  

Comments

H.J. Lu June 23, 2017, 3:08 p.m. UTC | #1
On Fri, Jun 23, 2017 at 6:51 AM, Florian Weimer <fweimer@redhat.com> wrote:
> I think this revised patch is an improvement because it avoids adding a
> branch.
>
> Thanks,
> Florian

The memcmp-avx2-movbe.S part is OK.  For

(check_result, do_random_tests): Write error messages to standard
output.

any particular reason to replace error with printf?

Thanks.
  
Florian Weimer June 23, 2017, 3:12 p.m. UTC | #2
On 06/23/2017 05:08 PM, H.J. Lu wrote:
> On Fri, Jun 23, 2017 at 6:51 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> I think this revised patch is an improvement because it avoids adding a
>> branch.
>>
>> Thanks,
>> Florian
> 
> The memcmp-avx2-movbe.S part is OK.

Thanks, I will commit it shortly.

> For
> 
> (check_result, do_random_tests): Write error messages to standard
> output.
> 
> any particular reason to replace error with printf?

The error function writes to standard error, so the test harness does
not capture it output in the .out file.

Thanks,
Florian
  
H.J. Lu June 23, 2017, 3:43 p.m. UTC | #3
On Fri, Jun 23, 2017 at 8:12 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 06/23/2017 05:08 PM, H.J. Lu wrote:
>> On Fri, Jun 23, 2017 at 6:51 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>> I think this revised patch is an improvement because it avoids adding a
>>> branch.
>>>
>>> Thanks,
>>> Florian
>>
>> The memcmp-avx2-movbe.S part is OK.
>
> Thanks, I will commit it shortly.
>
>> For
>>
>> (check_result, do_random_tests): Write error messages to standard
>> output.
>>
>> any particular reason to replace error with printf?
>
> The error function writes to standard error, so the test harness does
> not capture it output in the .out file.

All string tests are like this.  We still got

FAIL: string/test-memcmp

[hjl@gnu-6 build-x86_64-linux]$ string/test-memcmp
string/test-memcmp: Wrong result in function __memcmp_avx2_movbe 1488148660 -168
string/test-memcmp: Wrong result in function __memcmp_avx2_movbe 1488237565 -168
string/test-memcmp: Wrong result in function __memcmp_avx2_movbe -1275218947 180
string/test-memcmp: Wrong result in function __memcmp_avx2_movbe -1275265792 180
                        simple_memcmp __memcmp_avx2_movbe
__memcmp_sse4_1__memcmp_ssse3 __memcmp_sse2
[hjl@gnu-6 build-x86_64-linux]$

If we want to change this,  it should be in a separate patch.

Thanks.
  
Florian Weimer June 23, 2017, 3:47 p.m. UTC | #4
On 06/23/2017 05:43 PM, H.J. Lu wrote:
> On Fri, Jun 23, 2017 at 8:12 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 06/23/2017 05:08 PM, H.J. Lu wrote:
>>> On Fri, Jun 23, 2017 at 6:51 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>>> I think this revised patch is an improvement because it avoids adding a
>>>> branch.
>>>>
>>>> Thanks,
>>>> Florian
>>>
>>> The memcmp-avx2-movbe.S part is OK.
>>
>> Thanks, I will commit it shortly.
>>
>>> For
>>>
>>> (check_result, do_random_tests): Write error messages to standard
>>> output.
>>>
>>> any particular reason to replace error with printf?
>>
>> The error function writes to standard error, so the test harness does
>> not capture it output in the .out file.
> 
> All string tests are like this.

Right.

> We still got
> 
> FAIL: string/test-memcmp
> 
> [hjl@gnu-6 build-x86_64-linux]$ string/test-memcmp
> string/test-memcmp: Wrong result in function __memcmp_avx2_movbe 1488148660 -168
> string/test-memcmp: Wrong result in function __memcmp_avx2_movbe 1488237565 -168
> string/test-memcmp: Wrong result in function __memcmp_avx2_movbe -1275218947 180
> string/test-memcmp: Wrong result in function __memcmp_avx2_movbe -1275265792 180
>                         simple_memcmp __memcmp_avx2_movbe
> __memcmp_sse4_1__memcmp_ssse3 __memcmp_sse2
> [hjl@gnu-6 build-x86_64-linux]$
> 
> If we want to change this,  it should be in a separate patch.

Agreed, and we should fix all the instances in one go.  I'm not going to
work on this immediately.

Florian
  
H.J. Lu June 23, 2017, 3:48 p.m. UTC | #5
On Fri, Jun 23, 2017 at 8:47 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 06/23/2017 05:43 PM, H.J. Lu wrote:
>> On Fri, Jun 23, 2017 at 8:12 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>> On 06/23/2017 05:08 PM, H.J. Lu wrote:
>>>> On Fri, Jun 23, 2017 at 6:51 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>>>> I think this revised patch is an improvement because it avoids adding a
>>>>> branch.
>>>>>
>>>>> Thanks,
>>>>> Florian
>>>>
>>>> The memcmp-avx2-movbe.S part is OK.
>>>
>>> Thanks, I will commit it shortly.
>>>
>>>> For
>>>>
>>>> (check_result, do_random_tests): Write error messages to standard
>>>> output.
>>>>
>>>> any particular reason to replace error with printf?
>>>
>>> The error function writes to standard error, so the test harness does
>>> not capture it output in the .out file.
>>
>> All string tests are like this.
>
> Right.
>
>> We still got
>>
>> FAIL: string/test-memcmp
>>
>> [hjl@gnu-6 build-x86_64-linux]$ string/test-memcmp
>> string/test-memcmp: Wrong result in function __memcmp_avx2_movbe 1488148660 -168
>> string/test-memcmp: Wrong result in function __memcmp_avx2_movbe 1488237565 -168
>> string/test-memcmp: Wrong result in function __memcmp_avx2_movbe -1275218947 180
>> string/test-memcmp: Wrong result in function __memcmp_avx2_movbe -1275265792 180
>>                         simple_memcmp __memcmp_avx2_movbe
>> __memcmp_sse4_1__memcmp_ssse3 __memcmp_sse2
>> [hjl@gnu-6 build-x86_64-linux]$
>>
>> If we want to change this,  it should be in a separate patch.
>
> Agreed, and we should fix all the instances in one go.  I'm not going to
> work on this immediately.
>
> Florian

OK without the error -> printf change.

Thanks.
  
Carlos O'Donell June 23, 2017, 4:38 p.m. UTC | #6
On 06/23/2017 09:51 AM, Florian Weimer wrote:
> I think this revised patch is an improvement because it avoids adding a
> branch.

OK on the following conditions:

- Split the printf changes out and do that in another pass.
- Clarify if the new testsuite change did catch the regression.
- Adjust comment for L(between_2_3):.
- Fix grammar in assembly comment.
 
> Thanks,
> Florian
> 
> 
> memcmp.patch
> 
> 
> x86-64: memcmp-avx2-movbe.S needs saturating subtraction [BZ #21662]
> 
> This code:
> 
> L(between_2_3):
> 	/* Load as big endian with overlapping loads and bswap to avoid
> 	   branches.  */
> 	movzwl	-2(%rdi, %rdx), %eax
> 	movzwl	-2(%rsi, %rdx), %ecx
> 	shll	$16, %eax
> 	shll	$16, %ecx
> 	movzwl	(%rdi), %edi
> 	movzwl	(%rsi), %esi
> 	orl	%edi, %eax
> 	orl	%esi, %ecx
> 	bswap	%eax
> 	bswap	%ecx
> 	subl	%ecx, %eax
> 	ret
> 
> needs a saturating subtract because the full register is used.
> With this commit, only the lower 24 bits of the register are used,
> so a regular subtraction suffices.
> 
> The test case change adds coverage for these kinds of bugs.

Does the change to test-memcmp.c actually catch the bug reported by
the users?

> 2017-06-23  Florian Weimer  <fweimer@redhat.com>
> 
> 	[BZ #21662]
> 	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (between_2_3):
> 	Peform saturating subtraction.
> 	* string/test-memcmp.c (check1): Check with different lengths.
> 	(check_result, do_random_tests): Write error messages to standard
> 	output.
> 
> diff --git a/string/test-memcmp.c b/string/test-memcmp.c
> index a7969ed..93bc972 100644
> --- a/string/test-memcmp.c
> +++ b/string/test-memcmp.c
> @@ -85,8 +85,8 @@ check_result (impl_t *impl, const CHAR *s1, const CHAR *s2, size_t len,
>        || (exp_result < 0 && result >= 0)
>        || (exp_result > 0 && result <= 0))
>      {
> -      error (0, 0, "Wrong result in function %s %d %d", impl->name,
> -	     result, exp_result);
> +      printf ("error: wrong result in function %s %d %d", impl->name,
> +	      result, exp_result);
>        ret = 1;
>        return -1;
>      }
> @@ -192,8 +192,10 @@ do_random_tests (void)
>  	      || (r < 0 && result >= 0)
>  	      || (r > 0 && result <= 0))
>  	    {
> -	      error (0, 0, "Iteration %zd - wrong result in function %s (%zd, %zd, %zd, %zd) %ld != %d, p1 %p p2 %p",
> -		     n, impl->name, align1 * CHARBYTES & 63,  align2 * CHARBYTES & 63, len, pos, r, result, p1, p2);
> +	      printf ("error: Iteration %zd - wrong result in function %s"
> +		      " (%zd, %zd, %zd, %zd) %ld != %d, p1 %p p2 %p",
> +		      n, impl->name, align1 * CHARBYTES & 63,
> +		      align2 * CHARBYTES & 63, len, pos, r, result, p1, p2);

Outputting informative messages should be done to stdout not stderr.

However, in the future please don't mix changes like this with a bug fix.

Please commit the above two hunks right away since I can't see anyone
objecting.

>  	      ret = 1;
>  	    }
>  	}
> @@ -441,11 +443,12 @@ check1 (void)
>  
>    n = 116;
>    for (size_t i = 0; i < n; i++)
> -    {
> -      exp_result = SIMPLE_MEMCMP (s1 + i, s2 + i, n - i);
> -      FOR_EACH_IMPL (impl, 0)
> -	check_result (impl, s1 + i, s2 + i, n - i, exp_result);
> -    }
> +    for (size_t len = 0; len <= n - i; ++len)
> +      {
> +	exp_result = SIMPLE_MEMCMP (s1 + i, s2 + i, len);
> +	FOR_EACH_IMPL (impl, 0)
> +	  check_result (impl, s1 + i, s2 + i, len, exp_result);
> +      }

OK, so you've added more iterations here over the length of the remaining
substring to compare.

I assume that this catches the regression by ensuring the high values of
the subtraction result in an underflow which results in a positive value
of the subtraction and a wrong answer?

>  }
>  
>  /* This test checks that memcmp doesn't overrun buffers.  */
> diff --git a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
> index 47630dd..4cc9386 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
> @@ -139,16 +139,17 @@ L(exit):
>  L(between_2_3):
>  	/* Load as big endian with overlapping loads and bswap to avoid
>  	   branches.  */

Was this comment ever accurate? mobzwl is not a BE load.

Given that you changed the code, please adjust the comment.

> -	movzwl	-2(%rdi, %rdx), %eax
> -	movzwl	-2(%rsi, %rdx), %ecx
> -	shll	$16, %eax
> -	shll	$16, %ecx
> -	movzwl	(%rdi), %edi
> -	movzwl	(%rsi), %esi
> -	orl	%edi, %eax
> -	orl	%esi, %ecx
> +	movzwl	(%rdi), %eax
> +	movzwl	(%rsi), %ecx

OK, move 32-bit with zero-extend to full register.

1,2,0,0

> +	shll	$8, %eax
> +	shll	$8, %ecx

OK, shift out the part we don't care about.

0,1,2,0

>  	bswap	%eax
>  	bswap	%ecx

OK, swap, leaving either:

0,2,1,0

> +	movzbl	-1(%rdi, %rdx), %edi
> +	movzbl	-1(%rsi, %rdx), %esi

OK, byte 3 or 2 loaded with everything else set to 0 (24 bits).

3,0,0,0

or

2,0,0,0

> +	orl	%edi, %eax
> +	orl	%esi, %ecx
> +	/* Subtraction is okay because the upper 8 bits a zero.  */

s/a zero/are zero/g

We either have:

3,2,1,0

or

2,2,1,0

>  	subl	%ecx, %eax

OK, must use 'l' to subtract full DWORD.

>  	ret
>
  
Florian Weimer June 23, 2017, 4:42 p.m. UTC | #7
On 06/23/2017 06:38 PM, Carlos O'Donell wrote:

> I assume that this catches the regression by ensuring the high values of
> the subtraction result in an underflow which results in a positive value
> of the subtraction and a wrong answer?

Yes, I thought I said so in the commit message.

> Was this comment ever accurate? mobzwl is not a BE load.

We used bswap, so the register contents before the comparison is in
big-endian format.

>> +	orl	%edi, %eax
>> +	orl	%esi, %ecx
>> +	/* Subtraction is okay because the upper 8 bits a zero.  */
> 
> s/a zero/are zero/g

Okay, I'll fix this typo in a follow-up commit.

I won't work on the error → printf cleanup for now.  HJ asked that they
should be put into a separate commit.

Thanks,
Florian
  
Richard Henderson June 23, 2017, 8:48 p.m. UTC | #8
On 06/23/2017 06:51 AM, Florian Weimer wrote:
> +	shll	$8, %eax
> +	shll	$8, %ecx
>   	bswap	%eax
>   	bswap	%ecx
> +	movzbl	-1(%rdi, %rdx), %edi
> +	movzbl	-1(%rsi, %rdx), %esi
> +	orl	%edi, %eax
> +	orl	%esi, %ecx

Do you really need to re-load and merge the final byte?
It would appear you could leave the low byte zero.


r~
  
Florian Weimer June 23, 2017, 8:52 p.m. UTC | #9
On 06/23/2017 10:48 PM, Richard Henderson wrote:
> On 06/23/2017 06:51 AM, Florian Weimer wrote:
>> +    shll    $8, %eax
>> +    shll    $8, %ecx
>>       bswap    %eax
>>       bswap    %ecx
>> +    movzbl    -1(%rdi, %rdx), %edi
>> +    movzbl    -1(%rsi, %rdx), %esi
>> +    orl    %edi, %eax
>> +    orl    %esi, %ecx
> 
> Do you really need to re-load and merge the final byte?
> It would appear you could leave the low byte zero.

Sorry, I don't understand.  Is this the same issue as H.J. raised and
improved?

The final/third byte certainly can affect the outcome of the comparison.

Thanks,
Florian
  
Richard Henderson June 23, 2017, 9:05 p.m. UTC | #10
On 06/23/2017 01:52 PM, Florian Weimer wrote:
> On 06/23/2017 10:48 PM, Richard Henderson wrote:
>> On 06/23/2017 06:51 AM, Florian Weimer wrote:
>>> +    shll    $8, %eax
>>> +    shll    $8, %ecx
>>>        bswap    %eax
>>>        bswap    %ecx
>>> +    movzbl    -1(%rdi, %rdx), %edi
>>> +    movzbl    -1(%rsi, %rdx), %esi
>>> +    orl    %edi, %eax
>>> +    orl    %esi, %ecx
>>
>> Do you really need to re-load and merge the final byte?
>> It would appear you could leave the low byte zero.
> 
> Sorry, I don't understand.  Is this the same issue as H.J. raised and
> improved?
> 
> The final/third byte certainly can affect the outcome of the comparison.

Obviously I didn't intend to imply that one can completely ignore the final 
byte.  From the larger context that I didn't quote, it appeared that the last 
byte has already been loaded.

That said, I take it all back -- I mis-read the diff.


r~
  

Patch

x86-64: memcmp-avx2-movbe.S needs saturating subtraction [BZ #21662]

This code:

L(between_2_3):
	/* Load as big endian with overlapping loads and bswap to avoid
	   branches.  */
	movzwl	-2(%rdi, %rdx), %eax
	movzwl	-2(%rsi, %rdx), %ecx
	shll	$16, %eax
	shll	$16, %ecx
	movzwl	(%rdi), %edi
	movzwl	(%rsi), %esi
	orl	%edi, %eax
	orl	%esi, %ecx
	bswap	%eax
	bswap	%ecx
	subl	%ecx, %eax
	ret

needs a saturating subtract because the full register is used.
With this commit, only the lower 24 bits of the register are used,
so a regular subtraction suffices.

The test case change adds coverage for these kinds of bugs.

2017-06-23  Florian Weimer  <fweimer@redhat.com>

	[BZ #21662]
	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (between_2_3):
	Peform saturating subtraction.
	* string/test-memcmp.c (check1): Check with different lengths.
	(check_result, do_random_tests): Write error messages to standard
	output.

diff --git a/string/test-memcmp.c b/string/test-memcmp.c
index a7969ed..93bc972 100644
--- a/string/test-memcmp.c
+++ b/string/test-memcmp.c
@@ -85,8 +85,8 @@  check_result (impl_t *impl, const CHAR *s1, const CHAR *s2, size_t len,
       || (exp_result < 0 && result >= 0)
       || (exp_result > 0 && result <= 0))
     {
-      error (0, 0, "Wrong result in function %s %d %d", impl->name,
-	     result, exp_result);
+      printf ("error: wrong result in function %s %d %d", impl->name,
+	      result, exp_result);
       ret = 1;
       return -1;
     }
@@ -192,8 +192,10 @@  do_random_tests (void)
 	      || (r < 0 && result >= 0)
 	      || (r > 0 && result <= 0))
 	    {
-	      error (0, 0, "Iteration %zd - wrong result in function %s (%zd, %zd, %zd, %zd) %ld != %d, p1 %p p2 %p",
-		     n, impl->name, align1 * CHARBYTES & 63,  align2 * CHARBYTES & 63, len, pos, r, result, p1, p2);
+	      printf ("error: Iteration %zd - wrong result in function %s"
+		      " (%zd, %zd, %zd, %zd) %ld != %d, p1 %p p2 %p",
+		      n, impl->name, align1 * CHARBYTES & 63,
+		      align2 * CHARBYTES & 63, len, pos, r, result, p1, p2);
 	      ret = 1;
 	    }
 	}
@@ -441,11 +443,12 @@  check1 (void)
 
   n = 116;
   for (size_t i = 0; i < n; i++)
-    {
-      exp_result = SIMPLE_MEMCMP (s1 + i, s2 + i, n - i);
-      FOR_EACH_IMPL (impl, 0)
-	check_result (impl, s1 + i, s2 + i, n - i, exp_result);
-    }
+    for (size_t len = 0; len <= n - i; ++len)
+      {
+	exp_result = SIMPLE_MEMCMP (s1 + i, s2 + i, len);
+	FOR_EACH_IMPL (impl, 0)
+	  check_result (impl, s1 + i, s2 + i, len, exp_result);
+      }
 }
 
 /* This test checks that memcmp doesn't overrun buffers.  */
diff --git a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
index 47630dd..4cc9386 100644
--- a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
+++ b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
@@ -139,16 +139,17 @@  L(exit):
 L(between_2_3):
 	/* Load as big endian with overlapping loads and bswap to avoid
 	   branches.  */
-	movzwl	-2(%rdi, %rdx), %eax
-	movzwl	-2(%rsi, %rdx), %ecx
-	shll	$16, %eax
-	shll	$16, %ecx
-	movzwl	(%rdi), %edi
-	movzwl	(%rsi), %esi
-	orl	%edi, %eax
-	orl	%esi, %ecx
+	movzwl	(%rdi), %eax
+	movzwl	(%rsi), %ecx
+	shll	$8, %eax
+	shll	$8, %ecx
 	bswap	%eax
 	bswap	%ecx
+	movzbl	-1(%rdi, %rdx), %edi
+	movzbl	-1(%rsi, %rdx), %esi
+	orl	%edi, %eax
+	orl	%esi, %ecx
+	/* Subtraction is okay because the upper 8 bits a zero.  */
 	subl	%ecx, %eax
 	ret