x86-64: Add memcmp/wmemcmp optimized with AVX2
Commit Message
On Tue, Jun 20, 2017 at 11:16 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Sat, Jun 17, 2017 at 3:44 AM, Andrew Senkevich
> <andrew.n.senkevich@gmail.com> wrote:
>> 2017-06-16 4:15 GMT+02:00 H.J. Lu <hjl.tools@gmail.com>:
>>> On Thu, Jun 15, 2017 at 5:34 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>>>> On Thu, Jun 01, 2017 at 08:45:19AM -0700, H.J. Lu wrote:
>>>>> Optimize x86-64 memcmp/wmemcmp with AVX2. It uses vector compare as
>>>>> much as possible. It is as fast as SSE4 memcmp for size <= 16 bytes
>>>>> and up to 2X faster for size > 16 bytes on Haswell and Skylake. Select
>>>>> AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
>>>>> AVX unaligned load is fast.
>>>>>
>>>>> Key features:
>>>>>
>>>>> 1. Use overlapping compare to avoid branch.
>>>>> 2. Use vector compare when size >= 4 bytes for memcmp or size >= 8
>>>>> bytes for wmemcmp.
>>>>> 3. If size is 8 * VEC_SIZE or less, unroll the loop.
>>>>> 4. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
>>>>> 5. Use 2 vector compares when size is 2 * VEC_SIZE or less.
>>>>> 6. Use 4 vector compares when size is 4 * VEC_SIZE or less.
>>>>> 7. Use 8 vector compares when size is 8 * VEC_SIZE or less.
>>>>>
>>>>> Any comments?
>>>>>
>>>> I have some comments, its similar to one of my previous patches
>>>>
>>>>> + cmpq $(VEC_SIZE * 2), %rdx
>>>>> + ja L(more_2x_vec)
>>>>> +
>>>> This is unnecessary branch, its likely that there is difference in first
>>>> 16 bytes regardless of size. Move test about sizes...
>>>>> +L(last_2x_vec):
>>>>> + /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */
>>>>> + vmovdqu (%rsi), %ymm2
>>>>> + VPCMPEQ (%rdi), %ymm2, %ymm2
>>>>> + vpmovmskb %ymm2, %eax
>>>>> + subl $VEC_MASK, %eax
>>>>> + jnz L(first_vec)
>>>> here.
>>>>
>>>
>>> If we do that, the size check will be redundant from
>>>
>>> /* Less than 4 * VEC. */
>>> cmpq $VEC_SIZE, %rdx
>>> jbe L(last_vec)
>>> cmpq $(VEC_SIZE * 2), %rdx
>>> jbe L(last_2x_vec)
>>>
>>> L(last_4x_vec):
>>>
>>> Of cause, we can duplicate these blocks to avoid size.
>>>
>>>>
>>>>> +L(first_vec):
>>>>> + /* A byte or int32 is different within 16 or 32 bytes. */
>>>>> + bsfl %eax, %ecx
>>>>> +# ifdef USE_AS_WMEMCMP
>>>>> + xorl %eax, %eax
>>>>> + movl (%rdi, %rcx), %edx
>>>>> + cmpl (%rsi, %rcx), %edx
>>>>> +L(wmemcmp_return):
>>>>> + setl %al
>>>>> + negl %eax
>>>>> + orl $1, %eax
>>>>> +# else
>>>>> + movzbl (%rdi, %rcx), %eax
>>>>> + movzbl (%rsi, %rcx), %edx
>>>>> + sub %edx, %eax
>>>>> +# endif
>>>>> + VZEROUPPER
>>>>> + ret
>>>>> +
>>>>
>>>> Loading bytes depending on result of bsf is slow, alternative is to find
>>>> that from vector tests. I could avoid it using tests like this but I
>>>> didn't measure performance/test it yet.
>>>>
>>>> vmovdqu (%rdi), %ymm3
>>>>
>>>> VPCMPGTQ %ymm2, %ymm3, %ymm4
>>>> VPCMPGTQ %ymm3, %ymm2, %ymm5
>>>> vpmovmskb %ymm4, %eax
>>>> vpmovmskb %ymm5, %edx
>>>> neg %eax
>>>> neg %edx
>>>> lzcnt %eax, %eax
>>>> lzcnt %edx, %edx
>>>> sub %edx, %eax
>>>> ret
>>>
>>> Andrew, can you give it a try?
>>
>> Hi Ondrej, could you send patch with you proposal?
>> I have tried with the following change and got many test-memcmp wrong results:
>
> We can't use VPCMPGT for memcmp since it performs signed
> comparison, but memcmp requires unsigned comparison.
>
> H.J.
Here is a patch. Any comments?
Comments
On Fri, Jun 23, 2017 at 2:00 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 11:16 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Sat, Jun 17, 2017 at 3:44 AM, Andrew Senkevich
>> <andrew.n.senkevich@gmail.com> wrote:
>>> 2017-06-16 4:15 GMT+02:00 H.J. Lu <hjl.tools@gmail.com>:
>>>> On Thu, Jun 15, 2017 at 5:34 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>>>>> On Thu, Jun 01, 2017 at 08:45:19AM -0700, H.J. Lu wrote:
>>>>>> Optimize x86-64 memcmp/wmemcmp with AVX2. It uses vector compare as
>>>>>> much as possible. It is as fast as SSE4 memcmp for size <= 16 bytes
>>>>>> and up to 2X faster for size > 16 bytes on Haswell and Skylake. Select
>>>>>> AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
>>>>>> AVX unaligned load is fast.
>>>>>>
>>>>>> Key features:
>>>>>>
>>>>>> 1. Use overlapping compare to avoid branch.
>>>>>> 2. Use vector compare when size >= 4 bytes for memcmp or size >= 8
>>>>>> bytes for wmemcmp.
>>>>>> 3. If size is 8 * VEC_SIZE or less, unroll the loop.
>>>>>> 4. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
>>>>>> 5. Use 2 vector compares when size is 2 * VEC_SIZE or less.
>>>>>> 6. Use 4 vector compares when size is 4 * VEC_SIZE or less.
>>>>>> 7. Use 8 vector compares when size is 8 * VEC_SIZE or less.
>>>>>>
>>>>>> Any comments?
>>>>>>
>>>>> I have some comments, its similar to one of my previous patches
>>>>>
>>>>>> + cmpq $(VEC_SIZE * 2), %rdx
>>>>>> + ja L(more_2x_vec)
>>>>>> +
>>>>> This is unnecessary branch, its likely that there is difference in first
>>>>> 16 bytes regardless of size. Move test about sizes...
>>>>>> +L(last_2x_vec):
>>>>>> + /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */
>>>>>> + vmovdqu (%rsi), %ymm2
>>>>>> + VPCMPEQ (%rdi), %ymm2, %ymm2
>>>>>> + vpmovmskb %ymm2, %eax
>>>>>> + subl $VEC_MASK, %eax
>>>>>> + jnz L(first_vec)
>>>>> here.
>>>>>
>>>>
>>>> If we do that, the size check will be redundant from
>>>>
>>>> /* Less than 4 * VEC. */
>>>> cmpq $VEC_SIZE, %rdx
>>>> jbe L(last_vec)
>>>> cmpq $(VEC_SIZE * 2), %rdx
>>>> jbe L(last_2x_vec)
>>>>
>>>> L(last_4x_vec):
>>>>
>>>> Of cause, we can duplicate these blocks to avoid size.
>>>>
>>>>>
>>>>>> +L(first_vec):
>>>>>> + /* A byte or int32 is different within 16 or 32 bytes. */
>>>>>> + bsfl %eax, %ecx
>>>>>> +# ifdef USE_AS_WMEMCMP
>>>>>> + xorl %eax, %eax
>>>>>> + movl (%rdi, %rcx), %edx
>>>>>> + cmpl (%rsi, %rcx), %edx
>>>>>> +L(wmemcmp_return):
>>>>>> + setl %al
>>>>>> + negl %eax
>>>>>> + orl $1, %eax
>>>>>> +# else
>>>>>> + movzbl (%rdi, %rcx), %eax
>>>>>> + movzbl (%rsi, %rcx), %edx
>>>>>> + sub %edx, %eax
>>>>>> +# endif
>>>>>> + VZEROUPPER
>>>>>> + ret
>>>>>> +
>>>>>
>>>>> Loading bytes depending on result of bsf is slow, alternative is to find
>>>>> that from vector tests. I could avoid it using tests like this but I
>>>>> didn't measure performance/test it yet.
>>>>>
>>>>> vmovdqu (%rdi), %ymm3
>>>>>
>>>>> VPCMPGTQ %ymm2, %ymm3, %ymm4
>>>>> VPCMPGTQ %ymm3, %ymm2, %ymm5
>>>>> vpmovmskb %ymm4, %eax
>>>>> vpmovmskb %ymm5, %edx
>>>>> neg %eax
>>>>> neg %edx
>>>>> lzcnt %eax, %eax
>>>>> lzcnt %edx, %edx
>>>>> sub %edx, %eax
>>>>> ret
>>>>
>>>> Andrew, can you give it a try?
>>>
>>> Hi Ondrej, could you send patch with you proposal?
>>> I have tried with the following change and got many test-memcmp wrong results:
>>
>> We can't use VPCMPGT for memcmp since it performs signed
>> comparison, but memcmp requires unsigned comparison.
>>
>> H.J.
>
> Here is a patch. Any comments?
>
I will check it in this week.
> --
> H.J.
> ----
> Check the first 32 bytes before checking size when size >= 32 bytes
> to avoid unnecessary branch if the difference is in the first 32 bytes.
> Replace vpmovmskb/subl/jnz with vptest/jnc.
>
> On Haswell, the new version is as fast as the previous one. On Skylake,
> the new version is a little bit faster.
>
> * sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check
> the first 32 bytes before checking size when size >= 32 bytes.
> Replace vpmovmskb/subl/jnz with vptest/jnc.
From ebd5d69692463adc3733716fa1d608970d037bd6 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Tue, 20 Jun 2017 04:05:28 -0700
Subject: [PATCH] x86-64: Optimize memcmp-avx2-movbe.S for short difference
Check the first 32 bytes before checking size when size >= 32 bytes
to avoid unnecessary branch if the difference is in the first 32 bytes.
Replace vpmovmskb/subl/jnz with vptest/jnc.
On Haswell, the new version is as fast as the previous one. On Skylake,
the new version is a little bit faster.
* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check
the first 32 bytes before checking size when size >= 32 bytes.
Replace vpmovmskb/subl/jnz with vptest/jnc.
---
sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S | 118 ++++++++++++++-------------
1 file changed, 62 insertions(+), 56 deletions(-)
@@ -62,9 +62,68 @@ ENTRY (MEMCMP)
# endif
cmpq $VEC_SIZE, %rdx
jb L(less_vec)
+
+ /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */
+ vmovdqu (%rsi), %ymm2
+ VPCMPEQ (%rdi), %ymm2, %ymm2
+ vpmovmskb %ymm2, %eax
+ subl $VEC_MASK, %eax
+ jnz L(first_vec)
+
cmpq $(VEC_SIZE * 2), %rdx
- ja L(more_2x_vec)
+ jbe L(last_vec)
+
+ VPCMPEQ %ymm0, %ymm0, %ymm0
+ /* More than 2 * VEC. */
+ cmpq $(VEC_SIZE * 8), %rdx
+ ja L(more_8x_vec)
+ cmpq $(VEC_SIZE * 4), %rdx
+ jb L(last_4x_vec)
+
+ /* From 4 * VEC to 8 * VEC, inclusively. */
+ vmovdqu (%rsi), %ymm1
+ VPCMPEQ (%rdi), %ymm1, %ymm1
+
+ vmovdqu VEC_SIZE(%rsi), %ymm2
+ VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2
+
+ vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3
+ VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
+
+ vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4
+ VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
+
+ vpand %ymm1, %ymm2, %ymm5
+ vpand %ymm3, %ymm4, %ymm6
+ vpand %ymm5, %ymm6, %ymm5
+
+ vptest %ymm0, %ymm5
+ jnc L(4x_vec_end)
+
+ leaq -(4 * VEC_SIZE)(%rdi, %rdx), %rdi
+ leaq -(4 * VEC_SIZE)(%rsi, %rdx), %rsi
+ vmovdqu (%rsi), %ymm1
+ VPCMPEQ (%rdi), %ymm1, %ymm1
+
+ vmovdqu VEC_SIZE(%rsi), %ymm2
+ VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2
+ vpand %ymm2, %ymm1, %ymm5
+
+ vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3
+ VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
+ vpand %ymm3, %ymm5, %ymm5
+ vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4
+ VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
+ vpand %ymm4, %ymm5, %ymm5
+
+ vptest %ymm0, %ymm5
+ jnc L(4x_vec_end)
+ xorl %eax, %eax
+ VZEROUPPER
+ ret
+
+ .p2align 4
L(last_2x_vec):
/* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */
vmovdqu (%rsi), %ymm2
@@ -219,58 +278,6 @@ L(between_16_31):
ret
.p2align 4
-L(more_2x_vec):
- /* More than 2 * VEC. */
- cmpq $(VEC_SIZE * 8), %rdx
- ja L(more_8x_vec)
- cmpq $(VEC_SIZE * 4), %rdx
- jb L(last_4x_vec)
-
- /* From 4 * VEC to 8 * VEC, inclusively. */
- vmovdqu (%rsi), %ymm1
- VPCMPEQ (%rdi), %ymm1, %ymm1
-
- vmovdqu VEC_SIZE(%rsi), %ymm2
- VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2
-
- vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3
- VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
-
- vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4
- VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
-
- vpand %ymm1, %ymm2, %ymm5
- vpand %ymm3, %ymm4, %ymm6
- vpand %ymm5, %ymm6, %ymm5
-
- vpmovmskb %ymm5, %eax
- subl $VEC_MASK, %eax
- jnz L(4x_vec_end)
-
- leaq -(4 * VEC_SIZE)(%rdi, %rdx), %rdi
- leaq -(4 * VEC_SIZE)(%rsi, %rdx), %rsi
- vmovdqu (%rsi), %ymm1
- VPCMPEQ (%rdi), %ymm1, %ymm1
-
- vmovdqu VEC_SIZE(%rsi), %ymm2
- VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2
- vpand %ymm2, %ymm1, %ymm5
-
- vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3
- VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
- vpand %ymm3, %ymm5, %ymm5
-
- vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4
- VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
- vpand %ymm4, %ymm5, %ymm5
-
- vpmovmskb %ymm5, %eax
- subl $VEC_MASK, %eax
- jnz L(4x_vec_end)
- VZEROUPPER
- ret
-
- .p2align 4
L(more_8x_vec):
/* More than 8 * VEC. Check the first VEC. */
vmovdqu (%rsi), %ymm2
@@ -309,9 +316,8 @@ L(loop_4x_vec):
VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
vpand %ymm4, %ymm5, %ymm5
- vpmovmskb %ymm5, %eax
- subl $VEC_MASK, %eax
- jnz L(4x_vec_end)
+ vptest %ymm0, %ymm5
+ jnc L(4x_vec_end)
addq $(VEC_SIZE * 4), %rdi
addq $(VEC_SIZE * 4), %rsi
--
2.9.4