[5/5] x86_64: Add evex optimized bcmp implementation in bcmp-evex.S

Message ID 20210913230506.546749-5-goldstein.w.n@gmail.com
State Superseded
Delegated to: Carlos O'Donell
Headers
Series [1/5] x86_64: Add support for bcmp using sse2, sse4_1, avx2, and evex |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit fail Patch series failed to build

Commit Message

Noah Goldstein Sept. 13, 2021, 11:05 p.m. UTC
  No bug. This commit adds new optimized bcmp implementation for evex.

The primary optimizations are 1) skipping the logic to find the
difference of the first mismatched byte and 2) not updating src/dst
addresses as the non-equals logic does not need to be reused by
different areas.

The entry alignment has been fixed at 64. In throughput sensitive
functions which bcmp can potentially be frontend loop performance is
important to opimized for. This is impossible/difficult to do/maintain
with only 16 byte fixed alignment.

test-memcmp, test-bcmp, and test-wmemcmp are all passing.
---
 sysdeps/x86_64/multiarch/bcmp-evex.S       | 305 ++++++++++++++++++++-
 sysdeps/x86_64/multiarch/ifunc-bcmp.h      |   3 +-
 sysdeps/x86_64/multiarch/ifunc-impl-list.c |   1 -
 3 files changed, 302 insertions(+), 7 deletions(-)
  

Comments

Carlos O'Donell Sept. 14, 2021, 1:18 a.m. UTC | #1
On 9/13/21 7:05 PM, Noah Goldstein via Libc-alpha wrote:
> No bug. This commit adds new optimized bcmp implementation for evex.
> 
> The primary optimizations are 1) skipping the logic to find the
> difference of the first mismatched byte and 2) not updating src/dst
> addresses as the non-equals logic does not need to be reused by
> different areas.
> 
> The entry alignment has been fixed at 64. In throughput sensitive
> functions which bcmp can potentially be frontend loop performance is
> important to opimized for. This is impossible/difficult to do/maintain
> with only 16 byte fixed alignment.
> 
> test-memcmp, test-bcmp, and test-wmemcmp are all passing.

This series fails in the containerized 32-bit x86 CI/CD regression tester.
https://patchwork.sourceware.org/project/glibc/patch/20210913230506.546749-5-goldstein.w.n@gmail.com/
  
Noah Goldstein Sept. 14, 2021, 2:05 a.m. UTC | #2
On Mon, Sep 13, 2021 at 8:18 PM Carlos O'Donell <carlos@redhat.com> wrote:

> On 9/13/21 7:05 PM, Noah Goldstein via Libc-alpha wrote:
> > No bug. This commit adds new optimized bcmp implementation for evex.
> >
> > The primary optimizations are 1) skipping the logic to find the
> > difference of the first mismatched byte and 2) not updating src/dst
> > addresses as the non-equals logic does not need to be reused by
> > different areas.
> >
> > The entry alignment has been fixed at 64. In throughput sensitive
> > functions which bcmp can potentially be frontend loop performance is
> > important to opimized for. This is impossible/difficult to do/maintain
> > with only 16 byte fixed alignment.
> >
> > test-memcmp, test-bcmp, and test-wmemcmp are all passing.
>
> This series fails in the containerized 32-bit x86 CI/CD regression tester.
>
> https://patchwork.sourceware.org/project/glibc/patch/20210913230506.546749-5-goldstein.w.n@gmail.com/


Shoot.

AFAICT the first error is:
*** No rule to make target '/build/string/stamp.os', needed by
'/build/libc_pic.a'.

I saw that issue earlier when I was working on just supporting bcmp for the
first
commit:

[PATCH 1/5] x86_64: Add support for bcmp using sse2, sse4_1, avx2, and evex

So I think I missed/messed up something there regarding the necessary
changes
to the  Makefile/build infrastructure to support the change.

While it doesn't appear to be an issue on my local machine I left the
redirect in
string/memcmp.c:

https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memcmp.c;h=9b46d7a905c8b7886f046b7660f63df10dc4573c;hb=HEAD#l360

But was one area where I didn't really know the right answer.


Does anyone know if there is anything special that needs to be done for the
32 bit
build when adding a new implementation?

Also, does anyone know what make/configure commands I need to reproduce
this
on a x86_64-Linux machine? The build log doesn't appear to have the command.

For my completely fresh build / testing I ran:

rm -rf /path/to/build/glibc; mkdir -p /path/to/build/glibc; (cd
/path/to/build/glibc/; unset LD_LIBRARY_PATH; /path/to/src/glibc/configure
--prefix=/usr; make --silent; make xcheck; make -r -C
/path/to/src/glibc/string/ objdir=`pwd` check; make -r -C
/path/to/src/glibc/wcsmbs/ objdir=`pwd` check)

which doesn't appear to have cut it.


>
> --
> Cheers,
> Carlos.
>
>
  
Carlos O'Donell Sept. 14, 2021, 2:35 a.m. UTC | #3
On 9/13/21 10:05 PM, Noah Goldstein wrote:
> On Mon, Sep 13, 2021 at 8:18 PM Carlos O'Donell <carlos@redhat.com> wrote:
> 
>> On 9/13/21 7:05 PM, Noah Goldstein via Libc-alpha wrote:
>>> No bug. This commit adds new optimized bcmp implementation for evex.
>>>
>>> The primary optimizations are 1) skipping the logic to find the
>>> difference of the first mismatched byte and 2) not updating src/dst
>>> addresses as the non-equals logic does not need to be reused by
>>> different areas.
>>>
>>> The entry alignment has been fixed at 64. In throughput sensitive
>>> functions which bcmp can potentially be frontend loop performance is
>>> important to opimized for. This is impossible/difficult to do/maintain
>>> with only 16 byte fixed alignment.
>>>
>>> test-memcmp, test-bcmp, and test-wmemcmp are all passing.
>>
>> This series fails in the containerized 32-bit x86 CI/CD regression tester.
>>
>> https://patchwork.sourceware.org/project/glibc/patch/20210913230506.546749-5-goldstein.w.n@gmail.com/
> 
> 
> Shoot.

No worries! That's what the CI/CD system is there for :-)
 
> AFAICT the first error is:
> *** No rule to make target '/build/string/stamp.os', needed by
> '/build/libc_pic.a'.
 
I think a normal 32-bit x86 builds should show this issue.

You need a gcc that accepts -m32.

I minimally set:
export CC="gcc -m32 -Wl,--build-id=none"
export CXX="g++ -m32 -Wl,--build-id=none"
export CFLAGS="-g -O2 -march=i686 -Wl,--build-id=none"
export CXXFLAGS="-g -O2 -march=i686 -Wl,--build-id=none"
export CPPFLAGS="-g -O2 -march=i686 -Wl,--build-id=none"

Then build with --host.

e.g.

/home/carlos/src/glibc-work/configure --host i686-pc-linux-gnu CC=gcc -m32 -Wl,--build-id=none CFLAGS=-g -O2 -march=i686 -Wl,--build-id=none CPPFLAGS=-g -O2 -march=i686 -Wl,--build-id=none CXX=g++ -m32 -Wl,--build-id=none CXXFLAGS=-g -O2 -march=i686 -Wl,--build-id=none --prefix=/usr --with-headers=/home/carlos/build/glibc-headers-work-i686/include --with-selinux --disable-nss-crypt --enable-bind-now --enable-static-pie --enable-systemtap --enable-hardcoded-path-in-tests --enable-tunables=yes --enable-add-ons

> Also, does anyone know what make/configure commands I need to reproduce
> this on a x86_64-Linux machine? The build log doesn't appear to have the command.

DJ, Should the trybot log the configure step?
  
DJ Delorie Sept. 14, 2021, 2:55 a.m. UTC | #4
"Carlos O'Donell" <carlos@redhat.com> writes:
>> Also, does anyone know what make/configure commands I need to reproduce
>> this on a x86_64-Linux machine? The build log doesn't appear to have the command.
>
> DJ, Should the trybot log the configure step?

Perhaps.  It's in the stdout that gets added to the trybot's general log
file, rather than a per-series log (and in the git repo's sample script
;).  It's:

/glibc/configure CC="gcc -m32" CXX="g++ -m32" --prefix=/usr \
   --build=i686-pc-linux-gnu --host=i686-pc-linux-gnu

However, this doesn't smell like a 64-vs-32 bug, but a x86-64 vs
anything-else bug.

(It's also in build-many-glibcs.py)
  
Noah Goldstein Sept. 14, 2021, 3:24 a.m. UTC | #5
On Mon, Sep 13, 2021 at 9:55 PM DJ Delorie <dj@redhat.com> wrote:

> "Carlos O'Donell" <carlos@redhat.com> writes:
> >> Also, does anyone know what make/configure commands I need to reproduce
> >> this on a x86_64-Linux machine? The build log doesn't appear to have
> the command.
> >
> > DJ, Should the trybot log the configure step?
>
> Perhaps.  It's in the stdout that gets added to the trybot's general log
> file, rather than a per-series log (and in the git repo's sample script
> ;).  It's:
>
> /glibc/configure CC="gcc -m32" CXX="g++ -m32" --prefix=/usr \
>    --build=i686-pc-linux-gnu --host=i686-pc-linux-gnu
>

Thanks I was able to reproduce the bug with that. Thanks!

>
> However, this doesn't smell like a 64-vs-32 bug, but a x86-64 vs
> anything-else bug.
>

That makes sense.


>
> (It's also in build-many-glibcs.py)
>

Thanks!
  
Noah Goldstein Sept. 14, 2021, 3:40 a.m. UTC | #6
On Mon, Sep 13, 2021 at 9:35 PM Carlos O'Donell <carlos@redhat.com> wrote:

> On 9/13/21 10:05 PM, Noah Goldstein wrote:
> > On Mon, Sep 13, 2021 at 8:18 PM Carlos O'Donell <carlos@redhat.com>
> wrote:
> >
> >> On 9/13/21 7:05 PM, Noah Goldstein via Libc-alpha wrote:
> >>> No bug. This commit adds new optimized bcmp implementation for evex.
> >>>
> >>> The primary optimizations are 1) skipping the logic to find the
> >>> difference of the first mismatched byte and 2) not updating src/dst
> >>> addresses as the non-equals logic does not need to be reused by
> >>> different areas.
> >>>
> >>> The entry alignment has been fixed at 64. In throughput sensitive
> >>> functions which bcmp can potentially be frontend loop performance is
> >>> important to opimized for. This is impossible/difficult to do/maintain
> >>> with only 16 byte fixed alignment.
> >>>
> >>> test-memcmp, test-bcmp, and test-wmemcmp are all passing.
> >>
> >> This series fails in the containerized 32-bit x86 CI/CD regression
> tester.
> >>
> >>
> https://patchwork.sourceware.org/project/glibc/patch/20210913230506.546749-5-goldstein.w.n@gmail.com/
> >
> >
> > Shoot.
>
> No worries! That's what the CI/CD system is there for :-)
>
> > AFAICT the first error is:
> > *** No rule to make target '/build/string/stamp.os', needed by
> > '/build/libc_pic.a'.
>
> I think a normal 32-bit x86 builds should show this issue.
>
> You need a gcc that accepts -m32.
>

Was able to get it with DJ's command.

>
> I minimally set:
> export CC="gcc -m32 -Wl,--build-id=none"
> export CXX="g++ -m32 -Wl,--build-id=none"
> export CFLAGS="-g -O2 -march=i686 -Wl,--build-id=none"
> export CXXFLAGS="-g -O2 -march=i686 -Wl,--build-id=none"
> export CPPFLAGS="-g -O2 -march=i686 -Wl,--build-id=none"
>
> Then build with --host.
>
> e.g.
>
> /home/carlos/src/glibc-work/configure --host i686-pc-linux-gnu CC=gcc -m32
> -Wl,--build-id=none CFLAGS=-g -O2 -march=i686 -Wl,--build-id=none
> CPPFLAGS=-g -O2 -march=i686 -Wl,--build-id=none CXX=g++ -m32
> -Wl,--build-id=none CXXFLAGS=-g -O2 -march=i686 -Wl,--build-id=none
> --prefix=/usr
> --with-headers=/home/carlos/build/glibc-headers-work-i686/include
> --with-selinux --disable-nss-crypt --enable-bind-now --enable-static-pie
> --enable-systemtap --enable-hardcoded-path-in-tests --enable-tunables=yes
> --enable-add-ons


Thanks for the help!


>


> > Also, does anyone know what make/configure commands I need to reproduce
> > this on a x86_64-Linux machine? The build log doesn't appear to have the
> command.
>
> DJ, Should the trybot log the configure step?
>
>
So I think I was able to fix the build by making a new file in
glibc/string/bcmp.c
and just having bcmp call memcmp

Is there another/better way to fix the build?  I don't think it's really
fair that every
arch other than x86_64 should have to pay an extra function call cost to
use bcmp.


> --
> Cheers,
> Carlos.
>
>
  
DJ Delorie Sept. 14, 2021, 4:21 a.m. UTC | #7
Noah Goldstein <goldstein.w.n@gmail.com> writes:
> So I think I was able to fix the build by making a new file in glibc/string/bcmp.c
> and just having bcmp call memcmp
>
> Is there another/better way to fix the build?  I don't think it's really fair that every 
> arch other than x86_64 should have to pay an extra function call cost to use bcmp. 

There are at least three...

First, note that bcmp is a weak alias to memcmp already - see
strings/memcmp.c - which avoids the extra call you mention.

So, you could either move that weak alias into bcmp.c, or arrange for
bcmp.c to not be needed by the Makefile for non-x86_64 platforms.
Lastly, an empty bcmp.c wouldn't override the alias in memcmp.c.  I
think the first would be easiest, although it may be tricky to compile a
source file that seems to do "nothing".  Also, I suspect liberal use of
comments would be beneficial for the unsuspecting reader ;-)

Alternately, you could change your patch to provide alternate versions
of memcmp() instead of bcmp(), as glibc's bcmp *is* memcmp.  This is
what other arches (and x86_64) do:

$ find . -name 'memcmp*' -print
  
Noah Goldstein Sept. 14, 2021, 5:29 a.m. UTC | #8
On Mon, Sep 13, 2021 at 11:21 PM DJ Delorie <dj@redhat.com> wrote:

> Noah Goldstein <goldstein.w.n@gmail.com> writes:
> > So I think I was able to fix the build by making a new file in
> glibc/string/bcmp.c
> > and just having bcmp call memcmp
> >
> > Is there another/better way to fix the build?  I don't think it's really
> fair that every
> > arch other than x86_64 should have to pay an extra function call cost to
> use bcmp.
>
> There are at least three...
>
> First, note that bcmp is a weak alias to memcmp already - see
> strings/memcmp.c - which avoids the extra call you mention.
>
> So, you could either move that weak alias into bcmp.c, or arrange for
> bcmp.c to not be needed by the Makefile for non-x86_64 platforms.
> Lastly, an empty bcmp.c wouldn't override the alias in memcmp.c.  I
> think the first would be easiest, although it may be tricky to compile a
> source file that seems to do "nothing".  Also, I suspect liberal use of
> comments would be beneficial for the unsuspecting reader ;-)
>
>
I see.

I was able to get it working with just an empty bcmp.c file but was not able
to move the weak_alias from memcmp.c to bcmp.c

Adding:
```
#ifdef weak_alias
# undef bcmp
weak_alias (memcmp, bcmp)
#endif
```

to bcmp.c gets me the following compiler error:

```
bcmp.c:24:21: error: ‘bcmp’ aliased to undefined symbol ‘memcmp’
```

irrespective of the ifdef/undef and whether I include string.h/manually
put in a prototype of memcmp.

Sorry for the hassle. Build infrastructure, especially in a project as
complex
as this, is a bit out of my domain.


> Alternately, you could change your patch to provide alternate versions
> of memcmp() instead of bcmp(), as glibc's bcmp *is* memcmp.  This is
> what other arches (and x86_64) do:
>

I'm not 100% sure what you mean? memcmp can correctly implement bcmp
but not the vice versa.


>
> $ find . -name 'memcmp*' -print
>
>
  
DJ Delorie Sept. 14, 2021, 5:42 a.m. UTC | #9
Noah Goldstein <goldstein.w.n@gmail.com> writes:
> I'm not 100% sure what you mean? memcmp can correctly implement bcmp
> but not the vice versa.

glibc does not have a separate implementation of bcmp().  Any calls to
bcmp() end up calling memcmp() (through that weak alias).  So your patch
is not *optimizing* bcmp, it is *adding* bcmp.  The new version you are
adding is no longer using the optimized versions of memcmp, so you'd
have to either (1) be very careful to not introduce a performance
regression, or (2) optimize the existing memcmp()s further instead.
  
Noah Goldstein Sept. 14, 2021, 5:55 a.m. UTC | #10
On Tue, Sep 14, 2021 at 12:42 AM DJ Delorie <dj@redhat.com> wrote:

> Noah Goldstein <goldstein.w.n@gmail.com> writes:
> > I'm not 100% sure what you mean? memcmp can correctly implement bcmp
> > but not the vice versa.
>
> glibc does not have a separate implementation of bcmp().  Any calls to
> bcmp() end up calling memcmp() (through that weak alias).  So your patch
> is not *optimizing* bcmp, it is *adding* bcmp.  The new version you are
> adding is no longer using the optimized versions of memcmp, so you'd
> have to either (1) be very careful to not introduce a performance
> regression, or (2) optimize the existing memcmp()s further instead.
>

Ah, got it.

In the first patch of the set:
[PATCH 1/5] x86_64: Add support for bcmp using sse2, sse4_1, avx2, and evex

I have some performance numbers. Seems to be an improvement for avx2/evex.
The sse2/sse4 stuff is a bit more iffy. I don't really have the hardware to
properly
test those versions.

Thank you for all the help!
  

Patch

diff --git a/sysdeps/x86_64/multiarch/bcmp-evex.S b/sysdeps/x86_64/multiarch/bcmp-evex.S
index ade52e8c68..1bfe824eb4 100644
--- a/sysdeps/x86_64/multiarch/bcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/bcmp-evex.S
@@ -16,8 +16,305 @@ 
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-#ifndef MEMCMP
-# define MEMCMP	__bcmp_evex
-#endif
+#if IS_IN (libc)
+
+/* bcmp is implemented as:
+   1. Use ymm vector compares when possible. The only case where
+      vector compares is not possible for when size < VEC_SIZE
+      and loading from either s1 or s2 would cause a page cross.
+   2. Use xmm vector compare when size >= 8 bytes.
+   3. Optimistically compare up to first 4 * VEC_SIZE one at a
+      to check for early mismatches. Only do this if its guranteed the
+      work is not wasted.
+   4. If size is 8 * VEC_SIZE or less, unroll the loop.
+   5. Compare 4 * VEC_SIZE at a time with the aligned first memory
+      area.
+   6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
+   7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
+   8. Use 8 vector compares when size is 8 * VEC_SIZE or less.  */
+
+# include <sysdep.h>
+
+# ifndef BCMP
+#  define BCMP	__bcmp_evex
+# endif
+
+# define VMOVU	vmovdqu64
+# define VPCMP	vpcmpub
+# define VPTEST	vptestmb
+
+# define VEC_SIZE	32
+# define PAGE_SIZE	4096
+
+# define YMM0		ymm16
+# define YMM1		ymm17
+# define YMM2		ymm18
+# define YMM3		ymm19
+# define YMM4		ymm20
+# define YMM5		ymm21
+# define YMM6		ymm22
+
+
+	.section .text.evex, "ax", @progbits
+ENTRY_P2ALIGN (BCMP, 6)
+# ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %edx
+# endif
+	cmp	$VEC_SIZE, %RDX_LP
+	jb	L(less_vec)
+
+	/* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
+	VMOVU	(%rsi), %YMM1
+	/* Use compare not equals to directly check for mismatch.  */
+	VPCMP	$4, (%rdi), %YMM1, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(last_1x_vec)
+
+	/* Check second VEC no matter what.  */
+	VMOVU	VEC_SIZE(%rsi), %YMM2
+	VPCMP	$4, VEC_SIZE(%rdi), %YMM2, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	/* Less than 4 * VEC.  */
+	cmpq	$(VEC_SIZE * 4), %rdx
+	jbe	L(last_2x_vec)
+
+	/* Check third and fourth VEC no matter what.  */
+	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM3
+	VPCMP	$4, (VEC_SIZE * 2)(%rdi), %YMM3, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM4
+	VPCMP	$4, (VEC_SIZE * 3)(%rdi), %YMM4, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	/* Go to 4x VEC loop.  */
+	cmpq	$(VEC_SIZE * 8), %rdx
+	ja	L(more_8x_vec)
+
+	/* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any
+	   branches.  */
+
+	VMOVU	-(VEC_SIZE * 4)(%rsi, %rdx), %YMM1
+	VMOVU	-(VEC_SIZE * 3)(%rsi, %rdx), %YMM2
+	addq	%rdx, %rdi
+
+	/* Wait to load from s1 until addressed adjust due to unlamination.
+	 */
+
+	/* vpxor will be all 0s if s1 and s2 are equal. Otherwise it will
+	   have some 1s.  */
+	vpxorq	-(VEC_SIZE * 4)(%rdi), %YMM1, %YMM1
+	vpxorq	-(VEC_SIZE * 3)(%rdi), %YMM2, %YMM2
+
+	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %YMM3
+	vpxorq	-(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
+	/* Or together YMM1, YMM2, and YMM3 into YMM3.  */
+	vpternlogd $0xfe, %YMM1, %YMM2, %YMM3
 
-#include "memcmp-evex-movbe.S"
+	VMOVU	-(VEC_SIZE)(%rsi, %rdx), %YMM4
+	/* Ternary logic to xor (VEC_SIZE * 3)(%rdi) with YMM4 while oring
+	   with YMM3. Result is stored in YMM4.  */
+	vpternlogd $0xde, -(VEC_SIZE)(%rdi), %YMM3, %YMM4
+	/* Compare YMM4 with 0. If any 1s s1 and s2 don't match.  */
+	VPTEST	%YMM4, %YMM4, %k1
+	kmovd	%k1, %eax
+L(return_neq0):
+	ret
+
+	/* Fits in padding needed to .p2align 5 L(less_vec).  */
+L(last_1x_vec):
+	VMOVU	-(VEC_SIZE * 1)(%rsi, %rdx), %YMM1
+	VPCMP	$4, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %k1
+	kmovd	%k1, %eax
+	ret
+
+	/* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32 byte
+	   aligned.  */
+	.p2align 5
+L(less_vec):
+	/* Check if one or less char. This is necessary for size = 0 but is
+	   also faster for size = 1.  */
+	cmpl	$1, %edx
+	jbe	L(one_or_less)
+
+	/* Check if loading one VEC from either s1 or s2 could cause a page
+	   cross. This can have false positives but is by far the fastest
+	   method.  */
+	movl	%edi, %eax
+	orl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	jg	L(page_cross_less_vec)
+
+	/* No page cross possible.  */
+	VMOVU	(%rsi), %YMM2
+	VPCMP	$4, (%rdi), %YMM2, %k1
+	kmovd	%k1, %eax
+	/* Result will be zero if s1 and s2 match. Otherwise first set bit
+	   will be first mismatch.  */
+	bzhil	%edx, %eax, %eax
+	ret
+
+	/* Relatively cold but placing close to L(less_vec) for 2 byte jump
+	   encoding.  */
+	.p2align 4
+L(one_or_less):
+	jb	L(zero)
+	movzbl	(%rsi), %ecx
+	movzbl	(%rdi), %eax
+	subl	%ecx, %eax
+	/* No ymm register was touched.  */
+	ret
+	/* Within the same 16 byte block is L(one_or_less).  */
+L(zero):
+	xorl	%eax, %eax
+	ret
+
+	.p2align 4
+L(last_2x_vec):
+	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %YMM1
+	vpxorq	-(VEC_SIZE * 2)(%rdi, %rdx), %YMM1, %YMM1
+	VMOVU	-(VEC_SIZE * 1)(%rsi, %rdx), %YMM2
+	vpternlogd $0xde, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %YMM2
+	VPTEST	%YMM2, %YMM2, %k1
+	kmovd	%k1, %eax
+	ret
+
+	.p2align 4
+L(more_8x_vec):
+	/* Set end of s1 in rdx.  */
+	leaq	-(VEC_SIZE * 4)(%rdi, %rdx), %rdx
+	/* rsi stores s2 - s1. This allows loop to only update one pointer.
+	 */
+	subq	%rdi, %rsi
+	/* Align s1 pointer.  */
+	andq	$-VEC_SIZE, %rdi
+	/* Adjust because first 4x vec where check already.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	.p2align 4
+L(loop_4x_vec):
+	VMOVU	(%rsi, %rdi), %YMM1
+	vpxorq	(%rdi), %YMM1, %YMM1
+
+	VMOVU	VEC_SIZE(%rsi, %rdi), %YMM2
+	vpxorq	VEC_SIZE(%rdi), %YMM2, %YMM2
+
+	VMOVU	(VEC_SIZE * 2)(%rsi, %rdi), %YMM3
+	vpxorq	(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
+	vpternlogd $0xfe, %YMM1, %YMM2, %YMM3
+
+	VMOVU	(VEC_SIZE * 3)(%rsi, %rdi), %YMM4
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rdi), %YMM3, %YMM4
+	VPTEST	%YMM4, %YMM4, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq2)
+	subq	$-(VEC_SIZE * 4), %rdi
+	cmpq	%rdx, %rdi
+	jb	L(loop_4x_vec)
+
+	subq	%rdx, %rdi
+	VMOVU	(VEC_SIZE * 3)(%rsi, %rdx), %YMM4
+	vpxorq	(VEC_SIZE * 3)(%rdx), %YMM4, %YMM4
+	/* rdi has 4 * VEC_SIZE - remaining length.  */
+	cmpl	$(VEC_SIZE * 3), %edi
+	jae	L(8x_last_1x_vec)
+	/* Load regardless of branch.  */
+	VMOVU	(VEC_SIZE * 2)(%rsi, %rdx), %YMM3
+	/* Ternary logic to xor (VEC_SIZE * 2)(%rdx) with YMM3 while oring
+	   with YMM4. Result is stored in YMM4.  */
+	vpternlogd $0xf6, (VEC_SIZE * 2)(%rdx), %YMM3, %YMM4
+	cmpl	$(VEC_SIZE * 2), %edi
+	jae	L(8x_last_2x_vec)
+
+	VMOVU	VEC_SIZE(%rsi, %rdx), %YMM2
+	vpxorq	VEC_SIZE(%rdx), %YMM2, %YMM2
+
+	VMOVU	(%rsi, %rdx), %YMM1
+	vpxorq	(%rdx), %YMM1, %YMM1
+
+	vpternlogd $0xfe, %YMM1, %YMM2, %YMM4
+L(8x_last_1x_vec):
+L(8x_last_2x_vec):
+	VPTEST	%YMM4, %YMM4, %k1
+	kmovd	%k1, %eax
+L(return_neq2):
+	ret
+
+	/* Relatively cold case as page cross are unexpected.  */
+	.p2align 4
+L(page_cross_less_vec):
+	cmpl	$16, %edx
+	jae	L(between_16_31)
+	cmpl	$8, %edx
+	ja	L(between_9_15)
+	cmpl	$4, %edx
+	jb	L(between_2_3)
+	/* From 4 to 8 bytes.  No branch when size == 4.  */
+	movl	(%rdi), %eax
+	movl	(%rsi), %ecx
+	subl	%ecx, %eax
+	movl	-4(%rdi, %rdx), %ecx
+	movl	-4(%rsi, %rdx), %esi
+	subl	%esi, %ecx
+	orl	%ecx, %eax
+	ret
+
+	.p2align 4,, 8
+L(between_9_15):
+	/* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe.
+	 */
+	vmovq	(%rdi), %xmm1
+	vmovq	(%rsi), %xmm2
+	vpcmpeqb %xmm1, %xmm2, %xmm3
+	vmovq	-8(%rdi, %rdx), %xmm1
+	vmovq	-8(%rsi, %rdx), %xmm2
+	vpcmpeqb %xmm1, %xmm2, %xmm2
+	vpand	%xmm2, %xmm3, %xmm3
+	vpmovmskb %xmm3, %eax
+	subl	$0xffff, %eax
+	/* No ymm register was touched.  */
+	ret
+
+	.p2align 4,, 8
+L(between_16_31):
+	/* From 16 to 31 bytes.  No branch when size == 16.  */
+
+	/* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe.
+	 */
+	vmovdqu	(%rsi), %xmm1
+	vpcmpeqb (%rdi), %xmm1, %xmm1
+	vmovdqu	-16(%rsi, %rdx), %xmm2
+	vpcmpeqb -16(%rdi, %rdx), %xmm2, %xmm2
+	vpand	%xmm1, %xmm2, %xmm2
+	vpmovmskb %xmm2, %eax
+	subl	$0xffff, %eax
+	/* No ymm register was touched.  */
+	ret
+
+	.p2align 4,, 8
+L(between_2_3):
+	/* From 2 to 3 bytes.  No branch when size == 2.  */
+	movzwl	(%rdi), %eax
+	movzwl	(%rsi), %ecx
+	subl	%ecx, %eax
+	movzbl	-1(%rdi, %rdx), %edi
+	movzbl	-1(%rsi, %rdx), %esi
+	subl	%edi, %esi
+	orl	%esi, %eax
+	/* No ymm register was touched.  */
+	ret
+END (BCMP)
+#endif
diff --git a/sysdeps/x86_64/multiarch/ifunc-bcmp.h b/sysdeps/x86_64/multiarch/ifunc-bcmp.h
index f94516e5ee..51f251d0c9 100644
--- a/sysdeps/x86_64/multiarch/ifunc-bcmp.h
+++ b/sysdeps/x86_64/multiarch/ifunc-bcmp.h
@@ -35,8 +35,7 @@  IFUNC_SELECTOR (void)
       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
     {
       if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
-	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
-	  && CPU_FEATURE_USABLE_P (cpu_features, MOVBE))
+	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
 	return OPTIMIZE (evex);
 
       if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index cda0316928..abbb4e407f 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -52,7 +52,6 @@  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, bcmp,
 			      (CPU_FEATURE_USABLE (AVX512VL)
 			       && CPU_FEATURE_USABLE (AVX512BW)
-                   && CPU_FEATURE_USABLE (MOVBE)
 			       && CPU_FEATURE_USABLE (BMI2)),
 			      __bcmp_evex)
 	      IFUNC_IMPL_ADD (array, i, bcmp, CPU_FEATURE_USABLE (SSE4_1),