[RFC] PowerPC - e6500 optimized memcmp function
Commit Message
Thanks Zanella.
Please find below, patch for optimized implementation of 'memcmp' for PowerPC e6500 (32-bit & 64-bit) target using Altivec instructions.
* sysdeps/powerpc/bits/hwcap.h: Add macro to identify e6500 target.
* sysdeps/powerpc/powerpc32/e6500/memcmp.S: New File
* sysdeps/powerpc/powerpc32/e6500/multiarch/Implies: New File.
* sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c: Add
memcmp multiarch implementation.
* sysdeps/powerpc/powerpc32/power4/multiarch/Makefile: Likewise.
* sysdeps/powerpc/powerpc32/power4/multiarch/memcmp.c: Likewise.
* sysdeps/powerpc/powerpc32/power4/multiarch/memcmp-e6500.S: New File.
* sysdeps/powerpc/powerpc64/e6500/memcmp.S: New File.
* sysdeps/powerpc/powerpc64/e6500/multiarch/Implies: New File
* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Add
memcmp multiarch implementation.
* sysdeps/powerpc/powerpc64/multiarch/Makefile: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/memcmp.c: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/memcmp-e6500.S: New File.
The patch was generated on top of glibc v2.20 source base.
The patch was tested with dejaGNU and glibc testsuite.
There was 1 dejaGNU fail and quite a few glibc test failures.
But on backporting the patch (in the below mentioned link), those issues were resolved.
https://sourceware.org/ml/libc-alpha/2015-01/msg00403.html
The benchsuite (both 32-bit and 64-bit) results are attached for your reference.
Please let me know your comments.
Regards,
Rohit
Re: PowerPC e6500 optimized memcmp function - Query
* From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
* To: libc-help at sourceware dot org
* Date: Wed, 22 Jul 2015 08:53:58 -0300
* Subject: Re: PowerPC e6500 optimized memcmp function - Query
* Authentication-results: sourceware.org; auth=none
* References: <BLUPR03MB14589B0FC50FBC51FD6A2F29C2830 at BLUPR03MB1458 dot namprd03 dot prod dot outlook dot com>
Comments
Hi
Patch looks good in general, however you need to rebase against master
(although the differences are minimal). Some comments below:
On 19-08-2015 09:51, Dharmakan Rohit Arul Raj wrote:
> Thanks Zanella.
>
> Please find below, patch for optimized implementation of 'memcmp' for PowerPC e6500 (32-bit & 64-bit) target using Altivec instructions.
>
> * sysdeps/powerpc/bits/hwcap.h: Add macro to identify e6500 target.
> * sysdeps/powerpc/powerpc32/e6500/memcmp.S: New File
> * sysdeps/powerpc/powerpc32/e6500/multiarch/Implies: New File.
> * sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c: Add
> memcmp multiarch implementation.
You need to add the function you are referring to (__libc_ifunc_impl_list).
> * sysdeps/powerpc/powerpc32/power4/multiarch/Makefile: Likewise.
> * sysdeps/powerpc/powerpc32/power4/multiarch/memcmp.c: Likewise.
> * sysdeps/powerpc/powerpc32/power4/multiarch/memcmp-e6500.S: New File.
> * sysdeps/powerpc/powerpc64/e6500/memcmp.S: New File.
> * sysdeps/powerpc/powerpc64/e6500/multiarch/Implies: New File
> * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Add
> memcmp multiarch implementation.
Same as before.
> * sysdeps/powerpc/powerpc64/multiarch/Makefile: Likewise.
> * sysdeps/powerpc/powerpc64/multiarch/memcmp.c: Likewise.
> * sysdeps/powerpc/powerpc64/multiarch/memcmp-e6500.S: New File.
>
> diff -Naur glibc-2.20/sysdeps/powerpc/bits/hwcap.h glibc-2.20-e6500-mcmp/sysdeps/powerpc/bits/hwcap.h
> --- glibc-2.20/sysdeps/powerpc/bits/hwcap.h 2014-09-07 03:09:09.000000000 -0500
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/bits/hwcap.h 2015-08-19 05:48:43.688000596 -0500
> @@ -64,3 +64,7 @@
> #define PPC_FEATURE2_HAS_TAR 0x04000000 /* Target Address Register */
> #define PPC_FEATURE2_HAS_VEC_CRYPTO 0x02000000 /* Target supports vector
> instruction. */
> +/* Identify Freescale Processors. */
> +#define PPC_FEATURE_E6500 (PPC_FEATURE_64 | PPC_FEATURE_BOOKE | \
> + PPC_FEATURE_HAS_ALTIVEC | \
> + PPC_FEATURE_HAS_FPU | PPC_FEATURE_HAS_MMU)
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc32/e6500/memcmp.S glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/e6500/memcmp.S
> --- glibc-2.20/sysdeps/powerpc/powerpc32/e6500/memcmp.S 1969-12-31 18:00:00.000000000 -0600
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/e6500/memcmp.S 2015-08-19 05:48:43.689000603 -0500
> @@ -0,0 +1,387 @@
> +/* Optimized memcmp implementation for 32-bit e6500 PowerPC.
> +
> + Copyright (C) 2015 Free Software Foundation, Inc.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <http://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +
> +/* int [r3] memcmp (const char *s1 [r3],
> + const char *s2 [r4],
> + size_t size [r5])
> + r3:source1 address, return equality
> + r4:source2 address
> + r5:byte count
> +
> + volatile fixed point registers usable:
> + r0, r3-r12
> +
> + volatile floating point registers usable:
> + f0-f13
> +
> + v0-v1 General use Volatile (caller save)
> + v2-v13 Parameters, general volatile (caller save)
> + v14-v19 General Volatile (caller save)
> +
> + CR0-CR1 Volatile condition code register fields
> + CR5-CR7 Volatile condition code register fields. */
> +
> +#define rTMP r0
> +#define rRTN r3
> +#define rSTR1 r3 /* first string arg. */
> +#define rSTR2 r4 /* second string arg. */
> +#define rS2OFF r7 /* second string arg + 16. */
> +#define rN r5
> +#define rWORD1 r6 /* current word in s1. */
> +#define rWORD2 r7 /* current word in s2. */
> +#define rWORD3 r8 /* next word in s1. */
> +#define rWORD4 r9 /* next word in s2. */
> +#define rWORD5 r10 /* next word in s1. */
> +#define rWORD6 r11 /* next word in s2. */
> +#define rWORD7 r5 /* next word in s1. */
> +#define rWORD8 r12 /* next word in s2. */
> +#define rCOUNT r11
> +#define rINDEX r8
> +#define rVR0 v0
> +#define rVR1 v1
> +#define rVR2 v2
> +#define rVR3 v3
> +#define rVR4 v4
> +#define rVR5 v5
> +#define rVR6 v6
> +#define rVR7 v7
> +#define rVR8 v8
> +#define rVR9 v9
> +#define rVR10 v10
> +#define rVR11 v11
> +#define rVR14 v14
> +#define rVR15 v15
> +#define rVR16 v16
> +#define rVR17 v17
> +#define rVR18 v18
> +#define rVR19 v19
> +
> +EALIGN (memcmp, 5, 0)
> + cmplwi rN, 0
> + cmplwi cr1, rN, 32
> + beq L(zero)
> + ble cr1, L(medium)
> + neg rTMP, rSTR1
> + andi. rCOUNT, rTMP, 15 /* check src1 alignment. */
> + bne L(src1_nalign)
> +L(src1align16):
> + rlwinm. rTMP, r4, 0, 28, 31 /* check src2 alignment. */
> + srwi rCOUNT, rN, 4 /* no. of bytes / 16. */
> + cmplwi cr5, rCOUNT, 0xFFF0 /* check for large data compares. */
> + rlwinm rN, rN, 0, 28, 31 /* remaining bytes. */
> + mtctr rCOUNT
> + li rINDEX, 0
> + bne L(src2_nalign)
> + bgt cr5, L(large_align)
> +L(loop_align):
> + lvx rVR14, rSTR1, rINDEX
> + lvx rVR15, rSTR2, rINDEX
> + addi rINDEX, rINDEX, 16
> + vcmpequb. rVR16, rVR14, rVR15
> + vnor rVR17, rVR16, rVR16
> + bdnzt 4*cr6+lt, L(loop_align)
> + cmplwi cr1, rN, 0
> + bge cr6, L(Vwords_Differ)
> + beq cr1, L(zero)
> + add rSTR1, rSTR1, rINDEX
> + add rSTR2, rSTR2, rINDEX
> +L(small):
> + srwi. rCOUNT, rN, 3
> + rlwinm rN, rN, 0, 29, 31
> + beq L(cmp_bytes)
> + lwz rWORD1, 0(rSTR1)
> + lwz rWORD2, 0(rSTR2)
> + lwz rWORD3, 4(rSTR1)
> + lwz rWORD4, 4(rSTR2)
> + cmplw cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 8
> + cmplwi cr5, rN, 0
> + bne cr0, L(bLcr0)
> + cmplw cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 8
> + bne cr1, L(bLcr1)
> + beq cr5, L(zero)
> + .align 4
> +L(cmp_bytes):
> + mtctr rN
> + lbz rWORD1, 0(rSTR1)
> + lbz rWORD2, 0(rSTR2)
> + bdz L(b11)
> + lbz rWORD3, 1(rSTR1)
> + lbz rWORD4, 1(rSTR2)
> + cmplw cr0, rWORD1, rWORD2
> + bdz L(b12)
> + lbz rWORD5, 2(rSTR1)
> + lbz rWORD6, 2(rSTR2)
> + cmplw cr1, rWORD3, rWORD4
> + bdz L(b13)
> + lbz rWORD7, 3(rSTR1)
> + lbz rWORD8, 3(rSTR2)
> + bne cr0, L(bx11)
> + cmplw cr5, rWORD5, rWORD6
> + bdz L(b14)
> + cmplw cr6, rWORD7, rWORD8
> + lbz rWORD1, 4(rSTR1)
> + lbz rWORD2, 4(rSTR2)
> + bne cr1, L(bx12)
> + bdz L(b15)
> + lbz rWORD3, 5(rSTR1)
> + lbz rWORD4, 5(rSTR2)
> + cmplw cr0, rWORD1, rWORD2
> + bne cr5, L(bx13)
> + bdz L(b16)
> + lbz rWORD5, 6(rSTR1)
> + lbz rWORD6, 6(rSTR2)
> + cmplw cr1, rWORD3, rWORD4
> + bne cr6, L(bx14)
> + bne cr0, L(bx15)
> + bne cr1, L(bx16)
> + sub rRTN, rWORD5, rWORD6
> + blr
> +L(b16):
> + bne cr6, L(bx14)
> + bne cr0, L(bx15)
> +L(bx16):
> + sub rRTN, rWORD3, rWORD4
> + blr
> +L(b15):
> + bne cr5, L(bx13)
> + bne cr6, L(bx14)
> +L(bx15):
> + sub rRTN, rWORD1, rWORD2
> + blr
> +L(b14):
> + bne cr1, L(bx12)
> + bne cr5, L(bx13)
> +L(bx14):
> + sub rRTN, rWORD7, rWORD8
> + blr
> +L(b13):
> + bne cr0, L(bx11)
> + bne cr1, L(bx12)
> +L(bx13):
> + sub rRTN, rWORD5, rWORD6
> + blr
> +L(b12):
> + bne cr0, L(bx11)
> +L(bx12):
> + sub rRTN, rWORD3, rWORD4
> + blr
> +L(b11):
> +L(bx11):
> + sub rRTN, rWORD1, rWORD2
> + blr
> +
> + .align 4
> +L(medium):
> + srwi. rCOUNT, rN, 3
> + rlwinm rN, rN, 0, 29, 31
> + beq L(cmp_bytes)
> + mtctr rCOUNT
> + cmplwi cr5, rN, 0
> + lwz rWORD1, 0(rSTR1)
> + lwz rWORD2, 0(rSTR2)
> + lwz rWORD3, 4(rSTR1)
> + lwz rWORD4, 4(rSTR2)
> + cmplw cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 8
> + bne cr0, L(bLcr0)
> + cmplw cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 8
> + bne cr1, L(bLcr1)
> + bdz L(check_small)
> + lwz rWORD1, 0(rSTR1)
> + lwz rWORD2, 0(rSTR2)
> + lwz rWORD3, 4(rSTR1)
> + lwz rWORD4, 4(rSTR2)
> + cmplw cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 8
> + bne cr0, L(bLcr0)
> + cmplw cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 8
> + bne cr1, L(bLcr1)
> + bdz L(check_small)
> + lwz rWORD1, 0(rSTR1)
> + lwz rWORD2, 0(rSTR2)
> + lwz rWORD3, 4(rSTR1)
> + lwz rWORD4, 4(rSTR2)
> + cmplw cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 8
> + bne cr0, L(bLcr0)
> + cmplw cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 8
> + bne cr1, L(bLcr1)
> + bdz L(check_small)
> + lwz rWORD1, 0(rSTR1)
> + lwz rWORD2, 0(rSTR2)
> + lwz rWORD3, 4(rSTR1)
> + lwz rWORD4, 4(rSTR2)
> + cmplw cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 8
> + bne cr0, L(bLcr0)
> + cmplw cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 8
> + bne cr1, L(bLcr1)
> + li rRTN, 0
> + blr
> +
> + .align 4
> +L(check_small):
> + beq cr5, L(zero)
> + b L(cmp_bytes)
> +
> + .align 4
> +L(src1_nalign):
> + lwz rWORD1, 0(rSTR1)
> + lwz rWORD2, 0(rSTR2)
> + lwz rWORD3, 4(rSTR1)
> + lwz rWORD4, 4(rSTR2)
> + subfc. rWORD1, rWORD1, rWORD2
> + bne L(Words_Differ)
> + subfc. rWORD1, rWORD3, rWORD4
> + bne L(Words_Differ)
> + lwz rWORD1, 8(rSTR1)
> + lwz rWORD2, 8(rSTR2)
> + lwz rWORD3, 12(rSTR1)
> + lwz rWORD4, 12(rSTR2)
> + subfc. rWORD1, rWORD1, rWORD2
> + bne L(Words_Differ)
> + subfc. rWORD1, rWORD3, rWORD4
> + bne L(Words_Differ)
> + subf rN, rCOUNT, rN
> + cmplwi cr7, rN, 32
> + add rSTR1, rSTR1, rCOUNT
> + add rSTR2, rSTR2, rCOUNT
> + ble cr7, L(medium)
> + b L(src1align16)
> +
> + .align 4
> +L(bLcr0):
> + li rRTN, 1
> + bgtlr cr0
> + li rRTN, -1
> + blr
> +
> + .align 4
> +L(bLcr1):
> + li rRTN, 1
> + bgtlr cr1
> + li rRTN, -1
> + blr
> +
> + .align 4
> +L(src2_nalign):
> + addi rS2OFF, rSTR2, 16
> + bgt cr5, L(large_nalign)
> +L(loop_nalign):
> + lvx rVR14, rSTR1, rINDEX
> + lvsl rVR3, 0, rSTR2 /* set permute control vector. */
> + lvx rVR4, rS2OFF, rINDEX /* load LSQ. */
> + lvx rVR2, rSTR2, rINDEX /* load MSQ. */
> + addi rINDEX, rINDEX, 16
> + vperm rVR15, rVR2, rVR4, rVR3 /* align the data. */
> + vcmpequb. rVR16, rVR14, rVR15
> + vnor rVR17, rVR16, rVR16
> + bdnzt 4*cr6+lt, L(loop_nalign)
> + cmplwi cr1, rN, 0
> + bge cr6, L(Vwords_Differ)
> + beq cr1, L(zero)
> + add rSTR1, rSTR1, rINDEX
> + add rSTR2, rSTR2, rINDEX
> + b L(small)
> +
> + .align 4
> +L(large_nalign):
> + lvxl rVR14, rSTR1, rINDEX
> + lvsl rVR3, 0, rSTR2 /* set permute control vector. */
> + lvxl rVR4, rS2OFF, rINDEX /* load LSQ. */
> + lvxl rVR2, rSTR2, rINDEX /* load MSQ. */
> + addi rINDEX, rINDEX, 16
> + vperm rVR15, rVR2, rVR4, rVR3 /* align the data. */
> + vcmpequb. rVR16, rVR14, rVR15
> + vnor rVR17, rVR16, rVR16
> + bdnzt 4*cr6+lt, L(large_nalign)
> + cmplwi cr1, rN, 0
> + bge cr6, L(Vwords_Differ)
> + beq cr1, L(zero)
> + add rSTR1, rSTR1, rINDEX
> + add rSTR2, rSTR2, rINDEX
> + b L(small)
> +
> + .align 4
> +L(large_align):
> + lvxl rVR14, rSTR1, rINDEX
> + lvxl rVR15, rSTR2, rINDEX
> + addi rINDEX, rINDEX, 16
> + vcmpequb. rVR16, rVR14, rVR15
> + vnor rVR17, rVR16, rVR16
> + bdnzt 4*cr6+lt, L(large_align)
> + cmplwi cr1, rN, 0
> + bge cr6, L(Vwords_Differ)
> + beq cr1, L(zero)
> + add rSTR1, rSTR1, rINDEX
> + add rSTR2, rSTR2, rINDEX
> + b L(small)
> +
> + .align 4
> +L(Words_Differ):
> + subfe rRTN, rWORD1, rWORD1
> + nand rRTN, rRTN, rRTN
> + ori rRTN, rRTN, 1
> + blr
> +
> + .align 4
> +L(Vwords_Differ):
> + vspltisb rVR18, 1
> + vspltisb rVR1, 8
> + vslb rVR0, rVR1, rVR18
> + vslb rVR19, rVR0, rVR18
> + vslb rVR18, rVR19, rVR18
> + vxor rVR5, rVR5, rVR5
> + vsum4ubs rVR2, rVR1, rVR18
> + vsro rVR9, rVR17, rVR19
> + vsrw rVR19, rVR17, rVR1
> + vsro rVR10, rVR17, rVR18
> + vsrw rVR18, rVR17, rVR0
> + vsro rVR0, rVR17, rVR2
> + vor rVR11, rVR9, rVR10
> + vsro rVR2, rVR18, rVR1
> + vor rVR11, rVR11, rVR0
> + vcmpgtuw rVR11, rVR11, rVR5
> + vor rVR11, rVR11, rVR19
> + vor rVR11, rVR11, rVR18
> + vor rVR11, rVR11, rVR2
> + vor rVR15, rVR15, rVR11
> + vor rVR14, rVR14, rVR11
> + li rRTN, -1
> + vcmpgtub. rVR8, rVR15, rVR14
> + bnelr cr6
> + li rRTN, 1
> + blr
> +
> + .align 4
> +L(zero):
> + li rRTN, 0
> + blr
> +
> +END (memcmp)
> +libc_hidden_builtin_def (memcmp)
> +weak_alias (memcmp, bcmp)
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc32/e6500/multiarch/Implies glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/e6500/multiarch/Implies
> --- glibc-2.20/sysdeps/powerpc/powerpc32/e6500/multiarch/Implies 1969-12-31 18:00:00.000000000 -0600
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/e6500/multiarch/Implies 2015-08-19 05:48:43.689000603 -0500
> @@ -0,0 +1 @@
> +powerpc/powerpc32/power4/multiarch
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c
> --- glibc-2.20/sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c 2014-09-07 03:09:09.000000000 -0500
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c 2015-08-19 05:51:57.389000503 -0500
> @@ -34,6 +34,7 @@
> size_t i = 0;
>
> unsigned long int hwcap = GLRO(dl_hwcap);
> + unsigned long int hwcap2 = GLRO(dl_hwcap2);
> /* hwcap contains only the latest supported ISA, the code checks which is
> and fills the previous supported ones. */
> if (hwcap & PPC_FEATURE_ARCH_2_06)
> @@ -107,6 +108,10 @@
> IFUNC_IMPL (i, name, memcmp,
> IFUNC_IMPL_ADD (array, i, memcmp, hwcap & PPC_FEATURE_HAS_VSX,
> __memcmp_power7)
> + IFUNC_IMPL_ADD (array, i, memcmp,
> + (((hwcap & PPC_FEATURE_E6500) == PPC_FEATURE_E6500)
> + && (hwcap2 & PPC_FEATURE2_HAS_ISEL)),
> + __memcmp_e6500)
Do you really to check for PPC_FEATURE2_HAS_ISEL? There is no 'isel' usage in
implementation.
> IFUNC_IMPL_ADD (array, i, memcmp, 1, __memcmp_ppc))
>
> /* Support sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c. */
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc32/power4/multiarch/Makefile glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/power4/multiarch/Makefile
> --- glibc-2.20/sysdeps/powerpc/powerpc32/power4/multiarch/Makefile 2014-09-07 03:09:09.000000000 -0500
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/power4/multiarch/Makefile 2015-08-19 05:48:43.690000617 -0500
> @@ -1,6 +1,6 @@
> ifeq ($(subdir),string)
> sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
> - memcpy-ppc32 memcmp-power7 memcmp-ppc32 memset-power7 \
> + memcpy-ppc32 memcmp-power7 memcmp-e6500 memcmp-ppc32 memset-power7 \
> memset-power6 memset-ppc32 bzero-power7 bzero-power6 \
> bzero-ppc32 mempcpy-power7 mempcpy-ppc32 memchr-power7 \
> memchr-ppc32 memrchr-power7 memrchr-ppc32 rawmemchr-power7 \
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc32/power4/multiarch/memcmp.c glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/power4/multiarch/memcmp.c
> --- glibc-2.20/sysdeps/powerpc/powerpc32/power4/multiarch/memcmp.c 2014-09-07 03:09:09.000000000 -0500
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/power4/multiarch/memcmp.c 2015-08-19 05:48:43.690000617 -0500
> @@ -24,11 +24,15 @@
>
> extern __typeof (memcmp) __memcmp_ppc attribute_hidden;
> extern __typeof (memcmp) __memcmp_power7 attribute_hidden;
> +extern __typeof (memcmp) __memcmp_e6500 attribute_hidden;
>
> /* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
> ifunc symbol properly. */
> libc_ifunc (memcmp,
> (hwcap & PPC_FEATURE_HAS_VSX)
> - ? __memcmp_power7
> + ? __memcmp_power7 :
> + (((hwcap & PPC_FEATURE_E6500) == PPC_FEATURE_E6500)
> + && (hwcap2 & PPC_FEATURE2_HAS_ISEL))
> + ? __memcmp_e6500
Same comment as above.
> : __memcmp_ppc);
> #endif
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc32/power4/multiarch/memcmp-e6500.S glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/power4/multiarch/memcmp-e6500.S
> --- glibc-2.20/sysdeps/powerpc/powerpc32/power4/multiarch/memcmp-e6500.S 1969-12-31 18:00:00.000000000 -0600
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc32/power4/multiarch/memcmp-e6500.S 2015-08-19 05:48:43.691000636 -0500
> @@ -0,0 +1,41 @@
> +/* Optimized memcmp implementation for e6500/PowerPC32.
> + Copyright (C) 2015 Free Software Foundation, Inc.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <http://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +
> +#undef EALIGN
> +#define EALIGN(name, alignt, words) \
> + .globl C_SYMBOL_NAME(__memcmp_e6500); \
> + .type C_SYMBOL_NAME(__memcmp_e6500),@function; \
> + .align ALIGNARG(alignt); \
> + EALIGN_W_##words; \
> + C_LABEL(__memcmp_e6500) \
> + cfi_startproc;
> +
> +#undef END
> +#define END(name) \
> + cfi_endproc; \
> + ASM_SIZE_DIRECTIVE(__memcmp_e6500)
> +
> +#undef libc_hidden_builtin_def
> +#define libc_hidden_builtin_def(name)
> +
> +#undef weak_alias
> +#define weak_alias(a, b)
> +
> +#include <sysdeps/powerpc/powerpc32/e6500/memcmp.S>
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc64/e6500/memcmp.S glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/e6500/memcmp.S
> --- glibc-2.20/sysdeps/powerpc/powerpc64/e6500/memcmp.S 1969-12-31 18:00:00.000000000 -0600
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/e6500/memcmp.S 2015-08-19 05:48:43.691000636 -0500
> @@ -0,0 +1,428 @@
> +/* Optimized memcmp implementation for 64-bit e6500 PowerPC.
> +
> + Copyright (C) 2015 Free Software Foundation, Inc.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <http://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +
> +/* int [r3] memcmp (const char *s1 [r3],
> + const char *s2 [r4],
> + size_t size [r5])
> + r3:source1 address, return equality
> + r4:source2 address
> + r5:byte count
> +
> + volatile fixed point registers usable:
> + r0, r3-r12
> +
> + volatile floating point registers usable:
> + f0-f13
> +
> + v0-v1 General use volatile (caller save)
> + v2-v13 Parameters, general volatile (caller save)
> + v14-v19 General Volatile (caller save)
> +
> + CR0-CR1 Volatile condition code register fields
> + CR5-CR7 Volatile condition code register fields. */
> +
> +#define rTMP r0
> +#define rRTN r3
> +#define rSTR1 r3 /* first string arg. */
> +#define rSTR2 r4 /* second string arg. */
> +#define rS2OFF r7 /* second string arg + 16. */
> +#define rN r5
> +#define rWORD1 r6 /* current word in s1. */
> +#define rWORD2 r7 /* current word in s2. */
> +#define rWORD3 r8 /* next word in s1. */
> +#define rWORD4 r9 /* next word in s2. */
> +#define rWORD5 r10 /* next word in s1. */
> +#define rWORD6 r11 /* next word in s2. */
> +#define rWORD7 r5 /* next word in s1. */
> +#define rWORD8 r12 /* next word in s2. */
> +#define rCOUNT r11
> +#define rINDEX r8
> +#define rVR0 v0
> +#define rVR1 v1
> +#define rVR2 v2
> +#define rVR3 v3
> +#define rVR4 v4
> +#define rVR5 v5
> +#define rVR6 v6
> +#define rVR7 v7
> +#define rVR8 v8
> +#define rVR9 v9
> +#define rVR10 v10
> +#define rVR11 v11
> +#define rVR14 v14
> +#define rVR15 v15
> +#define rVR16 v16
> +#define rVR17 v17
> +#define rVR18 v18
> +#define rVR19 v19
> +
> +EALIGN (memcmp, 5, 0)
> + CALL_MCOUNT 3
> + cmpldi rN, 0
> + cmpldi cr1, rN, 16
> + cmpldi cr5, rN, 64
> + beq L(zero)
> + blt cr1, L(small)
> + ble cr5, L(medium)
> + neg rTMP, rSTR1
> + andi. rCOUNT, rTMP, 15 /* check src1 alignment. */
> + bne L(src1_nalign)
> +L(src1align16):
> + rlwinm. rTMP, rSTR2, 0, 28, 31 /* check src2 alignment. */
> + bne L(src2_nalign)
> + srdi rCOUNT, rN, 4 /* nb / 16. */;
> + cmpldi cr5, rCOUNT, 0xFFF0 /* check for large data compares. */
> + rlwinm rN, rN, 0, 28, 31 /* remaining bytes. */
> + mtctr rCOUNT
> + li rINDEX, 0
> + bgt cr5, L(large_align)
> + .align 4
> +L(loop_align):
> + lvx rVR14, rSTR1, rINDEX
> + lvx rVR15, rSTR2, rINDEX
> + addi rINDEX, rINDEX, 16
> + vcmpequb. rVR16, rVR14, rVR15
> + vnor rVR17, rVR16, rVR16
> + bdnzt 4*cr6+lt, L(loop_align)
> + cmpldi cr1, rN, 0
> + bge cr6, L(Vwords_Differ)
> + beq cr1, L(zero)
> + add rSTR1, rSTR1, rINDEX
> + add rSTR2, rSTR2, rINDEX
> + .align 4
> +L(small):
> + srdi. rCOUNT, rN, 3
> + rlwinm rN, rN, 0, 29, 31
> + beq L(cmp_bytes)
> + ld rWORD1, 0(rSTR1)
> + ld rWORD2, 0(rSTR2)
> + cmpld cr0, rWORD1, rWORD2
> + cmpldi cr1, rN, 0
> + addi rSTR1, rSTR1, 8
> + addi rSTR2, rSTR2, 8
> + bne cr0, L(bLcr0)
> + beq cr1, L(zero)
> + .align 4
> +L(cmp_bytes):
> + mtctr rN
> + lbz rWORD1, 0(rSTR1)
> + lbz rWORD2, 0(rSTR2)
> + bdz L(b11)
> + lbz rWORD3, 1(rSTR1)
> + lbz rWORD4, 1(rSTR2)
> + cmpld cr0, rWORD1, rWORD2
> + bdz L(b12)
> + lbz rWORD5, 2(rSTR1)
> + lbz rWORD6, 2(rSTR2)
> + cmpld cr1, rWORD3, rWORD4
> + bdz L(b13)
> + lbz rWORD7, 3(rSTR1)
> + lbz rWORD8, 3(rSTR2)
> + bne cr0, L(bx11)
> + cmpld cr5, rWORD5, rWORD6
> + bdz L(b14)
> + cmpld cr6, rWORD7, rWORD8
> + lbz rWORD1, 4(rSTR1)
> + lbz rWORD2, 4(rSTR2)
> + bne cr1, L(bx12)
> + bdz L(b15)
> + lbz rWORD3, 5(rSTR1)
> + lbz rWORD4, 5(rSTR2)
> + cmpld cr0, rWORD1, rWORD2
> + bne cr5, L(bx13)
> + bdz L(b16)
> + lbz rWORD5, 6(rSTR1)
> + lbz rWORD6, 6(rSTR2)
> + cmpld cr1, rWORD3, rWORD4
> + bne cr6, L(bx14)
> + bne cr0, L(bx15)
> + bne cr1, L(bx16)
> + sub rRTN, rWORD5, rWORD6
> + blr
> +L(b16):
> + bne cr6, L(bx14)
> + bne cr0, L(bx15)
> +L(bx16):
> + sub rRTN, rWORD3, rWORD4
> + blr
> +L(b15):
> + bne cr5, L(bx13)
> + bne cr6, L(bx14)
> +L(bx15):
> + sub rRTN, rWORD1, rWORD2
> + blr
> +L(b14):
> + bne cr1, L(bx12)
> + bne cr5, L(bx13)
> +L(bx14):
> + sub rRTN, rWORD7, rWORD8
> + blr
> +L(b13):
> + bne cr0, L(bx11)
> + bne cr1, L(bx12)
> +L(bx13):
> + sub rRTN, rWORD5, rWORD6
> + blr
> +L(b12):
> + bne cr0, L(bx11)
> +L(bx12):
> + sub rRTN, rWORD3, rWORD4
> + blr
> +L(b11):
> +L(bx11):
> + sub rRTN, rWORD1, rWORD2
> + blr
> +
> + .align 4
> +L(medium):
> + srwi rCOUNT, rN, 4
> + rlwinm rN, rN, 0, 28, 31
> + mtctr rCOUNT
> + cmpldi cr5, rN, 0
> + ld rWORD1, 0(rSTR1)
> + ld rWORD2, 0(rSTR2)
> + ld rWORD3, 8(rSTR1)
> + ld rWORD4, 8(rSTR2)
> + cmpld cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 16
> + bne cr0, L(bLcr0)
> + cmpld cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 16
> + bne cr1, L(bLcr1)
> + bdz L(check_small)
> + ld rWORD1, 0(rSTR1)
> + ld rWORD2, 0(rSTR2)
> + ld rWORD3, 8(rSTR1)
> + ld rWORD4, 8(rSTR2)
> + cmpld cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 16
> + bne cr0, L(bLcr0)
> + cmpld cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 16
> + bne cr1, L(bLcr1)
> + bdz L(check_small)
> + ld rWORD1, 0(rSTR1)
> + ld rWORD2, 0(rSTR2)
> + ld rWORD3, 8(rSTR1)
> + ld rWORD4, 8(rSTR2)
> + cmpld cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 16
> + bne cr0, L(bLcr0)
> + cmpld cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 16
> + bne cr1, L(bLcr1)
> + bdz L(check_small)
> + ld rWORD1, 0(rSTR1)
> + ld rWORD2, 0(rSTR2)
> + ld rWORD3, 8(rSTR1)
> + ld rWORD4, 8(rSTR2)
> + cmpld cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 16
> + bne cr0, L(bLcr0)
> + cmpld cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 16
> + bne cr1, L(bLcr1)
> + li rRTN, 0
> + blr
> +
> + .align 4
> +L(check_small):
> + beq cr5, L(zero)
> + b L(small)
> +
> + .align 4
> +L(src1_nalign):
> + ld rWORD1, 0(rSTR1)
> + ld rWORD2, 0(rSTR2)
> + ld rWORD3, 8(rSTR1)
> + ld rWORD4, 8(rSTR2)
> + cmpld cr0, rWORD1, rWORD2
> + add rSTR1, rSTR1, rCOUNT
> + bne cr0, L(bLcr0)
> + cmpld cr1, rWORD3, rWORD4
> + add rSTR2, rSTR2, rCOUNT
> + bne cr1, L(bLcr1)
> + subf rN, rCOUNT, rN
> + cmpldi cr7, rN, 64
> + ble cr7, L(medium)
> + b L(src1align16)
> +
> + .align 4
> +L(src2_nalign):
> + rlwinm. rTMP, rSTR2, 0, 29, 31
> + beq cr0, L(src2_dwalign)
> + srdi rCOUNT, rN, 4 /* n / 16. */;
> + cmpldi cr5, rCOUNT, 0xFFF0 /* check for large data compares. */
> + rlwinm rN, rN, 0, 28, 31 /* remaining bytes. */
> + mtctr rCOUNT
> + li rINDEX, 0
> + addi rS2OFF, rSTR2, 16
> + bgt cr5, L(large_nalign)
> + .align 4
> +L(loop_nalign):
> + lvx rVR14, rSTR1, rINDEX
> + lvsl rVR3, 0, rSTR2 /* set permute control vector. */
> + lvx rVR4, rS2OFF, rINDEX /* load LSQ. */
> + lvx rVR2, rSTR2, rINDEX /* load MSQ. */
> + addi rINDEX, rINDEX, 16
> + vperm rVR15, rVR2, rVR4, rVR3 /* align the data. */
> + vcmpequb. rVR16, rVR14, rVR15
> + vnor rVR17, rVR16, rVR16
> + bdnzt 4*cr6+lt, L(loop_nalign)
> + cmpldi cr1, rN, 0
> + bge cr6, L(Vwords_Differ)
> + beq cr1, L(zero)
> + add rSTR1, rSTR1, rINDEX
> + add rSTR2, rSTR2, rINDEX
> + b L(small)
> +
> + .align 4
> +L(src2_dwalign):
> + srdi rCOUNT, rN, 6
> + rlwinm rN, rN, 0, 26, 31
> + mtctr rCOUNT
> + li rINDEX, 0
> + cmpldi cr5, rN, 0
> + cmpldi cr6, rN, 16
> +L(dw_loop):
> + ld rWORD1, 0(rSTR1)
> + ld rWORD2, 0(rSTR2)
> + ld rWORD3, 8(rSTR1)
> + ld rWORD4, 8(rSTR2)
> + cmpld cr0, rWORD1, rWORD2
> + ld rWORD1, 16(rSTR1)
> + ld rWORD2, 16(rSTR2)
> + bne cr0, L(bLcr0)
> + cmpld cr1, rWORD3, rWORD4
> + ld rWORD3, 24(rSTR1)
> + ld rWORD4, 24(rSTR2)
> + bne cr1, L(bLcr1)
> + cmpld cr0, rWORD1, rWORD2
> + ld rWORD1, 32(rSTR1)
> + ld rWORD2, 32(rSTR2)
> + bne cr0, L(bLcr0)
> + cmpld cr1, rWORD3, rWORD4
> + ld rWORD3, 40(rSTR1)
> + ld rWORD4, 40(rSTR2)
> + bne cr1, L(bLcr1)
> + cmpld cr0, rWORD1, rWORD2
> + ld rWORD1, 48(rSTR1)
> + ld rWORD2, 48(rSTR2)
> + bne cr0, L(bLcr0)
> + cmpld cr1, rWORD3, rWORD4
> + ld rWORD3, 56(rSTR1)
> + ld rWORD4, 56(rSTR2)
> + bne cr1, L(bLcr1)
> + cmpld cr0, rWORD1, rWORD2
> + addi rSTR1, rSTR1, 64
> + bne cr0, L(bLcr0)
> + cmpld cr1, rWORD3, rWORD4
> + addi rSTR2, rSTR2, 64
> + bne cr1, L(bLcr1)
> + bdnz L(dw_loop)
> + beq cr5, L(zero)
> + blt cr6, L(small)
> + b L(medium)
> +
> + .align 4
> +L(bLcr0):
> + li rRTN, 1
> + bgtlr cr0
> + li rRTN, -1
> + blr
> +
> + .align 4
> +L(bLcr1):
> + li rRTN, 1
> + bgtlr cr1
> + li rRTN, -1
> + blr
> +
> + .align 4
> +L(large_nalign):
> + lvxl rVR14, rSTR1, rINDEX
> + lvsl rVR3, 0, rSTR2 /* set permute control vector. */
> + lvxl rVR4, rS2OFF, rINDEX /* load LSQ. */
> + lvxl rVR2, rSTR2, rINDEX /* load MSQ. */
> + addi rINDEX, rINDEX, 16
> + vperm rVR15, rVR2, rVR4, rVR3 /* align the data. */
> + vcmpequb. rVR16, rVR14, rVR15
> + vnor rVR17, rVR16, rVR16
> + bdnzt 4*cr6+lt, L(large_nalign)
> + cmpldi cr1, rN, 0
> + bge cr6, L(Vwords_Differ)
> + beq cr1, L(zero)
> + add rSTR1, rSTR1, rINDEX
> + add rSTR2, rSTR2, rINDEX
> + b L(small)
> +
> + .align 4
> +L(large_align):
> + lvxl rVR14, rSTR1, rINDEX
> + lvxl rVR15, rSTR2, rINDEX
> + addi rINDEX, rINDEX, 16
> + vcmpequb. rVR16, rVR14, rVR15
> + vnor rVR17, rVR16, rVR16
> + bdnzt 4*cr6+lt, L(large_align)
> + cmpldi cr1, rN, 0
> + bge cr6, L(Vwords_Differ)
> + beq cr1, L(zero)
> + add rSTR1, rSTR1, rINDEX
> + add rSTR2, rSTR2, rINDEX
> + b L(small)
> +
> + .align 4
> +L(Vwords_Differ):
> + vspltisb rVR18, 1
> + vspltisb rVR1, 8
> + vslb rVR0, rVR1, rVR18
> + vslb rVR19, rVR0, rVR18
> + vslb rVR18, rVR19, rVR18
> + vxor rVR5, rVR5, rVR5
> + vsum4ubs rVR2, rVR1, rVR18
> + vsro rVR9, rVR17, rVR19
> + vsrw rVR19, rVR17, rVR1
> + vsro rVR10, rVR17, rVR18
> + vsrw rVR18, rVR17, rVR0
> + vsro rVR0, rVR17, rVR2
> + vor rVR11, rVR9, rVR10
> + vsro rVR2, rVR18, rVR1
> + vor rVR11, rVR11, rVR0
> + vcmpgtuw rVR11, rVR11, rVR5
> + vor rVR11, rVR11, rVR19
> + vor rVR11, rVR11, rVR18
> + vor rVR11, rVR11, rVR2
> + vor rVR15, rVR15, rVR11
> + vor rVR14, rVR14, rVR11
> + li rRTN, -1
> + vcmpgtub. rVR8, rVR15, rVR14
> + bnelr cr6
> + li rRTN, 1
> + blr
> +
> + .align 4
> +L(zero):
> + li rRTN, 0
> + blr
> +
> +END (memcmp)
> +libc_hidden_builtin_def (memcmp)
> +weak_alias (memcmp, bcmp)
I have tested this implementation for powerpc64le (with some modification
to fix the endian issues) and at least for power7/power8 I only see some
gains in the 'small' and 'medium' path, less then 64 bytes, which I think
for power7 memcmp it can be fixed.
Sometime ago I try to use AVX/VSX instructions on memcmp to get some more
throughput, but I see to no gain in aligned case neither in unaligned one
(which was worse due the fact it required a lot of instruction to isolate
which byte was the different one, like what you do in L(Vwords_Differ)).
I do not have much knowledge of e6500 core and its contrainst, but did you
try to power7 memcmp version on it to check if it is worth to use? It is
basically the power4 version without the branch prediction instructions.
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc64/e6500/multiarch/Implies glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/e6500/multiarch/Implies
> --- glibc-2.20/sysdeps/powerpc/powerpc64/e6500/multiarch/Implies 1969-12-31 18:00:00.000000000 -0600
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/e6500/multiarch/Implies 2015-08-19 05:48:43.691000636 -0500
> @@ -0,0 +1 @@
> +powerpc/powerpc64/multiarch
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> --- glibc-2.20/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c 2014-09-07 03:09:09.000000000 -0500
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c 2015-08-19 05:52:26.371000238 -0500
> @@ -34,6 +34,7 @@
> size_t i = 0;
>
> unsigned long int hwcap = GLRO(dl_hwcap);
> + unsigned long int hwcap2 = GLRO(dl_hwcap2);
> /* hwcap contains only the latest supported ISA, the code checks which is
> and fills the previous supported ones. */
> if (hwcap & PPC_FEATURE_ARCH_2_06)
> @@ -130,6 +131,10 @@
> __memcmp_power7)
> IFUNC_IMPL_ADD (array, i, memcmp, hwcap & PPC_FEATURE_POWER4,
> __memcmp_power4)
> + IFUNC_IMPL_ADD (array, i, memcmp,
> + (((hwcap & PPC_FEATURE_E6500) == PPC_FEATURE_E6500)
> + && (hwcap2 & PPC_FEATURE2_HAS_ISEL)),
> + __memcmp_e6500)
> IFUNC_IMPL_ADD (array, i, memcmp, 1, __memcmp_ppc))
>
> /* Support sysdeps/powerpc/powerpc64/multiarch/bzero.c. */
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc64/multiarch/Makefile glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/multiarch/Makefile
> --- glibc-2.20/sysdeps/powerpc/powerpc64/multiarch/Makefile 2014-09-07 03:09:09.000000000 -0500
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/multiarch/Makefile 2015-08-19 05:48:43.712000652 -0500
> @@ -1,7 +1,7 @@
> ifeq ($(subdir),string)
> sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
> memcpy-power4 memcpy-ppc64 memcmp-power7 memcmp-power4 \
> - memcmp-ppc64 memset-power7 memset-power6 memset-power4 \
> + memcmp-e6500 memcmp-ppc64 memset-power7 memset-power6 memset-power4 \
> memset-ppc64 bzero-power4 bzero-power6 bzero-power7 \
> mempcpy-power7 mempcpy-ppc64 memchr-power7 memchr-ppc64 \
> memrchr-power7 memrchr-ppc64 rawmemchr-power7 \
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc64/multiarch/memcmp.c glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/multiarch/memcmp.c
> --- glibc-2.20/sysdeps/powerpc/powerpc64/multiarch/memcmp.c 2014-09-07 03:09:09.000000000 -0500
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/multiarch/memcmp.c 2015-08-19 05:48:43.712000652 -0500
> @@ -25,6 +25,7 @@
> extern __typeof (memcmp) __memcmp_ppc attribute_hidden;
> extern __typeof (memcmp) __memcmp_power4 attribute_hidden;
> extern __typeof (memcmp) __memcmp_power7 attribute_hidden;
> +extern __typeof (memcmp) __memcmp_e6500 attribute_hidden;
>
> /* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
> ifunc symbol properly. */
> @@ -32,7 +33,10 @@
> (hwcap & PPC_FEATURE_HAS_VSX)
> ? __memcmp_power7 :
> (hwcap & PPC_FEATURE_POWER4)
> - ? __memcmp_power4
> + ? __memcmp_power4 :
> + (((hwcap & PPC_FEATURE_E6500) == PPC_FEATURE_E6500)
> + && (hwcap2 & PPC_FEATURE2_HAS_ISEL))
> + ? __memcmp_e6500
> : __memcmp_ppc);
> #else
> #include <string/memcmp.c>
> diff -Naur glibc-2.20/sysdeps/powerpc/powerpc64/multiarch/memcmp-e6500.S glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/multiarch/memcmp-e6500.S
> --- glibc-2.20/sysdeps/powerpc/powerpc64/multiarch/memcmp-e6500.S 1969-12-31 18:00:00.000000000 -0600
> +++ glibc-2.20-e6500-mcmp/sysdeps/powerpc/powerpc64/multiarch/memcmp-e6500.S 2015-08-19 05:48:43.712000652 -0500
> @@ -0,0 +1,42 @@
> +/* Optimized memcmp implementation for PowerPC64/e6500.
> + Copyright (C) 2015 Free Software Foundation, Inc.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <http://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +
> +#undef EALIGN
> +#define EALIGN(name, alignt, words) \
> + .section ".text"; \
> + ENTRY_2(__memcmp_e6500) \
> + .align ALIGNARG(alignt); \
> + EALIGN_W_##words; \
> + BODY_LABEL(__memcmp_e6500): \
> + cfi_startproc; \
> + LOCALENTRY(__memcmp_e6500)
> +
> +#undef END
> +#define END(name) \
> + cfi_endproc; \
> + TRACEBACK(__memcmp_e6500) \
> + END_2(__memcmp_e6500)
> +
> +#undef libc_hidden_builtin_def
> +#define libc_hidden_builtin_def(name)
> +#undef weak_alias
> +#define weak_alias(name,alias)
> +
> +#include <sysdeps/powerpc/powerpc64/e6500/memcmp.S>
>
> The patch was generated on top of glibc v2.20 source base.
>
> The patch was tested with dejaGNU and glibc testsuite.
> There was 1 dejaGNU fail and quite a few glibc test failures.
> But on backporting the patch (in the below mentioned link), those issues were resolved.
> https://sourceware.org/ml/libc-alpha/2015-01/msg00403.html
>
> The benchsuite (both 32-bit and 64-bit) results are attached for your reference.
>
> Please let me know your comments.
>
> Regards,
> Rohit
>
> Re: PowerPC e6500 optimized memcmp function - Query
> * From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
> * To: libc-help at sourceware dot org
> * Date: Wed, 22 Jul 2015 08:53:58 -0300
> * Subject: Re: PowerPC e6500 optimized memcmp function - Query
> * Authentication-results: sourceware.org; auth=none
> * References: <BLUPR03MB14589B0FC50FBC51FD6A2F29C2830 at BLUPR03MB1458 dot namprd03 dot prod dot outlook dot com>
> ________________________________________
> Hi,
>
> You should add the implementation at sysdeps/powerpc/powerpc64/e6500/memcmp.{S,c}
> to avoid multiple symbol definition. This will make glibc select this implementation
> only if you configure it with --with-cpu=e6500 and --disable-multiarch.
>
> To enabled it for multiarch you will need to:
>
> 1. Update sysdeps/powerpc/powerpc64/multiarch/memcmp.c to test the correct hwcap
> flag and return the e6500 function pointer.
> 2. Add a new file sysdeps/powerpc/powerpc64/multiarch/memcmp-e6500.S following the
> others multiarch implementation. Basically you will need to redefine EALIGN/ENTRY
> and END macros for assembly implementation or do some C preprocessor magic if
> the new file is coded in C. The idea is to include it but changing the symbol
> name to __memcmp_e6500.
> 3. Finally add the new implementation at sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> so the testsuite can select it for testcases.
>
> Post the patch on libc-alpha along with benchsuite results for the machine. Also
> if the patch does not use any non-ISA specific instructions you may consider
> adding this optimization as a default ISA one (for instance, if it uses altivec
> and is compatible with ISA 2.05 you may enabled it for POWER5+).
>
> On 22-07-2015 07:25, Dharmakan Rohit Arul Raj wrote:
>> Hello All,
>>
>> I have added an optimized PowerPC e6500 library (memcmp) to glibc v2.20 sources and I need some clarification regarding multi-arch support.
>> The patch works if configured with --disable-multi-arch. But for default build, since multi-arch support is enabled by default, I get this error.
>>
>> eglibc/opt/freescale/XLMe6500-test/obj_glibc_1_scratch/libc_pic.a(memcmp-ppc64.os):(.opd+0x0): multiple definition of `__GI_memcmp'
>> eglibc/opt/freescale/XLMe6500-test/obj_glibc_1_scratch/libc_pic.a(memcmp.os):(.opd+0x0): first defined here
>>
>> As of now, we don't have any intention to add multi-arch support, so wanted to know the criteria to submit this patch to FSF.
>> Can I add this flag [--disable-multi-arch] by default for our target or should I try to resolve these multiple definition errors?
>>
>> Please let me know your comments.
>>
>> Regards,
>> Rohit
>>
>>
>>
>
@@ -64,3 +64,7 @@
#define PPC_FEATURE2_HAS_TAR 0x04000000 /* Target Address Register */
#define PPC_FEATURE2_HAS_VEC_CRYPTO 0x02000000 /* Target supports vector
instruction. */
+/* Identify Freescale Processors. */
+#define PPC_FEATURE_E6500 (PPC_FEATURE_64 | PPC_FEATURE_BOOKE | \
+ PPC_FEATURE_HAS_ALTIVEC | \
+ PPC_FEATURE_HAS_FPU | PPC_FEATURE_HAS_MMU)
@@ -0,0 +1,387 @@
+/* Optimized memcmp implementation for 32-bit e6500 PowerPC.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+
+/* int [r3] memcmp (const char *s1 [r3],
+ const char *s2 [r4],
+ size_t size [r5])
+ r3:source1 address, return equality
+ r4:source2 address
+ r5:byte count
+
+ volatile fixed point registers usable:
+ r0, r3-r12
+
+ volatile floating point registers usable:
+ f0-f13
+
+ v0-v1 General use Volatile (caller save)
+ v2-v13 Parameters, general volatile (caller save)
+ v14-v19 General Volatile (caller save)
+
+ CR0-CR1 Volatile condition code register fields
+ CR5-CR7 Volatile condition code register fields. */
+
+#define rTMP r0
+#define rRTN r3
+#define rSTR1 r3 /* first string arg. */
+#define rSTR2 r4 /* second string arg. */
+#define rS2OFF r7 /* second string arg + 16. */
+#define rN r5
+#define rWORD1 r6 /* current word in s1. */
+#define rWORD2 r7 /* current word in s2. */
+#define rWORD3 r8 /* next word in s1. */
+#define rWORD4 r9 /* next word in s2. */
+#define rWORD5 r10 /* next word in s1. */
+#define rWORD6 r11 /* next word in s2. */
+#define rWORD7 r5 /* next word in s1. */
+#define rWORD8 r12 /* next word in s2. */
+#define rCOUNT r11
+#define rINDEX r8
+#define rVR0 v0
+#define rVR1 v1
+#define rVR2 v2
+#define rVR3 v3
+#define rVR4 v4
+#define rVR5 v5
+#define rVR6 v6
+#define rVR7 v7
+#define rVR8 v8
+#define rVR9 v9
+#define rVR10 v10
+#define rVR11 v11
+#define rVR14 v14
+#define rVR15 v15
+#define rVR16 v16
+#define rVR17 v17
+#define rVR18 v18
+#define rVR19 v19
+
+EALIGN (memcmp, 5, 0)
+ cmplwi rN, 0
+ cmplwi cr1, rN, 32
+ beq L(zero)
+ ble cr1, L(medium)
+ neg rTMP, rSTR1
+ andi. rCOUNT, rTMP, 15 /* check src1 alignment. */
+ bne L(src1_nalign)
+L(src1align16):
+ rlwinm. rTMP, r4, 0, 28, 31 /* check src2 alignment. */
+ srwi rCOUNT, rN, 4 /* no. of bytes / 16. */
+ cmplwi cr5, rCOUNT, 0xFFF0 /* check for large data compares. */
+ rlwinm rN, rN, 0, 28, 31 /* remaining bytes. */
+ mtctr rCOUNT
+ li rINDEX, 0
+ bne L(src2_nalign)
+ bgt cr5, L(large_align)
+L(loop_align):
+ lvx rVR14, rSTR1, rINDEX
+ lvx rVR15, rSTR2, rINDEX
+ addi rINDEX, rINDEX, 16
+ vcmpequb. rVR16, rVR14, rVR15
+ vnor rVR17, rVR16, rVR16
+ bdnzt 4*cr6+lt, L(loop_align)
+ cmplwi cr1, rN, 0
+ bge cr6, L(Vwords_Differ)
+ beq cr1, L(zero)
+ add rSTR1, rSTR1, rINDEX
+ add rSTR2, rSTR2, rINDEX
+L(small):
+ srwi. rCOUNT, rN, 3
+ rlwinm rN, rN, 0, 29, 31
+ beq L(cmp_bytes)
+ lwz rWORD1, 0(rSTR1)
+ lwz rWORD2, 0(rSTR2)
+ lwz rWORD3, 4(rSTR1)
+ lwz rWORD4, 4(rSTR2)
+ cmplw cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 8
+ cmplwi cr5, rN, 0
+ bne cr0, L(bLcr0)
+ cmplw cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 8
+ bne cr1, L(bLcr1)
+ beq cr5, L(zero)
+ .align 4
+L(cmp_bytes):
+ mtctr rN
+ lbz rWORD1, 0(rSTR1)
+ lbz rWORD2, 0(rSTR2)
+ bdz L(b11)
+ lbz rWORD3, 1(rSTR1)
+ lbz rWORD4, 1(rSTR2)
+ cmplw cr0, rWORD1, rWORD2
+ bdz L(b12)
+ lbz rWORD5, 2(rSTR1)
+ lbz rWORD6, 2(rSTR2)
+ cmplw cr1, rWORD3, rWORD4
+ bdz L(b13)
+ lbz rWORD7, 3(rSTR1)
+ lbz rWORD8, 3(rSTR2)
+ bne cr0, L(bx11)
+ cmplw cr5, rWORD5, rWORD6
+ bdz L(b14)
+ cmplw cr6, rWORD7, rWORD8
+ lbz rWORD1, 4(rSTR1)
+ lbz rWORD2, 4(rSTR2)
+ bne cr1, L(bx12)
+ bdz L(b15)
+ lbz rWORD3, 5(rSTR1)
+ lbz rWORD4, 5(rSTR2)
+ cmplw cr0, rWORD1, rWORD2
+ bne cr5, L(bx13)
+ bdz L(b16)
+ lbz rWORD5, 6(rSTR1)
+ lbz rWORD6, 6(rSTR2)
+ cmplw cr1, rWORD3, rWORD4
+ bne cr6, L(bx14)
+ bne cr0, L(bx15)
+ bne cr1, L(bx16)
+ sub rRTN, rWORD5, rWORD6
+ blr
+L(b16):
+ bne cr6, L(bx14)
+ bne cr0, L(bx15)
+L(bx16):
+ sub rRTN, rWORD3, rWORD4
+ blr
+L(b15):
+ bne cr5, L(bx13)
+ bne cr6, L(bx14)
+L(bx15):
+ sub rRTN, rWORD1, rWORD2
+ blr
+L(b14):
+ bne cr1, L(bx12)
+ bne cr5, L(bx13)
+L(bx14):
+ sub rRTN, rWORD7, rWORD8
+ blr
+L(b13):
+ bne cr0, L(bx11)
+ bne cr1, L(bx12)
+L(bx13):
+ sub rRTN, rWORD5, rWORD6
+ blr
+L(b12):
+ bne cr0, L(bx11)
+L(bx12):
+ sub rRTN, rWORD3, rWORD4
+ blr
+L(b11):
+L(bx11):
+ sub rRTN, rWORD1, rWORD2
+ blr
+
+ .align 4
+L(medium):
+ srwi. rCOUNT, rN, 3
+ rlwinm rN, rN, 0, 29, 31
+ beq L(cmp_bytes)
+ mtctr rCOUNT
+ cmplwi cr5, rN, 0
+ lwz rWORD1, 0(rSTR1)
+ lwz rWORD2, 0(rSTR2)
+ lwz rWORD3, 4(rSTR1)
+ lwz rWORD4, 4(rSTR2)
+ cmplw cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 8
+ bne cr0, L(bLcr0)
+ cmplw cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 8
+ bne cr1, L(bLcr1)
+ bdz L(check_small)
+ lwz rWORD1, 0(rSTR1)
+ lwz rWORD2, 0(rSTR2)
+ lwz rWORD3, 4(rSTR1)
+ lwz rWORD4, 4(rSTR2)
+ cmplw cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 8
+ bne cr0, L(bLcr0)
+ cmplw cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 8
+ bne cr1, L(bLcr1)
+ bdz L(check_small)
+ lwz rWORD1, 0(rSTR1)
+ lwz rWORD2, 0(rSTR2)
+ lwz rWORD3, 4(rSTR1)
+ lwz rWORD4, 4(rSTR2)
+ cmplw cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 8
+ bne cr0, L(bLcr0)
+ cmplw cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 8
+ bne cr1, L(bLcr1)
+ bdz L(check_small)
+ lwz rWORD1, 0(rSTR1)
+ lwz rWORD2, 0(rSTR2)
+ lwz rWORD3, 4(rSTR1)
+ lwz rWORD4, 4(rSTR2)
+ cmplw cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 8
+ bne cr0, L(bLcr0)
+ cmplw cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 8
+ bne cr1, L(bLcr1)
+ li rRTN, 0
+ blr
+
+ .align 4
+L(check_small):
+ beq cr5, L(zero)
+ b L(cmp_bytes)
+
+ .align 4
+L(src1_nalign):
+ lwz rWORD1, 0(rSTR1)
+ lwz rWORD2, 0(rSTR2)
+ lwz rWORD3, 4(rSTR1)
+ lwz rWORD4, 4(rSTR2)
+ subfc. rWORD1, rWORD1, rWORD2
+ bne L(Words_Differ)
+ subfc. rWORD1, rWORD3, rWORD4
+ bne L(Words_Differ)
+ lwz rWORD1, 8(rSTR1)
+ lwz rWORD2, 8(rSTR2)
+ lwz rWORD3, 12(rSTR1)
+ lwz rWORD4, 12(rSTR2)
+ subfc. rWORD1, rWORD1, rWORD2
+ bne L(Words_Differ)
+ subfc. rWORD1, rWORD3, rWORD4
+ bne L(Words_Differ)
+ subf rN, rCOUNT, rN
+ cmplwi cr7, rN, 32
+ add rSTR1, rSTR1, rCOUNT
+ add rSTR2, rSTR2, rCOUNT
+ ble cr7, L(medium)
+ b L(src1align16)
+
+ .align 4
+L(bLcr0):
+ li rRTN, 1
+ bgtlr cr0
+ li rRTN, -1
+ blr
+
+ .align 4
+L(bLcr1):
+ li rRTN, 1
+ bgtlr cr1
+ li rRTN, -1
+ blr
+
+ .align 4
+L(src2_nalign):
+ addi rS2OFF, rSTR2, 16
+ bgt cr5, L(large_nalign)
+L(loop_nalign):
+ lvx rVR14, rSTR1, rINDEX
+ lvsl rVR3, 0, rSTR2 /* set permute control vector. */
+ lvx rVR4, rS2OFF, rINDEX /* load LSQ. */
+ lvx rVR2, rSTR2, rINDEX /* load MSQ. */
+ addi rINDEX, rINDEX, 16
+ vperm rVR15, rVR2, rVR4, rVR3 /* align the data. */
+ vcmpequb. rVR16, rVR14, rVR15
+ vnor rVR17, rVR16, rVR16
+ bdnzt 4*cr6+lt, L(loop_nalign)
+ cmplwi cr1, rN, 0
+ bge cr6, L(Vwords_Differ)
+ beq cr1, L(zero)
+ add rSTR1, rSTR1, rINDEX
+ add rSTR2, rSTR2, rINDEX
+ b L(small)
+
+ .align 4
+L(large_nalign):
+ lvxl rVR14, rSTR1, rINDEX
+ lvsl rVR3, 0, rSTR2 /* set permute control vector. */
+ lvxl rVR4, rS2OFF, rINDEX /* load LSQ. */
+ lvxl rVR2, rSTR2, rINDEX /* load MSQ. */
+ addi rINDEX, rINDEX, 16
+ vperm rVR15, rVR2, rVR4, rVR3 /* align the data. */
+ vcmpequb. rVR16, rVR14, rVR15
+ vnor rVR17, rVR16, rVR16
+ bdnzt 4*cr6+lt, L(large_nalign)
+ cmplwi cr1, rN, 0
+ bge cr6, L(Vwords_Differ)
+ beq cr1, L(zero)
+ add rSTR1, rSTR1, rINDEX
+ add rSTR2, rSTR2, rINDEX
+ b L(small)
+
+ .align 4
+L(large_align):
+ lvxl rVR14, rSTR1, rINDEX
+ lvxl rVR15, rSTR2, rINDEX
+ addi rINDEX, rINDEX, 16
+ vcmpequb. rVR16, rVR14, rVR15
+ vnor rVR17, rVR16, rVR16
+ bdnzt 4*cr6+lt, L(large_align)
+ cmplwi cr1, rN, 0
+ bge cr6, L(Vwords_Differ)
+ beq cr1, L(zero)
+ add rSTR1, rSTR1, rINDEX
+ add rSTR2, rSTR2, rINDEX
+ b L(small)
+
+ .align 4
+L(Words_Differ):
+ subfe rRTN, rWORD1, rWORD1
+ nand rRTN, rRTN, rRTN
+ ori rRTN, rRTN, 1
+ blr
+
+ .align 4
+L(Vwords_Differ):
+ vspltisb rVR18, 1
+ vspltisb rVR1, 8
+ vslb rVR0, rVR1, rVR18
+ vslb rVR19, rVR0, rVR18
+ vslb rVR18, rVR19, rVR18
+ vxor rVR5, rVR5, rVR5
+ vsum4ubs rVR2, rVR1, rVR18
+ vsro rVR9, rVR17, rVR19
+ vsrw rVR19, rVR17, rVR1
+ vsro rVR10, rVR17, rVR18
+ vsrw rVR18, rVR17, rVR0
+ vsro rVR0, rVR17, rVR2
+ vor rVR11, rVR9, rVR10
+ vsro rVR2, rVR18, rVR1
+ vor rVR11, rVR11, rVR0
+ vcmpgtuw rVR11, rVR11, rVR5
+ vor rVR11, rVR11, rVR19
+ vor rVR11, rVR11, rVR18
+ vor rVR11, rVR11, rVR2
+ vor rVR15, rVR15, rVR11
+ vor rVR14, rVR14, rVR11
+ li rRTN, -1
+ vcmpgtub. rVR8, rVR15, rVR14
+ bnelr cr6
+ li rRTN, 1
+ blr
+
+ .align 4
+L(zero):
+ li rRTN, 0
+ blr
+
+END (memcmp)
+libc_hidden_builtin_def (memcmp)
+weak_alias (memcmp, bcmp)
@@ -0,0 +1 @@
+powerpc/powerpc32/power4/multiarch
@@ -34,6 +34,7 @@
size_t i = 0;
unsigned long int hwcap = GLRO(dl_hwcap);
+ unsigned long int hwcap2 = GLRO(dl_hwcap2);
/* hwcap contains only the latest supported ISA, the code checks which is
and fills the previous supported ones. */
if (hwcap & PPC_FEATURE_ARCH_2_06)
@@ -107,6 +108,10 @@
IFUNC_IMPL (i, name, memcmp,
IFUNC_IMPL_ADD (array, i, memcmp, hwcap & PPC_FEATURE_HAS_VSX,
__memcmp_power7)
+ IFUNC_IMPL_ADD (array, i, memcmp,
+ (((hwcap & PPC_FEATURE_E6500) == PPC_FEATURE_E6500)
+ && (hwcap2 & PPC_FEATURE2_HAS_ISEL)),
+ __memcmp_e6500)
IFUNC_IMPL_ADD (array, i, memcmp, 1, __memcmp_ppc))
/* Support sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c. */
@@ -1,6 +1,6 @@
ifeq ($(subdir),string)
sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
- memcpy-ppc32 memcmp-power7 memcmp-ppc32 memset-power7 \
+ memcpy-ppc32 memcmp-power7 memcmp-e6500 memcmp-ppc32 memset-power7 \
memset-power6 memset-ppc32 bzero-power7 bzero-power6 \
bzero-ppc32 mempcpy-power7 mempcpy-ppc32 memchr-power7 \
memchr-ppc32 memrchr-power7 memrchr-ppc32 rawmemchr-power7 \
@@ -24,11 +24,15 @@
extern __typeof (memcmp) __memcmp_ppc attribute_hidden;
extern __typeof (memcmp) __memcmp_power7 attribute_hidden;
+extern __typeof (memcmp) __memcmp_e6500 attribute_hidden;
/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
ifunc symbol properly. */
libc_ifunc (memcmp,
(hwcap & PPC_FEATURE_HAS_VSX)
- ? __memcmp_power7
+ ? __memcmp_power7 :
+ (((hwcap & PPC_FEATURE_E6500) == PPC_FEATURE_E6500)
+ && (hwcap2 & PPC_FEATURE2_HAS_ISEL))
+ ? __memcmp_e6500
: __memcmp_ppc);
#endif
@@ -0,0 +1,41 @@
+/* Optimized memcmp implementation for e6500/PowerPC32.
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+
+#undef EALIGN
+#define EALIGN(name, alignt, words) \
+ .globl C_SYMBOL_NAME(__memcmp_e6500); \
+ .type C_SYMBOL_NAME(__memcmp_e6500),@function; \
+ .align ALIGNARG(alignt); \
+ EALIGN_W_##words; \
+ C_LABEL(__memcmp_e6500) \
+ cfi_startproc;
+
+#undef END
+#define END(name) \
+ cfi_endproc; \
+ ASM_SIZE_DIRECTIVE(__memcmp_e6500)
+
+#undef libc_hidden_builtin_def
+#define libc_hidden_builtin_def(name)
+
+#undef weak_alias
+#define weak_alias(a, b)
+
+#include <sysdeps/powerpc/powerpc32/e6500/memcmp.S>
@@ -0,0 +1,428 @@
+/* Optimized memcmp implementation for 64-bit e6500 PowerPC.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+
+/* int [r3] memcmp (const char *s1 [r3],
+ const char *s2 [r4],
+ size_t size [r5])
+ r3:source1 address, return equality
+ r4:source2 address
+ r5:byte count
+
+ volatile fixed point registers usable:
+ r0, r3-r12
+
+ volatile floating point registers usable:
+ f0-f13
+
+ v0-v1 General use volatile (caller save)
+ v2-v13 Parameters, general volatile (caller save)
+ v14-v19 General Volatile (caller save)
+
+ CR0-CR1 Volatile condition code register fields
+ CR5-CR7 Volatile condition code register fields. */
+
+#define rTMP r0
+#define rRTN r3
+#define rSTR1 r3 /* first string arg. */
+#define rSTR2 r4 /* second string arg. */
+#define rS2OFF r7 /* second string arg + 16. */
+#define rN r5
+#define rWORD1 r6 /* current word in s1. */
+#define rWORD2 r7 /* current word in s2. */
+#define rWORD3 r8 /* next word in s1. */
+#define rWORD4 r9 /* next word in s2. */
+#define rWORD5 r10 /* next word in s1. */
+#define rWORD6 r11 /* next word in s2. */
+#define rWORD7 r5 /* next word in s1. */
+#define rWORD8 r12 /* next word in s2. */
+#define rCOUNT r11
+#define rINDEX r8
+#define rVR0 v0
+#define rVR1 v1
+#define rVR2 v2
+#define rVR3 v3
+#define rVR4 v4
+#define rVR5 v5
+#define rVR6 v6
+#define rVR7 v7
+#define rVR8 v8
+#define rVR9 v9
+#define rVR10 v10
+#define rVR11 v11
+#define rVR14 v14
+#define rVR15 v15
+#define rVR16 v16
+#define rVR17 v17
+#define rVR18 v18
+#define rVR19 v19
+
+EALIGN (memcmp, 5, 0)
+ CALL_MCOUNT 3
+ cmpldi rN, 0
+ cmpldi cr1, rN, 16
+ cmpldi cr5, rN, 64
+ beq L(zero)
+ blt cr1, L(small)
+ ble cr5, L(medium)
+ neg rTMP, rSTR1
+ andi. rCOUNT, rTMP, 15 /* check src1 alignment. */
+ bne L(src1_nalign)
+L(src1align16):
+ rlwinm. rTMP, rSTR2, 0, 28, 31 /* check src2 alignment. */
+ bne L(src2_nalign)
+ srdi rCOUNT, rN, 4 /* nb / 16. */;
+ cmpldi cr5, rCOUNT, 0xFFF0 /* check for large data compares. */
+ rlwinm rN, rN, 0, 28, 31 /* remaining bytes. */
+ mtctr rCOUNT
+ li rINDEX, 0
+ bgt cr5, L(large_align)
+ .align 4
+L(loop_align):
+ lvx rVR14, rSTR1, rINDEX
+ lvx rVR15, rSTR2, rINDEX
+ addi rINDEX, rINDEX, 16
+ vcmpequb. rVR16, rVR14, rVR15
+ vnor rVR17, rVR16, rVR16
+ bdnzt 4*cr6+lt, L(loop_align)
+ cmpldi cr1, rN, 0
+ bge cr6, L(Vwords_Differ)
+ beq cr1, L(zero)
+ add rSTR1, rSTR1, rINDEX
+ add rSTR2, rSTR2, rINDEX
+ .align 4
+L(small):
+ srdi. rCOUNT, rN, 3
+ rlwinm rN, rN, 0, 29, 31
+ beq L(cmp_bytes)
+ ld rWORD1, 0(rSTR1)
+ ld rWORD2, 0(rSTR2)
+ cmpld cr0, rWORD1, rWORD2
+ cmpldi cr1, rN, 0
+ addi rSTR1, rSTR1, 8
+ addi rSTR2, rSTR2, 8
+ bne cr0, L(bLcr0)
+ beq cr1, L(zero)
+ .align 4
+L(cmp_bytes):
+ mtctr rN
+ lbz rWORD1, 0(rSTR1)
+ lbz rWORD2, 0(rSTR2)
+ bdz L(b11)
+ lbz rWORD3, 1(rSTR1)
+ lbz rWORD4, 1(rSTR2)
+ cmpld cr0, rWORD1, rWORD2
+ bdz L(b12)
+ lbz rWORD5, 2(rSTR1)
+ lbz rWORD6, 2(rSTR2)
+ cmpld cr1, rWORD3, rWORD4
+ bdz L(b13)
+ lbz rWORD7, 3(rSTR1)
+ lbz rWORD8, 3(rSTR2)
+ bne cr0, L(bx11)
+ cmpld cr5, rWORD5, rWORD6
+ bdz L(b14)
+ cmpld cr6, rWORD7, rWORD8
+ lbz rWORD1, 4(rSTR1)
+ lbz rWORD2, 4(rSTR2)
+ bne cr1, L(bx12)
+ bdz L(b15)
+ lbz rWORD3, 5(rSTR1)
+ lbz rWORD4, 5(rSTR2)
+ cmpld cr0, rWORD1, rWORD2
+ bne cr5, L(bx13)
+ bdz L(b16)
+ lbz rWORD5, 6(rSTR1)
+ lbz rWORD6, 6(rSTR2)
+ cmpld cr1, rWORD3, rWORD4
+ bne cr6, L(bx14)
+ bne cr0, L(bx15)
+ bne cr1, L(bx16)
+ sub rRTN, rWORD5, rWORD6
+ blr
+L(b16):
+ bne cr6, L(bx14)
+ bne cr0, L(bx15)
+L(bx16):
+ sub rRTN, rWORD3, rWORD4
+ blr
+L(b15):
+ bne cr5, L(bx13)
+ bne cr6, L(bx14)
+L(bx15):
+ sub rRTN, rWORD1, rWORD2
+ blr
+L(b14):
+ bne cr1, L(bx12)
+ bne cr5, L(bx13)
+L(bx14):
+ sub rRTN, rWORD7, rWORD8
+ blr
+L(b13):
+ bne cr0, L(bx11)
+ bne cr1, L(bx12)
+L(bx13):
+ sub rRTN, rWORD5, rWORD6
+ blr
+L(b12):
+ bne cr0, L(bx11)
+L(bx12):
+ sub rRTN, rWORD3, rWORD4
+ blr
+L(b11):
+L(bx11):
+ sub rRTN, rWORD1, rWORD2
+ blr
+
+ .align 4
+L(medium):
+ srwi rCOUNT, rN, 4
+ rlwinm rN, rN, 0, 28, 31
+ mtctr rCOUNT
+ cmpldi cr5, rN, 0
+ ld rWORD1, 0(rSTR1)
+ ld rWORD2, 0(rSTR2)
+ ld rWORD3, 8(rSTR1)
+ ld rWORD4, 8(rSTR2)
+ cmpld cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 16
+ bne cr0, L(bLcr0)
+ cmpld cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 16
+ bne cr1, L(bLcr1)
+ bdz L(check_small)
+ ld rWORD1, 0(rSTR1)
+ ld rWORD2, 0(rSTR2)
+ ld rWORD3, 8(rSTR1)
+ ld rWORD4, 8(rSTR2)
+ cmpld cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 16
+ bne cr0, L(bLcr0)
+ cmpld cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 16
+ bne cr1, L(bLcr1)
+ bdz L(check_small)
+ ld rWORD1, 0(rSTR1)
+ ld rWORD2, 0(rSTR2)
+ ld rWORD3, 8(rSTR1)
+ ld rWORD4, 8(rSTR2)
+ cmpld cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 16
+ bne cr0, L(bLcr0)
+ cmpld cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 16
+ bne cr1, L(bLcr1)
+ bdz L(check_small)
+ ld rWORD1, 0(rSTR1)
+ ld rWORD2, 0(rSTR2)
+ ld rWORD3, 8(rSTR1)
+ ld rWORD4, 8(rSTR2)
+ cmpld cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 16
+ bne cr0, L(bLcr0)
+ cmpld cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 16
+ bne cr1, L(bLcr1)
+ li rRTN, 0
+ blr
+
+ .align 4
+L(check_small):
+ beq cr5, L(zero)
+ b L(small)
+
+ .align 4
+L(src1_nalign):
+ ld rWORD1, 0(rSTR1)
+ ld rWORD2, 0(rSTR2)
+ ld rWORD3, 8(rSTR1)
+ ld rWORD4, 8(rSTR2)
+ cmpld cr0, rWORD1, rWORD2
+ add rSTR1, rSTR1, rCOUNT
+ bne cr0, L(bLcr0)
+ cmpld cr1, rWORD3, rWORD4
+ add rSTR2, rSTR2, rCOUNT
+ bne cr1, L(bLcr1)
+ subf rN, rCOUNT, rN
+ cmpldi cr7, rN, 64
+ ble cr7, L(medium)
+ b L(src1align16)
+
+ .align 4
+L(src2_nalign):
+ rlwinm. rTMP, rSTR2, 0, 29, 31
+ beq cr0, L(src2_dwalign)
+ srdi rCOUNT, rN, 4 /* n / 16. */;
+ cmpldi cr5, rCOUNT, 0xFFF0 /* check for large data compares. */
+ rlwinm rN, rN, 0, 28, 31 /* remaining bytes. */
+ mtctr rCOUNT
+ li rINDEX, 0
+ addi rS2OFF, rSTR2, 16
+ bgt cr5, L(large_nalign)
+ .align 4
+L(loop_nalign):
+ lvx rVR14, rSTR1, rINDEX
+ lvsl rVR3, 0, rSTR2 /* set permute control vector. */
+ lvx rVR4, rS2OFF, rINDEX /* load LSQ. */
+ lvx rVR2, rSTR2, rINDEX /* load MSQ. */
+ addi rINDEX, rINDEX, 16
+ vperm rVR15, rVR2, rVR4, rVR3 /* align the data. */
+ vcmpequb. rVR16, rVR14, rVR15
+ vnor rVR17, rVR16, rVR16
+ bdnzt 4*cr6+lt, L(loop_nalign)
+ cmpldi cr1, rN, 0
+ bge cr6, L(Vwords_Differ)
+ beq cr1, L(zero)
+ add rSTR1, rSTR1, rINDEX
+ add rSTR2, rSTR2, rINDEX
+ b L(small)
+
+ .align 4
+L(src2_dwalign):
+ srdi rCOUNT, rN, 6
+ rlwinm rN, rN, 0, 26, 31
+ mtctr rCOUNT
+ li rINDEX, 0
+ cmpldi cr5, rN, 0
+ cmpldi cr6, rN, 16
+L(dw_loop):
+ ld rWORD1, 0(rSTR1)
+ ld rWORD2, 0(rSTR2)
+ ld rWORD3, 8(rSTR1)
+ ld rWORD4, 8(rSTR2)
+ cmpld cr0, rWORD1, rWORD2
+ ld rWORD1, 16(rSTR1)
+ ld rWORD2, 16(rSTR2)
+ bne cr0, L(bLcr0)
+ cmpld cr1, rWORD3, rWORD4
+ ld rWORD3, 24(rSTR1)
+ ld rWORD4, 24(rSTR2)
+ bne cr1, L(bLcr1)
+ cmpld cr0, rWORD1, rWORD2
+ ld rWORD1, 32(rSTR1)
+ ld rWORD2, 32(rSTR2)
+ bne cr0, L(bLcr0)
+ cmpld cr1, rWORD3, rWORD4
+ ld rWORD3, 40(rSTR1)
+ ld rWORD4, 40(rSTR2)
+ bne cr1, L(bLcr1)
+ cmpld cr0, rWORD1, rWORD2
+ ld rWORD1, 48(rSTR1)
+ ld rWORD2, 48(rSTR2)
+ bne cr0, L(bLcr0)
+ cmpld cr1, rWORD3, rWORD4
+ ld rWORD3, 56(rSTR1)
+ ld rWORD4, 56(rSTR2)
+ bne cr1, L(bLcr1)
+ cmpld cr0, rWORD1, rWORD2
+ addi rSTR1, rSTR1, 64
+ bne cr0, L(bLcr0)
+ cmpld cr1, rWORD3, rWORD4
+ addi rSTR2, rSTR2, 64
+ bne cr1, L(bLcr1)
+ bdnz L(dw_loop)
+ beq cr5, L(zero)
+ blt cr6, L(small)
+ b L(medium)
+
+ .align 4
+L(bLcr0):
+ li rRTN, 1
+ bgtlr cr0
+ li rRTN, -1
+ blr
+
+ .align 4
+L(bLcr1):
+ li rRTN, 1
+ bgtlr cr1
+ li rRTN, -1
+ blr
+
+ .align 4
+L(large_nalign):
+ lvxl rVR14, rSTR1, rINDEX
+ lvsl rVR3, 0, rSTR2 /* set permute control vector. */
+ lvxl rVR4, rS2OFF, rINDEX /* load LSQ. */
+ lvxl rVR2, rSTR2, rINDEX /* load MSQ. */
+ addi rINDEX, rINDEX, 16
+ vperm rVR15, rVR2, rVR4, rVR3 /* align the data. */
+ vcmpequb. rVR16, rVR14, rVR15
+ vnor rVR17, rVR16, rVR16
+ bdnzt 4*cr6+lt, L(large_nalign)
+ cmpldi cr1, rN, 0
+ bge cr6, L(Vwords_Differ)
+ beq cr1, L(zero)
+ add rSTR1, rSTR1, rINDEX
+ add rSTR2, rSTR2, rINDEX
+ b L(small)
+
+ .align 4
+L(large_align):
+ lvxl rVR14, rSTR1, rINDEX
+ lvxl rVR15, rSTR2, rINDEX
+ addi rINDEX, rINDEX, 16
+ vcmpequb. rVR16, rVR14, rVR15
+ vnor rVR17, rVR16, rVR16
+ bdnzt 4*cr6+lt, L(large_align)
+ cmpldi cr1, rN, 0
+ bge cr6, L(Vwords_Differ)
+ beq cr1, L(zero)
+ add rSTR1, rSTR1, rINDEX
+ add rSTR2, rSTR2, rINDEX
+ b L(small)
+
+ .align 4
+L(Vwords_Differ):
+ vspltisb rVR18, 1
+ vspltisb rVR1, 8
+ vslb rVR0, rVR1, rVR18
+ vslb rVR19, rVR0, rVR18
+ vslb rVR18, rVR19, rVR18
+ vxor rVR5, rVR5, rVR5
+ vsum4ubs rVR2, rVR1, rVR18
+ vsro rVR9, rVR17, rVR19
+ vsrw rVR19, rVR17, rVR1
+ vsro rVR10, rVR17, rVR18
+ vsrw rVR18, rVR17, rVR0
+ vsro rVR0, rVR17, rVR2
+ vor rVR11, rVR9, rVR10
+ vsro rVR2, rVR18, rVR1
+ vor rVR11, rVR11, rVR0
+ vcmpgtuw rVR11, rVR11, rVR5
+ vor rVR11, rVR11, rVR19
+ vor rVR11, rVR11, rVR18
+ vor rVR11, rVR11, rVR2
+ vor rVR15, rVR15, rVR11
+ vor rVR14, rVR14, rVR11
+ li rRTN, -1
+ vcmpgtub. rVR8, rVR15, rVR14
+ bnelr cr6
+ li rRTN, 1
+ blr
+
+ .align 4
+L(zero):
+ li rRTN, 0
+ blr
+
+END (memcmp)
+libc_hidden_builtin_def (memcmp)
+weak_alias (memcmp, bcmp)
@@ -0,0 +1 @@
+powerpc/powerpc64/multiarch
@@ -34,6 +34,7 @@
size_t i = 0;
unsigned long int hwcap = GLRO(dl_hwcap);
+ unsigned long int hwcap2 = GLRO(dl_hwcap2);
/* hwcap contains only the latest supported ISA, the code checks which is
and fills the previous supported ones. */
if (hwcap & PPC_FEATURE_ARCH_2_06)
@@ -130,6 +131,10 @@
__memcmp_power7)
IFUNC_IMPL_ADD (array, i, memcmp, hwcap & PPC_FEATURE_POWER4,
__memcmp_power4)
+ IFUNC_IMPL_ADD (array, i, memcmp,
+ (((hwcap & PPC_FEATURE_E6500) == PPC_FEATURE_E6500)
+ && (hwcap2 & PPC_FEATURE2_HAS_ISEL)),
+ __memcmp_e6500)
IFUNC_IMPL_ADD (array, i, memcmp, 1, __memcmp_ppc))
/* Support sysdeps/powerpc/powerpc64/multiarch/bzero.c. */
@@ -1,7 +1,7 @@
ifeq ($(subdir),string)
sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
memcpy-power4 memcpy-ppc64 memcmp-power7 memcmp-power4 \
- memcmp-ppc64 memset-power7 memset-power6 memset-power4 \
+ memcmp-e6500 memcmp-ppc64 memset-power7 memset-power6 memset-power4 \
memset-ppc64 bzero-power4 bzero-power6 bzero-power7 \
mempcpy-power7 mempcpy-ppc64 memchr-power7 memchr-ppc64 \
memrchr-power7 memrchr-ppc64 rawmemchr-power7 \
@@ -25,6 +25,7 @@
extern __typeof (memcmp) __memcmp_ppc attribute_hidden;
extern __typeof (memcmp) __memcmp_power4 attribute_hidden;
extern __typeof (memcmp) __memcmp_power7 attribute_hidden;
+extern __typeof (memcmp) __memcmp_e6500 attribute_hidden;
/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
ifunc symbol properly. */
@@ -32,7 +33,10 @@
(hwcap & PPC_FEATURE_HAS_VSX)
? __memcmp_power7 :
(hwcap & PPC_FEATURE_POWER4)
- ? __memcmp_power4
+ ? __memcmp_power4 :
+ (((hwcap & PPC_FEATURE_E6500) == PPC_FEATURE_E6500)
+ && (hwcap2 & PPC_FEATURE2_HAS_ISEL))
+ ? __memcmp_e6500
: __memcmp_ppc);
#else
#include <string/memcmp.c>
@@ -0,0 +1,42 @@
+/* Optimized memcmp implementation for PowerPC64/e6500.
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+
+#undef EALIGN
+#define EALIGN(name, alignt, words) \
+ .section ".text"; \
+ ENTRY_2(__memcmp_e6500) \
+ .align ALIGNARG(alignt); \
+ EALIGN_W_##words; \
+ BODY_LABEL(__memcmp_e6500): \
+ cfi_startproc; \
+ LOCALENTRY(__memcmp_e6500)
+
+#undef END
+#define END(name) \
+ cfi_endproc; \
+ TRACEBACK(__memcmp_e6500) \
+ END_2(__memcmp_e6500)
+
+#undef libc_hidden_builtin_def
+#define libc_hidden_builtin_def(name)
+#undef weak_alias
+#define weak_alias(name,alias)
+
+#include <sysdeps/powerpc/powerpc64/e6500/memcmp.S>