From patchwork Wed Nov 15 05:21:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Amrita H S X-Patchwork-Id: 79899 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DFA2E3858C2B for ; Wed, 15 Nov 2023 05:22:18 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id A837E3858CDA for ; Wed, 15 Nov 2023 05:22:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A837E3858CDA Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Authentication-Results: sourceware.org; spf=none smtp.mailfrom=linux.vnet.ibm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A837E3858CDA Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700025726; cv=none; b=I6eVuxE6/4e1If422iyBKJy1JV5WQjCEhYCZ5iwm1lkBkZT03J18Alz+Pj493AokEe5yZX7myKtRXZf8U1+aysa+A/zxmzA9Ga6YM208jDyRdbxRFffJYQnOUbcx40hC87P4yZggZ+0LJOEk23PiivX1ui8fO+B9IysnSMU5d4g= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700025726; c=relaxed/simple; bh=kaDOSt01Ob0+SSnRMSHHk2csH9jGMca6vTR87z/V6jc=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=KXcdXVT8gTkw3GZqRBmpUxjRuTKnX201Kdl4Lyfy6xzmni5FKUqOHtHXG9wfu+PZIwx6NeXv4yFW4O+elaShOWx5BebwSZGzy5z24r/aLTZMnk299995KG/KRWXlBZfOSZOWIlM78o2vpnR5ztSbPPi2sOclCvpirbENAne6i3M= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AF5CMDS006837 for ; Wed, 15 Nov 2023 05:22:04 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : content-transfer-encoding : mime-version; s=pp1; bh=XQGJSknwFbNc+wHlDplr3Aq7R4f4EeiB4YM6MEfSb58=; b=NulM+1UGs/FbD5SHnmbDUwc/Wjf8dEFT+I8r9NzLg8FjDbHyNQqWoxJZcIxhnJR3z91N L6LoHHpgHXLB3RiDKlHNQnEthII9TQ0yYUFdfzGvmXLBghxBGKwVtRAcbcvWdN414IvG 4ApLsmENO7cz817Z41rLWKGp6eOm3dapRSCOtxv1niK6b9YAtLbtueIrbyuY3z+5V7tK 9UC/pFCNDzOZVHSZsKA6E39BoBYvItfgYslCYK2c7kkOhIjvejiB5B/cvCACuZY0pnge p0BxOSLvp0LAFcOlXrhuzwvZu2DXUyaipZWjg/jDSkSNHI7hUXJ69d9YG4gdOht4OWvX Bg== Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3ucqmn89ta-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 15 Nov 2023 05:22:03 +0000 Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AF4iDiA024350 for ; Wed, 15 Nov 2023 05:22:03 GMT Received: from smtprelay04.fra02v.mail.ibm.com ([9.218.2.228]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3uamayd7eq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 15 Nov 2023 05:22:03 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay04.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AF5M0Pv41091434 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 15 Nov 2023 05:22:00 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E203A20043; Wed, 15 Nov 2023 05:21:59 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 36F112004B; Wed, 15 Nov 2023 05:21:59 +0000 (GMT) Received: from ltcd97-lp3.. (unknown [9.40.194.171]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Wed, 15 Nov 2023 05:21:59 +0000 (GMT) From: Amrita H S To: libc-alpha@sourceware.org Cc: rajis@linux.ibm.com, Amrita H S Subject: [PATCHV3] powerpc: Optimized strcmp for power10 Date: Wed, 15 Nov 2023 00:21:55 -0500 Message-ID: <20231115052155.3353529-1-amritahs@linux.vnet.ibm.com> X-Mailer: git-send-email 2.41.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: P9IeXkk1voGYLqXhiqgbq8JTZjK0KmIY X-Proofpoint-ORIG-GUID: P9IeXkk1voGYLqXhiqgbq8JTZjK0KmIY X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.987,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-15_03,2023-11-14_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 suspectscore=0 priorityscore=1501 clxscore=1011 phishscore=0 impostorscore=0 bulkscore=0 lowpriorityscore=0 mlxlogscore=493 malwarescore=0 spamscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311150039 X-Spam-Status: No, score=-11.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org This patch is based on __strcmp_power9 and __strlen_power10. Improvements from __strcmp_power9: 1. Uses new POWER10 instructions - This code uses lxvp to decrease contention on load by loading 32 bytes per instruction. - The vextractbm is used to have a smaller tail code for calculating the return value. 2. Performance implication - This version has around 30% better performance on average. - Performance regression is seen for a specific combination of sizes and alignments. Some of them is observed without changes also, while rest may be induced by the patch. Signed-off-by: Amrita H S --- sysdeps/powerpc/powerpc64/le/power10/strcmp.S | 174 ++++++++++++++++++ sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +- .../powerpc64/multiarch/ifunc-impl-list.c | 4 + .../powerpc64/multiarch/strcmp-power10.S | 26 +++ sysdeps/powerpc/powerpc64/multiarch/strcmp.c | 4 + 5 files changed, 210 insertions(+), 1 deletion(-) create mode 100644 sysdeps/powerpc/powerpc64/le/power10/strcmp.S create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strcmp-power10.S diff --git a/sysdeps/powerpc/powerpc64/le/power10/strcmp.S b/sysdeps/powerpc/powerpc64/le/power10/strcmp.S new file mode 100644 index 0000000000..885be10ae8 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power10/strcmp.S @@ -0,0 +1,174 @@ +/* Optimized strcmp implementation for PowerPC64/POWER10. + Copyright (C) 2021-2023 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ +#include + +#ifndef STRCMP +# define STRCMP strcmp +#endif + +/* Implements the function + int [r3] strcmp (const char *s1 [r3], const char *s2 [r4]). */ + +/* TODO: Change this to actual instructions when minimum binutils is upgraded + to 2.27. Macros are defined below for these newer instructions in order + to maintain compatibility. */ + +#define LXVP(xtp,dq,ra) \ + .long(((6)<<(32-6)) \ + | ((((xtp)-32)>>1)<<(32-10)) \ + | ((1)<<(32-11)) \ + | ((ra)<<(32-16)) \ + | dq) + +#define COMPARE_16_S1S2_ALIGNED(vreg1, vreg2, offset) \ + lxv vreg1+32, offset(r3); \ + lxv vreg2+32, offset(r4); \ + vcmpnezb. v7, vreg1, vreg2; \ + bne cr6, L(different); \ + +#define COMPARE_32(vreg1, vreg2, offset, label1, label2) \ + LXVP(vreg1+32, offset, r3); \ + LXVP(vreg2+32, offset, r4); \ + vcmpnezb. v7, vreg1+1, vreg2+1; \ + bne cr6, L(label1); \ + vcmpnezb. v7, vreg1, vreg2; \ + bne cr6, L(label2); \ + +#define TAIL(vreg1, vreg2) \ + vctzlsbb r6, v7; \ + vextubrx r5, r6, vreg1; \ + vextubrx r4, r6, vreg2; \ + subf r3, r4, r5; \ + blr; \ + +#define CHECK_N_BYTES(reg1, reg2, len_reg) \ + mr r0, len_reg; \ + sldi r0, r0, 56; \ + lxvl 32+v4, reg1, r0; \ + lxvl 32+v5, reg2, r0; \ + add reg1, reg1, len_reg; \ + add reg2, reg2, len_reg; \ + vcmpnezb. v7, v4, v5; \ + vctzlsbb r6, v7; \ + cmpld cr7, r6, len_reg; \ + blt cr7, L(different); \ + + /* TODO: change this to .machine power10 when the minimum required binutils + allows it. */ + + .machine power9 +ENTRY_TOCLESS (STRCMP, 4) + li r11, 16 + li r8, 0 + vspltisb v19,-1 + andi. r7, r3, 15 + sub r7, r11, r7 /* r7(nalign1) = 16 - (str1 & 15). */ + andi. r9, r4, 15 + sub r5, r11, r9 /* r5(nalign2) = 16 - (str2 & 15). */ + cmpld cr7, r7, r5 + beq cr7, L(same_aligned) + blt cr7, L(nalign1_min) + sub r8, r7, r5 /* r8 has max(nalign1-nalign2, 0). */ + mr r7, r5 +L(nalign1_min): + mr r10, r7 /* r10 has min(nalign1, nalign2). */ + CHECK_N_BYTES(r3, r4, r10) + cmpldi r8, 0 + ble L(s1_aligned) + CHECK_N_BYTES(r3, r4, r8) + +L(s1_aligned): + li r10, 4096 + andi. r7, r4, 4095 + sub r7, r10, r7 /* r7 is rem = page_size - (str2 & (page_size-1)). */ + andi. r9, r4, 15 + sub r8, r10, r9 /* r8 is rem_step = page_size - (str2&15). */ +L(L1): + srdi r7, r7, 4 /* Get the count of 16B blocks, rem = rem/16. */ + mtctr r7 + cmpldi r7, 0 + ble L(L3) +L(L2): + CHECK_N_BYTES(r3, r4, r11) /* Load 16B blocks using lxvl. */ + bdnz L(L2) + /* Cross the page boundary of s2, carefully. */ +L(L3): + andi. r9, r4, 15 /* r9 is number of bytes to be read after page boundary. */ + sub r5, r11, r9 /* r5 is number of bytes to be read before page boundary. */ + CHECK_N_BYTES(r3, r4, r5) + CHECK_N_BYTES(r3, r4, r9) + mr r7, r8 /* Move rem_step to rem and continue the L1 loop. */ + b L(L1) + +L(same_aligned): + CHECK_N_BYTES(r3, r4, r7) + /* Align s1 to 32B and adjust s2 address. + Use lxvp only if both s1 and s2 are 32B aligned. */ + COMPARE_16_S1S2_ALIGNED(v4, v5, 0) + COMPARE_16_S1S2_ALIGNED(v4, v5, 16) + COMPARE_16_S1S2_ALIGNED(v4, v5, 32) + COMPARE_16_S1S2_ALIGNED(v4, v5, 48) + addi r3, r3, 64 + addi r4, r4, 64 + COMPARE_16_S1S2_ALIGNED(v4, v5, 0) + COMPARE_16_S1S2_ALIGNED(v4, v5, 16) + + clrldi r6, r3, 59 + subfic r5, r6, 32 + add r3, r3, r5 + add r4, r4, r5 + andi. r5, r4, 0x1F + beq cr0, L(32B_aligned_loop) + +L(16B_aligned_loop): + COMPARE_16_S1S2_ALIGNED(v4, v5, 0) + COMPARE_16_S1S2_ALIGNED(v4, v5, 16) + COMPARE_16_S1S2_ALIGNED(v4, v5, 32) + COMPARE_16_S1S2_ALIGNED(v4, v5, 48) + addi r3, r3, 64 + addi r4, r4, 64 + b L(16B_aligned_loop) + + /* Calculate and return the difference. */ +L(different): + vctzlsbb r6, v7 + vextubrx r5, r6, v4 + vextubrx r4, r6, v5 + subf r3, r4, r5 + blr + +L(32B_aligned_loop): + COMPARE_32(v14, v16, 0, tail1, tail2) + COMPARE_32(v18, v20, 32, tail3, tail4) + COMPARE_32(v22, v24, 64, tail5, tail6) + COMPARE_32(v26, v28, 96, tail7, tail8) + addi r3, r3, 128 + addi r4, r4, 128 + b L(32B_aligned_loop) + +L(tail1): TAIL(v15, v17) +L(tail2): TAIL(v14, v16) +L(tail3): TAIL(v19, v21) +L(tail4): TAIL(v18, v20) +L(tail5): TAIL(v23, v25) +L(tail6): TAIL(v22, v24) +L(tail7): TAIL(v27, v29) +L(tail8): TAIL(v26, v28) + +END (STRCMP) +libc_hidden_builtin_def (strcmp) diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile index 27d8495503..d7824a922b 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile @@ -33,7 +33,8 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ ifneq (,$(filter %le,$(config-machine))) sysdep_routines += memcmp-power10 memcpy-power10 memmove-power10 memset-power10 \ rawmemchr-power9 rawmemchr-power10 \ - strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \ + strcmp-power9 strcmp-power10 strncmp-power9 \ + strcpy-power9 stpcpy-power9 \ strlen-power9 strncpy-power9 stpncpy-power9 strlen-power10 endif CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c index ebe9434052..ca1f57e1e2 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c @@ -376,6 +376,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, /* Support sysdeps/powerpc/powerpc64/multiarch/strcmp.c. */ IFUNC_IMPL (i, name, strcmp, #ifdef __LITTLE_ENDIAN__ + IFUNC_IMPL_ADD (array, i, strcmp, + (hwcap2 & PPC_FEATURE2_ARCH_3_1) + && (hwcap & PPC_FEATURE_HAS_VSX), + __strcmp_power10) IFUNC_IMPL_ADD (array, i, strcmp, hwcap2 & PPC_FEATURE2_ARCH_3_00 && hwcap & PPC_FEATURE_HAS_ALTIVEC, diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcmp-power10.S b/sysdeps/powerpc/powerpc64/multiarch/strcmp-power10.S new file mode 100644 index 0000000000..c80067ce33 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strcmp-power10.S @@ -0,0 +1,26 @@ +/* Optimized strcmp implementation for POWER10/PPC64. + Copyright (C) 2021-2023 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#if defined __LITTLE_ENDIAN__ && IS_IN (libc) +#define STRCMP __strcmp_power10 + +#undef libc_hidden_builtin_def +#define libc_hidden_builtin_def(name) + +#include +#endif /* __LITTLE_ENDIAN__ && IS_IN (libc) */ diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcmp.c b/sysdeps/powerpc/powerpc64/multiarch/strcmp.c index 31fcdee916..f1dac99b66 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/strcmp.c +++ b/sysdeps/powerpc/powerpc64/multiarch/strcmp.c @@ -29,12 +29,16 @@ extern __typeof (strcmp) __strcmp_power7 attribute_hidden; extern __typeof (strcmp) __strcmp_power8 attribute_hidden; # ifdef __LITTLE_ENDIAN__ extern __typeof (strcmp) __strcmp_power9 attribute_hidden; +extern __typeof (strcmp) __strcmp_power10 attribute_hidden; # endif # undef strcmp libc_ifunc_redirected (__redirect_strcmp, strcmp, # ifdef __LITTLE_ENDIAN__ + (hwcap2 & PPC_FEATURE2_ARCH_3_1 + && hwcap & PPC_FEATURE_HAS_VSX) + ? __strcmp_power10 : (hwcap2 & PPC_FEATURE2_ARCH_3_00 && hwcap & PPC_FEATURE_HAS_ALTIVEC) ? __strcmp_power9 :