From patchwork Tue Dec 5 16:46:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Amrita H S X-Patchwork-Id: 81436 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A530C386C5A9 for ; Tue, 5 Dec 2023 16:46:32 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 3839E385C6DD for ; Tue, 5 Dec 2023 16:46:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3839E385C6DD Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Authentication-Results: sourceware.org; spf=none smtp.mailfrom=linux.vnet.ibm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 3839E385C6DD Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701794778; cv=none; b=cuGVGLMwXXtwj/22Lvq3x5m4ll6h7YKpi6YPDwFio045TmlKp0bsnJ4/SaaBp2z9DzutQ3vuV/HmgtuEFFsOAI3KyAR1u5CqXu4NfU3N8Z1M2vbux7m2O3GI8c0E5vKXtdiFDyYofkO+6szgOp6ZkeKnPRlooma2+OUGfAqeb6s= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701794778; c=relaxed/simple; bh=VUSnBeUdyjVFUXIzUnTF1dsokq3pzr0D6hkcrx2SdeY=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=ql6pafTaSP8010lEZ0WR+maAwYKK/inxC+Dx9R0ZL2H0YX0Dzic8Yai9dM18JW4tXA0m0ODLG1Ce4O2T9EhX0Grq9d9W+uKFVQOQfMWuifdyDZiA8xOWaNfejKnf2XdiFNBf4RJA+Y9zcu5zl8nnzHnsLgdbik4KJkRZPpXjjxU= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from pps.filterd (m0360072.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3B5Gd3ib012445 for ; Tue, 5 Dec 2023 16:46:15 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : content-transfer-encoding : mime-version; s=pp1; bh=5m1c55oumpPcKhUG4f3UXZv841R4BW3vm5IZ6BhokeI=; b=SGvHUULf/zxCWylEGmrIjWqlc8ndOc4J8C1qyZHDqvnXeUgvcPdZGBY1htJnOQEh76qA 5aVigW0ntZlwdUwljOAnz8A40GJRbBnTIqXjoZ1c1+AgHwlt0Gv9DKDRZSx6+2KLvL/N ax11Akr8xaOIVK4bSNSlW/g9kuBLp8rN4aDcQTod6KX+b5/3v9ywBp6aXrboGgnuAycL DMcx0bo1uAD2NKbVaqo1bLUxOHHRkllmfy8VZKSSljpSbMqUqG+c4wYKpXZx8sCsNArz +6LTUE9T2EiT18MnQ28dUBYr02Ruie+OJtiaQihFlswhA3kcDPqftrgKRKWLt0JnbA2L eg== Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3ut7jnr80b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 05 Dec 2023 16:46:15 +0000 Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3B5GgF77026459 for ; Tue, 5 Dec 2023 16:46:15 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3urv8dvne2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 05 Dec 2023 16:46:14 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3B5GkCWc14680720 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 5 Dec 2023 16:46:13 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CFD5220071; Tue, 5 Dec 2023 16:46:12 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 423CB20065; Tue, 5 Dec 2023 16:46:12 +0000 (GMT) Received: from ltcd97-lp3.. (unknown [9.40.194.171]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Tue, 5 Dec 2023 16:46:12 +0000 (GMT) From: Amrita H S To: libc-alpha@sourceware.org Cc: Amrita H S Subject: [PATCH V5] powerpc: Optimized strcmp for power10 Date: Tue, 5 Dec 2023 11:46:09 -0500 Message-ID: <20231205164609.2238965-1-amritahs@linux.vnet.ibm.com> X-Mailer: git-send-email 2.41.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: -eBIS60dfthyOTwnzj6MUjn5iUTudpdj X-Proofpoint-ORIG-GUID: -eBIS60dfthyOTwnzj6MUjn5iUTudpdj X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-12-05_12,2023-12-05_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 malwarescore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 phishscore=0 priorityscore=1501 adultscore=0 impostorscore=0 bulkscore=0 mlxlogscore=613 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2312050132 X-Spam-Status: No, score=-11.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org This patch is based on __strcmp_power9 and __strlen_power10. Improvements from __strcmp_power9: 1. Uses new POWER10 instructions - This code uses lxvp to decrease contention on load by loading 32 bytes per instruction. 2. Performance implication - This version has around 30% better performance on average. - Performance regression is seen for a specific combination of sizes and alignments. Some of them is observed without changes also, while rest may be induced by the patch. Signed-off-by: Amrita H S --- sysdeps/powerpc/powerpc64/le/power10/strcmp.S | 205 ++++++++++++++++++ sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +- .../powerpc64/multiarch/ifunc-impl-list.c | 4 + .../powerpc64/multiarch/strcmp-power10.S | 26 +++ sysdeps/powerpc/powerpc64/multiarch/strcmp.c | 4 + 5 files changed, 241 insertions(+), 1 deletion(-) create mode 100644 sysdeps/powerpc/powerpc64/le/power10/strcmp.S create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strcmp-power10.S diff --git a/sysdeps/powerpc/powerpc64/le/power10/strcmp.S b/sysdeps/powerpc/powerpc64/le/power10/strcmp.S new file mode 100644 index 0000000000..5e21bad580 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power10/strcmp.S @@ -0,0 +1,205 @@ +/* Optimized strcmp implementation for PowerPC64/POWER10. + Copyright (C) 2021-2023 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ +#include + +#ifndef STRCMP +# define STRCMP strcmp +#endif + +/* Implements the function + int [r3] strcmp (const char *s1 [r3], const char *s2 [r4]). */ + +/* TODO: Change this to actual instructions when minimum binutils is upgraded + to 2.27. Macros are defined below for these newer instructions in order + to maintain compatibility. */ + +#define LXVP(xtp,dq,ra) \ + .long(((6)<<(32-6)) \ + | ((((xtp)-32)>>1)<<(32-10)) \ + | ((1)<<(32-11)) \ + | ((ra)<<(32-16)) \ + | dq) + +#define COMPARE_16(vreg1,vreg2,offset) \ + lxv vreg1+32,offset(r3); \ + lxv vreg2+32,offset(r4); \ + vcmpnezb. v7,vreg1,vreg2; \ + bne cr6,L(different); \ + +#define COMPARE_32(vreg1,vreg2,offset,label1,label2) \ + LXVP(vreg1+32,offset,r3); \ + LXVP(vreg2+32,offset,r4); \ + vcmpnezb. v7,vreg1+1,vreg2+1; \ + bne cr6,L(label1); \ + vcmpnezb. v7,vreg1,vreg2; \ + bne cr6,L(label2); \ + +#define TAIL(vreg1,vreg2) \ + vctzlsbb r6,v7; \ + vextubrx r5,r6,vreg1; \ + vextubrx r4,r6,vreg2; \ + subf r3,r4,r5; \ + blr; \ + +#define CHECK_N_BYTES(reg1,reg2,len_reg) \ + mr r0,len_reg; \ + sldi r0,r0,56; \ + lxvl 32+v4,reg1,r0; \ + lxvl 32+v5,reg2,r0; \ + add reg1,reg1,len_reg; \ + add reg2,reg2,len_reg; \ + vcmpnezb. v7,v4,v5; \ + vctzlsbb r6,v7; \ + cmpld cr7,r6,len_reg; \ + blt cr7,L(different); \ + + /* TODO: change this to .machine power10 when the minimum required + binutils allows it. */ + + .machine power9 +ENTRY_TOCLESS (STRCMP, 4) + li r11,16 + /* eq bit of cr1 used as swap status flag to indicate if + source pointers were swapped. */ + crclr 4*cr1+eq + vspltisb v19,-1 + andi. r7,r3,15 + sub r7,r11,r7 /* r7(nalign1) = 16 - (str1 & 15). */ + andi. r9,r4,15 + sub r5,r11,r9 /* r5(nalign2) = 16 - (str2 & 15). */ + cmpld cr7,r7,r5 + beq cr7,L(same_aligned) + blt cr7,L(nalign1_min) + /* Swap r3 and r4, and r7 and r5 such that r3 and r7 hold the + pointer which is closer to the next 16B boundary so that only + one CHECK_N_BYTES is needed before entering the loop below. */ + mr r8,r4 + mr r4,r3 + mr r3,r8 + mr r12,r7 + mr r7,r5 + mr r5,r12 + crset 4*cr1+eq /* Set bit on swapping source pointers. */ + + .p2align 5 +L(nalign1_min): + CHECK_N_BYTES(r3,r4,r7) + + .p2align 5 +L(s1_aligned): + /* r9 and r5 is number of bytes to be read after and before + page boundary correspondingly. */ + sub r5,r5,r7 + subfic r9,r5,16 + /* Now let r7 hold the count of quadwords which can be + checked without crossing a page boundary. quadword offset is + (str2>>4)&0xFF. */ + rlwinm r7,r4,28,0xFF + /* Below check is required only for first iteration. For second + iteration and beyond, the new loop counter is always 255. */ + cmpldi r7,255 + beq L(L3) + /* Get the initial loop count by 255-((str2>>4)&0xFF). */ + subfic r11,r7,255 + + .p2align 5 +L(L1): + mtctr r11 + + .p2align 5 +L(L2): + COMPARE_16(v4,v5,0) /* Load 16B blocks using lxv. */ + addi r3,r3,16 + addi r4,r4,16 + bdnz L(L2) + /* Cross the page boundary of s2, carefully. */ + + .p2align 5 +L(L3): + CHECK_N_BYTES(r3,r4,r5) + CHECK_N_BYTES(r3,r4,r9) + li r11,255 /* Load the new loop counter. */ + b L(L1) + + .p2align 5 +L(same_aligned): + CHECK_N_BYTES(r3,r4,r7) + /* Align s1 to 32B and adjust s2 address. + Use lxvp only if both s1 and s2 are 32B aligned. */ + COMPARE_16(v4,v5,0) + COMPARE_16(v4,v5,16) + COMPARE_16(v4,v5,32) + COMPARE_16(v4,v5,48) + addi r3,r3,64 + addi r4,r4,64 + COMPARE_16(v4,v5,0) + COMPARE_16(v4,v5,16) + + clrldi r6,r3,59 + subfic r5,r6,32 + add r3,r3,r5 + add r4,r4,r5 + andi. r5,r4,0x1F + beq cr0,L(32B_aligned_loop) + + .p2align 5 +L(16B_aligned_loop): + COMPARE_16(v4,v5,0) + COMPARE_16(v4,v5,16) + COMPARE_16(v4,v5,32) + COMPARE_16(v4,v5,48) + addi r3,r3,64 + addi r4,r4,64 + b L(16B_aligned_loop) + + /* Calculate and return the difference. */ +L(different): + vctzlsbb r6,v7 + vextubrx r5,r6,v4 + vextubrx r4,r6,v5 + bt 4*cr1+eq,L(swapped) + subf r3,r4,r5 + blr + + /* If src pointers were swapped, then swap the + indices and calculate the return value. */ +L(swapped): + subf r3,r5,r4 + blr + + .p2align 5 +L(32B_aligned_loop): + COMPARE_32(v14,v16,0,tail1,tail2) + COMPARE_32(v18,v20,32,tail3,tail4) + COMPARE_32(v22,v24,64,tail5,tail6) + COMPARE_32(v26,v28,96,tail7,tail8) + addi r3,r3,128 + addi r4,r4,128 + b L(32B_aligned_loop) + +L(tail1): TAIL(v15,v17) +L(tail2): TAIL(v14,v16) +L(tail3): TAIL(v19,v21) +L(tail4): TAIL(v18,v20) +L(tail5): TAIL(v23,v25) +L(tail6): TAIL(v22,v24) +L(tail7): TAIL(v27,v29) +L(tail8): TAIL(v26,v28) + +END (STRCMP) +libc_hidden_builtin_def (strcmp) diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile index 27d8495503..d7824a922b 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile @@ -33,7 +33,8 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ ifneq (,$(filter %le,$(config-machine))) sysdep_routines += memcmp-power10 memcpy-power10 memmove-power10 memset-power10 \ rawmemchr-power9 rawmemchr-power10 \ - strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \ + strcmp-power9 strcmp-power10 strncmp-power9 \ + strcpy-power9 stpcpy-power9 \ strlen-power9 strncpy-power9 stpncpy-power9 strlen-power10 endif CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c index ebe9434052..ca1f57e1e2 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c @@ -376,6 +376,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, /* Support sysdeps/powerpc/powerpc64/multiarch/strcmp.c. */ IFUNC_IMPL (i, name, strcmp, #ifdef __LITTLE_ENDIAN__ + IFUNC_IMPL_ADD (array, i, strcmp, + (hwcap2 & PPC_FEATURE2_ARCH_3_1) + && (hwcap & PPC_FEATURE_HAS_VSX), + __strcmp_power10) IFUNC_IMPL_ADD (array, i, strcmp, hwcap2 & PPC_FEATURE2_ARCH_3_00 && hwcap & PPC_FEATURE_HAS_ALTIVEC, diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcmp-power10.S b/sysdeps/powerpc/powerpc64/multiarch/strcmp-power10.S new file mode 100644 index 0000000000..c80067ce33 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strcmp-power10.S @@ -0,0 +1,26 @@ +/* Optimized strcmp implementation for POWER10/PPC64. + Copyright (C) 2021-2023 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#if defined __LITTLE_ENDIAN__ && IS_IN (libc) +#define STRCMP __strcmp_power10 + +#undef libc_hidden_builtin_def +#define libc_hidden_builtin_def(name) + +#include +#endif /* __LITTLE_ENDIAN__ && IS_IN (libc) */ diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcmp.c b/sysdeps/powerpc/powerpc64/multiarch/strcmp.c index 31fcdee916..f1dac99b66 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/strcmp.c +++ b/sysdeps/powerpc/powerpc64/multiarch/strcmp.c @@ -29,12 +29,16 @@ extern __typeof (strcmp) __strcmp_power7 attribute_hidden; extern __typeof (strcmp) __strcmp_power8 attribute_hidden; # ifdef __LITTLE_ENDIAN__ extern __typeof (strcmp) __strcmp_power9 attribute_hidden; +extern __typeof (strcmp) __strcmp_power10 attribute_hidden; # endif # undef strcmp libc_ifunc_redirected (__redirect_strcmp, strcmp, # ifdef __LITTLE_ENDIAN__ + (hwcap2 & PPC_FEATURE2_ARCH_3_1 + && hwcap & PPC_FEATURE_HAS_VSX) + ? __strcmp_power10 : (hwcap2 & PPC_FEATURE2_ARCH_3_00 && hwcap & PPC_FEATURE_HAS_ALTIVEC) ? __strcmp_power9 :