From patchwork Thu May 21 19:10:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Paul E. Murphy" X-Patchwork-Id: 39348 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 58F8F388F052; Thu, 21 May 2020 19:11:00 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 58F8F388F052 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1590088260; bh=2jcsU93eN/laSpe0wR6Nw3dCY7fSDmvngYQAOvQB9gM=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=sazQtWYW5aY/O8SftJNiFYjs/Vcl0oQaYxgWuCHCaw1RJeFkTDw9yZMwy4lGGx/O3 DOGKwm4dB+opMGEQh9qZ6nbLBykRS/omrJ6J0xjxtGDE+0GRSL+ZY7dG3cxqOXQLYf 3evG4phU9u7KDgVE8OP8P1wsLfGCo/p6AqAZh3YE= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id D02B638708FF for ; Thu, 21 May 2020 19:10:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D02B638708FF Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04LIgKxO168743; Thu, 21 May 2020 15:10:51 -0400 Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0a-001b2d01.pphosted.com with ESMTP id 312btxm1ss-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 21 May 2020 15:10:51 -0400 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 04LIth8d006434; Thu, 21 May 2020 19:10:50 GMT Received: from b03cxnp07029.gho.boulder.ibm.com (b03cxnp07029.gho.boulder.ibm.com [9.17.130.16]) by ppma03wdc.us.ibm.com with ESMTP id 313wneh040-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 21 May 2020 19:10:50 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07029.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04LJAnlv55706102 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 21 May 2020 19:10:49 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 450AA78064; Thu, 21 May 2020 19:10:49 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CB1D978066; Thu, 21 May 2020 19:10:48 +0000 (GMT) Received: from brokenarrow.ibmuc.com (unknown [9.85.168.128]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Thu, 21 May 2020 19:10:48 +0000 (GMT) To: libc-alpha@sourceware.org, anton@ozlabs.org Subject: [PATCH] powerpc64le: add optimized strlen for P9 Date: Thu, 21 May 2020 14:10:48 -0500 Message-Id: <20200521191048.1566568-1-murphyp@linux.vnet.ibm.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.676 definitions=2020-05-21_11:2020-05-21, 2020-05-21 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 bulkscore=0 priorityscore=1501 malwarescore=0 mlxscore=0 mlxlogscore=999 impostorscore=0 spamscore=0 clxscore=1011 cotscore=-2147483648 lowpriorityscore=0 adultscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005210131 X-Spam-Status: No, score=-20.0 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TVD_SUBJ_NUM_OBFU_MINFP, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "Paul E. Murphy via Libc-alpha" From: "Paul E. Murphy" Reply-To: "Paul E. Murphy" Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" This is a followup to rawmemchr/strlen from Anton. I missed his original strlen patch, and likewise I wasn't happy with the 3-4% performance drop for larger strings which occurs around 2.5kB as the P8 vector loop is a bit faster. As noted, this is up to 50% faster for small strings, and about 1% faster for larger strings (I hazard to guess this some uarch difference between lxv and lvx). I guess this is a semi-V2 of the patch. Likewise, I need to double check binutils 2.26 supports the P9 insn used here. ---8<--- This started as a trivial change to Anton's rawmemchr. I got carried away. This is a hybrid between P8's asympotically faster 64B checks with extremely efficient small string checks e.g <64B (and sometimes a little bit more depending on alignment). The second trick is to align to 64B by running a 48B checking loop 16B at a time until we naturally align to 64B (i.e checking 48/96/144 bytes/iteration based on the alignment after the first 5 comparisons). This allieviates the need to check page boundaries. Finally, explicly use the P7 strlen with the runtime loader when building P9. We need to be cautious about vector/vsx extensions here on P9 only builds. --- .../powerpc/powerpc64/le/power9/rtld-strlen.S | 1 + sysdeps/powerpc/powerpc64/le/power9/strlen.S | 215 ++++++++++++++++++ sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +- .../powerpc64/multiarch/ifunc-impl-list.c | 4 + .../powerpc64/multiarch/strlen-power9.S | 2 + sysdeps/powerpc/powerpc64/multiarch/strlen.c | 5 + 6 files changed, 228 insertions(+), 1 deletion(-) create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S new file mode 100644 index 0000000000..e9d83323ac --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S @@ -0,0 +1 @@ +#include diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S new file mode 100644 index 0000000000..084d6e31a8 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S @@ -0,0 +1,215 @@ + +/* Optimized rawmemchr implementation for PowerPC64/POWER9. + Copyright (C) 2020 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include + +#ifndef STRLEN +# define STRLEN __strlen +# define DEFINE_STRLEN_HIDDEN_DEF 1 +#endif + +/* Implements the function + + int [r3] strlen (void *s [r3]) + + The implementation can load bytes past a matching byte, but only + up to the next 16B or 64B boundary, so it never crosses a page. */ + +.machine power9 +ENTRY_TOCLESS (STRLEN, 4) + CALL_MCOUNT 2 + + mr r4, r3 + vspltisb v18, 0 + vspltisb v19, -1 + + neg r5,r3 + rldicl r9,r5,0,60 /* How many bytes to get source 16B aligned? */ + + + /* Align data and fill bytes not loaded with non matching char */ + lvx v0,0,r4 + lvsr v1,0,r4 + vperm v0,v19,v0,v1 + + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + beq cr6,L(aligned) + + vctzlsbb r3,v6 + blr + + /* Test 64B 16B at a time. The vector loop is costly for small strings. */ +L(aligned): + add r4,r4,r9 + + rldicl. r5, r4, 60, 62 /* Determine how many 48B loops we should run */ + + lxv v0+32,0(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail1) + + lxv v0+32,16(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail2) + + lxv v0+32,32(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail3) + + lxv v0+32,48(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail4) + addi r4, r4, 64 + + /* prep for weird constant generation of reduction */ + li r0, 0 + + /* Skip the alignment if not needed */ + beq L(loop_64b) + mtctr r5 + + /* Test 48B per iteration until 64B aligned */ + .p2align 5 +L(loop): + lxv v0+32,0(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail1) + + lxv v0+32,16(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail2) + + lxv v0+32,32(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail3) + + addi r4,r4,48 + bdnz L(loop) + + .p2align 5 +L(loop_64b): + lxv v1+32, 0(r4) /* Load 4 quadwords. */ + lxv v2+32, 16(r4) + lxv v3+32, 32(r4) + lxv v4+32, 48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + bne cr6,L(vmx_zero) + + lxv v1+32, 0(r4) /* Load 4 quadwords. */ + lxv v2+32, 16(r4) + lxv v3+32, 32(r4) + lxv v4+32, 48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + bne cr6,L(vmx_zero) + + lxv v1+32, 0(r4) /* Load 4 quadwords. */ + lxv v2+32, 16(r4) + lxv v3+32, 32(r4) + lxv v4+32, 48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + beq cr6,L(loop_64b) + +L(vmx_zero): + /* OK, we found a null byte. Let's look for it in the current 64-byte + block and mark it in its corresponding VR. */ + vcmpequb v1,v1,v18 + vcmpequb v2,v2,v18 + vcmpequb v3,v3,v18 + vcmpequb v4,v4,v18 + + /* We will now 'compress' the result into a single doubleword, so it + can be moved to a GPR for the final calculation. First, we + generate an appropriate mask for vbpermq, so we can permute bits into + the first halfword. */ + vspltisb v10,3 + lvsl v11,r0,r0 + vslb v10,v11,v10 + + /* Permute the first bit of each byte into bits 48-63. */ + vbpermq v1,v1,v10 + vbpermq v2,v2,v10 + vbpermq v3,v3,v10 + vbpermq v4,v4,v10 + + /* Shift each component into its correct position for merging. */ + vsldoi v2,v2,v2,2 + vsldoi v3,v3,v3,4 + vsldoi v4,v4,v4,6 + + /* Merge the results and move to a GPR. */ + vor v1,v2,v1 + vor v2,v3,v4 + vor v4,v1,v2 + mfvrd r10,v4 + + /* Adjust address to the begninning of the current 64-byte block. */ + addi r4,r4,-64 + + addi r9, r10,-1 /* Form a mask from trailing zeros. */ + andc r9, r9,r10 + popcntd r0, r9 /* Count the bits in the mask. */ + subf r5,r3,r4 + add r3,r5,r0 /* Compute final length. */ + blr + +L(tail1): + vctzlsbb r0,v6 + add r4,r4,r0 + subf r3,r3,r4 + blr + +L(tail2): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,16 + subf r3,r3,r4 + blr + +L(tail3): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,32 + subf r3,r3,r4 + blr + +L(tail4): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,48 + subf r3,r3,r4 + blr + +END (STRLEN) + +#ifdef DEFINE_STRLEN_HIDDEN_DEF +weak_alias (__strlen, strlen) +libc_hidden_builtin_def (strlen) +#endif diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile index fc2268f6b5..19acb6c64a 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile @@ -33,7 +33,7 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ ifneq (,$(filter %le,$(config-machine))) sysdep_routines += strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \ - rawmemchr-power9 + rawmemchr-power9 strlen-power9 endif CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c index 59a227ee22..ea10b00417 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c @@ -111,6 +111,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, /* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c. */ IFUNC_IMPL (i, name, strlen, +#ifdef __LITTLE_ENDIAN__ + IFUNC_IMPL_ADD (array, i, strcpy, hwcap2 & PPC_FEATURE2_ARCH_3_00, + __strlen_power9) +#endif IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_2_07, __strlen_power8) IFUNC_IMPL_ADD (array, i, strlen, hwcap & PPC_FEATURE_HAS_VSX, diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S b/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S new file mode 100644 index 0000000000..68c8d54b5f --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S @@ -0,0 +1,2 @@ +#define STRLEN __strlen_power9 +#include diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/powerpc/powerpc64/multiarch/strlen.c index e587554221..cd9dc78a7c 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c @@ -30,8 +30,13 @@ extern __typeof (__redirect_strlen) __libc_strlen; extern __typeof (__redirect_strlen) __strlen_ppc attribute_hidden; extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden; extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden; +extern __typeof (__redirect_strlen) __strlen_power9 attribute_hidden; libc_ifunc (__libc_strlen, +# ifdef __LITTLE_ENDIAN__ + (hwcap2 & PPC_FEATURE2_ARCH_3_00) + ? __strlen_power9 : +# endif (hwcap2 & PPC_FEATURE2_ARCH_2_07) ? __strlen_power8 : (hwcap & PPC_FEATURE_HAS_VSX)