[v2] powerpc: Add optimized strlen for POWER10

  Improvements compared to POWER9 version:

1. Take into account first 16B comparison for aligned strings

   The previous version compares the first 16B and increments r4 by the number
   of bytes until the address is 16B-aligned, then starts doing aligned loads at
   that address. For aligned strings, this causes the first 16B to be compared
   twice, because the increment is 0. Here we calculate the next 16B-aligned
   address differently, which avoids that issue.

2. Use simple comparisons for the first ~192 bytes

   The main loop is good for big strings, but comparing 16B each time is better
   for smaller strings.  So after aligning the address to 16 Bytes, we check
   more 176B in 16B chunks.  There may be some overlaps with the main loop for
   unaligned strings, but we avoid using the more aggressive strategy too soon,
   and also allow the loop to start at a 64B-aligned address.  This greatly
   benefits smaller strings and avoids overlapping checks if the string is
   already aligned at a 64B boundary.

3. Reduce dependencies between load blocks caused by address calculation on loop

   Doing a precise time tracing on the code showed many loads in the loop were
   stalled waiting for updates to r4 from previous code blocks.  This
   implementation avoids that as much as possible by using 2 registers (r4 and
   r5) to hold addresses to be used by different parts of the code.

   Also, the previous code aligned the address to 16B, then to 64B by doing a
   few 48B loops (if needed) until the address was aligned. The main loop could
   not start until that 48B loop had finished and r4 was updated with the
   current address. Here we calculate the address used by the loop very early,
   so it can start sooner.

   The main loop now uses 2 pointers 128B apart to make pointer updates less
   frequent, and also unrolls 1 iteration to guarantee there is enough time
   between iterations to update the pointers, reducing stalled cycles.

4. Use new P10 instructions

   lxvp is used to load 32B with a single instruction, reducing contention in
   the load queue.

   vextractbm allows simplifying the tail code for the loop, replacing
   vbpermq and avoiding having to generate a permute control vector.

Output of bench-strlen from 'make USE_CLOCK_GETTIME=1 BENCHSET="string-benchset"
using slightly different set of inputs than the default:

$ ./compare_strings.py --functions __strlen_power9,__strlen_power10
                       -a length,alignment -s benchout_strings.schema.json
                       -i bench-strlen.out

Function: strlen
Variant:
                                    __strlen_power10	__strlen_power9
================================================================================
               length=1, alignment=0:         2.50	        2.50 (  0.00%)
               length=1, alignment=1:         2.50	        2.50 (  0.00%)
               length=2, alignment=0:         2.50	        2.50 (  0.00%)
               length=2, alignment=2:         2.50	        2.50 (  0.00%)
               length=3, alignment=0:         2.50	        2.50 (  0.00%)
               length=3, alignment=3:         2.50	        2.50 (  0.00%)
               length=4, alignment=0:         2.50	        2.50 (  0.00%)
               length=4, alignment=4:         2.50	        2.50 (  0.00%)
               length=5, alignment=0:         2.50	        2.50 (  0.00%)
               length=5, alignment=5:         2.50	        2.50 (  0.00%)
               length=6, alignment=0:         2.50	        2.50 (  0.00%)
               length=6, alignment=6:         2.50	        2.50 (  0.00%)
               length=7, alignment=0:         2.50	        2.50 (  0.00%)
               length=7, alignment=7:         2.50	        2.50 (  0.00%)
               length=8, alignment=0:         2.50	        2.50 (  0.00%)
               length=8, alignment=8:         3.12	        3.12 (  0.00%)
               length=9, alignment=0:         2.50	        2.50 (  0.00%)
               length=9, alignment=9:         3.12	        3.12 (  0.00%)
             length=10, alignment=10:         3.12	        3.12 (  0.00%)
              length=16, alignment=0:         3.12	        3.40 ( -9.09%)
              length=16, alignment=4:         3.12	        3.12 (  0.00%)
              length=16, alignment=7:         3.12	        3.12 (  0.00%)
              length=21, alignment=0:         3.12	        3.40 ( -9.09%)
              length=21, alignment=5:         3.12	        3.12 (  0.00%)
              length=32, alignment=0:         3.12	        3.40 ( -9.09%)
              length=32, alignment=7:         3.12	        3.40 ( -9.09%)
              length=42, alignment=0:         3.12	        3.40 ( -9.09%)
              length=42, alignment=7:         3.42	        3.40 (  0.51%)
              length=48, alignment=0:         3.43	        3.74 ( -9.13%)
              length=48, alignment=7:         3.40	        3.40 (  0.17%)
              length=64, alignment=0:         3.40	        5.21 (-53.34%)
              length=64, alignment=7:         3.40	        3.74 (-10.00%)
              length=80, alignment=0:         3.74	        5.21 (-39.43%)
              length=80, alignment=7:         3.74	        4.01 ( -7.14%)
              length=85, alignment=0:         3.74	        5.21 (-39.42%)
              length=85, alignment=7:         3.74	        4.01 ( -7.14%)
              length=96, alignment=0:         3.74	        5.21 (-39.40%)
              length=96, alignment=7:         3.74	        4.01 ( -7.14%)
             length=112, alignment=0:         3.74	        5.21 (-39.39%)
             length=112, alignment=7:         3.74	        4.88 (-30.43%)
             length=128, alignment=0:         4.01	        5.91 (-47.59%)
             length=128, alignment=7:         4.01	        6.15 (-53.59%)
            length=128, alignment=16:         4.01	        6.16 (-53.78%)
            length=128, alignment=23:         4.01	        5.17 (-29.08%)
             length=160, alignment=0:         4.01	        5.92 (-47.75%)
             length=160, alignment=7:         4.01	        6.16 (-53.72%)
            length=160, alignment=16:         4.01	        6.14 (-53.29%)
            length=160, alignment=23:         4.01	        6.05 (-50.98%)
             length=192, alignment=0:         5.93	        6.84 (-15.44%)
             length=192, alignment=7:         5.93	        6.90 (-16.35%)
             length=256, alignment=0:         6.61	        7.73 (-17.02%)
             length=256, alignment=7:         6.61	        7.85 (-18.79%)
             length=320, alignment=0:         7.26	        8.65 (-19.12%)
             length=320, alignment=7:         7.26	        8.76 (-20.70%)
             length=384, alignment=0:         7.95	        9.62 (-20.98%)
             length=384, alignment=7:         7.95	        9.49 (-19.37%)
             length=448, alignment=0:         8.73	       10.39 (-19.06%)
             length=448, alignment=7:         8.73	       10.51 (-20.40%)
             length=512, alignment=0:         9.44	       11.13 (-17.87%)
             length=512, alignment=7:         9.45	       11.32 (-19.85%)
             length=576, alignment=0:        10.10	       11.93 (-18.05%)
             length=576, alignment=7:        10.10	       12.02 (-18.97%)
             length=640, alignment=0:        10.71	       12.73 (-18.86%)
             length=640, alignment=7:        10.67	       12.89 (-20.76%)
             length=704, alignment=0:        11.59	       13.39 (-15.61%)
             length=704, alignment=7:        11.59	       13.61 (-17.45%)
             length=768, alignment=0:        12.27	       14.22 (-15.90%)
             length=768, alignment=7:        12.27	       14.44 (-17.72%)
             length=896, alignment=0:        13.48	       15.70 (-16.47%)
             length=896, alignment=7:        13.47	       15.97 (-18.56%)
             length=960, alignment=0:        14.22	       16.63 (-16.92%)
             length=960, alignment=7:        14.19	       16.70 (-17.66%)
            length=1024, alignment=0:        14.85	       17.46 (-17.54%)
            length=1024, alignment=7:        14.87	       17.68 (-18.94%)
            length=1280, alignment=0:        17.58	       20.91 (-18.94%)
            length=1280, alignment=7:        17.62	       21.35 (-21.13%)
            length=1536, alignment=0:        20.61	       24.54 (-19.07%)
            length=1536, alignment=7:        20.61	       24.21 (-17.48%)
            length=1792, alignment=0:        23.02	       27.94 (-21.39%)
            length=1792, alignment=7:        23.02	       27.83 (-20.90%)
            length=2048, alignment=0:        25.98	       30.71 (-18.23%)
            length=2048, alignment=7:        25.96	       31.26 (-20.45%)
            length=2560, alignment=0:        31.37	       37.82 (-20.57%)
            length=2560, alignment=7:        31.34	       37.69 (-20.26%)
            length=3008, alignment=0:        35.61	       43.29 (-21.56%)
            length=3008, alignment=7:        35.55	       43.84 (-23.31%)
            length=3520, alignment=0:        41.08	       50.48 (-22.90%)
            length=3520, alignment=7:        41.12	       50.63 (-23.13%)
            length=4096, alignment=0:        47.80	       57.96 (-21.25%)
            length=4096, alignment=7:        47.79	       57.66 (-20.66%)

Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com>

---
Changes from v1:
  - Added comment about minimum binutils version needed to remove the instruction macros
  - s/reg/vreg/ on CHECK16 for clarity

---
 sysdeps/powerpc/powerpc64/le/power10/strlen.S | 221 ++++++++++++++++++
 sysdeps/powerpc/powerpc64/multiarch/Makefile  |   3 +-
 .../powerpc64/multiarch/ifunc-impl-list.c     |   2 +
 .../powerpc64/multiarch/strlen-power10.S      |   2 +
 sysdeps/powerpc/powerpc64/multiarch/strlen.c  |   3 +
 5 files changed, 230 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/powerpc/powerpc64/le/power10/strlen.S
 create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S

Message ID	20210422122911.27758-1-msc@linux.ibm.com
State	Committed
Headers	Return-Path: <libc-alpha-bounces@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id B08813969006; Thu, 22 Apr 2021 12:29:19 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B08813969006 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1619094559; bh=4UOuhSETuFylNcgqG4aUXhw8YAqRq8c2O6IjXzM+ops=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=MZrRtEezc6m7Oi+mrmosYB6ks4fEfyMZ93dKVT/SMeEBQlFDT7pUxonceyNAgXTkq WahWeP0Oa3eymLX55Yg/ypFiWbEpJoUqa1eAXmFYtNmcU1833Q8HFNbUCcwemV/CaX CHmvMHNJKrRN7YMJkcNuC8ZB6Mlpk4XZbyGQC2K8= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id B894F383303F for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 12:29:15 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org B894F383303F Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 13MC3gEp096212 for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 08:29:15 -0400 Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com with ESMTP id 3838hk9cwd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 08:29:14 -0400 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13MCRdmb016104 for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 12:29:13 GMT Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by ppma01dal.us.ibm.com with ESMTP id 38311tkqxh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 12:29:13 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 13MCTD4Z31981966 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 12:29:13 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DBD56112063 for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 12:29:12 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 50CCF112062 for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 12:29:12 +0000 (GMT) Received: from localhost (unknown [9.65.241.178]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 12:29:12 +0000 (GMT) To: libc-alpha@sourceware.org Subject: [PATCH v2] powerpc: Add optimized strlen for POWER10 Date: Thu, 22 Apr 2021 09:29:11 -0300 Message-Id: <20210422122911.27758-1-msc@linux.ibm.com> X-Mailer: git-send-email 2.30.2 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: PQ_dnNomhus_UIjVwAamapiuRX9aV29X X-Proofpoint-GUID: PQ_dnNomhus_UIjVwAamapiuRX9aV29X Content-Transfer-Encoding: 8bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-04-22_04:2021-04-21, 2021-04-22 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 impostorscore=0 mlxscore=0 malwarescore=0 priorityscore=1501 spamscore=0 bulkscore=0 suspectscore=0 phishscore=0 clxscore=1015 mlxlogscore=999 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2104220102 X-Spam-Status: No, score=-11.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_ASCII_DIVIDERS, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: Matheus Castanho via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Matheus Castanho <msc@linux.ibm.com> Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>
Series	[v2] powerpc: Add optimized strlen for POWER10 \| [v2] powerpc: Add optimized strlen for POWER10

[v2] powerpc: Add optimized strlen for POWER10

Commit Message

Comments

Patch