From patchwork Wed Apr 21 21:39:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 43063 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A454538930CD; Wed, 21 Apr 2021 21:40:40 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A454538930CD DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1619041240; bh=/5ReZ/j9WHHPP6Nz4AlgVlzTNUw0ikzpDMxVMbw1118=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=JUMTNH2hGfGoULtvxNZ71JF6rvN5Ru+JfwOEbbwt2v21qX4ZKz4w4rE8KgIGrsY5N qRgp3MvBgomhvhr+o2X10JTPeJgtxd2hXnZ3vM2WbUv76epsO5uaf2H5gmyDTMnwHi fODGMUojH/aiJ+Buhg3wdIP1HkViPeHPsl52uM8g= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qk1-x72c.google.com (mail-qk1-x72c.google.com [IPv6:2607:f8b0:4864:20::72c]) by sourceware.org (Postfix) with ESMTPS id 9891838930CD for ; Wed, 21 Apr 2021 21:40:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 9891838930CD Received: by mail-qk1-x72c.google.com with SMTP id 66so4824395qkf.2 for ; Wed, 21 Apr 2021 14:40:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=/5ReZ/j9WHHPP6Nz4AlgVlzTNUw0ikzpDMxVMbw1118=; b=QeN8DALKLu5Us8FLlXTUPdgxM1HheOYuN4EATwdCRXBSG0HFSi+3LXuX6emNsUaMuU m2D2sp+dguXqZO3E+Igw1GFGrjBWtN+R2kNBiYH9oHFt/XyWi3jC4libqU7j0EniCNrn PNT6a7ep+aVKHwtuLqfAMT7DZuk0AF4bbTc+aLDO9/AqhTLOGvR9V7d/D3cTx924jX04 NVVBUoTnhQvTUMDfXgi8Z3gjIYX0KP4mGRfws8H2AhgsVs06uvEC/9jQrhAGUs6y7OVY vgUI9vkO/OEhb5htP3ltweCqtnkAd4dRqkF82+jF3nB2KAL1JjdKP6dBa0fAltk4BCbo DK7Q== X-Gm-Message-State: AOAM53239hU44Oh/QaU3hjh9SiJMxl0jt1kiNl6PC9DwekXLA6fCAIOf v/9tAWlTMbKRWfnrIkndNIoOWLAF/GQ= X-Google-Smtp-Source: ABdhPJybMAtTyu3YEgL+uT10SxPadISnXmVvXOYBVCclKBOnFrY2kDC5trUE5xTi6y5r7pkWDANydA== X-Received: by 2002:a05:620a:4007:: with SMTP id h7mr301667qko.348.1619041235486; Wed, 21 Apr 2021 14:40:35 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id m29sm572365qkm.101.2021.04.21.14.40.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Apr 2021 14:40:34 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 1/2] x86: Optimize strlen-avx2.S Date: Wed, 21 Apr 2021 17:39:52 -0400 Message-Id: <20210421213951.404588-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" No bug. This commit optimizes strlen-evex.S. The optimizations are all small things such as save an ALU in the alignment process, saving a few instructions in the loop return, saving some bytes in the main loop, and increasing the ILP in the return cases. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein --- Tests where run on the following CPUs: Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html All times are the geometric mean of N=20. The unit of time is seconds. "Cur" refers to the current implementation "New" refers to this patches implementation For strchr-evex the numbers are a near universal improvement. The only exception seems to be the [32, 64] case is marginally slower for Tigerlake and about even on Icelake (less than the gain in the [0, 31] case). Overall though I think the number show a sizable improvement, particularly once the 4x loop is hit. Results For Tigerlake strchr-evex size, algn, Cur T , New T , Win , Dif 32 , 0 , 4.89 , 5.23 , Cur , 0.34 32 , 1 , 4.67 , 5.09 , Cur , 0.42 64 , 0 , 5.59 , 5.46 , New , 0.13 64 , 2 , 5.52 , 5.43 , New , 0.09 128 , 0 , 8.04 , 7.44 , New , 0.6 128 , 3 , 8.0 , 7.45 , New , 0.55 256 , 0 , 14.7 , 12.94 , New , 1.76 256 , 4 , 14.78 , 13.03 , New , 1.75 512 , 0 , 20.37 , 19.05 , New , 1.32 512 , 5 , 20.34 , 18.98 , New , 1.36 1024, 0 , 31.62 , 28.24 , New , 3.38 1024, 6 , 31.55 , 28.2 , New , 3.35 2048, 0 , 53.22 , 47.12 , New , 6.1 2048, 7 , 53.15 , 47.0 , New , 6.15 64 , 1 , 5.45 , 5.41 , New , 0.04 64 , 3 , 5.46 , 5.39 , New , 0.07 64 , 4 , 5.48 , 5.39 , New , 0.09 64 , 5 , 5.54 , 5.39 , New , 0.15 64 , 6 , 5.47 , 5.41 , New , 0.06 64 , 7 , 5.46 , 5.39 , New , 0.07 256 , 16 , 14.58 , 12.92 , New , 1.66 256 , 32 , 15.36 , 13.54 , New , 1.82 256 , 48 , 15.49 , 13.71 , New , 1.78 256 , 64 , 16.53 , 14.78 , New , 1.75 256 , 80 , 16.57 , 14.82 , New , 1.75 256 , 96 , 13.26 , 11.99 , New , 1.27 256 , 112 , 13.36 , 12.07 , New , 1.29 0 , 0 , 3.75 , 3.09 , New , 0.66 1 , 0 , 3.75 , 3.09 , New , 0.66 2 , 0 , 3.74 , 3.09 , New , 0.65 3 , 0 , 3.74 , 3.09 , New , 0.65 4 , 0 , 3.74 , 3.09 , New , 0.65 5 , 0 , 3.74 , 3.1 , New , 0.64 6 , 0 , 3.74 , 3.1 , New , 0.64 7 , 0 , 3.74 , 3.09 , New , 0.65 8 , 0 , 3.74 , 3.09 , New , 0.65 9 , 0 , 3.74 , 3.1 , New , 0.64 10 , 0 , 3.75 , 3.09 , New , 0.66 11 , 0 , 3.75 , 3.1 , New , 0.65 12 , 0 , 3.74 , 3.1 , New , 0.64 13 , 0 , 3.77 , 3.1 , New , 0.67 14 , 0 , 3.78 , 3.1 , New , 0.68 15 , 0 , 3.82 , 3.1 , New , 0.72 16 , 0 , 3.76 , 3.1 , New , 0.66 17 , 0 , 3.8 , 3.1 , New , 0.7 18 , 0 , 3.77 , 3.1 , New , 0.67 19 , 0 , 3.81 , 3.1 , New , 0.71 20 , 0 , 3.77 , 3.13 , New , 0.64 21 , 0 , 3.8 , 3.11 , New , 0.69 22 , 0 , 3.82 , 3.11 , New , 0.71 23 , 0 , 3.77 , 3.11 , New , 0.66 24 , 0 , 3.77 , 3.11 , New , 0.66 25 , 0 , 3.76 , 3.11 , New , 0.65 26 , 0 , 3.76 , 3.11 , New , 0.65 27 , 0 , 3.76 , 3.11 , New , 0.65 28 , 0 , 3.77 , 3.11 , New , 0.66 29 , 0 , 3.76 , 3.11 , New , 0.65 30 , 0 , 3.76 , 3.11 , New , 0.65 31 , 0 , 3.76 , 3.11 , New , 0.65 Results For Icelake strchr-evex size, algn, Cur T , New T , Win , Dif 32 , 0 , 3.57 , 3.77 , Cur , 0.2 32 , 1 , 3.36 , 3.34 , New , 0.02 64 , 0 , 3.77 , 3.64 , New , 0.13 64 , 2 , 3.73 , 3.58 , New , 0.15 128 , 0 , 5.22 , 4.92 , New , 0.3 128 , 3 , 5.16 , 4.94 , New , 0.22 256 , 0 , 9.83 , 8.8 , New , 1.03 256 , 4 , 9.89 , 8.77 , New , 1.12 512 , 0 , 13.47 , 12.77 , New , 0.7 512 , 5 , 13.58 , 12.74 , New , 0.84 1024, 0 , 20.33 , 18.46 , New , 1.87 1024, 6 , 20.28 , 18.39 , New , 1.89 2048, 0 , 35.45 , 31.59 , New , 3.86 2048, 7 , 35.44 , 31.66 , New , 3.78 64 , 1 , 3.76 , 3.62 , New , 0.14 64 , 3 , 3.7 , 3.6 , New , 0.1 64 , 4 , 3.71 , 3.62 , New , 0.09 64 , 5 , 3.74 , 3.61 , New , 0.13 64 , 6 , 3.74 , 3.61 , New , 0.13 64 , 7 , 3.72 , 3.62 , New , 0.1 256 , 16 , 9.81 , 8.77 , New , 1.04 256 , 32 , 10.25 , 9.24 , New , 1.01 256 , 48 , 10.48 , 9.39 , New , 1.09 256 , 64 , 11.09 , 10.11 , New , 0.98 256 , 80 , 11.09 , 10.09 , New , 1.0 256 , 96 , 8.88 , 8.09 , New , 0.79 256 , 112 , 8.84 , 8.16 , New , 0.68 0 , 0 , 2.31 , 2.08 , New , 0.23 1 , 0 , 2.36 , 2.09 , New , 0.27 2 , 0 , 2.39 , 2.12 , New , 0.27 3 , 0 , 2.4 , 2.14 , New , 0.26 4 , 0 , 2.42 , 2.15 , New , 0.27 5 , 0 , 2.4 , 2.15 , New , 0.25 6 , 0 , 2.38 , 2.15 , New , 0.23 7 , 0 , 2.36 , 2.15 , New , 0.21 8 , 0 , 2.41 , 2.16 , New , 0.25 9 , 0 , 2.37 , 2.14 , New , 0.23 10 , 0 , 2.36 , 2.16 , New , 0.2 11 , 0 , 2.36 , 2.17 , New , 0.19 12 , 0 , 2.35 , 2.15 , New , 0.2 13 , 0 , 2.37 , 2.16 , New , 0.21 14 , 0 , 2.37 , 2.16 , New , 0.21 15 , 0 , 2.39 , 2.15 , New , 0.24 16 , 0 , 2.36 , 2.14 , New , 0.22 17 , 0 , 2.35 , 2.14 , New , 0.21 18 , 0 , 2.36 , 2.14 , New , 0.22 19 , 0 , 2.37 , 2.14 , New , 0.23 20 , 0 , 2.37 , 2.16 , New , 0.21 21 , 0 , 2.38 , 2.16 , New , 0.22 22 , 0 , 2.38 , 2.14 , New , 0.24 23 , 0 , 2.33 , 2.11 , New , 0.22 24 , 0 , 2.3 , 2.07 , New , 0.23 25 , 0 , 2.27 , 2.06 , New , 0.21 26 , 0 , 2.26 , 2.06 , New , 0.2 27 , 0 , 2.28 , 2.1 , New , 0.18 28 , 0 , 2.34 , 2.13 , New , 0.21 29 , 0 , 2.34 , 2.09 , New , 0.25 30 , 0 , 2.29 , 2.09 , New , 0.2 31 , 0 , 2.31 , 2.08 , New , 0.23 For strchr-avx the results are a lot closer as the optimizations where smaller but the trend is improvement. Especially on Skylake (which is the only one of the benchmark CPUs that this will actually be used on). Results For Skylake strchr-avx2 size, algn, Cur T , New T , Win , Dif 32 , 0 , 6.04 , 5.02 , New , 1.02 32 , 1 , 6.19 , 4.94 , New , 1.25 64 , 0 , 6.68 , 5.92 , New , 0.76 64 , 2 , 6.59 , 5.95 , New , 0.64 128 , 0 , 7.66 , 7.42 , New , 0.24 128 , 3 , 7.66 , 7.4 , New , 0.26 256 , 0 , 14.68 , 12.93 , New , 1.75 256 , 4 , 14.74 , 12.88 , New , 1.86 512 , 0 , 20.81 , 17.47 , New , 3.34 512 , 5 , 20.73 , 17.44 , New , 3.29 1024, 0 , 33.16 , 27.06 , New , 6.1 1024, 6 , 33.15 , 27.09 , New , 6.06 2048, 0 , 59.06 , 56.15 , New , 2.91 2048, 7 , 59.0 , 53.92 , New , 5.08 64 , 1 , 6.56 , 5.86 , New , 0.7 64 , 3 , 6.55 , 5.99 , New , 0.56 64 , 4 , 6.61 , 5.96 , New , 0.65 64 , 5 , 6.52 , 5.94 , New , 0.58 64 , 6 , 6.62 , 5.95 , New , 0.67 64 , 7 , 6.61 , 6.11 , New , 0.5 256 , 16 , 14.64 , 12.85 , New , 1.79 256 , 32 , 15.2 , 12.97 , New , 2.23 256 , 48 , 15.13 , 13.33 , New , 1.8 256 , 64 , 16.18 , 13.46 , New , 2.72 256 , 80 , 16.26 , 13.49 , New , 2.77 256 , 96 , 13.13 , 11.43 , New , 1.7 256 , 112 , 13.12 , 11.4 , New , 1.72 0 , 0 , 5.36 , 4.25 , New , 1.11 1 , 0 , 5.28 , 4.24 , New , 1.04 2 , 0 , 5.27 , 4.2 , New , 1.07 3 , 0 , 5.27 , 4.23 , New , 1.04 4 , 0 , 5.36 , 4.3 , New , 1.06 5 , 0 , 5.35 , 4.29 , New , 1.06 6 , 0 , 5.38 , 4.35 , New , 1.03 7 , 0 , 5.39 , 4.28 , New , 1.11 8 , 0 , 5.5 , 4.45 , New , 1.05 9 , 0 , 5.47 , 4.43 , New , 1.04 10 , 0 , 5.5 , 4.4 , New , 1.1 11 , 0 , 5.51 , 4.44 , New , 1.07 12 , 0 , 5.49 , 4.44 , New , 1.05 13 , 0 , 5.49 , 4.46 , New , 1.03 14 , 0 , 5.49 , 4.46 , New , 1.03 15 , 0 , 5.51 , 4.43 , New , 1.08 16 , 0 , 5.52 , 4.48 , New , 1.04 17 , 0 , 5.57 , 4.47 , New , 1.1 18 , 0 , 5.56 , 4.52 , New , 1.04 19 , 0 , 5.54 , 4.46 , New , 1.08 20 , 0 , 5.53 , 4.48 , New , 1.05 21 , 0 , 5.54 , 4.48 , New , 1.06 22 , 0 , 5.57 , 4.45 , New , 1.12 23 , 0 , 5.57 , 4.48 , New , 1.09 24 , 0 , 5.53 , 4.43 , New , 1.1 25 , 0 , 5.49 , 4.42 , New , 1.07 26 , 0 , 5.5 , 4.44 , New , 1.06 27 , 0 , 5.48 , 4.44 , New , 1.04 28 , 0 , 5.48 , 4.43 , New , 1.05 29 , 0 , 5.54 , 4.41 , New , 1.13 30 , 0 , 5.49 , 4.4 , New , 1.09 31 , 0 , 5.46 , 4.4 , New , 1.06 Results For Tigerlake strchr-avx2 size, algn, Cur T , New T , Win , Dif 32 , 0 , 5.88 , 5.47 , New , 0.41 32 , 1 , 5.73 , 5.46 , New , 0.27 64 , 0 , 6.32 , 6.1 , New , 0.22 64 , 2 , 6.17 , 6.11 , New , 0.06 128 , 0 , 7.93 , 7.68 , New , 0.25 128 , 3 , 7.93 , 7.73 , New , 0.2 256 , 0 , 14.87 , 14.5 , New , 0.37 256 , 4 , 14.96 , 14.59 , New , 0.37 512 , 0 , 21.25 , 20.18 , New , 1.07 512 , 5 , 21.25 , 20.11 , New , 1.14 1024, 0 , 33.17 , 31.26 , New , 1.91 1024, 6 , 33.14 , 31.13 , New , 2.01 2048, 0 , 53.39 , 52.51 , New , 0.88 2048, 7 , 53.3 , 52.34 , New , 0.96 64 , 1 , 6.11 , 6.09 , New , 0.02 64 , 3 , 6.04 , 6.01 , New , 0.03 64 , 4 , 6.04 , 6.03 , New , 0.01 64 , 5 , 6.13 , 6.05 , New , 0.08 64 , 6 , 6.09 , 6.06 , New , 0.03 64 , 7 , 6.04 , 6.03 , New , 0.01 256 , 16 , 14.77 , 14.39 , New , 0.38 256 , 32 , 15.58 , 15.27 , New , 0.31 256 , 48 , 15.88 , 15.32 , New , 0.56 256 , 64 , 16.85 , 16.01 , New , 0.84 256 , 80 , 16.83 , 16.03 , New , 0.8 256 , 96 , 13.5 , 13.14 , New , 0.36 256 , 112 , 13.71 , 13.24 , New , 0.47 0 , 0 , 3.78 , 3.76 , New , 0.02 1 , 0 , 3.79 , 3.76 , New , 0.03 2 , 0 , 3.82 , 3.77 , New , 0.05 3 , 0 , 3.78 , 3.76 , New , 0.02 4 , 0 , 3.75 , 3.75 , Eq , 0.0 5 , 0 , 3.77 , 3.74 , New , 0.03 6 , 0 , 3.78 , 3.76 , New , 0.02 7 , 0 , 3.91 , 3.85 , New , 0.06 8 , 0 , 3.76 , 3.77 , Cur , 0.01 9 , 0 , 3.75 , 3.75 , Eq , 0.0 10 , 0 , 3.76 , 3.76 , Eq , 0.0 11 , 0 , 3.77 , 3.75 , New , 0.02 12 , 0 , 3.79 , 3.77 , New , 0.02 13 , 0 , 3.86 , 3.86 , Eq , 0.0 14 , 0 , 4.2 , 4.2 , Eq , 0.0 15 , 0 , 4.17 , 4.07 , New , 0.1 16 , 0 , 4.1 , 4.1 , Eq , 0.0 17 , 0 , 4.12 , 4.09 , New , 0.03 18 , 0 , 4.12 , 4.12 , Eq , 0.0 19 , 0 , 4.18 , 4.09 , New , 0.09 20 , 0 , 4.14 , 4.09 , New , 0.05 21 , 0 , 4.15 , 4.11 , New , 0.04 22 , 0 , 4.23 , 4.13 , New , 0.1 23 , 0 , 4.18 , 4.16 , New , 0.02 24 , 0 , 4.13 , 4.21 , Cur , 0.08 25 , 0 , 4.17 , 4.15 , New , 0.02 26 , 0 , 4.17 , 4.16 , New , 0.01 27 , 0 , 4.18 , 4.16 , New , 0.02 28 , 0 , 4.17 , 4.15 , New , 0.02 29 , 0 , 4.2 , 4.13 , New , 0.07 30 , 0 , 4.16 , 4.12 , New , 0.04 31 , 0 , 4.15 , 4.15 , Eq , 0.0 Results For Icelake strchr-avx2 size, algn, Cur T , New T , Win , Dif 32 , 0 , 3.73 , 3.72 , New , 0.01 32 , 1 , 3.46 , 3.44 , New , 0.02 64 , 0 , 3.96 , 3.87 , New , 0.09 64 , 2 , 3.92 , 3.87 , New , 0.05 128 , 0 , 5.15 , 4.9 , New , 0.25 128 , 3 , 5.12 , 4.87 , New , 0.25 256 , 0 , 9.79 , 9.45 , New , 0.34 256 , 4 , 9.76 , 9.52 , New , 0.24 512 , 0 , 13.93 , 12.89 , New , 1.04 512 , 5 , 13.84 , 13.02 , New , 0.82 1024, 0 , 21.41 , 19.92 , New , 1.49 1024, 6 , 21.69 , 20.12 , New , 1.57 2048, 0 , 35.12 , 33.83 , New , 1.29 2048, 7 , 35.13 , 33.99 , New , 1.14 64 , 1 , 3.96 , 3.9 , New , 0.06 64 , 3 , 3.88 , 3.86 , New , 0.02 64 , 4 , 3.87 , 3.83 , New , 0.04 64 , 5 , 3.9 , 3.85 , New , 0.05 64 , 6 , 3.9 , 3.89 , New , 0.01 64 , 7 , 3.9 , 3.84 , New , 0.06 256 , 16 , 9.76 , 9.4 , New , 0.36 256 , 32 , 10.36 , 9.97 , New , 0.39 256 , 48 , 10.5 , 10.02 , New , 0.48 256 , 64 , 11.13 , 10.55 , New , 0.58 256 , 80 , 11.14 , 10.56 , New , 0.58 256 , 96 , 8.98 , 8.57 , New , 0.41 256 , 112 , 9.1 , 8.66 , New , 0.44 0 , 0 , 2.52 , 2.49 , New , 0.03 1 , 0 , 2.56 , 2.53 , New , 0.03 2 , 0 , 2.6 , 2.54 , New , 0.06 3 , 0 , 2.63 , 2.58 , New , 0.05 4 , 0 , 2.63 , 2.6 , New , 0.03 5 , 0 , 2.65 , 2.62 , New , 0.03 6 , 0 , 2.75 , 2.73 , New , 0.02 7 , 0 , 2.73 , 2.76 , Cur , 0.03 8 , 0 , 2.61 , 2.6 , New , 0.01 9 , 0 , 2.73 , 2.74 , Cur , 0.01 10 , 0 , 2.72 , 2.71 , New , 0.01 11 , 0 , 2.74 , 2.72 , New , 0.02 12 , 0 , 2.73 , 2.74 , Cur , 0.01 13 , 0 , 2.73 , 2.75 , Cur , 0.02 14 , 0 , 2.74 , 2.72 , New , 0.02 15 , 0 , 2.74 , 2.72 , New , 0.02 16 , 0 , 2.75 , 2.74 , New , 0.01 17 , 0 , 2.73 , 2.74 , Cur , 0.01 18 , 0 , 2.72 , 2.73 , Cur , 0.01 19 , 0 , 2.74 , 2.72 , New , 0.02 20 , 0 , 2.75 , 2.71 , New , 0.04 21 , 0 , 2.74 , 2.74 , Eq , 0.0 22 , 0 , 2.73 , 2.74 , Cur , 0.01 23 , 0 , 2.7 , 2.72 , Cur , 0.02 24 , 0 , 2.68 , 2.68 , Eq , 0.0 25 , 0 , 2.65 , 2.63 , New , 0.02 26 , 0 , 2.64 , 2.62 , New , 0.02 27 , 0 , 2.71 , 2.68 , New , 0.03 28 , 0 , 2.72 , 2.68 , New , 0.04 29 , 0 , 2.68 , 2.74 , Cur , 0.06 30 , 0 , 2.65 , 2.65 , Eq , 0.0 31 , 0 , 2.7 , 2.68 , New , 0.02 sysdeps/x86_64/multiarch/strchr-avx2.S | 294 +++++++++++++++---------- 1 file changed, 173 insertions(+), 121 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S index 25bec38b5d..220165d2ba 100644 --- a/sysdeps/x86_64/multiarch/strchr-avx2.S +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S @@ -49,132 +49,144 @@ .section SECTION(.text),"ax",@progbits ENTRY (STRCHR) - movl %edi, %ecx -# ifndef USE_AS_STRCHRNUL - xorl %edx, %edx -# endif - /* Broadcast CHAR to YMM0. */ vmovd %esi, %xmm0 + movl %edi, %eax + andl $(PAGE_SIZE - 1), %eax + VPBROADCAST %xmm0, %ymm0 vpxor %xmm9, %xmm9, %xmm9 - VPBROADCAST %xmm0, %ymm0 /* Check if we cross page boundary with one vector load. */ - andl $(PAGE_SIZE - 1), %ecx - cmpl $(PAGE_SIZE - VEC_SIZE), %ecx - ja L(cross_page_boundary) + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + ja L(cross_page_boundary) /* Check the first VEC_SIZE bytes. Search for both CHAR and the null byte. */ vmovdqu (%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax + vpmovmskb %ymm1, %eax testl %eax, %eax - jz L(more_vecs) + jz L(aligned_more) tzcntl %eax, %eax - /* Found CHAR or the null byte. */ - addq %rdi, %rax # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Found CHAR or the null byte. */ + cmp (%rdi, %rax), %CHAR_REG + jne L(zero) # endif -L(return_vzeroupper): - ZERO_UPPER_VEC_REGISTERS_RETURN - - .p2align 4 -L(more_vecs): - /* Align data for aligned loads in the loop. */ - andq $-VEC_SIZE, %rdi -L(aligned_more): - - /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vmovdqa VEC_SIZE(%rdi), %ymm8 - addq $VEC_SIZE, %rdi - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(first_vec_x0) - - vmovdqa VEC_SIZE(%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(first_vec_x1) - - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(first_vec_x2) - - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jz L(prep_loop_4x) + addq %rdi, %rax + VZEROUPPER_RETURN + /* .p2align 5 helps keep performance more consistent if ENTRY() + alignment % 32 was either 16 or 0. As well this makes the + alignment % 32 of the loop_4x_vec fixed which makes tuning it + easier. */ + .p2align 5 +L(first_vec_x4): tzcntl %eax, %eax - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + addq $(VEC_SIZE * 3 + 1), %rdi # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Found CHAR or the null byte. */ + cmp (%rdi, %rax), %CHAR_REG + jne L(zero) # endif + addq %rdi, %rax VZEROUPPER_RETURN - .p2align 4 -L(first_vec_x0): - tzcntl %eax, %eax - /* Found CHAR or the null byte. */ - addq %rdi, %rax # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax -# endif +L(zero): + xorl %eax, %eax VZEROUPPER_RETURN +# endif + .p2align 4 L(first_vec_x1): tzcntl %eax, %eax - leaq VEC_SIZE(%rdi, %rax), %rax + incq %rdi # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Found CHAR or the null byte. */ + cmp (%rdi, %rax), %CHAR_REG + jne L(zero) # endif + addq %rdi, %rax VZEROUPPER_RETURN .p2align 4 L(first_vec_x2): tzcntl %eax, %eax + addq $(VEC_SIZE + 1), %rdi +# ifndef USE_AS_STRCHRNUL /* Found CHAR or the null byte. */ - leaq (VEC_SIZE * 2)(%rdi, %rax), %rax + cmp (%rdi, %rax), %CHAR_REG + jne L(zero) +# endif + addq %rdi, %rax + VZEROUPPER_RETURN + + .p2align 4 +L(first_vec_x3): + tzcntl %eax, %eax + addq $(VEC_SIZE * 2 + 1), %rdi # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Found CHAR or the null byte. */ + cmp (%rdi, %rax), %CHAR_REG + jne L(zero) # endif + addq %rdi, %rax VZEROUPPER_RETURN -L(prep_loop_4x): - /* Align data to 4 * VEC_SIZE. */ - andq $-(VEC_SIZE * 4), %rdi + .p2align 4 +L(aligned_more): + /* Align data to VEC_SIZE - 1. This is the same number of + instructions as using andq -VEC_SIZE but saves 4 bytes of code on + x4 check. */ + orq $(VEC_SIZE - 1), %rdi +L(cross_page_continue): + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time since + data is only aligned to VEC_SIZE. */ + vmovdqa 1(%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %eax + testl %eax, %eax + jnz L(first_vec_x1) + vmovdqa (VEC_SIZE + 1)(%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %eax + testl %eax, %eax + jnz L(first_vec_x2) + + vmovdqa (VEC_SIZE * 2 + 1)(%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %eax + testl %eax, %eax + jnz L(first_vec_x3) + + vmovdqa (VEC_SIZE * 3 + 1)(%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %eax + testl %eax, %eax + jnz L(first_vec_x4) + /* Align data to VEC_SIZE * 4 - 1. */ + addq $(VEC_SIZE * 4 + 1), %rdi + andq $-(VEC_SIZE * 4), %rdi .p2align 4 L(loop_4x_vec): /* Compare 4 * VEC at a time forward. */ - vmovdqa (VEC_SIZE * 4)(%rdi), %ymm5 - vmovdqa (VEC_SIZE * 5)(%rdi), %ymm6 - vmovdqa (VEC_SIZE * 6)(%rdi), %ymm7 - vmovdqa (VEC_SIZE * 7)(%rdi), %ymm8 + vmovdqa (%rdi), %ymm5 + vmovdqa (VEC_SIZE)(%rdi), %ymm6 + vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 + vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 /* Leaves only CHARS matching esi as 0. */ vpxor %ymm5, %ymm0, %ymm1 @@ -190,62 +202,102 @@ L(loop_4x_vec): VPMINU %ymm1, %ymm2, %ymm5 VPMINU %ymm3, %ymm4, %ymm6 - VPMINU %ymm5, %ymm6, %ymm5 + VPMINU %ymm5, %ymm6, %ymm6 - VPCMPEQ %ymm5, %ymm9, %ymm5 - vpmovmskb %ymm5, %eax + VPCMPEQ %ymm6, %ymm9, %ymm6 + vpmovmskb %ymm6, %ecx + subq $-(VEC_SIZE * 4), %rdi + testl %ecx, %ecx + jz L(loop_4x_vec) - addq $(VEC_SIZE * 4), %rdi - testl %eax, %eax - jz L(loop_4x_vec) - VPCMPEQ %ymm1, %ymm9, %ymm1 - vpmovmskb %ymm1, %eax + VPCMPEQ %ymm1, %ymm9, %ymm1 + vpmovmskb %ymm1, %eax testl %eax, %eax - jnz L(first_vec_x0) + jnz L(last_vec_x0) + - VPCMPEQ %ymm2, %ymm9, %ymm2 - vpmovmskb %ymm2, %eax + VPCMPEQ %ymm5, %ymm9, %ymm2 + vpmovmskb %ymm2, %eax testl %eax, %eax - jnz L(first_vec_x1) + jnz L(last_vec_x1) + + VPCMPEQ %ymm3, %ymm9, %ymm3 + vpmovmskb %ymm3, %eax + /* rcx has combined result from all 4 VEC. It will only be used if + the first 3 other VEC all did not contain a match. */ + salq $32, %rcx + orq %rcx, %rax + tzcntq %rax, %rax + subq $(VEC_SIZE * 2), %rdi +# ifndef USE_AS_STRCHRNUL + /* Found CHAR or the null byte. */ + cmp (%rdi, %rax), %CHAR_REG + jne L(zero_end) +# endif + addq %rdi, %rax + VZEROUPPER_RETURN + - VPCMPEQ %ymm3, %ymm9, %ymm3 - VPCMPEQ %ymm4, %ymm9, %ymm4 - vpmovmskb %ymm3, %ecx - vpmovmskb %ymm4, %eax - salq $32, %rax - orq %rcx, %rax - tzcntq %rax, %rax - leaq (VEC_SIZE * 2)(%rdi, %rax), %rax + .p2align 4 +L(last_vec_x0): + tzcntl %eax, %eax + addq $-(VEC_SIZE * 4), %rdi # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Found CHAR or the null byte. */ + cmp (%rdi, %rax), %CHAR_REG + jne L(zero_end) # endif + addq %rdi, %rax VZEROUPPER_RETURN +# ifndef USE_AS_STRCHRNUL +L(zero_end): + xorl %eax, %eax + VZEROUPPER_RETURN +# endif + + .p2align 4 +L(last_vec_x1): + tzcntl %eax, %eax + subq $(VEC_SIZE * 3), %rdi +# ifndef USE_AS_STRCHRNUL + /* Found CHAR or the null byte. */ + cmp (%rdi, %rax), %CHAR_REG + jne L(zero_end) +# endif + addq %rdi, %rax + VZEROUPPER_RETURN + + /* Cold case for crossing page with first load. */ .p2align 4 L(cross_page_boundary): - andq $-VEC_SIZE, %rdi - andl $(VEC_SIZE - 1), %ecx - - vmovdqa (%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 + movq %rdi, %rdx + /* Align rdi to VEC_SIZE - 1. */ + orq $(VEC_SIZE - 1), %rdi + vmovdqa -(VEC_SIZE - 1)(%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - /* Remove the leading bits. */ - sarxl %ecx, %eax, %eax + vpmovmskb %ymm1, %eax + /* Remove the leading bytes. sarxl only uses bits [5:0] of COUNT + so no need to manually mod edx. */ + sarxl %edx, %eax, %eax testl %eax, %eax - jz L(aligned_more) + jz L(cross_page_continue) tzcntl %eax, %eax - addq %rcx, %rdi - addq %rdi, %rax # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + xorl %ecx, %ecx + /* Found CHAR or the null byte. */ + cmp (%rdx, %rax), %CHAR_REG + leaq (%rdx, %rax), %rax + cmovne %rcx, %rax +# else + addq %rdx, %rax # endif - VZEROUPPER_RETURN +L(return_vzeroupper): + ZERO_UPPER_VEC_REGISTERS_RETURN END (STRCHR) # endif From patchwork Wed Apr 21 21:39:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 43064 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id BA7F5398B86E; Wed, 21 Apr 2021 21:40:50 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BA7F5398B86E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1619041250; bh=TwFZph6rpqUQx60nGrcmk58uo9BIAdT/vBPYpbwY3Mk=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=dMjid8CKa/scHSqCuguPFXykLivQStnUNGn2ZnaLOGzwglIAVNQoqXvFa1FNjAIEF vhTYOjk5yQnaN3bSrj5vhiwL+VPz3aVi5m3gEKiredNk/GYWznLpJJo6I+goEAnwiT x5YqnV0aKHVAF1s0/wlwWp0r7ANvyqzmxxd8PuvI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qk1-x72c.google.com (mail-qk1-x72c.google.com [IPv6:2607:f8b0:4864:20::72c]) by sourceware.org (Postfix) with ESMTPS id 3FB55398B879 for ; Wed, 21 Apr 2021 21:40:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 3FB55398B879 Received: by mail-qk1-x72c.google.com with SMTP id 8so10642576qkv.8 for ; Wed, 21 Apr 2021 14:40:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=TwFZph6rpqUQx60nGrcmk58uo9BIAdT/vBPYpbwY3Mk=; b=aO0k/OaDjOTj3Rebsc0lhkRgOu5RarJqN6XaKRAG8nI7UWNdBIlEHbpuE3hoJiiEJo ZV/mh8+4yXsesWiGmUP+CZHNEzJiLoIrGTmkf0SdBFVIrHTK0r1SLaziJaQdlO/Z/roR 4wkgRrx5b2nBywYmEuUcXep/wpfC9OgzBeLI5DZHGb2C2IlIC3h8Fyx+SMbrF97WVvDh bxguf6YNLDr5lCdT0brPA/GecgEMb9cB07PS+NluO9EU/Gys6B8NDTBbJBnD0jqhxp+W GNyQzlRSG6hzSIr49Q00kTcCVzv2ROKm3cXumjkowAsQBiKakrPbcaRbv66qJEuDF0i3 Q45A== X-Gm-Message-State: AOAM5333oWA1pdQoiYEtVGlSwg0e6uWiFBbMpoZMj0j5EOZk0nDV73aO XB1336eyZHuj5Vzhwdt0fAQ306GE15Q= X-Google-Smtp-Source: ABdhPJwL8arWhhZtoSlMej3mVeUwY4xLSrgomYmjyzFgJgUw9c/xgr0sZGKbKWSh9E+KuoI6sG6W2w== X-Received: by 2002:a37:8a46:: with SMTP id m67mr257626qkd.259.1619041246253; Wed, 21 Apr 2021 14:40:46 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id m29sm572365qkm.101.2021.04.21.14.40.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Apr 2021 14:40:45 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 2/2] x86: Optimize strchr-evex.S Date: Wed, 21 Apr 2021 17:39:53 -0400 Message-Id: <20210421213951.404588-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20210421213951.404588-1-goldstein.w.n@gmail.com> References: <20210421213951.404588-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" No bug. This commit optimizes strlen-evex.S. The optimizations are mostly small things such as save an ALU in the alignment process, saving a few instructions in the loop return. The one significant change is saving 2 instructions in the 4x loop. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein --- sysdeps/x86_64/multiarch/strchr-evex.S | 388 ++++++++++++++----------- 1 file changed, 214 insertions(+), 174 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strchr-evex.S b/sysdeps/x86_64/multiarch/strchr-evex.S index ddc86a7058..7cd111e96c 100644 --- a/sysdeps/x86_64/multiarch/strchr-evex.S +++ b/sysdeps/x86_64/multiarch/strchr-evex.S @@ -24,23 +24,26 @@ # define STRCHR __strchr_evex # endif -# define VMOVU vmovdqu64 -# define VMOVA vmovdqa64 +# define VMOVU vmovdqu64 +# define VMOVA vmovdqa64 # ifdef USE_AS_WCSCHR # define VPBROADCAST vpbroadcastd # define VPCMP vpcmpd # define VPMINU vpminud # define CHAR_REG esi -# define SHIFT_REG r8d +# define SHIFT_REG ecx +# define CHAR_SIZE 4 # else # define VPBROADCAST vpbroadcastb # define VPCMP vpcmpb # define VPMINU vpminub # define CHAR_REG sil -# define SHIFT_REG ecx +# define SHIFT_REG edx +# define CHAR_SIZE 1 # endif + # define XMMZERO xmm16 # define YMMZERO ymm16 @@ -56,23 +59,20 @@ # define VEC_SIZE 32 # define PAGE_SIZE 4096 +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) .section .text.evex,"ax",@progbits ENTRY (STRCHR) - movl %edi, %ecx -# ifndef USE_AS_STRCHRNUL - xorl %edx, %edx -# endif - /* Broadcast CHAR to YMM0. */ - VPBROADCAST %esi, %YMM0 - + VPBROADCAST %esi, %YMM0 + movl %edi, %eax + andl $(PAGE_SIZE - 1), %eax vpxorq %XMMZERO, %XMMZERO, %XMMZERO - /* Check if we cross page boundary with one vector load. */ - andl $(PAGE_SIZE - 1), %ecx - cmpl $(PAGE_SIZE - VEC_SIZE), %ecx - ja L(cross_page_boundary) + /* Check if we cross page boundary with one vector load. Otherwise + it is safe to use an unaligned load. */ + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + ja L(cross_page_boundary) /* Check the first VEC_SIZE bytes. Search for both CHAR and the null bytes. */ @@ -83,251 +83,291 @@ ENTRY (STRCHR) VPMINU %YMM2, %YMM1, %YMM2 /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ VPCMP $0, %YMMZERO, %YMM2, %k0 - ktestd %k0, %k0 - jz L(more_vecs) kmovd %k0, %eax + testl %eax, %eax + jz L(aligned_more) tzcntl %eax, %eax - /* Found CHAR or the null byte. */ # ifdef USE_AS_WCSCHR /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq (%rdi, %rax, 4), %rax + leaq (%rdi, %rax, CHAR_SIZE), %rax # else addq %rdi, %rax # endif # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Found CHAR or the null byte. */ + cmp (%rax), %CHAR_REG + jne L(zero) # endif ret - .p2align 4 -L(more_vecs): - /* Align data for aligned loads in the loop. */ - andq $-VEC_SIZE, %rdi -L(aligned_more): - - /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - VMOVA VEC_SIZE(%rdi), %YMM1 - addq $VEC_SIZE, %rdi - - /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x0) - - VMOVA VEC_SIZE(%rdi), %YMM1 - /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x1) - - VMOVA (VEC_SIZE * 2)(%rdi), %YMM1 - /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x2) - - VMOVA (VEC_SIZE * 3)(%rdi), %YMM1 - /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM2, %k0 - ktestd %k0, %k0 - jz L(prep_loop_4x) - - kmovd %k0, %eax + /* .p2align 5 helps keep performance more consistent if ENTRY() + alignment % 32 was either 16 or 0. As well this makes the + alignment % 32 of the loop_4x_vec fixed which makes tuning it + easier. */ + .p2align 5 +L(first_vec_x3): tzcntl %eax, %eax +# ifndef USE_AS_STRCHRNUL /* Found CHAR or the null byte. */ -# ifdef USE_AS_WCSCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq (VEC_SIZE * 3)(%rdi, %rax, 4), %rax -# else - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + cmp (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + jne L(zero) # endif + /* NB: Multiply sizeof char type (1 or 4) to get the number of + bytes. */ + leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax + ret + # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax -# endif +L(zero): + xorl %eax, %eax ret +# endif .p2align 4 -L(first_vec_x0): +L(first_vec_x4): +# ifndef USE_AS_STRCHRNUL + /* Check to see if first match was CHAR (k0) or null (k1). */ + kmovd %k0, %eax tzcntl %eax, %eax - /* Found CHAR or the null byte. */ -# ifdef USE_AS_WCSCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq (%rdi, %rax, 4), %rax + kmovd %k1, %ecx + /* bzhil will not be 0 if first match was null. */ + bzhil %eax, %ecx, %ecx + jne L(zero) # else - addq %rdi, %rax -# endif -# ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Combine CHAR and null matches. */ + kord %k0, %k1, %k0 + kmovd %k0, %eax + tzcntl %eax, %eax # endif + /* NB: Multiply sizeof char type (1 or 4) to get the number of + bytes. */ + leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax ret .p2align 4 L(first_vec_x1): tzcntl %eax, %eax - /* Found CHAR or the null byte. */ -# ifdef USE_AS_WCSCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq VEC_SIZE(%rdi, %rax, 4), %rax -# else - leaq VEC_SIZE(%rdi, %rax), %rax -# endif # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Found CHAR or the null byte. */ + cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + jne L(zero) + # endif + /* NB: Multiply sizeof char type (1 or 4) to get the number of + bytes. */ + leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax ret .p2align 4 L(first_vec_x2): +# ifndef USE_AS_STRCHRNUL + /* Check to see if first match was CHAR (k0) or null (k1). */ + kmovd %k0, %eax tzcntl %eax, %eax - /* Found CHAR or the null byte. */ -# ifdef USE_AS_WCSCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq (VEC_SIZE * 2)(%rdi, %rax, 4), %rax + kmovd %k1, %ecx + /* bzhil will not be 0 if first match was null. */ + bzhil %eax, %ecx, %ecx + jne L(zero) # else - leaq (VEC_SIZE * 2)(%rdi, %rax), %rax -# endif -# ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Combine CHAR and null matches. */ + kord %k0, %k1, %k0 + kmovd %k0, %eax + tzcntl %eax, %eax # endif + /* NB: Multiply sizeof char type (1 or 4) to get the number of + bytes. */ + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax ret -L(prep_loop_4x): - /* Align data to 4 * VEC_SIZE. */ + .p2align 4 +L(aligned_more): + /* Align data to VEC_SIZE. */ + andq $-VEC_SIZE, %rdi +L(cross_page_continue): + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time since + data is only aligned to VEC_SIZE. Use two alternating methods for + checking VEC to balance latency and port contention. */ + + /* This method has higher latency but has better port + distribution. */ + VMOVA (VEC_SIZE)(%rdi), %YMM1 + /* Leaves only CHARS matching esi as 0. */ + vpxorq %YMM1, %YMM0, %YMM2 + VPMINU %YMM2, %YMM1, %YMM2 + /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ + VPCMP $0, %YMMZERO, %YMM2, %k0 + kmovd %k0, %eax + testl %eax, %eax + jnz L(first_vec_x1) + + /* This method has higher latency but has better port + distribution. */ + VMOVA (VEC_SIZE * 2)(%rdi), %YMM1 + /* Each bit in K0 represents a CHAR in YMM1. */ + VPCMP $0, %YMM1, %YMM0, %k0 + /* Each bit in K1 represents a CHAR in YMM1. */ + VPCMP $0, %YMM1, %YMMZERO, %k1 + kortestd %k0, %k1 + jnz L(first_vec_x2) + + VMOVA (VEC_SIZE * 3)(%rdi), %YMM1 + /* Leaves only CHARS matching esi as 0. */ + vpxorq %YMM1, %YMM0, %YMM2 + VPMINU %YMM2, %YMM1, %YMM2 + /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ + VPCMP $0, %YMMZERO, %YMM2, %k0 + kmovd %k0, %eax + testl %eax, %eax + jnz L(first_vec_x3) + + VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 + /* Each bit in K0 represents a CHAR in YMM1. */ + VPCMP $0, %YMM1, %YMM0, %k0 + /* Each bit in K1 represents a CHAR in YMM1. */ + VPCMP $0, %YMM1, %YMMZERO, %k1 + kortestd %k0, %k1 + jnz L(first_vec_x4) + + /* Align data to VEC_SIZE * 4 for the loop. */ + addq $VEC_SIZE, %rdi andq $-(VEC_SIZE * 4), %rdi .p2align 4 L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ + /* Check 4x VEC at a time. No penalty to imm32 offset with evex + encoding. */ VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 VMOVA (VEC_SIZE * 5)(%rdi), %YMM2 VMOVA (VEC_SIZE * 6)(%rdi), %YMM3 VMOVA (VEC_SIZE * 7)(%rdi), %YMM4 - /* Leaves only CHARS matching esi as 0. */ + /* For YMM1 and YMM3 use xor to set the CHARs matching esi to zero. */ vpxorq %YMM1, %YMM0, %YMM5 - vpxorq %YMM2, %YMM0, %YMM6 + /* For YMM2 and YMM4 cmp not equals to CHAR and store result in k + register. Its possible to save either 1 or 2 instructions using cmp no + equals method for either YMM1 or YMM1 and YMM3 respectively but + bottleneck on p5 makes it no worth it. */ + VPCMP $4, %YMM0, %YMM2, %k2 vpxorq %YMM3, %YMM0, %YMM7 - vpxorq %YMM4, %YMM0, %YMM8 - - VPMINU %YMM5, %YMM1, %YMM5 - VPMINU %YMM6, %YMM2, %YMM6 - VPMINU %YMM7, %YMM3, %YMM7 - VPMINU %YMM8, %YMM4, %YMM8 - - VPMINU %YMM5, %YMM6, %YMM1 - VPMINU %YMM7, %YMM8, %YMM2 - - VPMINU %YMM1, %YMM2, %YMM1 - - /* Each bit in K0 represents a CHAR or a null byte. */ - VPCMP $0, %YMMZERO, %YMM1, %k0 - - addq $(VEC_SIZE * 4), %rdi - - ktestd %k0, %k0 + VPCMP $4, %YMM0, %YMM4, %k4 + + /* Use min to select all zeros (either from xor or end of string). */ + VPMINU %YMM1, %YMM5, %YMM1 + VPMINU %YMM3, %YMM7, %YMM3 + + /* Use min + zeromask to select for zeros. Since k2 and k4 will be + have 0 as positions that matched with CHAR which will set zero in + the corresponding destination bytes in YMM2 / YMM4. */ + VPMINU %YMM1, %YMM2, %YMM2{%k2}{z} + VPMINU %YMM3, %YMM4, %YMM4 + VPMINU %YMM2, %YMM4, %YMM4{%k4}{z} + + VPCMP $0, %YMMZERO, %YMM4, %k1 + kmovd %k1, %ecx + subq $-(VEC_SIZE * 4), %rdi + testl %ecx, %ecx jz L(loop_4x_vec) - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM5, %k0 + VPCMP $0, %YMMZERO, %YMM1, %k0 kmovd %k0, %eax testl %eax, %eax - jnz L(first_vec_x0) + jnz L(last_vec_x1) - /* Each bit in K1 represents a CHAR or a null byte in YMM2. */ - VPCMP $0, %YMMZERO, %YMM6, %k1 - kmovd %k1, %eax + VPCMP $0, %YMMZERO, %YMM2, %k0 + kmovd %k0, %eax testl %eax, %eax - jnz L(first_vec_x1) - - /* Each bit in K2 represents a CHAR or a null byte in YMM3. */ - VPCMP $0, %YMMZERO, %YMM7, %k2 - /* Each bit in K3 represents a CHAR or a null byte in YMM4. */ - VPCMP $0, %YMMZERO, %YMM8, %k3 + jnz L(last_vec_x2) + VPCMP $0, %YMMZERO, %YMM3, %k0 + kmovd %k0, %eax + /* Combine YMM3 matches (eax) with YMM4 matches (ecx). */ # ifdef USE_AS_WCSCHR - /* NB: Each bit in K2/K3 represents 4-byte element. */ - kshiftlw $8, %k3, %k1 + sall $8, %ecx + orl %ecx, %eax + tzcntl %eax, %eax # else - kshiftlq $32, %k3, %k1 + salq $32, %rcx + orq %rcx, %rax + tzcntq %rax, %rax # endif +# ifndef USE_AS_STRCHRNUL + /* Check if match was CHAR or null. */ + cmp (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + jne L(zero_end) +# endif + /* NB: Multiply sizeof char type (1 or 4) to get the number of + bytes. */ + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + ret - /* Each bit in K1 represents a NULL or a mismatch. */ - korq %k1, %k2, %k1 - kmovq %k1, %rax +# ifndef USE_AS_STRCHRNUL +L(zero_end): + xorl %eax, %eax + ret +# endif - tzcntq %rax, %rax -# ifdef USE_AS_WCSCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq (VEC_SIZE * 2)(%rdi, %rax, 4), %rax -# else - leaq (VEC_SIZE * 2)(%rdi, %rax), %rax + .p2align 4 +L(last_vec_x1): + tzcntl %eax, %eax +# ifndef USE_AS_STRCHRNUL + /* Check if match was null. */ + cmp (%rdi, %rax, CHAR_SIZE), %CHAR_REG + jne L(zero_end) # endif + /* NB: Multiply sizeof char type (1 or 4) to get the number of + bytes. */ + leaq (%rdi, %rax, CHAR_SIZE), %rax + ret + + .p2align 4 +L(last_vec_x2): + tzcntl %eax, %eax # ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + /* Check if match was null. */ + cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + jne L(zero_end) # endif + /* NB: Multiply sizeof char type (1 or 4) to get the number of + bytes. */ + leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax ret /* Cold case for crossing page with first load. */ .p2align 4 L(cross_page_boundary): + movq %rdi, %rdx + /* Align rdi. */ andq $-VEC_SIZE, %rdi - andl $(VEC_SIZE - 1), %ecx - VMOVA (%rdi), %YMM1 - /* Leaves only CHARS matching esi as 0. */ vpxorq %YMM1, %YMM0, %YMM2 VPMINU %YMM2, %YMM1, %YMM2 /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ VPCMP $0, %YMMZERO, %YMM2, %k0 kmovd %k0, %eax - testl %eax, %eax - + /* Remove the leading bits. */ # ifdef USE_AS_WCSCHR + movl %edx, %SHIFT_REG /* NB: Divide shift count by 4 since each bit in K1 represent 4 bytes. */ - movl %ecx, %SHIFT_REG - sarl $2, %SHIFT_REG + sarl $2, %SHIFT_REG + andl $(CHAR_PER_VEC - 1), %SHIFT_REG # endif - - /* Remove the leading bits. */ sarxl %SHIFT_REG, %eax, %eax + /* If eax is zero continue. */ testl %eax, %eax - - jz L(aligned_more) + jz L(cross_page_continue) tzcntl %eax, %eax - addq %rcx, %rdi +# ifndef USE_AS_STRCHRNUL + /* Check to see if match was CHAR or null. */ + cmp (%rdx, %rax, CHAR_SIZE), %CHAR_REG + jne L(zero_end) +# endif # ifdef USE_AS_WCSCHR /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq (%rdi, %rax, 4), %rax + leaq (%rdx, %rax, CHAR_SIZE), %rax # else - addq %rdi, %rax -# endif -# ifndef USE_AS_STRCHRNUL - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax + addq %rdx, %rax # endif ret