From patchwork Wed Apr 21 21:39:52 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 43063
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id A454538930CD;
	Wed, 21 Apr 2021 21:40:40 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A454538930CD
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1619041240;
	bh=/5ReZ/j9WHHPP6Nz4AlgVlzTNUw0ikzpDMxVMbw1118=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:From;
	b=JUMTNH2hGfGoULtvxNZ71JF6rvN5Ru+JfwOEbbwt2v21qX4ZKz4w4rE8KgIGrsY5N
	 qRgp3MvBgomhvhr+o2X10JTPeJgtxd2hXnZ3vM2WbUv76epsO5uaf2H5gmyDTMnwHi
	 fODGMUojH/aiJ+Buhg3wdIP1HkViPeHPsl52uM8g=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-qk1-x72c.google.com (mail-qk1-x72c.google.com
 [IPv6:2607:f8b0:4864:20::72c])
 by sourceware.org (Postfix) with ESMTPS id 9891838930CD
 for <libc-alpha@sourceware.org>; Wed, 21 Apr 2021 21:40:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 9891838930CD
Received: by mail-qk1-x72c.google.com with SMTP id 66so4824395qkf.2
 for <libc-alpha@sourceware.org>; Wed, 21 Apr 2021 14:40:36 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
 :content-transfer-encoding;
 bh=/5ReZ/j9WHHPP6Nz4AlgVlzTNUw0ikzpDMxVMbw1118=;
 b=QeN8DALKLu5Us8FLlXTUPdgxM1HheOYuN4EATwdCRXBSG0HFSi+3LXuX6emNsUaMuU
 m2D2sp+dguXqZO3E+Igw1GFGrjBWtN+R2kNBiYH9oHFt/XyWi3jC4libqU7j0EniCNrn
 PNT6a7ep+aVKHwtuLqfAMT7DZuk0AF4bbTc+aLDO9/AqhTLOGvR9V7d/D3cTx924jX04
 NVVBUoTnhQvTUMDfXgi8Z3gjIYX0KP4mGRfws8H2AhgsVs06uvEC/9jQrhAGUs6y7OVY
 vgUI9vkO/OEhb5htP3ltweCqtnkAd4dRqkF82+jF3nB2KAL1JjdKP6dBa0fAltk4BCbo
 DK7Q==
X-Gm-Message-State: AOAM53239hU44Oh/QaU3hjh9SiJMxl0jt1kiNl6PC9DwekXLA6fCAIOf
 v/9tAWlTMbKRWfnrIkndNIoOWLAF/GQ=
X-Google-Smtp-Source: 
 ABdhPJybMAtTyu3YEgL+uT10SxPadISnXmVvXOYBVCclKBOnFrY2kDC5trUE5xTi6y5r7pkWDANydA==
X-Received: by 2002:a05:620a:4007:: with SMTP id
 h7mr301667qko.348.1619041235486;
 Wed, 21 Apr 2021 14:40:35 -0700 (PDT)
Received: from localhost.localdomain
 (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39])
 by smtp.googlemail.com with ESMTPSA id m29sm572365qkm.101.2021.04.21.14.40.34
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 21 Apr 2021 14:40:34 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v1 1/2] x86: Optimize strlen-avx2.S
Date: Wed, 21 Apr 2021 17:39:52 -0400
Message-Id: <20210421213951.404588-1-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.29.2
MIME-Version: 1.0
X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

No bug. This commit optimizes strlen-evex.S. The optimizations are all
small things such as save an ALU in the alignment process, saving a
few instructions in the loop return, saving some bytes in the main
loop, and increasing the ILP in the return cases. test-strchr,
test-strchrnul, test-wcschr, and test-wcschrnul are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
Tests where run on the following CPUs:
    
Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html
    
All times are the geometric mean of N=20. The unit of time is
seconds.

"Cur" refers to the current implementation
"New" refers to this patches implementation

For strchr-evex the numbers are a near universal improvement. The only
exception seems to be the [32, 64] case is marginally slower for
Tigerlake and about even on Icelake (less than the gain in the [0, 31]
case). Overall though I think the number show a sizable improvement,
particularly once the 4x loop is hit.

Results For Tigerlake strchr-evex
size, algn, Cur T , New T , Win , Dif 
32  , 0   , 4.89  , 5.23  , Cur , 0.34
32  , 1   , 4.67  , 5.09  , Cur , 0.42
64  , 0   , 5.59  , 5.46  , New , 0.13
64  , 2   , 5.52  , 5.43  , New , 0.09
128 , 0   , 8.04  , 7.44  , New , 0.6 
128 , 3   , 8.0   , 7.45  , New , 0.55
256 , 0   , 14.7  , 12.94 , New , 1.76
256 , 4   , 14.78 , 13.03 , New , 1.75
512 , 0   , 20.37 , 19.05 , New , 1.32
512 , 5   , 20.34 , 18.98 , New , 1.36
1024, 0   , 31.62 , 28.24 , New , 3.38
1024, 6   , 31.55 , 28.2  , New , 3.35
2048, 0   , 53.22 , 47.12 , New , 6.1 
2048, 7   , 53.15 , 47.0  , New , 6.15
64  , 1   , 5.45  , 5.41  , New , 0.04
64  , 3   , 5.46  , 5.39  , New , 0.07
64  , 4   , 5.48  , 5.39  , New , 0.09
64  , 5   , 5.54  , 5.39  , New , 0.15
64  , 6   , 5.47  , 5.41  , New , 0.06
64  , 7   , 5.46  , 5.39  , New , 0.07
256 , 16  , 14.58 , 12.92 , New , 1.66
256 , 32  , 15.36 , 13.54 , New , 1.82
256 , 48  , 15.49 , 13.71 , New , 1.78
256 , 64  , 16.53 , 14.78 , New , 1.75
256 , 80  , 16.57 , 14.82 , New , 1.75
256 , 96  , 13.26 , 11.99 , New , 1.27
256 , 112 , 13.36 , 12.07 , New , 1.29
0   , 0   , 3.75  , 3.09  , New , 0.66
1   , 0   , 3.75  , 3.09  , New , 0.66
2   , 0   , 3.74  , 3.09  , New , 0.65
3   , 0   , 3.74  , 3.09  , New , 0.65
4   , 0   , 3.74  , 3.09  , New , 0.65
5   , 0   , 3.74  , 3.1   , New , 0.64
6   , 0   , 3.74  , 3.1   , New , 0.64
7   , 0   , 3.74  , 3.09  , New , 0.65
8   , 0   , 3.74  , 3.09  , New , 0.65
9   , 0   , 3.74  , 3.1   , New , 0.64
10  , 0   , 3.75  , 3.09  , New , 0.66
11  , 0   , 3.75  , 3.1   , New , 0.65
12  , 0   , 3.74  , 3.1   , New , 0.64
13  , 0   , 3.77  , 3.1   , New , 0.67
14  , 0   , 3.78  , 3.1   , New , 0.68
15  , 0   , 3.82  , 3.1   , New , 0.72
16  , 0   , 3.76  , 3.1   , New , 0.66
17  , 0   , 3.8   , 3.1   , New , 0.7 
18  , 0   , 3.77  , 3.1   , New , 0.67
19  , 0   , 3.81  , 3.1   , New , 0.71
20  , 0   , 3.77  , 3.13  , New , 0.64
21  , 0   , 3.8   , 3.11  , New , 0.69
22  , 0   , 3.82  , 3.11  , New , 0.71
23  , 0   , 3.77  , 3.11  , New , 0.66
24  , 0   , 3.77  , 3.11  , New , 0.66
25  , 0   , 3.76  , 3.11  , New , 0.65
26  , 0   , 3.76  , 3.11  , New , 0.65
27  , 0   , 3.76  , 3.11  , New , 0.65
28  , 0   , 3.77  , 3.11  , New , 0.66
29  , 0   , 3.76  , 3.11  , New , 0.65
30  , 0   , 3.76  , 3.11  , New , 0.65
31  , 0   , 3.76  , 3.11  , New , 0.65

Results For Icelake strchr-evex
size, algn, Cur T , New T , Win , Dif 
32  , 0   , 3.57  , 3.77  , Cur , 0.2 
32  , 1   , 3.36  , 3.34  , New , 0.02
64  , 0   , 3.77  , 3.64  , New , 0.13
64  , 2   , 3.73  , 3.58  , New , 0.15
128 , 0   , 5.22  , 4.92  , New , 0.3 
128 , 3   , 5.16  , 4.94  , New , 0.22
256 , 0   , 9.83  , 8.8   , New , 1.03
256 , 4   , 9.89  , 8.77  , New , 1.12
512 , 0   , 13.47 , 12.77 , New , 0.7 
512 , 5   , 13.58 , 12.74 , New , 0.84
1024, 0   , 20.33 , 18.46 , New , 1.87
1024, 6   , 20.28 , 18.39 , New , 1.89
2048, 0   , 35.45 , 31.59 , New , 3.86
2048, 7   , 35.44 , 31.66 , New , 3.78
64  , 1   , 3.76  , 3.62  , New , 0.14
64  , 3   , 3.7   , 3.6   , New , 0.1 
64  , 4   , 3.71  , 3.62  , New , 0.09
64  , 5   , 3.74  , 3.61  , New , 0.13
64  , 6   , 3.74  , 3.61  , New , 0.13
64  , 7   , 3.72  , 3.62  , New , 0.1 
256 , 16  , 9.81  , 8.77  , New , 1.04
256 , 32  , 10.25 , 9.24  , New , 1.01
256 , 48  , 10.48 , 9.39  , New , 1.09
256 , 64  , 11.09 , 10.11 , New , 0.98
256 , 80  , 11.09 , 10.09 , New , 1.0 
256 , 96  , 8.88  , 8.09  , New , 0.79
256 , 112 , 8.84  , 8.16  , New , 0.68
0   , 0   , 2.31  , 2.08  , New , 0.23
1   , 0   , 2.36  , 2.09  , New , 0.27
2   , 0   , 2.39  , 2.12  , New , 0.27
3   , 0   , 2.4   , 2.14  , New , 0.26
4   , 0   , 2.42  , 2.15  , New , 0.27
5   , 0   , 2.4   , 2.15  , New , 0.25
6   , 0   , 2.38  , 2.15  , New , 0.23
7   , 0   , 2.36  , 2.15  , New , 0.21
8   , 0   , 2.41  , 2.16  , New , 0.25
9   , 0   , 2.37  , 2.14  , New , 0.23
10  , 0   , 2.36  , 2.16  , New , 0.2 
11  , 0   , 2.36  , 2.17  , New , 0.19
12  , 0   , 2.35  , 2.15  , New , 0.2 
13  , 0   , 2.37  , 2.16  , New , 0.21
14  , 0   , 2.37  , 2.16  , New , 0.21
15  , 0   , 2.39  , 2.15  , New , 0.24
16  , 0   , 2.36  , 2.14  , New , 0.22
17  , 0   , 2.35  , 2.14  , New , 0.21
18  , 0   , 2.36  , 2.14  , New , 0.22
19  , 0   , 2.37  , 2.14  , New , 0.23
20  , 0   , 2.37  , 2.16  , New , 0.21
21  , 0   , 2.38  , 2.16  , New , 0.22
22  , 0   , 2.38  , 2.14  , New , 0.24
23  , 0   , 2.33  , 2.11  , New , 0.22
24  , 0   , 2.3   , 2.07  , New , 0.23
25  , 0   , 2.27  , 2.06  , New , 0.21
26  , 0   , 2.26  , 2.06  , New , 0.2 
27  , 0   , 2.28  , 2.1   , New , 0.18
28  , 0   , 2.34  , 2.13  , New , 0.21
29  , 0   , 2.34  , 2.09  , New , 0.25
30  , 0   , 2.29  , 2.09  , New , 0.2 
31  , 0   , 2.31  , 2.08  , New , 0.23

For strchr-avx the results are a lot closer as the optimizations where
smaller but the trend is improvement. Especially on Skylake (which is
the only one of the benchmark CPUs that this will actually be used
on).

Results For Skylake strchr-avx2
size, algn, Cur T , New T , Win , Dif 
32  , 0   , 6.04  , 5.02  , New , 1.02
32  , 1   , 6.19  , 4.94  , New , 1.25
64  , 0   , 6.68  , 5.92  , New , 0.76
64  , 2   , 6.59  , 5.95  , New , 0.64
128 , 0   , 7.66  , 7.42  , New , 0.24
128 , 3   , 7.66  , 7.4   , New , 0.26
256 , 0   , 14.68 , 12.93 , New , 1.75
256 , 4   , 14.74 , 12.88 , New , 1.86
512 , 0   , 20.81 , 17.47 , New , 3.34
512 , 5   , 20.73 , 17.44 , New , 3.29
1024, 0   , 33.16 , 27.06 , New , 6.1 
1024, 6   , 33.15 , 27.09 , New , 6.06
2048, 0   , 59.06 , 56.15 , New , 2.91
2048, 7   , 59.0  , 53.92 , New , 5.08
64  , 1   , 6.56  , 5.86  , New , 0.7 
64  , 3   , 6.55  , 5.99  , New , 0.56
64  , 4   , 6.61  , 5.96  , New , 0.65
64  , 5   , 6.52  , 5.94  , New , 0.58
64  , 6   , 6.62  , 5.95  , New , 0.67
64  , 7   , 6.61  , 6.11  , New , 0.5 
256 , 16  , 14.64 , 12.85 , New , 1.79
256 , 32  , 15.2  , 12.97 , New , 2.23
256 , 48  , 15.13 , 13.33 , New , 1.8 
256 , 64  , 16.18 , 13.46 , New , 2.72
256 , 80  , 16.26 , 13.49 , New , 2.77
256 , 96  , 13.13 , 11.43 , New , 1.7 
256 , 112 , 13.12 , 11.4  , New , 1.72
0   , 0   , 5.36  , 4.25  , New , 1.11
1   , 0   , 5.28  , 4.24  , New , 1.04
2   , 0   , 5.27  , 4.2   , New , 1.07
3   , 0   , 5.27  , 4.23  , New , 1.04
4   , 0   , 5.36  , 4.3   , New , 1.06
5   , 0   , 5.35  , 4.29  , New , 1.06
6   , 0   , 5.38  , 4.35  , New , 1.03
7   , 0   , 5.39  , 4.28  , New , 1.11
8   , 0   , 5.5   , 4.45  , New , 1.05
9   , 0   , 5.47  , 4.43  , New , 1.04
10  , 0   , 5.5   , 4.4   , New , 1.1 
11  , 0   , 5.51  , 4.44  , New , 1.07
12  , 0   , 5.49  , 4.44  , New , 1.05
13  , 0   , 5.49  , 4.46  , New , 1.03
14  , 0   , 5.49  , 4.46  , New , 1.03
15  , 0   , 5.51  , 4.43  , New , 1.08
16  , 0   , 5.52  , 4.48  , New , 1.04
17  , 0   , 5.57  , 4.47  , New , 1.1 
18  , 0   , 5.56  , 4.52  , New , 1.04
19  , 0   , 5.54  , 4.46  , New , 1.08
20  , 0   , 5.53  , 4.48  , New , 1.05
21  , 0   , 5.54  , 4.48  , New , 1.06
22  , 0   , 5.57  , 4.45  , New , 1.12
23  , 0   , 5.57  , 4.48  , New , 1.09
24  , 0   , 5.53  , 4.43  , New , 1.1 
25  , 0   , 5.49  , 4.42  , New , 1.07
26  , 0   , 5.5   , 4.44  , New , 1.06
27  , 0   , 5.48  , 4.44  , New , 1.04
28  , 0   , 5.48  , 4.43  , New , 1.05
29  , 0   , 5.54  , 4.41  , New , 1.13
30  , 0   , 5.49  , 4.4   , New , 1.09
31  , 0   , 5.46  , 4.4   , New , 1.06

Results For Tigerlake strchr-avx2
size, algn, Cur T , New T , Win , Dif 
32  , 0   , 5.88  , 5.47  , New , 0.41
32  , 1   , 5.73  , 5.46  , New , 0.27
64  , 0   , 6.32  , 6.1   , New , 0.22
64  , 2   , 6.17  , 6.11  , New , 0.06
128 , 0   , 7.93  , 7.68  , New , 0.25
128 , 3   , 7.93  , 7.73  , New , 0.2 
256 , 0   , 14.87 , 14.5  , New , 0.37
256 , 4   , 14.96 , 14.59 , New , 0.37
512 , 0   , 21.25 , 20.18 , New , 1.07
512 , 5   , 21.25 , 20.11 , New , 1.14
1024, 0   , 33.17 , 31.26 , New , 1.91
1024, 6   , 33.14 , 31.13 , New , 2.01
2048, 0   , 53.39 , 52.51 , New , 0.88
2048, 7   , 53.3  , 52.34 , New , 0.96
64  , 1   , 6.11  , 6.09  , New , 0.02
64  , 3   , 6.04  , 6.01  , New , 0.03
64  , 4   , 6.04  , 6.03  , New , 0.01
64  , 5   , 6.13  , 6.05  , New , 0.08
64  , 6   , 6.09  , 6.06  , New , 0.03
64  , 7   , 6.04  , 6.03  , New , 0.01
256 , 16  , 14.77 , 14.39 , New , 0.38
256 , 32  , 15.58 , 15.27 , New , 0.31
256 , 48  , 15.88 , 15.32 , New , 0.56
256 , 64  , 16.85 , 16.01 , New , 0.84
256 , 80  , 16.83 , 16.03 , New , 0.8 
256 , 96  , 13.5  , 13.14 , New , 0.36
256 , 112 , 13.71 , 13.24 , New , 0.47
0   , 0   , 3.78  , 3.76  , New , 0.02
1   , 0   , 3.79  , 3.76  , New , 0.03
2   , 0   , 3.82  , 3.77  , New , 0.05
3   , 0   , 3.78  , 3.76  , New , 0.02
4   , 0   , 3.75  , 3.75  , Eq  , 0.0
5   , 0   , 3.77  , 3.74  , New , 0.03
6   , 0   , 3.78  , 3.76  , New , 0.02
7   , 0   , 3.91  , 3.85  , New , 0.06
8   , 0   , 3.76  , 3.77  , Cur , 0.01
9   , 0   , 3.75  , 3.75  , Eq  , 0.0
10  , 0   , 3.76  , 3.76  , Eq  , 0.0
11  , 0   , 3.77  , 3.75  , New , 0.02
12  , 0   , 3.79  , 3.77  , New , 0.02
13  , 0   , 3.86  , 3.86  , Eq  , 0.0
14  , 0   , 4.2   , 4.2   , Eq  , 0.0
15  , 0   , 4.17  , 4.07  , New , 0.1 
16  , 0   , 4.1   , 4.1   , Eq  , 0.0
17  , 0   , 4.12  , 4.09  , New , 0.03
18  , 0   , 4.12  , 4.12  , Eq  , 0.0
19  , 0   , 4.18  , 4.09  , New , 0.09
20  , 0   , 4.14  , 4.09  , New , 0.05
21  , 0   , 4.15  , 4.11  , New , 0.04
22  , 0   , 4.23  , 4.13  , New , 0.1 
23  , 0   , 4.18  , 4.16  , New , 0.02
24  , 0   , 4.13  , 4.21  , Cur , 0.08
25  , 0   , 4.17  , 4.15  , New , 0.02
26  , 0   , 4.17  , 4.16  , New , 0.01
27  , 0   , 4.18  , 4.16  , New , 0.02
28  , 0   , 4.17  , 4.15  , New , 0.02
29  , 0   , 4.2   , 4.13  , New , 0.07
30  , 0   , 4.16  , 4.12  , New , 0.04
31  , 0   , 4.15  , 4.15  , Eq  , 0.0

Results For Icelake strchr-avx2
size, algn, Cur T , New T , Win , Dif 
32  , 0   , 3.73  , 3.72  , New , 0.01
32  , 1   , 3.46  , 3.44  , New , 0.02
64  , 0   , 3.96  , 3.87  , New , 0.09
64  , 2   , 3.92  , 3.87  , New , 0.05
128 , 0   , 5.15  , 4.9   , New , 0.25
128 , 3   , 5.12  , 4.87  , New , 0.25
256 , 0   , 9.79  , 9.45  , New , 0.34
256 , 4   , 9.76  , 9.52  , New , 0.24
512 , 0   , 13.93 , 12.89 , New , 1.04
512 , 5   , 13.84 , 13.02 , New , 0.82
1024, 0   , 21.41 , 19.92 , New , 1.49
1024, 6   , 21.69 , 20.12 , New , 1.57
2048, 0   , 35.12 , 33.83 , New , 1.29
2048, 7   , 35.13 , 33.99 , New , 1.14
64  , 1   , 3.96  , 3.9   , New , 0.06
64  , 3   , 3.88  , 3.86  , New , 0.02
64  , 4   , 3.87  , 3.83  , New , 0.04
64  , 5   , 3.9   , 3.85  , New , 0.05
64  , 6   , 3.9   , 3.89  , New , 0.01
64  , 7   , 3.9   , 3.84  , New , 0.06
256 , 16  , 9.76  , 9.4   , New , 0.36
256 , 32  , 10.36 , 9.97  , New , 0.39
256 , 48  , 10.5  , 10.02 , New , 0.48
256 , 64  , 11.13 , 10.55 , New , 0.58
256 , 80  , 11.14 , 10.56 , New , 0.58
256 , 96  , 8.98  , 8.57  , New , 0.41
256 , 112 , 9.1   , 8.66  , New , 0.44
0   , 0   , 2.52  , 2.49  , New , 0.03
1   , 0   , 2.56  , 2.53  , New , 0.03
2   , 0   , 2.6   , 2.54  , New , 0.06
3   , 0   , 2.63  , 2.58  , New , 0.05
4   , 0   , 2.63  , 2.6   , New , 0.03
5   , 0   , 2.65  , 2.62  , New , 0.03
6   , 0   , 2.75  , 2.73  , New , 0.02
7   , 0   , 2.73  , 2.76  , Cur , 0.03
8   , 0   , 2.61  , 2.6   , New , 0.01
9   , 0   , 2.73  , 2.74  , Cur , 0.01
10  , 0   , 2.72  , 2.71  , New , 0.01
11  , 0   , 2.74  , 2.72  , New , 0.02
12  , 0   , 2.73  , 2.74  , Cur , 0.01
13  , 0   , 2.73  , 2.75  , Cur , 0.02
14  , 0   , 2.74  , 2.72  , New , 0.02
15  , 0   , 2.74  , 2.72  , New , 0.02
16  , 0   , 2.75  , 2.74  , New , 0.01
17  , 0   , 2.73  , 2.74  , Cur , 0.01
18  , 0   , 2.72  , 2.73  , Cur , 0.01
19  , 0   , 2.74  , 2.72  , New , 0.02
20  , 0   , 2.75  , 2.71  , New , 0.04
21  , 0   , 2.74  , 2.74  , Eq  , 0.0
22  , 0   , 2.73  , 2.74  , Cur , 0.01
23  , 0   , 2.7   , 2.72  , Cur , 0.02
24  , 0   , 2.68  , 2.68  , Eq  , 0.0
25  , 0   , 2.65  , 2.63  , New , 0.02
26  , 0   , 2.64  , 2.62  , New , 0.02
27  , 0   , 2.71  , 2.68  , New , 0.03
28  , 0   , 2.72  , 2.68  , New , 0.04
29  , 0   , 2.68  , 2.74  , Cur , 0.06
30  , 0   , 2.65  , 2.65  , Eq  , 0.0
31  , 0   , 2.7   , 2.68  , New , 0.02            
    
 sysdeps/x86_64/multiarch/strchr-avx2.S | 294 +++++++++++++++----------
 1 file changed, 173 insertions(+), 121 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S
index 25bec38b5d..220165d2ba 100644
--- a/sysdeps/x86_64/multiarch/strchr-avx2.S
+++ b/sysdeps/x86_64/multiarch/strchr-avx2.S
@@ -49,132 +49,144 @@
 
 	.section SECTION(.text),"ax",@progbits
 ENTRY (STRCHR)
-	movl	%edi, %ecx
-# ifndef USE_AS_STRCHRNUL
-	xorl	%edx, %edx
-# endif
-
 	/* Broadcast CHAR to YMM0.	*/
 	vmovd	%esi, %xmm0
+	movl	%edi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+	VPBROADCAST	%xmm0, %ymm0
 	vpxor	%xmm9, %xmm9, %xmm9
-	VPBROADCAST %xmm0, %ymm0
 
 	/* Check if we cross page boundary with one vector load.  */
-	andl	$(PAGE_SIZE - 1), %ecx
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %ecx
-	ja  L(cross_page_boundary)
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	ja	L(cross_page_boundary)
 
 	/* Check the first VEC_SIZE bytes.	Search for both CHAR and the
 	   null byte.  */
 	vmovdqu	(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
+	VPCMPEQ	%ymm8, %ymm0, %ymm1
+	VPCMPEQ	%ymm8, %ymm9, %ymm2
 	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
+	vpmovmskb	%ymm1, %eax
 	testl	%eax, %eax
-	jz	L(more_vecs)
+	jz	L(aligned_more)
 	tzcntl	%eax, %eax
-	/* Found CHAR or the null byte.	 */
-	addq	%rdi, %rax
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Found CHAR or the null byte.	 */
+	cmp	(%rdi, %rax), %CHAR_REG
+	jne	L(zero)
 # endif
-L(return_vzeroupper):
-	ZERO_UPPER_VEC_REGISTERS_RETURN
-
-	.p2align 4
-L(more_vecs):
-	/* Align data for aligned loads in the loop.  */
-	andq	$-VEC_SIZE, %rdi
-L(aligned_more):
-
-	/* Check the next 4 * VEC_SIZE.	 Only one VEC_SIZE at a time
-	   since data is only aligned to VEC_SIZE.	*/
-	vmovdqa	VEC_SIZE(%rdi), %ymm8
-	addq	$VEC_SIZE, %rdi
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x0)
-
-	vmovdqa	VEC_SIZE(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x1)
-
-	vmovdqa	(VEC_SIZE * 2)(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x2)
-
-	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	testl	%eax, %eax
-	jz	L(prep_loop_4x)
+	addq	%rdi, %rax
+	VZEROUPPER_RETURN
 
+	/* .p2align 5 helps keep performance more consistent if ENTRY()
+	   alignment % 32 was either 16 or 0. As well this makes the
+	   alignment % 32 of the loop_4x_vec fixed which makes tuning it
+	   easier.  */
+	.p2align 5
+L(first_vec_x4):
 	tzcntl	%eax, %eax
-	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+	addq	$(VEC_SIZE * 3 + 1), %rdi
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Found CHAR or the null byte.	 */
+	cmp	(%rdi, %rax), %CHAR_REG
+	jne	L(zero)
 # endif
+	addq	%rdi, %rax
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(first_vec_x0):
-	tzcntl	%eax, %eax
-	/* Found CHAR or the null byte.	 */
-	addq	%rdi, %rax
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
-# endif
+L(zero):
+	xorl	%eax, %eax
 	VZEROUPPER_RETURN
+# endif
+
 
 	.p2align 4
 L(first_vec_x1):
 	tzcntl	%eax, %eax
-	leaq	VEC_SIZE(%rdi, %rax), %rax
+	incq	%rdi
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Found CHAR or the null byte.	 */
+	cmp	(%rdi, %rax), %CHAR_REG
+	jne	L(zero)
 # endif
+	addq	%rdi, %rax
 	VZEROUPPER_RETURN
 
 	.p2align 4
 L(first_vec_x2):
 	tzcntl	%eax, %eax
+	addq	$(VEC_SIZE + 1), %rdi
+# ifndef USE_AS_STRCHRNUL
 	/* Found CHAR or the null byte.	 */
-	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
+	cmp	(%rdi, %rax), %CHAR_REG
+	jne	L(zero)
+# endif
+	addq	%rdi, %rax
+	VZEROUPPER_RETURN
+
+	.p2align 4
+L(first_vec_x3):
+	tzcntl	%eax, %eax
+	addq	$(VEC_SIZE * 2 + 1), %rdi
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Found CHAR or the null byte.	 */
+	cmp	(%rdi, %rax), %CHAR_REG
+	jne	L(zero)
 # endif
+	addq	%rdi, %rax
 	VZEROUPPER_RETURN
 
-L(prep_loop_4x):
-	/* Align data to 4 * VEC_SIZE.	*/
-	andq	$-(VEC_SIZE * 4), %rdi
+	.p2align 4
+L(aligned_more):
+	/* Align data to VEC_SIZE - 1. This is the same number of
+	   instructions as using andq -VEC_SIZE but saves 4 bytes of code on
+	   x4 check.  */
+	orq	$(VEC_SIZE - 1), %rdi
+L(cross_page_continue):
+	/* Check the next 4 * VEC_SIZE.  Only one VEC_SIZE at a time since
+	   data is only aligned to VEC_SIZE.  */
+	vmovdqa	1(%rdi), %ymm8
+	VPCMPEQ	%ymm8, %ymm0, %ymm1
+	VPCMPEQ	%ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
+	vpmovmskb	%ymm1, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x1)
 
+	vmovdqa	(VEC_SIZE + 1)(%rdi), %ymm8
+	VPCMPEQ	%ymm8, %ymm0, %ymm1
+	VPCMPEQ	%ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
+	vpmovmskb	%ymm1, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x2)
+
+	vmovdqa	(VEC_SIZE * 2 + 1)(%rdi), %ymm8
+	VPCMPEQ	%ymm8, %ymm0, %ymm1
+	VPCMPEQ	%ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
+	vpmovmskb	%ymm1, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x3)
+
+	vmovdqa	(VEC_SIZE * 3 + 1)(%rdi), %ymm8
+	VPCMPEQ	%ymm8, %ymm0, %ymm1
+	VPCMPEQ	%ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
+	vpmovmskb	%ymm1, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x4)
+	/* Align data to VEC_SIZE * 4 - 1.	*/
+	addq	$(VEC_SIZE * 4 + 1), %rdi
+	andq	$-(VEC_SIZE * 4), %rdi
 	.p2align 4
 L(loop_4x_vec):
 	/* Compare 4 * VEC at a time forward.  */
-	vmovdqa	(VEC_SIZE * 4)(%rdi), %ymm5
-	vmovdqa	(VEC_SIZE * 5)(%rdi), %ymm6
-	vmovdqa	(VEC_SIZE * 6)(%rdi), %ymm7
-	vmovdqa	(VEC_SIZE * 7)(%rdi), %ymm8
+	vmovdqa	(%rdi), %ymm5
+	vmovdqa	(VEC_SIZE)(%rdi), %ymm6
+	vmovdqa	(VEC_SIZE * 2)(%rdi), %ymm7
+	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
 
 	/* Leaves only CHARS matching esi as 0.	 */
 	vpxor	%ymm5, %ymm0, %ymm1
@@ -190,62 +202,102 @@ L(loop_4x_vec):
 	VPMINU	%ymm1, %ymm2, %ymm5
 	VPMINU	%ymm3, %ymm4, %ymm6
 
-	VPMINU	%ymm5, %ymm6, %ymm5
+	VPMINU	%ymm5, %ymm6, %ymm6
 
-	VPCMPEQ %ymm5, %ymm9, %ymm5
-	vpmovmskb %ymm5, %eax
+	VPCMPEQ	%ymm6, %ymm9, %ymm6
+	vpmovmskb	%ymm6, %ecx
+	subq	$-(VEC_SIZE * 4), %rdi
+	testl	%ecx, %ecx
+	jz	L(loop_4x_vec)
 
-	addq	$(VEC_SIZE * 4), %rdi
-	testl	%eax, %eax
-	jz  L(loop_4x_vec)
 
-	VPCMPEQ %ymm1, %ymm9, %ymm1
-	vpmovmskb %ymm1, %eax
+	VPCMPEQ	%ymm1, %ymm9, %ymm1
+	vpmovmskb	%ymm1, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x0)
+	jnz	L(last_vec_x0)
+
 
-	VPCMPEQ %ymm2, %ymm9, %ymm2
-	vpmovmskb %ymm2, %eax
+	VPCMPEQ	%ymm5, %ymm9, %ymm2
+	vpmovmskb	%ymm2, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x1)
+	jnz	L(last_vec_x1)
+
+	VPCMPEQ	%ymm3, %ymm9, %ymm3
+	vpmovmskb	%ymm3, %eax
+	/* rcx has combined result from all 4 VEC. It will only be used if
+	   the first 3 other VEC all did not contain a match.  */
+	salq	$32, %rcx
+	orq	%rcx, %rax
+	tzcntq	%rax, %rax
+	subq	$(VEC_SIZE * 2), %rdi
+# ifndef USE_AS_STRCHRNUL
+	/* Found CHAR or the null byte.	 */
+	cmp	(%rdi, %rax), %CHAR_REG
+	jne	L(zero_end)
+# endif
+	addq	%rdi, %rax
+	VZEROUPPER_RETURN
+
 
-	VPCMPEQ %ymm3, %ymm9, %ymm3
-	VPCMPEQ %ymm4, %ymm9, %ymm4
-	vpmovmskb %ymm3, %ecx
-	vpmovmskb %ymm4, %eax
-	salq	$32, %rax
-	orq %rcx, %rax
-	tzcntq  %rax, %rax
-	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
+	.p2align 4
+L(last_vec_x0):
+	tzcntl	%eax, %eax
+	addq	$-(VEC_SIZE * 4), %rdi
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Found CHAR or the null byte.	 */
+	cmp	(%rdi, %rax), %CHAR_REG
+	jne	L(zero_end)
 # endif
+	addq	%rdi, %rax
 	VZEROUPPER_RETURN
 
+# ifndef USE_AS_STRCHRNUL
+L(zero_end):
+	xorl	%eax, %eax
+	VZEROUPPER_RETURN
+# endif
+
+	.p2align 4
+L(last_vec_x1):
+	tzcntl	%eax, %eax
+	subq	$(VEC_SIZE * 3), %rdi
+# ifndef USE_AS_STRCHRNUL
+	/* Found CHAR or the null byte.	 */
+	cmp	(%rdi, %rax), %CHAR_REG
+	jne	L(zero_end)
+# endif
+	addq	%rdi, %rax
+	VZEROUPPER_RETURN
+
+
 	/* Cold case for crossing page with first load.	 */
 	.p2align 4
 L(cross_page_boundary):
-	andq	$-VEC_SIZE, %rdi
-	andl	$(VEC_SIZE - 1), %ecx
-
-	vmovdqa	(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
+	movq	%rdi, %rdx
+	/* Align rdi to VEC_SIZE - 1.  */
+	orq	$(VEC_SIZE - 1), %rdi
+	vmovdqa	-(VEC_SIZE - 1)(%rdi), %ymm8
+	VPCMPEQ	%ymm8, %ymm0, %ymm1
+	VPCMPEQ	%ymm8, %ymm9, %ymm2
 	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	/* Remove the leading bits.	 */
-	sarxl	%ecx, %eax, %eax
+	vpmovmskb	%ymm1, %eax
+	/* Remove the leading bytes. sarxl only uses bits [5:0] of COUNT
+	   so no need to manually mod edx.  */
+	sarxl	%edx, %eax, %eax
 	testl	%eax, %eax
-	jz	L(aligned_more)
+	jz	L(cross_page_continue)
 	tzcntl	%eax, %eax
-	addq	%rcx, %rdi
-	addq	%rdi, %rax
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	xorl	%ecx, %ecx
+	/* Found CHAR or the null byte.	 */
+	cmp	(%rdx, %rax), %CHAR_REG
+	leaq	(%rdx, %rax), %rax
+	cmovne	%rcx, %rax
+# else
+	addq	%rdx, %rax
 # endif
-	VZEROUPPER_RETURN
+L(return_vzeroupper):
+	ZERO_UPPER_VEC_REGISTERS_RETURN
 
 END (STRCHR)
 # endif

From patchwork Wed Apr 21 21:39:53 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 43064
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id BA7F5398B86E;
	Wed, 21 Apr 2021 21:40:50 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BA7F5398B86E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1619041250;
	bh=TwFZph6rpqUQx60nGrcmk58uo9BIAdT/vBPYpbwY3Mk=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=dMjid8CKa/scHSqCuguPFXykLivQStnUNGn2ZnaLOGzwglIAVNQoqXvFa1FNjAIEF
	 vhTYOjk5yQnaN3bSrj5vhiwL+VPz3aVi5m3gEKiredNk/GYWznLpJJo6I+goEAnwiT
	 x5YqnV0aKHVAF1s0/wlwWp0r7ANvyqzmxxd8PuvI=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-qk1-x72c.google.com (mail-qk1-x72c.google.com
 [IPv6:2607:f8b0:4864:20::72c])
 by sourceware.org (Postfix) with ESMTPS id 3FB55398B879
 for <libc-alpha@sourceware.org>; Wed, 21 Apr 2021 21:40:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 3FB55398B879
Received: by mail-qk1-x72c.google.com with SMTP id 8so10642576qkv.8
 for <libc-alpha@sourceware.org>; Wed, 21 Apr 2021 14:40:47 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=TwFZph6rpqUQx60nGrcmk58uo9BIAdT/vBPYpbwY3Mk=;
 b=aO0k/OaDjOTj3Rebsc0lhkRgOu5RarJqN6XaKRAG8nI7UWNdBIlEHbpuE3hoJiiEJo
 ZV/mh8+4yXsesWiGmUP+CZHNEzJiLoIrGTmkf0SdBFVIrHTK0r1SLaziJaQdlO/Z/roR
 4wkgRrx5b2nBywYmEuUcXep/wpfC9OgzBeLI5DZHGb2C2IlIC3h8Fyx+SMbrF97WVvDh
 bxguf6YNLDr5lCdT0brPA/GecgEMb9cB07PS+NluO9EU/Gys6B8NDTBbJBnD0jqhxp+W
 GNyQzlRSG6hzSIr49Q00kTcCVzv2ROKm3cXumjkowAsQBiKakrPbcaRbv66qJEuDF0i3
 Q45A==
X-Gm-Message-State: AOAM5333oWA1pdQoiYEtVGlSwg0e6uWiFBbMpoZMj0j5EOZk0nDV73aO
 XB1336eyZHuj5Vzhwdt0fAQ306GE15Q=
X-Google-Smtp-Source: 
 ABdhPJwL8arWhhZtoSlMej3mVeUwY4xLSrgomYmjyzFgJgUw9c/xgr0sZGKbKWSh9E+KuoI6sG6W2w==
X-Received: by 2002:a37:8a46:: with SMTP id m67mr257626qkd.259.1619041246253;
 Wed, 21 Apr 2021 14:40:46 -0700 (PDT)
Received: from localhost.localdomain
 (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39])
 by smtp.googlemail.com with ESMTPSA id m29sm572365qkm.101.2021.04.21.14.40.45
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 21 Apr 2021 14:40:45 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v1 2/2] x86: Optimize strchr-evex.S
Date: Wed, 21 Apr 2021 17:39:53 -0400
Message-Id: <20210421213951.404588-2-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.29.2
In-Reply-To: <20210421213951.404588-1-goldstein.w.n@gmail.com>
References: <20210421213951.404588-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

No bug. This commit optimizes strlen-evex.S. The optimizations are
mostly small things such as save an ALU in the alignment process,
saving a few instructions in the loop return. The one significant
change is saving 2 instructions in the 4x loop. test-strchr,
test-strchrnul, test-wcschr, and test-wcschrnul are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strchr-evex.S | 388 ++++++++++++++-----------
 1 file changed, 214 insertions(+), 174 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strchr-evex.S b/sysdeps/x86_64/multiarch/strchr-evex.S
index ddc86a7058..7cd111e96c 100644
--- a/sysdeps/x86_64/multiarch/strchr-evex.S
+++ b/sysdeps/x86_64/multiarch/strchr-evex.S
@@ -24,23 +24,26 @@
 #  define STRCHR	__strchr_evex
 # endif
 
-# define VMOVU		vmovdqu64
-# define VMOVA		vmovdqa64
+# define VMOVU	vmovdqu64
+# define VMOVA	vmovdqa64
 
 # ifdef USE_AS_WCSCHR
 #  define VPBROADCAST	vpbroadcastd
 #  define VPCMP		vpcmpd
 #  define VPMINU	vpminud
 #  define CHAR_REG	esi
-#  define SHIFT_REG	r8d
+#  define SHIFT_REG	ecx
+#  define CHAR_SIZE	4
 # else
 #  define VPBROADCAST	vpbroadcastb
 #  define VPCMP		vpcmpb
 #  define VPMINU	vpminub
 #  define CHAR_REG	sil
-#  define SHIFT_REG	ecx
+#  define SHIFT_REG	edx
+#  define CHAR_SIZE	1
 # endif
 
+
 # define XMMZERO	xmm16
 
 # define YMMZERO	ymm16
@@ -56,23 +59,20 @@
 
 # define VEC_SIZE 32
 # define PAGE_SIZE 4096
+# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
 
 	.section .text.evex,"ax",@progbits
 ENTRY (STRCHR)
-	movl	%edi, %ecx
-# ifndef USE_AS_STRCHRNUL
-	xorl	%edx, %edx
-# endif
-
 	/* Broadcast CHAR to YMM0.	*/
-	VPBROADCAST %esi, %YMM0
-
+	VPBROADCAST	%esi, %YMM0
+	movl	%edi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
 	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
 
-	/* Check if we cross page boundary with one vector load.  */
-	andl	$(PAGE_SIZE - 1), %ecx
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %ecx
-	ja  L(cross_page_boundary)
+	/* Check if we cross page boundary with one vector load. Otherwise
+	   it is safe to use an unaligned load.  */
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	ja	L(cross_page_boundary)
 
 	/* Check the first VEC_SIZE bytes. Search for both CHAR and the
 	   null bytes.  */
@@ -83,251 +83,291 @@ ENTRY (STRCHR)
 	VPMINU	%YMM2, %YMM1, %YMM2
 	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
 	VPCMP	$0, %YMMZERO, %YMM2, %k0
-	ktestd	%k0, %k0
-	jz	L(more_vecs)
 	kmovd	%k0, %eax
+	testl	%eax, %eax
+	jz	L(aligned_more)
 	tzcntl	%eax, %eax
-	/* Found CHAR or the null byte.	 */
 # ifdef USE_AS_WCSCHR
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	(%rdi, %rax, 4), %rax
+	leaq	(%rdi, %rax, CHAR_SIZE), %rax
 # else
 	addq	%rdi, %rax
 # endif
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Found CHAR or the null byte.	 */
+	cmp	(%rax), %CHAR_REG
+	jne	L(zero)
 # endif
 	ret
 
-	.p2align 4
-L(more_vecs):
-	/* Align data for aligned loads in the loop.  */
-	andq	$-VEC_SIZE, %rdi
-L(aligned_more):
-
-	/* Check the next 4 * VEC_SIZE.	 Only one VEC_SIZE at a time
-	   since data is only aligned to VEC_SIZE.	*/
-	VMOVA	VEC_SIZE(%rdi), %YMM1
-	addq	$VEC_SIZE, %rdi
-
-	/* Leaves only CHARS matching esi as 0.  */
-	vpxorq	%YMM1, %YMM0, %YMM2
-	VPMINU	%YMM2, %YMM1, %YMM2
-	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
-	VPCMP	$0, %YMMZERO, %YMM2, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x0)
-
-	VMOVA	VEC_SIZE(%rdi), %YMM1
-	/* Leaves only CHARS matching esi as 0.  */
-	vpxorq	%YMM1, %YMM0, %YMM2
-	VPMINU	%YMM2, %YMM1, %YMM2
-	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
-	VPCMP	$0, %YMMZERO, %YMM2, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x1)
-
-	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM1
-	/* Leaves only CHARS matching esi as 0.  */
-	vpxorq	%YMM1, %YMM0, %YMM2
-	VPMINU	%YMM2, %YMM1, %YMM2
-	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
-	VPCMP	$0, %YMMZERO, %YMM2, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x2)
-
-	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM1
-	/* Leaves only CHARS matching esi as 0.  */
-	vpxorq	%YMM1, %YMM0, %YMM2
-	VPMINU	%YMM2, %YMM1, %YMM2
-	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
-	VPCMP	$0, %YMMZERO, %YMM2, %k0
-	ktestd	%k0, %k0
-	jz	L(prep_loop_4x)
-
-	kmovd	%k0, %eax
+	/* .p2align 5 helps keep performance more consistent if ENTRY()
+	   alignment % 32 was either 16 or 0. As well this makes the
+	   alignment % 32 of the loop_4x_vec fixed which makes tuning it
+	   easier.  */
+	.p2align 5
+L(first_vec_x3):
 	tzcntl	%eax, %eax
+# ifndef USE_AS_STRCHRNUL
 	/* Found CHAR or the null byte.	 */
-# ifdef USE_AS_WCSCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	(VEC_SIZE * 3)(%rdi, %rax, 4), %rax
-# else
-	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+	cmp	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
+	jne	L(zero)
 # endif
+	/* NB: Multiply sizeof char type (1 or 4) to get the number of
+	   bytes.  */
+	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
+	ret
+
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
-# endif
+L(zero):
+	xorl	%eax, %eax
 	ret
+# endif
 
 	.p2align 4
-L(first_vec_x0):
+L(first_vec_x4):
+# ifndef USE_AS_STRCHRNUL
+	/* Check to see if first match was CHAR (k0) or null (k1).  */
+	kmovd	%k0, %eax
 	tzcntl	%eax, %eax
-	/* Found CHAR or the null byte.	 */
-# ifdef USE_AS_WCSCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	(%rdi, %rax, 4), %rax
+	kmovd	%k1, %ecx
+	/* bzhil will not be 0 if first match was null.  */
+	bzhil	%eax, %ecx, %ecx
+	jne	L(zero)
 # else
-	addq	%rdi, %rax
-# endif
-# ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Combine CHAR and null matches.  */
+	kord	%k0, %k1, %k0
+	kmovd	%k0, %eax
+	tzcntl	%eax, %eax
 # endif
+	/* NB: Multiply sizeof char type (1 or 4) to get the number of
+	   bytes.  */
+	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
 	.p2align 4
 L(first_vec_x1):
 	tzcntl	%eax, %eax
-	/* Found CHAR or the null byte.	 */
-# ifdef USE_AS_WCSCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	VEC_SIZE(%rdi, %rax, 4), %rax
-# else
-	leaq	VEC_SIZE(%rdi, %rax), %rax
-# endif
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Found CHAR or the null byte.	 */
+	cmp	(VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
+	jne	L(zero)
+
 # endif
+	/* NB: Multiply sizeof char type (1 or 4) to get the number of
+	   bytes.  */
+	leaq	(VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
 	.p2align 4
 L(first_vec_x2):
+# ifndef USE_AS_STRCHRNUL
+	/* Check to see if first match was CHAR (k0) or null (k1).  */
+	kmovd	%k0, %eax
 	tzcntl	%eax, %eax
-	/* Found CHAR or the null byte.	 */
-# ifdef USE_AS_WCSCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	(VEC_SIZE * 2)(%rdi, %rax, 4), %rax
+	kmovd	%k1, %ecx
+	/* bzhil will not be 0 if first match was null.  */
+	bzhil	%eax, %ecx, %ecx
+	jne	L(zero)
 # else
-	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
-# endif
-# ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Combine CHAR and null matches.  */
+	kord	%k0, %k1, %k0
+	kmovd	%k0, %eax
+	tzcntl	%eax, %eax
 # endif
+	/* NB: Multiply sizeof char type (1 or 4) to get the number of
+	   bytes.  */
+	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
-L(prep_loop_4x):
-	/* Align data to 4 * VEC_SIZE.	*/
+	.p2align 4
+L(aligned_more):
+	/* Align data to VEC_SIZE.  */
+	andq	$-VEC_SIZE, %rdi
+L(cross_page_continue):
+	/* Check the next 4 * VEC_SIZE.  Only one VEC_SIZE at a time since
+	   data is only aligned to VEC_SIZE. Use two alternating methods for
+	   checking VEC to balance latency and port contention.  */
+
+	/* This method has higher latency but has better port
+	   distribution.  */
+	VMOVA	(VEC_SIZE)(%rdi), %YMM1
+	/* Leaves only CHARS matching esi as 0.  */
+	vpxorq	%YMM1, %YMM0, %YMM2
+	VPMINU	%YMM2, %YMM1, %YMM2
+	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
+	VPCMP	$0, %YMMZERO, %YMM2, %k0
+	kmovd	%k0, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x1)
+
+	/* This method has higher latency but has better port
+	   distribution.  */
+	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM1
+	/* Each bit in K0 represents a CHAR in YMM1.  */
+	VPCMP	$0, %YMM1, %YMM0, %k0
+	/* Each bit in K1 represents a CHAR in YMM1.  */
+	VPCMP	$0, %YMM1, %YMMZERO, %k1
+	kortestd	%k0, %k1
+	jnz	L(first_vec_x2)
+
+	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM1
+	/* Leaves only CHARS matching esi as 0.  */
+	vpxorq	%YMM1, %YMM0, %YMM2
+	VPMINU	%YMM2, %YMM1, %YMM2
+	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
+	VPCMP	$0, %YMMZERO, %YMM2, %k0
+	kmovd	%k0, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x3)
+
+	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM1
+	/* Each bit in K0 represents a CHAR in YMM1.  */
+	VPCMP	$0, %YMM1, %YMM0, %k0
+	/* Each bit in K1 represents a CHAR in YMM1.  */
+	VPCMP	$0, %YMM1, %YMMZERO, %k1
+	kortestd	%k0, %k1
+	jnz	L(first_vec_x4)
+
+	/* Align data to VEC_SIZE * 4 for the loop.  */
+	addq	$VEC_SIZE, %rdi
 	andq	$-(VEC_SIZE * 4), %rdi
 
 	.p2align 4
 L(loop_4x_vec):
-	/* Compare 4 * VEC at a time forward.  */
+	/* Check 4x VEC at a time. No penalty to imm32 offset with evex
+	   encoding.  */
 	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM1
 	VMOVA	(VEC_SIZE * 5)(%rdi), %YMM2
 	VMOVA	(VEC_SIZE * 6)(%rdi), %YMM3
 	VMOVA	(VEC_SIZE * 7)(%rdi), %YMM4
 
-	/* Leaves only CHARS matching esi as 0.  */
+	/* For YMM1 and YMM3 use xor to set the CHARs matching esi to zero.  */
 	vpxorq	%YMM1, %YMM0, %YMM5
-	vpxorq	%YMM2, %YMM0, %YMM6
+	/* For YMM2 and YMM4 cmp not equals to CHAR and store result in k
+	   register. Its possible to save either 1 or 2 instructions using cmp no
+	   equals method for either YMM1 or YMM1 and YMM3 respectively but
+	   bottleneck on p5 makes it no worth it.  */
+	VPCMP	$4, %YMM0, %YMM2, %k2
 	vpxorq	%YMM3, %YMM0, %YMM7
-	vpxorq	%YMM4, %YMM0, %YMM8
-
-	VPMINU	%YMM5, %YMM1, %YMM5
-	VPMINU	%YMM6, %YMM2, %YMM6
-	VPMINU	%YMM7, %YMM3, %YMM7
-	VPMINU	%YMM8, %YMM4, %YMM8
-
-	VPMINU	%YMM5, %YMM6, %YMM1
-	VPMINU	%YMM7, %YMM8, %YMM2
-
-	VPMINU	%YMM1, %YMM2, %YMM1
-
-	/* Each bit in K0 represents a CHAR or a null byte.  */
-	VPCMP	$0, %YMMZERO, %YMM1, %k0
-
-	addq	$(VEC_SIZE * 4), %rdi
-
-	ktestd	%k0, %k0
+	VPCMP	$4, %YMM0, %YMM4, %k4
+
+	/* Use min to select all zeros (either from xor or end of string).  */
+	VPMINU	%YMM1, %YMM5, %YMM1
+	VPMINU	%YMM3, %YMM7, %YMM3
+
+	/* Use min + zeromask to select for zeros. Since k2 and k4 will be
+	   have 0 as positions that matched with CHAR which will set zero in
+	   the corresponding destination bytes in YMM2 / YMM4.  */
+	VPMINU	%YMM1, %YMM2, %YMM2{%k2}{z}
+	VPMINU	%YMM3, %YMM4, %YMM4
+	VPMINU	%YMM2, %YMM4, %YMM4{%k4}{z}
+
+	VPCMP	$0, %YMMZERO, %YMM4, %k1
+	kmovd	%k1, %ecx
+	subq	$-(VEC_SIZE * 4), %rdi
+	testl	%ecx, %ecx
 	jz	L(loop_4x_vec)
 
-	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
-	VPCMP	$0, %YMMZERO, %YMM5, %k0
+	VPCMP	$0, %YMMZERO, %YMM1, %k0
 	kmovd	%k0, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x0)
+	jnz	L(last_vec_x1)
 
-	/* Each bit in K1 represents a CHAR or a null byte in YMM2.  */
-	VPCMP	$0, %YMMZERO, %YMM6, %k1
-	kmovd	%k1, %eax
+	VPCMP	$0, %YMMZERO, %YMM2, %k0
+	kmovd	%k0, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x1)
-
-	/* Each bit in K2 represents a CHAR or a null byte in YMM3.  */
-	VPCMP	$0, %YMMZERO, %YMM7, %k2
-	/* Each bit in K3 represents a CHAR or a null byte in YMM4.  */
-	VPCMP	$0, %YMMZERO, %YMM8, %k3
+	jnz	L(last_vec_x2)
 
+	VPCMP	$0, %YMMZERO, %YMM3, %k0
+	kmovd	%k0, %eax
+	/* Combine YMM3 matches (eax) with YMM4 matches (ecx).  */
 # ifdef USE_AS_WCSCHR
-	/* NB: Each bit in K2/K3 represents 4-byte element.  */
-	kshiftlw $8, %k3, %k1
+	sall	$8, %ecx
+	orl	%ecx, %eax
+	tzcntl	%eax, %eax
 # else
-	kshiftlq $32, %k3, %k1
+	salq	$32, %rcx
+	orq	%rcx, %rax
+	tzcntq	%rax, %rax
 # endif
+# ifndef USE_AS_STRCHRNUL
+	/* Check if match was CHAR or null.  */
+	cmp	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
+	jne	L(zero_end)
+# endif
+	/* NB: Multiply sizeof char type (1 or 4) to get the number of
+	   bytes.  */
+	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
+	ret
 
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	korq	%k1, %k2, %k1
-	kmovq	%k1, %rax
+# ifndef USE_AS_STRCHRNUL
+L(zero_end):
+	xorl	%eax, %eax
+	ret
+# endif
 
-	tzcntq  %rax, %rax
-# ifdef USE_AS_WCSCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	(VEC_SIZE * 2)(%rdi, %rax, 4), %rax
-# else
-	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
+	.p2align 4
+L(last_vec_x1):
+	tzcntl	%eax, %eax
+# ifndef USE_AS_STRCHRNUL
+	/* Check if match was null.  */
+	cmp	(%rdi, %rax, CHAR_SIZE), %CHAR_REG
+	jne	L(zero_end)
 # endif
+	/* NB: Multiply sizeof char type (1 or 4) to get the number of
+	   bytes.  */
+	leaq	(%rdi, %rax, CHAR_SIZE), %rax
+	ret
+
+	.p2align 4
+L(last_vec_x2):
+	tzcntl	%eax, %eax
 # ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	/* Check if match was null.  */
+	cmp	(VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
+	jne	L(zero_end)
 # endif
+	/* NB: Multiply sizeof char type (1 or 4) to get the number of
+	   bytes.  */
+	leaq	(VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
 	/* Cold case for crossing page with first load.	 */
 	.p2align 4
 L(cross_page_boundary):
+	movq	%rdi, %rdx
+	/* Align rdi.  */
 	andq	$-VEC_SIZE, %rdi
-	andl	$(VEC_SIZE - 1), %ecx
-
 	VMOVA	(%rdi), %YMM1
-
 	/* Leaves only CHARS matching esi as 0.  */
 	vpxorq	%YMM1, %YMM0, %YMM2
 	VPMINU	%YMM2, %YMM1, %YMM2
 	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
 	VPCMP	$0, %YMMZERO, %YMM2, %k0
 	kmovd	%k0, %eax
-	testl	%eax, %eax
-
+	/* Remove the leading bits.	 */
 # ifdef USE_AS_WCSCHR
+	movl	%edx, %SHIFT_REG
 	/* NB: Divide shift count by 4 since each bit in K1 represent 4
 	   bytes.  */
-	movl	%ecx, %SHIFT_REG
-	sarl    $2, %SHIFT_REG
+	sarl	$2, %SHIFT_REG
+	andl	$(CHAR_PER_VEC - 1), %SHIFT_REG
 # endif
-
-	/* Remove the leading bits.	 */
 	sarxl	%SHIFT_REG, %eax, %eax
+	/* If eax is zero continue.  */
 	testl	%eax, %eax
-
-	jz	L(aligned_more)
+	jz	L(cross_page_continue)
 	tzcntl	%eax, %eax
-	addq	%rcx, %rdi
+# ifndef USE_AS_STRCHRNUL
+	/* Check to see if match was CHAR or null.  */
+	cmp	(%rdx, %rax, CHAR_SIZE), %CHAR_REG
+	jne	L(zero_end)
+# endif
 # ifdef USE_AS_WCSCHR
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	(%rdi, %rax, 4), %rax
+	leaq	(%rdx, %rax, CHAR_SIZE), %rax
 # else
-	addq	%rdi, %rax
-# endif
-# ifndef USE_AS_STRCHRNUL
-	cmp (%rax), %CHAR_REG
-	cmovne	%rdx, %rax
+	addq	%rdx, %rax
 # endif
 	ret