From patchwork Mon Feb  1 00:30:14 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 41878
X-Patchwork-Delegate: hjl.tools@gmail.com
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 4BF0B384A87E;
	Mon,  1 Feb 2021 00:30:44 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4BF0B384A87E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1612139444;
	bh=os7F2JXal1aB/ifeRK3vFg8a8R0uw7voIFxjN9F3oaQ=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=CVNQL7/njLDi7khca7HYXmSmchnVaPaoK/2sU0u6ZwxsNQYcr5FGmoA7BebHVVOMM
	 v396fHewg/cThUj/ZwltIvkBDHyg5A4I5ieMGmMo+f3YrPzjR50Eno+oJd9bmEJXmO
	 P6itw/ZJmi5mwNS5PGFAL7jBTi1go2roF7FYWJSo=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com
 [IPv6:2607:f8b0:4864:20::631])
 by sourceware.org (Postfix) with ESMTPS id 1F6B8384B406
 for <libc-alpha@sourceware.org>; Mon,  1 Feb 2021 00:30:41 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 1F6B8384B406
Received: by mail-pl1-x631.google.com with SMTP id q2so9024503plk.4
 for <libc-alpha@sourceware.org>; Sun, 31 Jan 2021 16:30:41 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
 :content-transfer-encoding;
 bh=os7F2JXal1aB/ifeRK3vFg8a8R0uw7voIFxjN9F3oaQ=;
 b=KL0wUt2ku95Si1fw0Pp6cfIZ6BO6WR4VnaKv7B07GGbAHeL0NZb99tf/I6UOySOcQl
 4q3f9E3KbL8td2QkKqwAyY+QqrJQ2chXs9s1BXbM7LIAESNUtUJn/YOd206e/ikhFJii
 +EyuPzoPov+oxfMApOGSF8eP1FciSZnYbceF7O0X8vw5CllfM+pWrp9OwprIRPQohplc
 IBk59lRgqfeT7OM3iJayrzTQzgv0QW4Lkc5qQGXseAZDOhejQt+KJ8fsUjt9HNrSnDRI
 uANXg9vRVz59VrTYjN1HEcUyynBb4lKLCcfcuODzxR2RKWYNR/RP+Ooa7s5E4lvSjJAf
 ZcOg==
X-Gm-Message-State: AOAM530pSdrrE79dw0zPnOpMTEN+QiER4iZ4IIax4YOlSBNSzCr75DT2
 kU9OrrNTgtRtXi8ZsXBH+ucIkAC4YNU=
X-Google-Smtp-Source: 
 ABdhPJyS687LC8OW6myO61pmHIpT7fMYm1nMc6f2FI7Q0UdXFhDbzrF4DxYCc9x1q1XrV+4b9t9ZqQ==
X-Received: by 2002:a17:90a:d249:: with SMTP id
 o9mr14642742pjw.196.1612139439965;
 Sun, 31 Jan 2021 16:30:39 -0800 (PST)
Received: from localhost.localdomain (c-73-241-149-213.hsd1.ca.comcast.net.
 [73.241.149.213])
 by smtp.googlemail.com with ESMTPSA id w67sm2954211pfb.79.2021.01.31.16.30.38
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sun, 31 Jan 2021 16:30:39 -0800 (PST)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 1/2] x86: Refactor and improve performance of strchr-avx2.S
Date: Sun, 31 Jan 2021 19:30:14 -0500
Message-Id: <20210201003014.785099-1-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.29.2
MIME-Version: 1.0
X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: noah via Libc-alpha <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: noah <goldstein.w.n@gmail.com>
Cc: goldstein.w.n@gmail.com
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

No bug. Just seemed the performance could be improved a bit. Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. Both test-strchrnul and test-strchr passed.

Signed-off-by: noah <goldstein.w.n@gmail.com>
---
Since V1 optimized more around smaller lengths. The original version
expected the 4x loop to be hit though the benchmarks in bench-strchr.c
indicate optimization for very short strings is most important.

Made the first 32 byte check expect to find either the end of the
string or character in question. As well increased number of vectors
in L(aligned_more) to 4. This does cost for most alignments if the 4x
loop is hit but is faster for strings < 128 byte.
    
 sysdeps/x86_64/multiarch/strchr-avx2.S | 247 ++++++++++++-------------
 sysdeps/x86_64/multiarch/strchr.c      |   1 +
 2 files changed, 124 insertions(+), 124 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S
index d416558d04..3012cb6ece 100644
--- a/sysdeps/x86_64/multiarch/strchr-avx2.S
+++ b/sysdeps/x86_64/multiarch/strchr-avx2.S
@@ -27,10 +27,12 @@
 # ifdef USE_AS_WCSCHR
 #  define VPBROADCAST	vpbroadcastd
 #  define VPCMPEQ	vpcmpeqd
+#  define VPMINU	vpminud
 #  define CHAR_REG	esi
 # else
 #  define VPBROADCAST	vpbroadcastb
 #  define VPCMPEQ	vpcmpeqb
+#  define VPMINU	vpminub
 #  define CHAR_REG	sil
 # endif
 
@@ -39,19 +41,25 @@
 # endif
 
 # define VEC_SIZE 32
+# define PAGE_SIZE 4096
 
 	.section .text.avx,"ax",@progbits
 ENTRY (STRCHR)
-	movl	%edi, %ecx
+    movl	%edi, %ecx
+# ifndef USE_AS_STRCHRNUL
+	xorl	%edx, %edx
+# endif
+    
 	/* Broadcast CHAR to YMM0.  */
 	vmovd	%esi, %xmm0
 	vpxor	%xmm9, %xmm9, %xmm9
 	VPBROADCAST %xmm0, %ymm0
-	/* Check if we may cross page boundary with one vector load.  */
-	andl	$(2 * VEC_SIZE - 1), %ecx
-	cmpl	$VEC_SIZE, %ecx
-	ja	L(cros_page_boundary)
-
+    
+	/* Check if we cross page boundary with one vector load.  */
+	andl	$(PAGE_SIZE - 1), %ecx
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %ecx
+	ja	L(cross_page_boundary)
+    
 	/* Check the first VEC_SIZE bytes.  Search for both CHAR and the
 	   null byte.  */
 	vmovdqu	(%rdi), %ymm8
@@ -60,50 +68,27 @@ ENTRY (STRCHR)
 	vpor	%ymm1, %ymm2, %ymm1
 	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x0)
-
-	/* Align data for aligned loads in the loop.  */
-	addq	$VEC_SIZE, %rdi
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-
-	jmp	L(more_4x_vec)
-
-	.p2align 4
-L(cros_page_boundary):
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-	vmovdqu	(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	/* Remove the leading bytes.  */
-	sarl	%cl, %eax
-	testl	%eax, %eax
-	jz	L(aligned_more)
+	jz	L(more_vecs)
+    tzcntl	%eax, %eax
 	/* Found CHAR or the null byte.  */
-	tzcntl	%eax, %eax
-	addq	%rcx, %rax
-# ifdef USE_AS_STRCHRNUL
 	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
 
-	.p2align 4
+    .p2align 4
+L(more_vecs):    
+	/* Align data for aligned loads in the loop.  */
+    andq	$-VEC_SIZE, %rdi
 L(aligned_more):
-	addq	$VEC_SIZE, %rdi
 
-L(more_4x_vec):
-	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
-	   since data is only aligned to VEC_SIZE.  */
-	vmovdqa	(%rdi), %ymm8
+	/* Check the next 4 * VEC_SIZE.  Only one VEC_SIZE at a time
+       since data is only aligned to VEC_SIZE.  */
+	vmovdqa	VEC_SIZE(%rdi), %ymm8
+    addq    $VEC_SIZE, %rdi
 	VPCMPEQ %ymm8, %ymm0, %ymm1
 	VPCMPEQ %ymm8, %ymm9, %ymm2
 	vpor	%ymm1, %ymm2, %ymm1
@@ -125,7 +110,7 @@ L(more_4x_vec):
 	vpor	%ymm1, %ymm2, %ymm1
 	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x2)
+	jnz	L(first_vec_x2)    
 
 	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
 	VPCMPEQ %ymm8, %ymm0, %ymm1
@@ -133,122 +118,136 @@ L(more_4x_vec):
 	vpor	%ymm1, %ymm2, %ymm1
 	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x3)
-
-	addq	$(VEC_SIZE * 4), %rdi
-
-	/* Align data to 4 * VEC_SIZE.  */
-	movq	%rdi, %rcx
-	andl	$(4 * VEC_SIZE - 1), %ecx
-	andq	$-(4 * VEC_SIZE), %rdi
-
-	.p2align 4
-L(loop_4x_vec):
-	/* Compare 4 * VEC at a time forward.  */
-	vmovdqa	(%rdi), %ymm5
-	vmovdqa	VEC_SIZE(%rdi), %ymm6
-	vmovdqa	(VEC_SIZE * 2)(%rdi), %ymm7
-	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
-
-	VPCMPEQ %ymm5, %ymm0, %ymm1
-	VPCMPEQ %ymm6, %ymm0, %ymm2
-	VPCMPEQ %ymm7, %ymm0, %ymm3
-	VPCMPEQ %ymm8, %ymm0, %ymm4
-
-	VPCMPEQ %ymm5, %ymm9, %ymm5
-	VPCMPEQ %ymm6, %ymm9, %ymm6
-	VPCMPEQ %ymm7, %ymm9, %ymm7
-	VPCMPEQ %ymm8, %ymm9, %ymm8
-
-	vpor	%ymm1, %ymm5, %ymm1
-	vpor	%ymm2, %ymm6, %ymm2
-	vpor	%ymm3, %ymm7, %ymm3
-	vpor	%ymm4, %ymm8, %ymm4
-
-	vpor	%ymm1, %ymm2, %ymm5
-	vpor	%ymm3, %ymm4, %ymm6
+	jz	L(prep_loop_4x)
 
-	vpor	%ymm5, %ymm6, %ymm5
-
-	vpmovmskb %ymm5, %eax
-	testl	%eax, %eax
-	jnz	L(4x_vec_end)
-
-	addq	$(VEC_SIZE * 4), %rdi
-
-	jmp	L(loop_4x_vec)
+    tzcntl	%eax, %eax
+    leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
+	cmovne	%rdx, %rax
+# endif
+	VZEROUPPER
+	ret
 
-	.p2align 4
+    .p2align 4
 L(first_vec_x0):
+    tzcntl	%eax, %eax
 	/* Found CHAR or the null byte.  */
-	tzcntl	%eax, %eax
-# ifdef USE_AS_STRCHRNUL
 	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
-
+    
 	.p2align 4
 L(first_vec_x1):
-	tzcntl	%eax, %eax
-# ifdef USE_AS_STRCHRNUL
-	addq	$VEC_SIZE, %rax
-	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	VEC_SIZE(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+    tzcntl	%eax, %eax
+    leaq	VEC_SIZE(%rdi, %rax), %rax
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
-	ret
-
-	.p2align 4
+	ret    
+    
+    .p2align 4
 L(first_vec_x2):
-	tzcntl	%eax, %eax
-# ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 2), %rax
-	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
+    tzcntl	%eax, %eax
+	/* Found CHAR or the null byte.  */
 	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
+    
+L(prep_loop_4x):
+    /* Align data to 4 * VEC_SIZE.  */
+	andq	$-(VEC_SIZE * 4), %rdi
 
 	.p2align 4
-L(4x_vec_end):
+L(loop_4x_vec):
+	/* Compare 4 * VEC at a time forward.  */
+	vmovdqa	(VEC_SIZE * 4)(%rdi), %ymm5
+	vmovdqa	(VEC_SIZE * 5)(%rdi), %ymm6
+	vmovdqa	(VEC_SIZE * 6)(%rdi), %ymm7
+	vmovdqa	(VEC_SIZE * 7)(%rdi), %ymm8
+
+    /* Leaves only CHARS matching esi as 0.  */
+    vpxor   %ymm5, %ymm0, %ymm1
+    vpxor   %ymm6, %ymm0, %ymm2
+    vpxor   %ymm7, %ymm0, %ymm3
+    vpxor   %ymm8, %ymm0, %ymm4
+
+	VPMINU	%ymm1, %ymm5, %ymm1
+	VPMINU	%ymm2, %ymm6, %ymm2
+	VPMINU	%ymm3, %ymm7, %ymm3
+	VPMINU	%ymm4, %ymm8, %ymm4
+
+	VPMINU	%ymm1, %ymm2, %ymm5
+	VPMINU	%ymm3, %ymm4, %ymm6
+
+	VPMINU	%ymm5, %ymm6, %ymm5
+
+    VPCMPEQ %ymm5, %ymm9, %ymm5
+	vpmovmskb %ymm5, %eax
+
+    addq	$(VEC_SIZE * 4), %rdi
+	testl	%eax, %eax
+    jz	L(loop_4x_vec)
+    
+    VPCMPEQ %ymm1, %ymm9, %ymm1
 	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
 	jnz	L(first_vec_x0)
-	vpmovmskb %ymm2, %eax
+
+    VPCMPEQ %ymm2, %ymm9, %ymm2
+    vpmovmskb %ymm2, %eax
 	testl	%eax, %eax
 	jnz	L(first_vec_x1)
-	vpmovmskb %ymm3, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x2)
+
+    VPCMPEQ %ymm3, %ymm9, %ymm3
+    VPCMPEQ %ymm4, %ymm9, %ymm4
+	vpmovmskb %ymm3, %ecx
 	vpmovmskb %ymm4, %eax
+    salq    $32, %rax
+    orq     %rcx, %rax
+	tzcntq	%rax, %rax
+	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
+	cmovne	%rdx, %rax
+# endif
+	VZEROUPPER
+	ret
+
+    /* Cold case for crossing page with first load.  */
+	.p2align 4
+L(cross_page_boundary):
+    andq	$-VEC_SIZE, %rdi
+	andl	$(VEC_SIZE - 1), %ecx
+
+	vmovdqa	(%rdi), %ymm8
+	VPCMPEQ %ymm8, %ymm0, %ymm1
+	VPCMPEQ %ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %eax
+	/* Remove the leading bits.  */
+	sarxl	%ecx, %eax, %eax
 	testl	%eax, %eax
-L(first_vec_x3):
+	jz	L(aligned_more)    
 	tzcntl	%eax, %eax
-# ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 3), %rax
+    addq	%rcx, %rdi
 	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
 
 END (STRCHR)
-#endif
+# endif
diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c
index 583a152794..4dfbe3b58b 100644
--- a/sysdeps/x86_64/multiarch/strchr.c
+++ b/sysdeps/x86_64/multiarch/strchr.c
@@ -37,6 +37,7 @@ IFUNC_SELECTOR (void)
 
   if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
       && CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
     return OPTIMIZE (avx2);
 

From patchwork Mon Feb  1 00:30:16 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 41879
X-Patchwork-Delegate: hjl.tools@gmail.com
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id D63BD384A030;
	Mon,  1 Feb 2021 00:30:52 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D63BD384A030
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1612139452;
	bh=htgEI2oCbIFSGaOT93uch58EFL0X3arH3MbdIuena5w=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=m+QUcaTcxP7PGLSN3Byd+81fdAF2as0bACMp5G4eirnsARXqbgkXj7bm8ycNrjnqR
	 82E5Yjta/eWgyrahX/hdPJ99h3Bgp+7MurYzW8TpUxa7WvtJCo04SvNS8+vq0/KvMD
	 cRQW9hCPNRVw5wucqJKnLdc+m0SMZ5zJxsiN5bXg=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com
 [IPv6:2607:f8b0:4864:20::42e])
 by sourceware.org (Postfix) with ESMTPS id 6C2C5384B406
 for <libc-alpha@sourceware.org>; Mon,  1 Feb 2021 00:30:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 6C2C5384B406
Received: by mail-pf1-x42e.google.com with SMTP id f63so10425872pfa.13
 for <libc-alpha@sourceware.org>; Sun, 31 Jan 2021 16:30:49 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=htgEI2oCbIFSGaOT93uch58EFL0X3arH3MbdIuena5w=;
 b=mXg9IyQFZFmgAopB+T9AHYBMmcIIcrOUsQKEHWmI3HycK/a/p++RVNVYQ+CHy0A30O
 QASDvwUBWzDcmdw+WBJVuywRyk0Q4Y77qWvxNj2vJZ7Vrajq9RRqYzH7e6t/KqZDMK5e
 ytvZIdeYFtro7QUwrvXU4QI39ccXaAwiTO4i8Y9YvG9becI2JHiRE/m+efR+OiRBaKj6
 kpPFjshGCh17oCsOK57o9G0pT8QHqEuvRvn17jLujGyTzELYtqB8LtV5hXC19ANXV799
 ySqZsrkvoRXMVcmdy1O+tM3QBe9+IEWeEZICZVKJ3iC0yMlgQ5Vk6nDdaJGmHLCWAVSS
 xOdA==
X-Gm-Message-State: AOAM5307X3o9kSU4B7ELp2r1/hWyVZyVs4s6hj+DcvPiTrGytvv9L3MO
 bGvMc+qBBUA9KwGdv5rf5VBqF9BmmW8=
X-Google-Smtp-Source: 
 ABdhPJxPsqO8v9a1Xl//n6Pe7ensTceUmcCSjcwovOV5lI+m8tvBD50Gic2/pNKRgRiTS/plcmNj6g==
X-Received: by 2002:a63:3850:: with SMTP id
 h16mr14639506pgn.140.1612139448133;
 Sun, 31 Jan 2021 16:30:48 -0800 (PST)
Received: from localhost.localdomain (c-73-241-149-213.hsd1.ca.comcast.net.
 [73.241.149.213])
 by smtp.googlemail.com with ESMTPSA id w67sm2954211pfb.79.2021.01.31.16.30.47
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sun, 31 Jan 2021 16:30:47 -0800 (PST)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 2/2] x86: Add additional benchmarks for strchr
Date: Sun, 31 Jan 2021 19:30:16 -0500
Message-Id: <20210201003014.785099-2-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.29.2
In-Reply-To: <20210201003014.785099-1-goldstein.w.n@gmail.com>
References: <20210201003014.785099-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: noah via Libc-alpha <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: noah <goldstein.w.n@gmail.com>
Cc: goldstein.w.n@gmail.com
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

This patch adds additional benchmarks for string size of 4096 and
several benchmarks for string size 256 with different alignments.

Signed-off-by: noah <goldstein.w.n@gmail.com>
---
Added 2 additional benchmark sizes:

4096: Just feels like a natural "large" size to test
    
256 with multiple alignments: This essentially is to test how
expensive the initial work prior to the 4x loop is depending on
different alignments.

results from bench-strchr: All times are in seconds and the medium of
100 runs.  Old is current strchr-avx2.S implementation. New is this
patch.

Summary: New is definetly faster for medium -> large sizes. Once the
4x loop is hit there is a 10%+ speedup and New always wins out. For
smaller sizes there is more variance as to which is faster and the
differences are small. Generally it seems the New version wins
out. This is likely because 0 - 31 sized strings are the fast path for
new (no jmp).

Benchmarking CPU:
Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz

size, algn, Old T , New T  -------- Win  Dif
0   , 0   , 2.54  , 2.52   -------- New  -0.02
1   , 0   , 2.57  , 2.52   -------- New  -0.05
2   , 0   , 2.56  , 2.52   -------- New  -0.04
3   , 0   , 2.58  , 2.54   -------- New  -0.04
4   , 0   , 2.61  , 2.55   -------- New  -0.06
5   , 0   , 2.65  , 2.62   -------- New  -0.03
6   , 0   , 2.73  , 2.74   -------- Old  -0.01
7   , 0   , 2.75  , 2.74   -------- New  -0.01
8   , 0   , 2.62  , 2.6    -------- New  -0.02
9   , 0   , 2.73  , 2.75   -------- Old  -0.02
10  , 0   , 2.74  , 2.74   -------- Eq    N/A
11  , 0   , 2.76  , 2.72   -------- New  -0.04
12  , 0   , 2.74  , 2.72   -------- New  -0.02
13  , 0   , 2.75  , 2.72   -------- New  -0.03
14  , 0   , 2.74  , 2.73   -------- New  -0.01
15  , 0   , 2.74  , 2.73   -------- New  -0.01
16  , 0   , 2.74  , 2.73   -------- New  -0.01
17  , 0   , 2.74  , 2.74   -------- Eq    N/A
18  , 0   , 2.73  , 2.73   -------- Eq    N/A
19  , 0   , 2.73  , 2.73   -------- Eq    N/A
20  , 0   , 2.73  , 2.73   -------- Eq    N/A
21  , 0   , 2.73  , 2.72   -------- New  -0.01
22  , 0   , 2.71  , 2.74   -------- Old  -0.03
23  , 0   , 2.71  , 2.69   -------- New  -0.02
24  , 0   , 2.68  , 2.67   -------- New  -0.01
25  , 0   , 2.66  , 2.62   -------- New  -0.04
26  , 0   , 2.64  , 2.62   -------- New  -0.02
27  , 0   , 2.71  , 2.64   -------- New  -0.07
28  , 0   , 2.67  , 2.69   -------- Old  -0.02
29  , 0   , 2.72  , 2.72   -------- Eq    N/A
30  , 0   , 2.68  , 2.69   -------- Old  -0.01
31  , 0   , 2.68  , 2.68   -------- Eq    N/A
32  , 0   , 3.51  , 3.52   -------- Old  -0.01
32  , 1   , 3.52  , 3.51   -------- New  -0.01
64  , 0   , 3.97  , 3.93   -------- New  -0.04
64  , 2   , 3.95  , 3.9    -------- New  -0.05
64  , 1   , 4.0   , 3.93   -------- New  -0.07
64  , 3   , 3.97  , 3.88   -------- New  -0.09
64  , 4   , 3.95  , 3.89   -------- New  -0.06
64  , 5   , 3.94  , 3.9    -------- New  -0.04
64  , 6   , 3.97  , 3.9    -------- New  -0.07
64  , 7   , 3.97  , 3.91   -------- New  -0.06
96  , 0   , 4.74  , 4.52   -------- New  -0.22
128 , 0   , 5.29  , 5.19   -------- New  -0.1
128 , 2   , 5.29  , 5.15   -------- New  -0.14
128 , 3   , 5.31  , 5.22   -------- New  -0.09
256 , 0   , 11.19 , 9.81   -------- New  -1.38
256 , 3   , 11.19 , 9.84   -------- New  -1.35
256 , 4   , 11.2  , 9.88   -------- New  -1.32
256 , 16  , 11.21 , 9.79   -------- New  -1.42
256 , 32  , 11.39 , 10.34  -------- New  -1.05
256 , 48  , 11.88 , 10.56  -------- New  -1.32
256 , 64  , 11.82 , 10.83  -------- New  -0.99
256 , 80  , 11.85 , 10.86  -------- New  -0.99
256 , 96  , 9.56  , 8.76   -------- New  -0.8 
256 , 112 , 9.55  , 8.9    -------- New  -0.65
512 , 0   , 15.76 , 13.72  -------- New  -2.04
512 , 4   , 15.72 , 13.74  -------- New  -1.98
512 , 5   , 15.73 , 13.74  -------- New  -1.99
1024, 0   , 24.85 , 21.33  -------- New  -3.52
1024, 5   , 24.86 , 21.27  -------- New  -3.59
1024, 6   , 24.87 , 21.32  -------- New  -3.55
2048, 0   , 45.75 , 36.7   -------- New  -9.05
2048, 6   , 43.91 , 35.42  -------- New  -8.49
2048, 7   , 44.43 , 36.37  -------- New  -8.06
4096, 0   , 96.94 , 81.34  -------- New  -15.6
4096, 7   , 97.01 , 81.32  -------- New  -15.69


 benchtests/bench-strchr.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/benchtests/bench-strchr.c b/benchtests/bench-strchr.c
index bf493fe458..5fd98a5d43 100644
--- a/benchtests/bench-strchr.c
+++ b/benchtests/bench-strchr.c
@@ -100,9 +100,13 @@ do_test (size_t align, size_t pos, size_t len, int seek_char, int max_char)
   size_t i;
   CHAR *result;
   CHAR *buf = (CHAR *) buf1;
-  align &= 15;
+
+  align &= 127;
   if ((align + len) * sizeof (CHAR) >= page_size)
-    return;
+    {
+      return;                
+    }
+
 
   for (i = 0; i < len; ++i)
     {
@@ -151,12 +155,24 @@ test_main (void)
       do_test (i, 16 << i, 2048, SMALL_CHAR, MIDDLE_CHAR);
     }
 
+  for (i = 1; i < 8; ++i)
+    {
+      do_test (0, 16 << i, 4096, SMALL_CHAR, MIDDLE_CHAR);
+      do_test (i, 16 << i, 4096, SMALL_CHAR, MIDDLE_CHAR);
+    }
+
   for (i = 1; i < 8; ++i)
     {
       do_test (i, 64, 256, SMALL_CHAR, MIDDLE_CHAR);
       do_test (i, 64, 256, SMALL_CHAR, BIG_CHAR);
     }
 
+  for (i = 0; i < 8; ++i)
+    {
+      do_test (16 * i, 256, 512, SMALL_CHAR, MIDDLE_CHAR);
+      do_test (16 * i, 256, 512, SMALL_CHAR, BIG_CHAR);
+    }
+
   for (i = 0; i < 32; ++i)
     {
       do_test (0, i, i + 1, SMALL_CHAR, MIDDLE_CHAR);
@@ -169,12 +185,24 @@ test_main (void)
       do_test (i, 16 << i, 2048, 0, MIDDLE_CHAR);
     }
 
+  for (i = 1; i < 8; ++i)
+    {
+      do_test (0, 16 << i, 4096, 0, MIDDLE_CHAR);
+      do_test (i, 16 << i, 4096, 0, MIDDLE_CHAR);
+    }
+
   for (i = 1; i < 8; ++i)
     {
       do_test (i, 64, 256, 0, MIDDLE_CHAR);
       do_test (i, 64, 256, 0, BIG_CHAR);
     }
 
+  for (i = 0; i < 8; ++i)
+    {
+      do_test (16 * i, 256, 512, 0, MIDDLE_CHAR);
+      do_test (16 * i, 256, 512, 0, BIG_CHAR);
+    }
+
   for (i = 0; i < 32; ++i)
     {
       do_test (0, i, i + 1, 0, MIDDLE_CHAR);