diff mbox series

x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M

Message ID e6de570b-48bf-88cf-2cec-5f5a5e7821bf@huawei.com
State New
Headers show
Series x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M | expand

Commit Message

liqingqing May 23, 2020, 4:10 a.m. UTC
this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
this api will use STOB to instead of  MOVQ

but when I test this API on x86_64 platform
and found that this default value is not appropriate for some input length. here it's the enviornment and result

test suite: libMicro-0.4.0
	./memset -E -C 200 -L -S -W -N "memset_4k"    -s 4k    -I 250
	./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k    -u -I 400
	./memset -E -C 200 -L -S -W -N "memset_1m"    -s 1m   -I 200000
	./memset -E -C 200 -L -S -W -N "memset_10m"   -s 10m -I 2000000

hardware platform:
	Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
	L1d cache:32KB
	L1i cache: 32KB
	L2 cache: 1MB
	L3 cache: 60MB

the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
	before this commit     after this commit         	
	        cycle	   cycle
memset_4k  	249 	    96	
memset_10k  	657	    185	
memset_36k	2773	    3767	
memset_100k	7594	    10002	
memset_500k	37678	    52149	
memset_1m  	86780	    108044	
memset_10m 	1307238	    1148994	

	before this commit          after this commit         	
	   MLC cache miss(10sec)	 MLC cache miss(10sec)
memset_4k  	1,09,33,823	     1,01,79,270
memset_10k  	1,23,78,958	     1,05,41,087
memset_36k	3,61,64,244	     4,07,22,429
memset_100k	8,25,33,052	     9,31,81,253
memset_500k	37,32,55,449	     43,56,70,395
memset_1m  	75,16,28,239	     88,29,90,237
memset_10m 	9,36,61,67,397	     8,96,69,49,522

though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:

From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
From: liqingqing <liqingqing3@huawei.com>
Date: Thu, 21 May 2020 11:23:06 +0800
Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
so update the defaule value to eliminate the decrement .

 sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff mbox series


diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index dcd63c92..92c08eed 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -65,7 +65,7 @@ 
    Enhanced REP STOSB.  Since the stored value is fixed, larger register
    size has minimal impact on threshold.  */
-# define REP_STOSB_THRESHOLD           2048
+# define REP_STOSB_THRESHOLD           1048576

 #ifndef SECTION