From patchwork Sat May 23 04:10:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qingqing Li X-Patchwork-Id: 39355 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3C2C8386F83F; Sat, 23 May 2020 04:10:17 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from huawei.com (szxga05-in.huawei.com [45.249.212.191]) by sourceware.org (Postfix) with ESMTPS id 45CAD386F465 for ; Sat, 23 May 2020 04:10:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 45CAD386F465 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=huawei.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=liqingqing3@huawei.com Received: from DGGEMS411-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id A83DBE9D999278B3FA26 for ; Sat, 23 May 2020 12:10:11 +0800 (CST) Received: from [127.0.0.1] (10.166.213.55) by DGGEMS411-HUB.china.huawei.com (10.3.19.211) with Microsoft SMTP Server id 14.3.487.0; Sat, 23 May 2020 12:10:02 +0800 Subject: [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M References: <15ec783d-46f5-0166-aee9-f1d16a58ca83@huawei.com> To: "libc-alpha@sourceware.org" , , Hushiyuan From: liqingqing X-Forwarded-Message-Id: <15ec783d-46f5-0166-aee9-f1d16a58ca83@huawei.com> Message-ID: Date: Sat, 23 May 2020 12:10:01 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.2.0 MIME-Version: 1.0 In-Reply-To: <15ec783d-46f5-0166-aee9-f1d16a58ca83@huawei.com> Content-Language: en-US X-Originating-IP: [10.166.213.55] X-CFilter-Loop: Reflected X-Spam-Status: No, score=-14.7 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset, and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB, this api will use STOB to instead of MOVQ but when I test this API on x86_64 platform and found that this default value is not appropriate for some input length. here it's the enviornment and result test suite: libMicro-0.4.0 ./memset -E -C 200 -L -S -W -N "memset_4k" -s 4k -I 250 ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k -u -I 400 ./memset -E -C 200 -L -S -W -N "memset_1m" -s 1m -I 200000 ./memset -E -C 200 -L -S -W -N "memset_10m" -s 10m -I 2000000 hardware platform: Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz L1d cache:32KB L1i cache: 32KB L2 cache: 1MB L3 cache: 60MB the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance. before this commit after this commit cycle cycle memset_4k 249 96 memset_10k 657 185 memset_36k 2773 3767 memset_100k 7594 10002 memset_500k 37678 52149 memset_1m 86780 108044 memset_10m 1307238 1148994 before this commit after this commit MLC cache miss(10sec) MLC cache miss(10sec) memset_4k 1,09,33,823 1,01,79,270 memset_10k 1,23,78,958 1,05,41,087 memset_36k 3,61,64,244 4,07,22,429 memset_100k 8,25,33,052 9,31,81,253 memset_500k 37,32,55,449 43,56,70,395 memset_1m 75,16,28,239 88,29,90,237 memset_10m 9,36,61,67,397 8,96,69,49,522 though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx, but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below: From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001 From: liqingqing Date: Thu, 21 May 2020 11:23:06 +0800 Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache. so update the defaule value to eliminate the decrement . --- sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index dcd63c92..92c08eed 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -65,7 +65,7 @@ Enhanced REP STOSB. Since the stored value is fixed, larger register size has minimal impact on threshold. */ #ifndef REP_STOSB_THRESHOLD -# define REP_STOSB_THRESHOLD 2048 +# define REP_STOSB_THRESHOLD 1048576 #endif #ifndef SECTION