From patchwork Sat May 23 04:37:39 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 39356 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 759E3386F465; Sat, 23 May 2020 04:38:20 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 759E3386F465 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1590208700; bh=0qAlHg8+vGgplxoIG/Ka9CmQ9sNyfpaM6TC/zsnsipc=; h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=SlAXfb3GDiIiB8fjb6WaTyMbdIx++tTJ31STEmu6FboOtwUXwAfudXYa7I9I66JhL UpNiPdmFyotbChXgTk01hQAiv1qMWnKkPqY6k9wVPppVQN7WAUTCFUBxskJvfCWOIT CWbW5ppJGblXFqD8tNfAoYJEH6OH1Y1CTcVcDzcI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-io1-xd29.google.com (mail-io1-xd29.google.com [IPv6:2607:f8b0:4864:20::d29]) by sourceware.org (Postfix) with ESMTPS id AD780385DC1C for ; Sat, 23 May 2020 04:38:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org AD780385DC1C Received: by mail-io1-xd29.google.com with SMTP id r2so3422600ioo.4 for ; Fri, 22 May 2020 21:38:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0qAlHg8+vGgplxoIG/Ka9CmQ9sNyfpaM6TC/zsnsipc=; b=ElKkt8VXMHGA9vrEYlDEh5pIIby9VRWtmXPK54Smcjc5khqSs0Nw0arJwQ4+LTKvsI MfRv3qdCYkeGZTKJv4zhoLzjYNL+ID7tbn82jgIrzfFQlz2uu4B2GiL/8IQxp7NlFHl8 Um4TFx3zRDSUZIk/HGaZLu50NvuOk8DKHzUwFoFAON4VUl1Q5qHiFtPCsZWHbtG9VOIS Qzdw5xS+7cdDKvtbBigWsED4QsoS4N65+paoi8Bwo718d7OujdrrpaZF1GT6pqgIOoZx 6fBUOwcWW++9iRk/7Gtk0c1lg848CkFF7CtJLShKzZpEynAzHdZkaKMfd4Ig3cCmLBIa 48DA== X-Gm-Message-State: AOAM532aGct5AJ5EUFCB5EDx1X/1St7ReqClrMflH/2cMZK72L6dCCXc nbnF0n0/jtBvr+FWPy4q2xmLRqWTCu4US8bHNwY= X-Google-Smtp-Source: ABdhPJx1WIOYTATMEIANBDk4gwRPCfm+EE+OH7taEfqruLfrOrkp7t6qdBwCl1+7NSF57CFqR7WD6CErWLScUF91fqE= X-Received: by 2002:a6b:398a:: with SMTP id g132mr5569989ioa.91.1590208695578; Fri, 22 May 2020 21:38:15 -0700 (PDT) MIME-Version: 1.0 References: <15ec783d-46f5-0166-aee9-f1d16a58ca83@huawei.com> In-Reply-To: Date: Fri, 22 May 2020 21:37:39 -0700 Message-ID: Subject: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables To: liqingqing X-Spam-Status: No, score=-10.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "H.J. Lu via Libc-alpha" From: "H.J. Lu" Reply-To: "H.J. Lu" Cc: Hushiyuan , "libc-alpha@sourceware.org" Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" On Fri, May 22, 2020 at 9:10 PM liqingqing wrote: > > this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset, > and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB, > this api will use STOB to instead of MOVQ > > but when I test this API on x86_64 platform > and found that this default value is not appropriate for some input length. here it's the enviornment and result > > test suite: libMicro-0.4.0 > ./memset -E -C 200 -L -S -W -N "memset_4k" -s 4k -I 250 > ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k -u -I 400 > ./memset -E -C 200 -L -S -W -N "memset_1m" -s 1m -I 200000 > ./memset -E -C 200 -L -S -W -N "memset_10m" -s 10m -I 2000000 > > hardware platform: > Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz > L1d cache:32KB > L1i cache: 32KB > L2 cache: 1MB > L3 cache: 60MB > > the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance. > > before this commit after this commit > cycle cycle > memset_4k 249 96 > memset_10k 657 185 > memset_36k 2773 3767 > memset_100k 7594 10002 > memset_500k 37678 52149 > memset_1m 86780 108044 > memset_10m 1307238 1148994 > > before this commit after this commit > MLC cache miss(10sec) MLC cache miss(10sec) > memset_4k 1,09,33,823 1,01,79,270 > memset_10k 1,23,78,958 1,05,41,087 > memset_36k 3,61,64,244 4,07,22,429 > memset_100k 8,25,33,052 9,31,81,253 > memset_500k 37,32,55,449 43,56,70,395 > memset_1m 75,16,28,239 88,29,90,237 > memset_10m 9,36,61,67,397 8,96,69,49,522 > > > though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx, > but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below: > > > > From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001 > From: liqingqing > Date: Thu, 21 May 2020 11:23:06 +0800 > Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M > macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache. > so update the defaule value to eliminate the decrement . > There is no single threshold value which is good for all workloads. I don't think we should change REP_STOSB_THRESHOLD to 1MB. On the other hand, the fixed threshold isn't flexible. Please try this patch to see if you can set the threshold for your specific workload. From 7d2e0c0b843d509716d92960b9b139b32eacea54 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Sat, 9 May 2020 11:13:57 -0700 Subject: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables to update thresholds for "rep movsb" and "rep stosb" at run-time. Note that the user specified threshold for "rep movsb" smaller than the minimum threshold will be ignored. --- manual/tunables.texi | 16 +++++++ sysdeps/x86/cacheinfo.c | 46 +++++++++++++++++++ sysdeps/x86/cpu-features.c | 4 ++ sysdeps/x86/cpu-features.h | 4 ++ sysdeps/x86/dl-tunables.list | 6 +++ .../multiarch/memmove-vec-unaligned-erms.S | 16 +------ .../multiarch/memset-vec-unaligned-erms.S | 12 +---- 7 files changed, 78 insertions(+), 26 deletions(-) diff --git a/manual/tunables.texi b/manual/tunables.texi index ec18b10834..8054f79be0 100644 --- a/manual/tunables.texi +++ b/manual/tunables.texi @@ -396,6 +396,22 @@ to set threshold in bytes for non temporal store. This tunable is specific to i386 and x86-64. @end deftp +@deftp Tunable glibc.cpu.x86_rep_movsb_threshold +The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user +to set threshold in bytes to start using "rep movsb". Note that the +user specified threshold smaller than the minimum threshold will be +ignored. + +This tunable is specific to i386 and x86-64. +@end deftp + +@deftp Tunable glibc.cpu.x86_rep_stosb_threshold +The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user +to set threshold in bytes to start using "rep stosb". + +This tunable is specific to i386 and x86-64. +@end deftp + @deftp Tunable glibc.cpu.x86_ibt The @code{glibc.cpu.x86_ibt} tunable allows the user to control how indirect branch tracking (IBT) should be enabled. Accepted values are diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index 311502dee3..4322328a1b 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -530,6 +530,23 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024; /* Threshold to use non temporal store. */ long int __x86_shared_non_temporal_threshold attribute_hidden; +/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set + up REP MOVSB operation, REP MOVSB isn't faster on short data. The + memcpy micro benchmark in glibc shows that 2KB is the approximate + value above which REP MOVSB becomes faster than SSE2 optimization + on processors with Enhanced REP MOVSB. Since larger register size + can move more data with a single load and store, the threshold is + higher with larger register size. */ +long int __x86_rep_movsb_threshold attribute_hidden = 2048; + +/* Threshold to use Enhanced REP STOSB. Since there is overhead to set + up REP STOSB operation, REP STOSB isn't faster on short data. The + memset micro benchmark in glibc shows that 2KB is the approximate + value above which REP STOSB becomes faster on processors with + Enhanced REP STOSB. Since the stored value is fixed, larger register + size has minimal impact on threshold. */ +long int __x86_rep_stosb_threshold attribute_hidden = 2048; + #ifndef DISABLE_PREFETCHW /* PREFETCHW support flag for use in memory and string routines. */ int __x86_prefetchw attribute_hidden; @@ -872,6 +889,35 @@ init_cacheinfo (void) = (cpu_features->non_temporal_threshold != 0 ? cpu_features->non_temporal_threshold : __x86_shared_cache_size * threads * 3 / 4); + + /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8. */ + unsigned int minimum_rep_movsb_threshold; + /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16). */ + unsigned int rep_movsb_threshold; + if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable) + && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)) + { + rep_movsb_threshold = 2048 * (64 / 16); + minimum_rep_movsb_threshold = 64 * 8; + } + else if (CPU_FEATURES_ARCH_P (cpu_features, + AVX_Fast_Unaligned_Load)) + { + rep_movsb_threshold = 2048 * (32 / 16); + minimum_rep_movsb_threshold = 32 * 8; + } + else + { + rep_movsb_threshold = 2048 * (16 / 16); + minimum_rep_movsb_threshold = 16 * 8; + } + if (cpu_features->rep_movsb_threshold > minimum_rep_movsb_threshold) + __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold; + else + __x86_rep_movsb_threshold = rep_movsb_threshold; + + if (cpu_features->rep_stosb_threshold) + __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold; } #endif diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index 916bbf5242..14f847320f 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -564,6 +564,10 @@ no_cpuid: TUNABLE_GET (hwcaps, tunable_val_t *, TUNABLE_CALLBACK (set_hwcaps)); cpu_features->non_temporal_threshold = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL); + cpu_features->rep_movsb_threshold + = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL); + cpu_features->rep_stosb_threshold + = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL); cpu_features->data_cache_size = TUNABLE_GET (x86_data_cache_size, long int, NULL); cpu_features->shared_cache_size diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h index f05d5ce158..7410324e83 100644 --- a/sysdeps/x86/cpu-features.h +++ b/sysdeps/x86/cpu-features.h @@ -91,6 +91,10 @@ struct cpu_features unsigned long int shared_cache_size; /* Threshold to use non temporal store. */ unsigned long int non_temporal_threshold; + /* Threshold to use "rep movsb". */ + unsigned long int rep_movsb_threshold; + /* Threshold to use "rep stosb". */ + unsigned long int rep_stosb_threshold; }; /* Used from outside of glibc to get access to the CPU features diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list index 251b926ce4..43bf6c2389 100644 --- a/sysdeps/x86/dl-tunables.list +++ b/sysdeps/x86/dl-tunables.list @@ -30,6 +30,12 @@ glibc { x86_non_temporal_threshold { type: SIZE_T } + x86_rep_movsb_threshold { + type: SIZE_T + } + x86_rep_stosb_threshold { + type: SIZE_T + } x86_data_cache_size { type: SIZE_T } diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 74953245aa..bd5dc1a3f3 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -56,17 +56,6 @@ # endif #endif -/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set - up REP MOVSB operation, REP MOVSB isn't faster on short data. The - memcpy micro benchmark in glibc shows that 2KB is the approximate - value above which REP MOVSB becomes faster than SSE2 optimization - on processors with Enhanced REP MOVSB. Since larger register size - can move more data with a single load and store, the threshold is - higher with larger register size. */ -#ifndef REP_MOVSB_THRESHOLD -# define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16)) -#endif - #ifndef PREFETCH # define PREFETCH(addr) prefetcht0 addr #endif @@ -253,9 +242,6 @@ L(movsb): leaq (%rsi,%rdx), %r9 cmpq %r9, %rdi /* Avoid slow backward REP MOVSB. */ -# if REP_MOVSB_THRESHOLD <= (VEC_SIZE * 8) -# error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE! -# endif jb L(more_8x_vec_backward) 1: mov %RDX_LP, %RCX_LP @@ -331,7 +317,7 @@ L(between_2_3): #if defined USE_MULTIARCH && IS_IN (libc) L(movsb_more_2x_vec): - cmpq $REP_MOVSB_THRESHOLD, %rdx + cmp __x86_rep_movsb_threshold(%rip), %RDX_LP ja L(movsb) #endif L(more_2x_vec): diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index af2299709c..2bfc95de05 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -58,16 +58,6 @@ # endif #endif -/* Threshold to use Enhanced REP STOSB. Since there is overhead to set - up REP STOSB operation, REP STOSB isn't faster on short data. The - memset micro benchmark in glibc shows that 2KB is the approximate - value above which REP STOSB becomes faster on processors with - Enhanced REP STOSB. Since the stored value is fixed, larger register - size has minimal impact on threshold. */ -#ifndef REP_STOSB_THRESHOLD -# define REP_STOSB_THRESHOLD 2048 -#endif - #ifndef SECTION # error SECTION is not defined! #endif @@ -181,7 +171,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) ret L(stosb_more_2x_vec): - cmpq $REP_STOSB_THRESHOLD, %rdx + cmp __x86_rep_stosb_threshold(%rip), %RDX_LP ja L(stosb) #endif L(more_2x_vec): -- 2.26.2