From patchwork Fri Jul 3 17:52:20 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 39907 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 61B61386F82E; Fri, 3 Jul 2020 17:52:25 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 61B61386F82E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1593798745; bh=rniFW0q3xdAzECNyINKfD2CQv8TvkQem3PQLdR0xIjk=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=tGXbll0b1moRt0FpDFJfQ9sRSWT1ubWjwOC6Mnm84qOIfH5qzgIy68Bk+4tmPbjIX 22Sgu45HJQ/Qq0clK6rRh12FJWfVUQhbaKoazQcFCCbw5qpEtvUQzZFXSatb3viTD4 S0KalEWa+HlZKBA1aP7SMp6yj5wdgctQfm2AT5/g= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x642.google.com (mail-pl1-x642.google.com [IPv6:2607:f8b0:4864:20::642]) by sourceware.org (Postfix) with ESMTPS id 842AA3844079 for ; Fri, 3 Jul 2020 17:52:22 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 842AA3844079 Received: by mail-pl1-x642.google.com with SMTP id g17so12883129plq.12 for ; Fri, 03 Jul 2020 10:52:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=rniFW0q3xdAzECNyINKfD2CQv8TvkQem3PQLdR0xIjk=; b=dhaOLXNLXae85ZZf4AXZ1M7ex9/U3fvtqiw8i8LFKQYFXJnJEB0hSoTpwU3NVRDe2P gCdB9OAstW109sURfO/lx8SV6kNWSE5OJslBdQkPKz5h3XW37gBjh60rwnyRjUvm+E82 tozwFyyhDvqUYiGHnZBIUH6aVyObjOMr2/XDrvdhHVlbJ10iDJ4G98zE/0qeImsaffDL SUoE2WNthwBugPtjT6zLw7dT7ABX4VciRYTDwZHnf3Dwg16S7h85A2QmeYk94t89Trmt Xj4fvAxk/Wdf4nU18REODAAoHsZk/GusUTDn5s+tBTg7VSJkGK8BqyONyjhm7m+9iftX 5vTg== X-Gm-Message-State: AOAM531a55/i8VaOc091ZjTR9HfpFHpOehjzr4bjA7aw/wZCraINgvkF 9ONCa1xaz2tYBGJ8E5U9o0s= X-Google-Smtp-Source: ABdhPJzv4u29PkNG6HVzr+EjG6G51ro0QAGEPiHV4QwfU0iAFVkz1qqLGbmpqhZ/PVE6z44hiWckjg== X-Received: by 2002:a17:90a:b38b:: with SMTP id e11mr20551131pjr.120.1593798741480; Fri, 03 Jul 2020 10:52:21 -0700 (PDT) Received: from gnu-cfl-2.localdomain (c-69-181-90-243.hsd1.ca.comcast.net. [69.181.90.243]) by smtp.gmail.com with ESMTPSA id v8sm5241602pjf.46.2020.07.03.10.52.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Jul 2020 10:52:21 -0700 (PDT) Received: from gnu-cfl-2.localdomain (localhost [IPv6:::1]) by gnu-cfl-2.localdomain (Postfix) with ESMTP id 5E0591A0507; Fri, 3 Jul 2020 10:52:20 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH 2/2] x86: Add thresholds for "rep movsb/stosb" to tunables Date: Fri, 3 Jul 2020 10:52:20 -0700 Message-Id: <20200703175220.1178840-3-hjl.tools@gmail.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200703175220.1178840-1-hjl.tools@gmail.com> References: <20200703175220.1178840-1-hjl.tools@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-13.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "H.J. Lu via Libc-alpha" From: "H.J. Lu" Reply-To: "H.J. Lu" Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables to update thresholds for "rep movsb" and "rep stosb" at run-time. Note that the user specified threshold for "rep movsb" smaller than the minimum threshold will be ignored. --- manual/tunables.texi | 14 +++++++ sysdeps/x86/cacheinfo.c | 20 ++++++++++ sysdeps/x86/cpu-features.h | 4 ++ sysdeps/x86/dl-cacheinfo.c | 38 +++++++++++++++++++ sysdeps/x86/dl-tunables.list | 6 +++ .../multiarch/memmove-vec-unaligned-erms.S | 16 +------- .../multiarch/memset-vec-unaligned-erms.S | 12 +----- 7 files changed, 84 insertions(+), 26 deletions(-) diff --git a/manual/tunables.texi b/manual/tunables.texi index ec18b10834..61edd62425 100644 --- a/manual/tunables.texi +++ b/manual/tunables.texi @@ -396,6 +396,20 @@ to set threshold in bytes for non temporal store. This tunable is specific to i386 and x86-64. @end deftp +@deftp Tunable glibc.cpu.x86_rep_movsb_threshold +The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user +to set threshold in bytes to start using "rep movsb". + +This tunable is specific to i386 and x86-64. +@end deftp + +@deftp Tunable glibc.cpu.x86_rep_stosb_threshold +The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user +to set threshold in bytes to start using "rep stosb". + +This tunable is specific to i386 and x86-64. +@end deftp + @deftp Tunable glibc.cpu.x86_ibt The @code{glibc.cpu.x86_ibt} tunable allows the user to control how indirect branch tracking (IBT) should be enabled. Accepted values are diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index 8c4c7f9972..bb536d96ef 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -41,6 +41,23 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024; /* Threshold to use non temporal store. */ long int __x86_shared_non_temporal_threshold attribute_hidden; +/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set + up REP MOVSB operation, REP MOVSB isn't faster on short data. The + memcpy micro benchmark in glibc shows that 2KB is the approximate + value above which REP MOVSB becomes faster than SSE2 optimization + on processors with Enhanced REP MOVSB. Since larger register size + can move more data with a single load and store, the threshold is + higher with larger register size. */ +long int __x86_rep_movsb_threshold attribute_hidden = 2048; + +/* Threshold to use Enhanced REP STOSB. Since there is overhead to set + up REP STOSB operation, REP STOSB isn't faster on short data. The + memset micro benchmark in glibc shows that 2KB is the approximate + value above which REP STOSB becomes faster on processors with + Enhanced REP STOSB. Since the stored value is fixed, larger register + size has minimal impact on threshold. */ +long int __x86_rep_stosb_threshold attribute_hidden = 2048; + #ifndef __x86_64__ /* PREFETCHW support flag for use in memory and string routines. */ int __x86_prefetchw attribute_hidden; @@ -117,6 +134,9 @@ init_cacheinfo (void) __x86_shared_non_temporal_threshold = cpu_features->non_temporal_threshold; + __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold; + __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold; + #ifndef __x86_64__ __x86_prefetchw = cpu_features->prefetchw; #endif diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h index 3aaed33cbc..002e12e11f 100644 --- a/sysdeps/x86/cpu-features.h +++ b/sysdeps/x86/cpu-features.h @@ -128,6 +128,10 @@ struct cpu_features /* PREFETCHW support flag for use in memory and string routines. */ unsigned long int prefetchw; #endif + /* Threshold to use "rep movsb". */ + unsigned long int rep_movsb_threshold; + /* Threshold to use "rep stosb". */ + unsigned long int rep_stosb_threshold; }; /* Used from outside of glibc to get access to the CPU features diff --git a/sysdeps/x86/dl-cacheinfo.c b/sysdeps/x86/dl-cacheinfo.c index 8e2a6f552c..aff9bd1067 100644 --- a/sysdeps/x86/dl-cacheinfo.c +++ b/sysdeps/x86/dl-cacheinfo.c @@ -860,6 +860,31 @@ __init_cacheinfo (void) total shared cache size. */ unsigned long int non_temporal_threshold = (shared * threads * 3 / 4); + /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8. */ + unsigned long int minimum_rep_movsb_threshold; + /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16). See + comments for __x86_rep_movsb_threshold in cacheinfo.c. */ + unsigned long int rep_movsb_threshold; + if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable) + && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)) + { + rep_movsb_threshold = 2048 * (64 / 16); + minimum_rep_movsb_threshold = 64 * 8; + } + else if (CPU_FEATURES_ARCH_P (cpu_features, + AVX_Fast_Unaligned_Load)) + { + rep_movsb_threshold = 2048 * (32 / 16); + minimum_rep_movsb_threshold = 32 * 8; + } + else + { + rep_movsb_threshold = 2048 * (16 / 16); + minimum_rep_movsb_threshold = 16 * 8; + } + /* NB: See comments for __x86_rep_stosb_threshold in cacheinfo.c. */ + unsigned long int rep_stosb_threshold = 2048; + #if HAVE_TUNABLES long int tunable_size; tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL); @@ -871,11 +896,19 @@ __init_cacheinfo (void) tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL); if (tunable_size != 0) non_temporal_threshold = tunable_size; + tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL); + if (tunable_size > minimum_rep_movsb_threshold) + rep_movsb_threshold = tunable_size; + tunable_size = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL); + if (tunable_size != 0) + rep_stosb_threshold = tunable_size; #endif cpu_features->data_cache_size = data; cpu_features->shared_cache_size = shared; cpu_features->non_temporal_threshold = non_temporal_threshold; + cpu_features->rep_movsb_threshold = rep_movsb_threshold; + cpu_features->rep_stosb_threshold = rep_stosb_threshold; #if HAVE_TUNABLES TUNABLE_UPDATE (x86_data_cache_size, long int, @@ -884,5 +917,10 @@ __init_cacheinfo (void) shared, 0, (long int) -1); TUNABLE_UPDATE (x86_non_temporal_threshold, long int, non_temporal_threshold, 0, (long int) -1); + TUNABLE_UPDATE (x86_rep_movsb_threshold, long int, + rep_movsb_threshold, minimum_rep_movsb_threshold, + (long int) -1); + TUNABLE_UPDATE (x86_rep_stosb_threshold, long int, + rep_stosb_threshold, 0, (long int) -1); #endif } diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list index 251b926ce4..43bf6c2389 100644 --- a/sysdeps/x86/dl-tunables.list +++ b/sysdeps/x86/dl-tunables.list @@ -30,6 +30,12 @@ glibc { x86_non_temporal_threshold { type: SIZE_T } + x86_rep_movsb_threshold { + type: SIZE_T + } + x86_rep_stosb_threshold { + type: SIZE_T + } x86_data_cache_size { type: SIZE_T } diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 74953245aa..bd5dc1a3f3 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -56,17 +56,6 @@ # endif #endif -/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set - up REP MOVSB operation, REP MOVSB isn't faster on short data. The - memcpy micro benchmark in glibc shows that 2KB is the approximate - value above which REP MOVSB becomes faster than SSE2 optimization - on processors with Enhanced REP MOVSB. Since larger register size - can move more data with a single load and store, the threshold is - higher with larger register size. */ -#ifndef REP_MOVSB_THRESHOLD -# define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16)) -#endif - #ifndef PREFETCH # define PREFETCH(addr) prefetcht0 addr #endif @@ -253,9 +242,6 @@ L(movsb): leaq (%rsi,%rdx), %r9 cmpq %r9, %rdi /* Avoid slow backward REP MOVSB. */ -# if REP_MOVSB_THRESHOLD <= (VEC_SIZE * 8) -# error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE! -# endif jb L(more_8x_vec_backward) 1: mov %RDX_LP, %RCX_LP @@ -331,7 +317,7 @@ L(between_2_3): #if defined USE_MULTIARCH && IS_IN (libc) L(movsb_more_2x_vec): - cmpq $REP_MOVSB_THRESHOLD, %rdx + cmp __x86_rep_movsb_threshold(%rip), %RDX_LP ja L(movsb) #endif L(more_2x_vec): diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index af2299709c..2bfc95de05 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -58,16 +58,6 @@ # endif #endif -/* Threshold to use Enhanced REP STOSB. Since there is overhead to set - up REP STOSB operation, REP STOSB isn't faster on short data. The - memset micro benchmark in glibc shows that 2KB is the approximate - value above which REP STOSB becomes faster on processors with - Enhanced REP STOSB. Since the stored value is fixed, larger register - size has minimal impact on threshold. */ -#ifndef REP_STOSB_THRESHOLD -# define REP_STOSB_THRESHOLD 2048 -#endif - #ifndef SECTION # error SECTION is not defined! #endif @@ -181,7 +171,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) ret L(stosb_more_2x_vec): - cmpq $REP_STOSB_THRESHOLD, %rdx + cmp __x86_rep_stosb_threshold(%rip), %RDX_LP ja L(stosb) #endif L(more_2x_vec):