From patchwork Fri Jul 3 16:54:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 39905 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 936F0384403B; Fri, 3 Jul 2020 16:54:59 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 936F0384403B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1593795299; bh=xFGjcxLHzqDVulYeirWKI/N7jWc1lrEf/xsrm3iI6TY=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=XNdc5tapf3I7euR1oxcZ8FZH3FlS5uDB/7ikoMmz0MWTese1ZOolBAzUjkfBmZHKJ Me7qGymaTiSmOD5Na9dKpNqxcn689TbUUFLFHsqk/mF84LeBHo4MPQC1gvFFmdSvtZ HXPaRuMEzPDwMQL52beamm6YY6lcL5PHwFQxepy4= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x643.google.com (mail-pl1-x643.google.com [IPv6:2607:f8b0:4864:20::643]) by sourceware.org (Postfix) with ESMTPS id 6EB6F3844079 for ; Fri, 3 Jul 2020 16:54:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 6EB6F3844079 Received: by mail-pl1-x643.google.com with SMTP id bf7so3517978plb.2 for ; Fri, 03 Jul 2020 09:54:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=xFGjcxLHzqDVulYeirWKI/N7jWc1lrEf/xsrm3iI6TY=; b=KVUF/v+kzJdL9ZsocXkKPHdnJI6Qaiijt4ORH7tWZSvSkAvCA56YnwMhImoU53/ANm ZoMq8r3W3Qp+iWefemfsAYrELCE7m8oqABXwcccQWSiPsedSHlQKdouguyv+gaHldQeE hg1GODw+6NMeS/Ulc54IkYHTih3Lv68yb+LFHM1ccCeNsuHQDuHhE0hdRvgHJ+1r/hLh XSmG+ecoi+4jwnLE5QEOFUyE+jWatfHd1GJV5eDZMdtkN6mIupahZiBHtZ4aBJcgFiGS 0MjrLV6esvq5IYTUp/ezDUMWJlbQ3zI/SH63PvMxbXoPNCfTe2YP/4J/ch8D0mtgALNd XhdA== X-Gm-Message-State: AOAM530n5GjVD9nqjsupaD1RrYce+kPooNDizDygJbjak+pZhnQ8qsNo yV3tj/lhxtwMCB4GqQPMgVA= X-Google-Smtp-Source: ABdhPJwFgsh4h8V97g7i/iN/mjNxJGPMc86kbz1Y9T0U+duidyFrjekU8uXn6NP7azeoHQUi8GVWJw== X-Received: by 2002:a17:90a:e007:: with SMTP id u7mr28531240pjy.9.1593795294286; Fri, 03 Jul 2020 09:54:54 -0700 (PDT) Received: from gnu-cfl-2.localdomain (c-69-181-90-243.hsd1.ca.comcast.net. [69.181.90.243]) by smtp.gmail.com with ESMTPSA id c23sm11566456pfo.32.2020.07.03.09.54.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Jul 2020 09:54:53 -0700 (PDT) Received: by gnu-cfl-2.localdomain (Postfix, from userid 1000) id 6D75E1A0507; Fri, 3 Jul 2020 09:54:52 -0700 (PDT) Date: Fri, 3 Jul 2020 09:54:52 -0700 To: Carlos O'Donell Subject: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables Message-ID: <20200703165452.GA226121@gmail.com> References: <20200605224550.GA1253830@gmail.com> <20200702190840.GA1474341@gmail.com> <369c062d-209c-5894-f8c7-e236753f946b@redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <369c062d-209c-5894-f8c7-e236753f946b@redhat.com> X-Spam-Status: No, score=-13.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "H.J. Lu via Libc-alpha" From: "H.J. Lu" Reply-To: "H.J. Lu" Cc: Hushiyuan , Florian Weimer , GNU C Library Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" On Fri, Jul 03, 2020 at 12:14:01PM -0400, Carlos O'Donell wrote: > On 7/2/20 3:08 PM, H.J. Lu wrote: > > On Thu, Jul 02, 2020 at 02:00:54PM -0400, Carlos O'Donell wrote: > >> On 6/6/20 5:51 PM, H.J. Lu wrote: > >>> On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu wrote: > >>>> > >>>> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote: > >>>>> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell wrote: > >>>>>> > >>>>>> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu wrote: > >>>>>>> Tunables are designed to pass info from user to glibc, not the other > >>>>>>> way around. When __libc_main is called, init_cacheinfo is never > >>>>>>> called. I can call init_cacheinfo from __libc_main. But there is no > >>>>>>> interface to update min and max values from init_cacheinfo. I don't > >>>>>>> think --list-tunables will work here without changes to tunables. > >>>>>> > >>>>>> You have a dynamic threshold. > >>>>>> > >>>>>> You have to tell the user what that minimum is, otherwise they can't > >>>>>> use the tunable reliably. > >>>>>> > >>>>>> This is the first instance of a min/max that is dynamically determined. > >>>>>> > >>>>>> You must fetch the cache info ahead of the tunable initialization, that > >>>>>> is you must call init_cacheinfo before __init_tunables. > >>>>>> > >>>>>> You can initialize the tunable data dynamically like this: > >>>>>> > >>>>>> /* Dynamically set the min and max of glibc.foo.bar. */ > >>>>>> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar); > >>>>>> tunable_list[id].type.min = lowval; > >>>>>> tunable_list[id].type.max = highval; > >>>>>> > >>>>>> We do something similar for maybe_enable_malloc_check. > >>>>>> > >>>>>> Then once the tunables are parsed, and the cpu features are loaded > >>>>>> you can print the tunables, and the printed tunables will have meaningful > >>>>>> min and max values. > >>>>>> > >>>>>> If you have circular dependency, then you must process the cpu features > >>>>>> first without reading from the tunables, then allow the tunables to be > >>>>>> initialized from the system, *then* process the tunables to alter the existing > >>>>>> cpu feature settings. > >>>>>> > >>>>> > >>>>> How about this? I got > >>>>> > >>>> > >>>> Here is the updated patch, which depends on > >>>> > >>>> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html > >>>> > >>>> to add "%d" support to _dl_debug_vdprintf. I got > >>>> > >>>> $ ./elf/ld.so ./libc.so --list-tunables > >>>> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647) > >>>> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff) > >>>> glibc.malloc.perturb: 0 (min: 0, max: 255) > >>>> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff) > >>>> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647) > >>>> glibc.elision.enable: 0 (min: 0, max: 1) > >>>> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff) > >>>> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647) > >>>> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff) > >>>> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff) > >>>> glibc.cpu.x86_shstk: > >>>> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff) > >>>> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647) > >>>> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647) > >>>> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff) > >>>> glibc.cpu.x86_ibt: > >>>> glibc.cpu.hwcaps: > >>>> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647) > >>>> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff) > >>>> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff) > >>>> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff) > >>>> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff) > >>>> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff) > >>>> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767) > >>>> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff) > >>>> glibc.malloc.check: 0 (min: 0, max: 3) > >>>> $ > >>>> > >>>> Ok for master? > >>>> > >>> > >>> Here is the updated patch. To support --list-tunables, a target should add > >>> > >>> CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body > >>> CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body > >>> > >>> and start.S should be updated to define __libc_main and call > >>> __libc_main_body: > >>> > >>> extern void __libc_main_body (int argc, char **argv) > >>> __attribute__ ((noreturn, visibility ("hidden"))); > >>> > >>> when LIBC_MAIN is defined. > >> > >> I like where this patch is going, but the __libc_main wiring up means > >> we'll have to delay this until glibc 2.33 opens for development and > >> give the architectures time to fill in the required pieces of assembly. > >> > >> Can we split this into: > >> > >> (a) Minimum required to implement the feature e.g. just the tunable without > >> my requested changes. > >> > >> (b) A second patch which implements the --list-tunables that users can > >> then use to know what the values they can choose are. > >> > >> That way we can commit (a) right now, and then commit (b) when we > >> reopen for development? > >> > > > > Like this? > > Almost. > > Why do we still use a constructor? > > Why don't we accurately set the min and max? > > +#if HAVE_TUNABLES > + TUNABLE_UPDATE (x86_non_temporal_threshold, long int, > + __x86_shared_non_temporal_threshold, 0, > + (long int) -1); > + TUNABLE_UPDATE (x86_rep_movsb_threshold, long int, > + __x86_rep_movsb_threshold, > + minimum_rep_movsb_threshold, (long int) -1); > + TUNABLE_UPDATE (x86_rep_stosb_threshold, long int, > + __x86_rep_stosb_threshold, 0, (long int) -1); > > A min and max of 0 and -1 respectively could have been set in the tunables > list file and are not dynamic? > > I'd expect your patch would do everything except actually implement > --list-tunables. Here is the followup patch which does it. > > We need a manual page, and I accept that showing a "lower value" will > have to wait for --list-tunables. > > Otherwise the patch is looking ready. Are these 2 patches OK for trunk? Thanks. H.J. --- Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables to update thresholds for "rep movsb" and "rep stosb" at run-time. Note that the user specified threshold for "rep movsb" smaller than the minimum threshold will be ignored. --- manual/tunables.texi | 14 +++++++ sysdeps/x86/cacheinfo.c | 20 ++++++++++ sysdeps/x86/cpu-features.h | 4 ++ sysdeps/x86/dl-cacheinfo.c | 38 +++++++++++++++++++ sysdeps/x86/dl-tunables.list | 6 +++ .../multiarch/memmove-vec-unaligned-erms.S | 16 +------- .../multiarch/memset-vec-unaligned-erms.S | 12 +----- 7 files changed, 84 insertions(+), 26 deletions(-) diff --git a/manual/tunables.texi b/manual/tunables.texi index ec18b10834..61edd62425 100644 --- a/manual/tunables.texi +++ b/manual/tunables.texi @@ -396,6 +396,20 @@ to set threshold in bytes for non temporal store. This tunable is specific to i386 and x86-64. @end deftp +@deftp Tunable glibc.cpu.x86_rep_movsb_threshold +The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user +to set threshold in bytes to start using "rep movsb". + +This tunable is specific to i386 and x86-64. +@end deftp + +@deftp Tunable glibc.cpu.x86_rep_stosb_threshold +The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user +to set threshold in bytes to start using "rep stosb". + +This tunable is specific to i386 and x86-64. +@end deftp + @deftp Tunable glibc.cpu.x86_ibt The @code{glibc.cpu.x86_ibt} tunable allows the user to control how indirect branch tracking (IBT) should be enabled. Accepted values are diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index 8c4c7f9972..bb536d96ef 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -41,6 +41,23 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024; /* Threshold to use non temporal store. */ long int __x86_shared_non_temporal_threshold attribute_hidden; +/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set + up REP MOVSB operation, REP MOVSB isn't faster on short data. The + memcpy micro benchmark in glibc shows that 2KB is the approximate + value above which REP MOVSB becomes faster than SSE2 optimization + on processors with Enhanced REP MOVSB. Since larger register size + can move more data with a single load and store, the threshold is + higher with larger register size. */ +long int __x86_rep_movsb_threshold attribute_hidden = 2048; + +/* Threshold to use Enhanced REP STOSB. Since there is overhead to set + up REP STOSB operation, REP STOSB isn't faster on short data. The + memset micro benchmark in glibc shows that 2KB is the approximate + value above which REP STOSB becomes faster on processors with + Enhanced REP STOSB. Since the stored value is fixed, larger register + size has minimal impact on threshold. */ +long int __x86_rep_stosb_threshold attribute_hidden = 2048; + #ifndef __x86_64__ /* PREFETCHW support flag for use in memory and string routines. */ int __x86_prefetchw attribute_hidden; @@ -117,6 +134,9 @@ init_cacheinfo (void) __x86_shared_non_temporal_threshold = cpu_features->non_temporal_threshold; + __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold; + __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold; + #ifndef __x86_64__ __x86_prefetchw = cpu_features->prefetchw; #endif diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h index 3aaed33cbc..002e12e11f 100644 --- a/sysdeps/x86/cpu-features.h +++ b/sysdeps/x86/cpu-features.h @@ -128,6 +128,10 @@ struct cpu_features /* PREFETCHW support flag for use in memory and string routines. */ unsigned long int prefetchw; #endif + /* Threshold to use "rep movsb". */ + unsigned long int rep_movsb_threshold; + /* Threshold to use "rep stosb". */ + unsigned long int rep_stosb_threshold; }; /* Used from outside of glibc to get access to the CPU features diff --git a/sysdeps/x86/dl-cacheinfo.c b/sysdeps/x86/dl-cacheinfo.c index 8e2a6f552c..aff9bd1067 100644 --- a/sysdeps/x86/dl-cacheinfo.c +++ b/sysdeps/x86/dl-cacheinfo.c @@ -860,6 +860,31 @@ __init_cacheinfo (void) total shared cache size. */ unsigned long int non_temporal_threshold = (shared * threads * 3 / 4); + /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8. */ + unsigned long int minimum_rep_movsb_threshold; + /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16). See + comments for __x86_rep_movsb_threshold in cacheinfo.c. */ + unsigned long int rep_movsb_threshold; + if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable) + && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)) + { + rep_movsb_threshold = 2048 * (64 / 16); + minimum_rep_movsb_threshold = 64 * 8; + } + else if (CPU_FEATURES_ARCH_P (cpu_features, + AVX_Fast_Unaligned_Load)) + { + rep_movsb_threshold = 2048 * (32 / 16); + minimum_rep_movsb_threshold = 32 * 8; + } + else + { + rep_movsb_threshold = 2048 * (16 / 16); + minimum_rep_movsb_threshold = 16 * 8; + } + /* NB: See comments for __x86_rep_stosb_threshold in cacheinfo.c. */ + unsigned long int rep_stosb_threshold = 2048; + #if HAVE_TUNABLES long int tunable_size; tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL); @@ -871,11 +896,19 @@ __init_cacheinfo (void) tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL); if (tunable_size != 0) non_temporal_threshold = tunable_size; + tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL); + if (tunable_size > minimum_rep_movsb_threshold) + rep_movsb_threshold = tunable_size; + tunable_size = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL); + if (tunable_size != 0) + rep_stosb_threshold = tunable_size; #endif cpu_features->data_cache_size = data; cpu_features->shared_cache_size = shared; cpu_features->non_temporal_threshold = non_temporal_threshold; + cpu_features->rep_movsb_threshold = rep_movsb_threshold; + cpu_features->rep_stosb_threshold = rep_stosb_threshold; #if HAVE_TUNABLES TUNABLE_UPDATE (x86_data_cache_size, long int, @@ -884,5 +917,10 @@ __init_cacheinfo (void) shared, 0, (long int) -1); TUNABLE_UPDATE (x86_non_temporal_threshold, long int, non_temporal_threshold, 0, (long int) -1); + TUNABLE_UPDATE (x86_rep_movsb_threshold, long int, + rep_movsb_threshold, minimum_rep_movsb_threshold, + (long int) -1); + TUNABLE_UPDATE (x86_rep_stosb_threshold, long int, + rep_stosb_threshold, 0, (long int) -1); #endif } diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list index 251b926ce4..43bf6c2389 100644 --- a/sysdeps/x86/dl-tunables.list +++ b/sysdeps/x86/dl-tunables.list @@ -30,6 +30,12 @@ glibc { x86_non_temporal_threshold { type: SIZE_T } + x86_rep_movsb_threshold { + type: SIZE_T + } + x86_rep_stosb_threshold { + type: SIZE_T + } x86_data_cache_size { type: SIZE_T } diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 74953245aa..bd5dc1a3f3 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -56,17 +56,6 @@ # endif #endif -/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set - up REP MOVSB operation, REP MOVSB isn't faster on short data. The - memcpy micro benchmark in glibc shows that 2KB is the approximate - value above which REP MOVSB becomes faster than SSE2 optimization - on processors with Enhanced REP MOVSB. Since larger register size - can move more data with a single load and store, the threshold is - higher with larger register size. */ -#ifndef REP_MOVSB_THRESHOLD -# define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16)) -#endif - #ifndef PREFETCH # define PREFETCH(addr) prefetcht0 addr #endif @@ -253,9 +242,6 @@ L(movsb): leaq (%rsi,%rdx), %r9 cmpq %r9, %rdi /* Avoid slow backward REP MOVSB. */ -# if REP_MOVSB_THRESHOLD <= (VEC_SIZE * 8) -# error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE! -# endif jb L(more_8x_vec_backward) 1: mov %RDX_LP, %RCX_LP @@ -331,7 +317,7 @@ L(between_2_3): #if defined USE_MULTIARCH && IS_IN (libc) L(movsb_more_2x_vec): - cmpq $REP_MOVSB_THRESHOLD, %rdx + cmp __x86_rep_movsb_threshold(%rip), %RDX_LP ja L(movsb) #endif L(more_2x_vec): diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index af2299709c..2bfc95de05 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -58,16 +58,6 @@ # endif #endif -/* Threshold to use Enhanced REP STOSB. Since there is overhead to set - up REP STOSB operation, REP STOSB isn't faster on short data. The - memset micro benchmark in glibc shows that 2KB is the approximate - value above which REP STOSB becomes faster on processors with - Enhanced REP STOSB. Since the stored value is fixed, larger register - size has minimal impact on threshold. */ -#ifndef REP_STOSB_THRESHOLD -# define REP_STOSB_THRESHOLD 2048 -#endif - #ifndef SECTION # error SECTION is not defined! #endif @@ -181,7 +171,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) ret L(stosb_more_2x_vec): - cmpq $REP_STOSB_THRESHOLD, %rdx + cmp __x86_rep_stosb_threshold(%rip), %RDX_LP ja L(stosb) #endif L(more_2x_vec):