From patchwork Mon May 22 19:17:53 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 20533 Received: (qmail 61700 invoked by alias); 22 May 2017 19:17:55 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 61682 invoked by uid 89); 22 May 2017 19:17:54 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.1 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, RCVD_IN_SORBS_SPAM, SPF_PASS autolearn=ham version=3.3.2 spammy=HX-Received:10.55.50.19, Hx-languages-length:2321, H*Ad:U*carlos X-HELO: mail-qk0-f176.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=gsth4nIYOEP7CS5LPrEtDklpA6R63Vcv7USnuY8LLYM=; b=hB94fEjutYQvSYHKm5OwJj+40pb8NZK9gdNpd8lXdxneewxqAt6zum2iHA6iZL09u5 uiQGedlqwcxuiDryo/vjgmX/QdxNLbTXGYCzURdMu2Wx/KXbOydjY52Z0eML/wISGiOM mH0BAZYzrXvBi3mxv1zmY6hUfZtXhYxc0rcmkdbiG6AVf9Q7btk/H+jOqHjhYy9Z2XWc QWpWOzKDZQpGriYMGkrp2qzcSue5XQigWb8ZvvYX81R7Xp5MWMCstG8XC6MoqQIqL9y1 rb11Cb4tZQB7EQXkXQhdUqjuQFQt5GJ7UZbc1J6CjtrDdYmOjTGZzwK7WfQgSnYVTStw UR4w== X-Gm-Message-State: AODbwcAhRt5C6kYmprhnLpB0B+pAWwuXMW/2LBaJ9wTFsp0fFgP9+jGx bU/Wi5Io71zQc0xvVbpldFmsWoT45Q== X-Received: by 10.55.50.19 with SMTP id y19mr19801177qky.24.1495480674728; Mon, 22 May 2017 12:17:54 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <9c563a4b-424b-242f-b82f-4650ab2637f7@redhat.com> <28e34264-e8c5-5570-c48c-9125893808b2@redhat.com> From: "H.J. Lu" Date: Mon, 22 May 2017 12:17:53 -0700 Message-ID: Subject: Re: memcpy performance regressions 2.19 -> 2.24(5) To: Erich Elsen Cc: "Carlos O'Donell" , GNU C Library On Thu, May 18, 2017 at 1:59 PM, Erich Elsen wrote: > Hi H.J., > > I was on vacation, sorry for the slow reply. The updated benchmark > still shows the same behavior, thanks. > > I'll try my hand at creating a patch that makes that variable > __x86_shared_non_temporal_threshold a tunable. It will be necessary > to do internal experiments anyway. > __x86_shared_non_temporal_threshold was set to 6 times of per-core shared cache size, based on the large memcpy micro benchmark in glibc on a 8-core processor. For a processor with more than 8 cores, the threshold is too low. Set __x86_shared_non_temporal_threshold to the 3/4 of the total shared cache size so that it is unchanged on 8-core processors. On processors with less than 8 cores, the threshold is lower. Any comments? From bfb716e07b77f0ed8e0c2689d5cd01e2c8251fc5 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Fri, 12 May 2017 13:38:04 -0700 Subject: [PATCH] x86: Update __x86_shared_non_temporal_threshold __x86_shared_non_temporal_threshold was set to 6 times of per-core shared cache size, based on the large memcpy micro benchmark in glibc on a 8-core processor. For a processor with more than 8 cores, the threshold is too low. Set __x86_shared_non_temporal_threshold to the 3/4 of the total shared cache size so that it is unchanged on 8-core processors. On processors with less than 8 cores, the threshold is lower. * sysdeps/x86/cacheinfo.c (__x86_shared_non_temporal_threshold): Set to the 3/4 of the total shared cache size. --- sysdeps/x86/cacheinfo.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index 1ccbe41..3434d97 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -766,6 +766,8 @@ intel_bug_no_cache_info: /* The large memcpy micro benchmark in glibc shows that 6 times of shared cache size is the approximate value above which non-temporal - store becomes faster. */ - __x86_shared_non_temporal_threshold = __x86_shared_cache_size * 6; + store becomes faster on a 8-core processor. This is the 3/4 of the + total shared cache size. */ + __x86_shared_non_temporal_threshold + = __x86_shared_cache_size * threads * 3 / 4; } -- 2.9.4