Message ID | 878sbqy0rr.fsf@oldenburg2.str.redhat.com |
---|---|
State | Superseded |
Headers |
Return-Path: <libc-alpha-bounces@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 9F17F386F002; Wed, 28 Oct 2020 12:48:33 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 9F17F386F002 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1603889313; bh=C2J3Hbun8XblOciESszKYZ7u/WuZQe+0tGVeg5yaIHU=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=ZbjNoO+gmCPux/HrjuPJL8xgsQL6jz82CA0vsGfu9RsnB63Zzc7H7jB9L0/oH7b/f X5ZevG/9VQ77mKiVG2NaGLuZdywls0fy4YFooB/oDi0+h/7CeI8vgy4kXZi/uPNlhE B20vFmrenA3kttY1v1J+RLmjJ/0nTezleQ71nu78= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTP id 103B63858012 for <libc-alpha@sourceware.org>; Wed, 28 Oct 2020 12:48:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 103B63858012 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-123-5_uxBFdhNRW9qCaZLZCePA-1; Wed, 28 Oct 2020 08:48:28 -0400 X-MC-Unique: 5_uxBFdhNRW9qCaZLZCePA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 5EF9F188C136; Wed, 28 Oct 2020 12:48:27 +0000 (UTC) Received: from oldenburg2.str.redhat.com (ovpn-113-60.ams2.redhat.com [10.36.113.60]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 78A6F5C22E; Wed, 28 Oct 2020 12:48:26 +0000 (UTC) To: Sajan Karumanchi <sajan.karumanchi@amd.com> Subject: [PATCH 2.32] x86: Optimizing memcpy for AMD Zen architecture. Date: Wed, 28 Oct 2020 13:48:24 +0100 Message-ID: <878sbqy0rr.fsf@oldenburg2.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Florian Weimer <fweimer@redhat.com> Cc: Premachandra Mallappa <premachandra.mallappa@amd.com>, libc-alpha@sourceware.org Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org> |
Series |
[2.32] x86: Optimizing memcpy for AMD Zen architecture.
|
|
Commit Message
Florian Weimer
Oct. 28, 2020, 12:48 p.m. UTC
I would like to backport your fix to various stable release branches. Due to some refactoring, the patch doesn't apply cleanly. Would you kindly review this 2.32 backport? Thanks. Florian 8<------------------------------------------------------------------8< From: Sajan Karumanchi <sajan.karumanchi@amd.com> Modifying the shareable cache '__x86_shared_cache_size', which is a factor in computing the non-temporal threshold parameter '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen architectures. In the existing implementation, the shareable cache is computed as 'L3 per thread, L2 per core'. Recomputing this shareable cache as 'L3 per CCX(Core-Complex)' has brought in performance gains. As per the large bench variant results, this patch also addresses the regression problem on AMD Zen architectures. Backport of commit 59803e81f96b479c17f583b31eac44b57591a1bf upstream. Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> --- sysdeps/x86/cacheinfo.c | 30 +++++++++++++++++++++++++----- 1 file changed, 25 insertions(+), 5 deletions(-)
Comments
On Wed, Oct 28, 2020 at 5:48 AM Florian Weimer via Libc-alpha <libc-alpha@sourceware.org> wrote: > > I would like to backport your fix to various stable release branches. > Due to some refactoring, the patch doesn't apply cleanly. Would you > kindly review this 2.32 backport? Thanks. > > Florian > 8<------------------------------------------------------------------8< > > From: Sajan Karumanchi <sajan.karumanchi@amd.com> > > Modifying the shareable cache '__x86_shared_cache_size', which is a > factor in computing the non-temporal threshold parameter > '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen > architectures. > In the existing implementation, the shareable cache is computed as 'L3 > per thread, L2 per core'. Recomputing this shareable cache as 'L3 per > CCX(Core-Complex)' has brought in performance gains. > As per the large bench variant results, this patch also addresses the > regression problem on AMD Zen architectures. > > Backport of commit 59803e81f96b479c17f583b31eac44b57591a1bf upstream. > > Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> > > --- > sysdeps/x86/cacheinfo.c | 30 +++++++++++++++++++++++++----- > 1 file changed, 25 insertions(+), 5 deletions(-) > > diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c > index dadec5d58f..2c06435170 100644 > --- a/sysdeps/x86/cacheinfo.c > +++ b/sysdeps/x86/cacheinfo.c > @@ -808,7 +808,7 @@ init_cacheinfo (void) > threads = 1 << ((ecx >> 12) & 0x0f); > } > > - if (threads == 0) > + if (threads == 0 || cpu_features->basic.family >= 0x17) > { > /* If APIC ID width is not available, use logical > processor count. */ > @@ -823,13 +823,30 @@ init_cacheinfo (void) > if (threads > 0) > shared /= threads; > > - /* Account for exclusive L2 and L3 caches. */ > - shared += core; > + /* Get shared cache per ccx for Zen architectures. */ > + if (cpu_features->basic.family >= 0x17) > + { > + unsigned int eax; > + > + /* Get number of threads share the L3 cache in CCX. */ > + __cpuid_count (0x8000001D, 0x3, eax, ebx, ecx, edx); > + > + unsigned int threads_per_ccx = ((eax >> 14) & 0xfff) + 1; > + shared *= threads_per_ccx; > + } > + else > + { > + /* Account for exclusive L2 and L3 caches. */ > + shared += core; > + } > } > } > > if (cpu_features->data_cache_size != 0) > - data = cpu_features->data_cache_size; > + { > + if (data == 0 || cpu_features->basic.kind != arch_kind_amd) > + data = cpu_features->data_cache_size; > + } This is wrong. > if (data > 0) > { > @@ -842,7 +859,10 @@ init_cacheinfo (void) > } > > if (cpu_features->shared_cache_size != 0) > - shared = cpu_features->shared_cache_size; > + { > + if (shared == 0 || cpu_features->basic.kind != arch_kind_amd) > + shared = cpu_features->shared_cache_size; > + } This is wrong. > if (shared > 0) > { > > -- > Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, > Commercial register: Amtsgericht Muenchen, HRB 153243, > Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill >
[AMD Public Use] Hi Florian, The backport to 2.32 looks clean. But make sure this backporting is done on top of Patrick McGehearty patch, which tunes the non-temporal threshold '__x86_shared_non_temporal_threshold' parameter by dropping of 'threads' parameter. - : __x86_shared_cache_size * threads * 3 / 4); + : __x86_shared_cache_size * 3 / 4); Thanks & Regards, Sajan K. -----Original Message----- From: Florian Weimer <fweimer@redhat.com> Sent: Wednesday, October 28, 2020 6:18 PM To: Karumanchi, Sajan <Sajan.Karumanchi@amd.com> Cc: libc-alpha@sourceware.org; Mallappa, Premachandra <Premachandra.Mallappa@amd.com> Subject: [PATCH 2.32] x86: Optimizing memcpy for AMD Zen architecture. [CAUTION: External Email] I would like to backport your fix to various stable release branches. Due to some refactoring, the patch doesn't apply cleanly. Would you kindly review this 2.32 backport? Thanks. Florian 8<------------------------------------------------------------------8< From: Sajan Karumanchi <sajan.karumanchi@amd.com> Modifying the shareable cache '__x86_shared_cache_size', which is a factor in computing the non-temporal threshold parameter '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen architectures. In the existing implementation, the shareable cache is computed as 'L3 per thread, L2 per core'. Recomputing this shareable cache as 'L3 per CCX(Core-Complex)' has brought in performance gains. As per the large bench variant results, this patch also addresses the regression problem on AMD Zen architectures. Backport of commit 59803e81f96b479c17f583b31eac44b57591a1bf upstream. Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> --- sysdeps/x86/cacheinfo.c | 30 +++++++++++++++++++++++++----- 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index dadec5d58f..2c06435170 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -808,7 +808,7 @@ init_cacheinfo (void) threads = 1 << ((ecx >> 12) & 0x0f); } - if (threads == 0) + if (threads == 0 || cpu_features->basic.family >= 0x17) { /* If APIC ID width is not available, use logical processor count. */ @@ -823,13 +823,30 @@ init_cacheinfo (void) if (threads > 0) shared /= threads; - /* Account for exclusive L2 and L3 caches. */ - shared += core; + /* Get shared cache per ccx for Zen architectures. */ + if (cpu_features->basic.family >= 0x17) + { + unsigned int eax; + + /* Get number of threads share the L3 cache in CCX. */ + __cpuid_count (0x8000001D, 0x3, eax, ebx, ecx, edx); + + unsigned int threads_per_ccx = ((eax >> 14) & 0xfff) + 1; + shared *= threads_per_ccx; + } + else + { + /* Account for exclusive L2 and L3 caches. */ + shared += core; + } } } if (cpu_features->data_cache_size != 0) - data = cpu_features->data_cache_size; + { + if (data == 0 || cpu_features->basic.kind != arch_kind_amd) + data = cpu_features->data_cache_size; + } if (data > 0) { @@ -842,7 +859,10 @@ init_cacheinfo (void) } if (cpu_features->shared_cache_size != 0) - shared = cpu_features->shared_cache_size; + { + if (shared == 0 || cpu_features->basic.kind != arch_kind_amd) + shared = cpu_features->shared_cache_size; + } if (shared > 0) { -- Red Hat GmbH, https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fde.redhat.com%2F&data=04%7C01%7Csajan.karumanchi%40amd.com%7Cb690e19cf6b046ddd0f308d87b3fc62c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637394861120154864%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PSVaJQ6%2BaWCza0ibhKmfwYOGLx9pyJt9yGMH4kntq%2Bk%3D&reserved=0 , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
* Sajan Karumanchi: > [AMD Public Use] > > Hi Florian, > > The backport to 2.32 looks clean. But make sure this backporting is done on top of Patrick McGehearty patch, which tunes the non-temporal threshold '__x86_shared_non_temporal_threshold' parameter by dropping of 'threads' parameter. > > - : __x86_shared_cache_size * threads * 3 / 4); > + : __x86_shared_cache_size * 3 / 4); Yes, that's already on the branch. I'm going to repost this backport patch with the follow-up fix applied. Thanks, Florian
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index dadec5d58f..2c06435170 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -808,7 +808,7 @@ init_cacheinfo (void) threads = 1 << ((ecx >> 12) & 0x0f); } - if (threads == 0) + if (threads == 0 || cpu_features->basic.family >= 0x17) { /* If APIC ID width is not available, use logical processor count. */ @@ -823,13 +823,30 @@ init_cacheinfo (void) if (threads > 0) shared /= threads; - /* Account for exclusive L2 and L3 caches. */ - shared += core; + /* Get shared cache per ccx for Zen architectures. */ + if (cpu_features->basic.family >= 0x17) + { + unsigned int eax; + + /* Get number of threads share the L3 cache in CCX. */ + __cpuid_count (0x8000001D, 0x3, eax, ebx, ecx, edx); + + unsigned int threads_per_ccx = ((eax >> 14) & 0xfff) + 1; + shared *= threads_per_ccx; + } + else + { + /* Account for exclusive L2 and L3 caches. */ + shared += core; + } } } if (cpu_features->data_cache_size != 0) - data = cpu_features->data_cache_size; + { + if (data == 0 || cpu_features->basic.kind != arch_kind_amd) + data = cpu_features->data_cache_size; + } if (data > 0) { @@ -842,7 +859,10 @@ init_cacheinfo (void) } if (cpu_features->shared_cache_size != 0) - shared = cpu_features->shared_cache_size; + { + if (shared == 0 || cpu_features->basic.kind != arch_kind_amd) + shared = cpu_features->shared_cache_size; + } if (shared > 0) {