[2.32] x86: Optimizing memcpy for AMD Zen architecture.

Message ID 878sbqy0rr.fsf@oldenburg2.str.redhat.com
State Superseded
Headers
Series [2.32] x86: Optimizing memcpy for AMD Zen architecture. |

Commit Message

Florian Weimer Oct. 28, 2020, 12:48 p.m. UTC
  I would like to backport your fix to various stable release branches.
Due to some refactoring, the patch doesn't apply cleanly.  Would you
kindly review this 2.32 backport?  Thanks.

Florian
8<------------------------------------------------------------------8<

From: Sajan Karumanchi <sajan.karumanchi@amd.com>

Modifying the shareable cache '__x86_shared_cache_size', which is a
factor in computing the non-temporal threshold parameter
'__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen
architectures.
In the existing implementation, the shareable cache is computed as 'L3
per thread, L2 per core'. Recomputing this shareable cache as 'L3 per
CCX(Core-Complex)' has brought in performance gains.
As per the large bench variant results, this patch also addresses the
regression problem on AMD Zen architectures.

Backport of commit 59803e81f96b479c17f583b31eac44b57591a1bf upstream.

Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>

---
 sysdeps/x86/cacheinfo.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)
  

Comments

H.J. Lu Oct. 28, 2020, 2:01 p.m. UTC | #1
On Wed, Oct 28, 2020 at 5:48 AM Florian Weimer via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> I would like to backport your fix to various stable release branches.
> Due to some refactoring, the patch doesn't apply cleanly.  Would you
> kindly review this 2.32 backport?  Thanks.
>
> Florian
> 8<------------------------------------------------------------------8<
>
> From: Sajan Karumanchi <sajan.karumanchi@amd.com>
>
> Modifying the shareable cache '__x86_shared_cache_size', which is a
> factor in computing the non-temporal threshold parameter
> '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen
> architectures.
> In the existing implementation, the shareable cache is computed as 'L3
> per thread, L2 per core'. Recomputing this shareable cache as 'L3 per
> CCX(Core-Complex)' has brought in performance gains.
> As per the large bench variant results, this patch also addresses the
> regression problem on AMD Zen architectures.
>
> Backport of commit 59803e81f96b479c17f583b31eac44b57591a1bf upstream.
>
> Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
>
> ---
>  sysdeps/x86/cacheinfo.c | 30 +++++++++++++++++++++++++-----
>  1 file changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> index dadec5d58f..2c06435170 100644
> --- a/sysdeps/x86/cacheinfo.c
> +++ b/sysdeps/x86/cacheinfo.c
> @@ -808,7 +808,7 @@ init_cacheinfo (void)
>               threads = 1 << ((ecx >> 12) & 0x0f);
>             }
>
> -         if (threads == 0)
> +         if (threads == 0 || cpu_features->basic.family >= 0x17)
>             {
>               /* If APIC ID width is not available, use logical
>                  processor count.  */
> @@ -823,13 +823,30 @@ init_cacheinfo (void)
>           if (threads > 0)
>             shared /= threads;
>
> -         /* Account for exclusive L2 and L3 caches.  */
> -         shared += core;
> +         /* Get shared cache per ccx for Zen architectures.  */
> +         if (cpu_features->basic.family >= 0x17)
> +           {
> +             unsigned int eax;
> +
> +             /* Get number of threads share the L3 cache in CCX.  */
> +             __cpuid_count (0x8000001D, 0x3, eax, ebx, ecx, edx);
> +
> +             unsigned int threads_per_ccx = ((eax >> 14) & 0xfff) + 1;
> +             shared *= threads_per_ccx;
> +           }
> +         else
> +           {
> +             /* Account for exclusive L2 and L3 caches.  */
> +             shared += core;
> +            }
>         }
>      }
>
>    if (cpu_features->data_cache_size != 0)
> -    data = cpu_features->data_cache_size;
> +    {
> +      if (data == 0 || cpu_features->basic.kind != arch_kind_amd)
> +        data = cpu_features->data_cache_size;
> +    }

This is wrong.

>    if (data > 0)
>      {
> @@ -842,7 +859,10 @@ init_cacheinfo (void)
>      }
>
>    if (cpu_features->shared_cache_size != 0)
> -    shared = cpu_features->shared_cache_size;
> +    {
> +      if (shared == 0 || cpu_features->basic.kind != arch_kind_amd)
> +        shared = cpu_features->shared_cache_size;
> +    }

This is wrong.

>    if (shared > 0)
>      {
>
> --
> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
>
  
Karumanchi, Sajan Oct. 30, 2020, 6 a.m. UTC | #2
[AMD Public Use]

Hi Florian,

The backport to 2.32 looks clean. But make sure this backporting is done on top of Patrick McGehearty patch, which tunes the non-temporal threshold '__x86_shared_non_temporal_threshold' parameter by dropping of 'threads' parameter.

-       : __x86_shared_cache_size * threads * 3 / 4);
+       : __x86_shared_cache_size * 3 / 4);

Thanks & Regards,
Sajan K.

-----Original Message-----
From: Florian Weimer <fweimer@redhat.com> 
Sent: Wednesday, October 28, 2020 6:18 PM
To: Karumanchi, Sajan <Sajan.Karumanchi@amd.com>
Cc: libc-alpha@sourceware.org; Mallappa, Premachandra <Premachandra.Mallappa@amd.com>
Subject: [PATCH 2.32] x86: Optimizing memcpy for AMD Zen architecture.

[CAUTION: External Email]

I would like to backport your fix to various stable release branches.
Due to some refactoring, the patch doesn't apply cleanly.  Would you kindly review this 2.32 backport?  Thanks.

Florian
8<------------------------------------------------------------------8<

From: Sajan Karumanchi <sajan.karumanchi@amd.com>

Modifying the shareable cache '__x86_shared_cache_size', which is a factor in computing the non-temporal threshold parameter '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen architectures.
In the existing implementation, the shareable cache is computed as 'L3 per thread, L2 per core'. Recomputing this shareable cache as 'L3 per CCX(Core-Complex)' has brought in performance gains.
As per the large bench variant results, this patch also addresses the regression problem on AMD Zen architectures.

Backport of commit 59803e81f96b479c17f583b31eac44b57591a1bf upstream.

Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>

---
 sysdeps/x86/cacheinfo.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index dadec5d58f..2c06435170 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -808,7 +808,7 @@ init_cacheinfo (void)
              threads = 1 << ((ecx >> 12) & 0x0f);
            }

-         if (threads == 0)
+         if (threads == 0 || cpu_features->basic.family >= 0x17)
            {
              /* If APIC ID width is not available, use logical
                 processor count.  */
@@ -823,13 +823,30 @@ init_cacheinfo (void)
          if (threads > 0)
            shared /= threads;

-         /* Account for exclusive L2 and L3 caches.  */
-         shared += core;
+         /* Get shared cache per ccx for Zen architectures.  */
+         if (cpu_features->basic.family >= 0x17)
+           {
+             unsigned int eax;
+
+             /* Get number of threads share the L3 cache in CCX.  */
+             __cpuid_count (0x8000001D, 0x3, eax, ebx, ecx, edx);
+
+             unsigned int threads_per_ccx = ((eax >> 14) & 0xfff) + 1;
+             shared *= threads_per_ccx;
+           }
+         else
+           {
+             /* Account for exclusive L2 and L3 caches.  */
+             shared += core;
+            }
        }
     }

   if (cpu_features->data_cache_size != 0)
-    data = cpu_features->data_cache_size;
+    {
+      if (data == 0 || cpu_features->basic.kind != arch_kind_amd)
+        data = cpu_features->data_cache_size;
+    }

   if (data > 0)
     {
@@ -842,7 +859,10 @@ init_cacheinfo (void)
     }

   if (cpu_features->shared_cache_size != 0)
-    shared = cpu_features->shared_cache_size;
+    {
+      if (shared == 0 || cpu_features->basic.kind != arch_kind_amd)
+        shared = cpu_features->shared_cache_size;
+    }

   if (shared > 0)
     {

--
Red Hat GmbH, https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fde.redhat.com%2F&amp;data=04%7C01%7Csajan.karumanchi%40amd.com%7Cb690e19cf6b046ddd0f308d87b3fc62c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637394861120154864%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=PSVaJQ6%2BaWCza0ibhKmfwYOGLx9pyJt9yGMH4kntq%2Bk%3D&amp;reserved=0 , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
  
Florian Weimer Oct. 30, 2020, 8:44 a.m. UTC | #3
* Sajan Karumanchi:

> [AMD Public Use]
>
> Hi Florian,
>
> The backport to 2.32 looks clean. But make sure this backporting is done on top of Patrick McGehearty patch, which tunes the non-temporal threshold '__x86_shared_non_temporal_threshold' parameter by dropping of 'threads' parameter.
>
> -       : __x86_shared_cache_size * threads * 3 / 4);
> +       : __x86_shared_cache_size * 3 / 4);

Yes, that's already on the branch.

I'm going to repost this backport patch with the follow-up fix applied.

Thanks,
Florian
  

Patch

diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index dadec5d58f..2c06435170 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -808,7 +808,7 @@  init_cacheinfo (void)
 	      threads = 1 << ((ecx >> 12) & 0x0f);
 	    }
 
-	  if (threads == 0)
+	  if (threads == 0 || cpu_features->basic.family >= 0x17)
 	    {
 	      /* If APIC ID width is not available, use logical
 		 processor count.  */
@@ -823,13 +823,30 @@  init_cacheinfo (void)
 	  if (threads > 0)
 	    shared /= threads;
 
-	  /* Account for exclusive L2 and L3 caches.  */
-	  shared += core;
+	  /* Get shared cache per ccx for Zen architectures.  */
+	  if (cpu_features->basic.family >= 0x17)
+	    {
+	      unsigned int eax;
+
+	      /* Get number of threads share the L3 cache in CCX.  */
+	      __cpuid_count (0x8000001D, 0x3, eax, ebx, ecx, edx);
+
+	      unsigned int threads_per_ccx = ((eax >> 14) & 0xfff) + 1;
+	      shared *= threads_per_ccx;
+	    }
+	  else
+	    {
+	      /* Account for exclusive L2 and L3 caches.  */
+	      shared += core;
+            }
 	}
     }
 
   if (cpu_features->data_cache_size != 0)
-    data = cpu_features->data_cache_size;
+    {
+      if (data == 0 || cpu_features->basic.kind != arch_kind_amd)
+        data = cpu_features->data_cache_size;
+    }
 
   if (data > 0)
     {
@@ -842,7 +859,10 @@  init_cacheinfo (void)
     }
 
   if (cpu_features->shared_cache_size != 0)
-    shared = cpu_features->shared_cache_size;
+    {
+      if (shared == 0 || cpu_features->basic.kind != arch_kind_amd)
+        shared = cpu_features->shared_cache_size;
+    }
 
   if (shared > 0)
     {