Count number of logical processors sharing L2 cache

Message ID CAMe9rOoq8MNkX0GvoePQ-C51mfUr2ikrRJgqCZE0CoGoJEmOOw@mail.gmail.com
State New, archived
Headers

Commit Message

H.J. Lu May 19, 2016, 5:50 p.m. UTC
  On Fri, May 13, 2016 at 1:39 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
> We need count number of available logical processors sharing L2
> cache.

For Intel processors, when there are both L2 and L3 caches, SMT level
type should be ued to count number of available logical processors
sharing L2 cache.  If there is only L2 cache, core level type should
be used to count number of available logical processors sharing L2
cache.  Number of available logical processors sharing L2 cache should
be used for non-inclusive L2 and L3 caches.

Any comments?
  

Comments

Florian Weimer May 20, 2016, 8:56 a.m. UTC | #1
On 05/19/2016 07:50 PM, H.J. Lu wrote:
> On Fri, May 13, 2016 at 1:39 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> We need count number of available logical processors sharing L2
>> cache.
>
> For Intel processors, when there are both L2 and L3 caches, SMT level
> type should be ued to count number of available logical processors
> sharing L2 cache.  If there is only L2 cache, core level type should
> be used to count number of available logical processors sharing L2
> cache.  Number of available logical processors sharing L2 cache should
> be used for non-inclusive L2 and L3 caches.
>
> Any comments?

Is this accounting even relevant anymore, now that cache allocation can 
be tweaked dynamically?

Thanks,
Florian
  
H.J. Lu May 20, 2016, 12:13 p.m. UTC | #2
On Fri, May 20, 2016 at 1:56 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 05/19/2016 07:50 PM, H.J. Lu wrote:
>>
>> On Fri, May 13, 2016 at 1:39 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>
>>>
>>> We need count number of available logical processors sharing L2
>>> cache.
>>
>>
>> For Intel processors, when there are both L2 and L3 caches, SMT level
>> type should be ued to count number of available logical processors
>> sharing L2 cache.  If there is only L2 cache, core level type should
>> be used to count number of available logical processors sharing L2
>> cache.  Number of available logical processors sharing L2 cache should
>> be used for non-inclusive L2 and L3 caches.
>>
>> Any comments?
>
>
> Is this accounting even relevant anymore, now that cache allocation can be
> tweaked dynamically?

Can you elaborate?
  
Florian Weimer May 23, 2016, 2:14 p.m. UTC | #3
On 05/20/2016 02:13 PM, H.J. Lu wrote:
> On Fri, May 20, 2016 at 1:56 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 05/19/2016 07:50 PM, H.J. Lu wrote:
>>>
>>> On Fri, May 13, 2016 at 1:39 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>
>>>>
>>>> We need count number of available logical processors sharing L2
>>>> cache.
>>>
>>>
>>> For Intel processors, when there are both L2 and L3 caches, SMT level
>>> type should be ued to count number of available logical processors
>>> sharing L2 cache.  If there is only L2 cache, core level type should
>>> be used to count number of available logical processors sharing L2
>>> cache.  Number of available logical processors sharing L2 cache should
>>> be used for non-inclusive L2 and L3 caches.
>>>
>>> Any comments?
>>
>>
>> Is this accounting even relevant anymore, now that cache allocation can be
>> tweaked dynamically?
>
> Can you elaborate?

I'm wondering how this

<https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology>

technology affects what glibc records here.  How accurate are the values 
glibc computes?  Do they need to be recomputed during the life of a 
process?  Is it still possible to obtain a reasonable approximation of 
the over-all CPU cache system from within a userspace process?

Thanks,
Florian
  
H.J. Lu May 24, 2016, 3:02 p.m. UTC | #4
On Mon, May 23, 2016 at 7:14 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 05/20/2016 02:13 PM, H.J. Lu wrote:
>>
>> On Fri, May 20, 2016 at 1:56 AM, Florian Weimer <fweimer@redhat.com>
>> wrote:
>>>
>>> On 05/19/2016 07:50 PM, H.J. Lu wrote:
>>>>
>>>>
>>>> On Fri, May 13, 2016 at 1:39 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> We need count number of available logical processors sharing L2
>>>>> cache.
>>>>
>>>>
>>>>
>>>> For Intel processors, when there are both L2 and L3 caches, SMT level
>>>> type should be ued to count number of available logical processors
>>>> sharing L2 cache.  If there is only L2 cache, core level type should
>>>> be used to count number of available logical processors sharing L2
>>>> cache.  Number of available logical processors sharing L2 cache should
>>>> be used for non-inclusive L2 and L3 caches.
>>>>
>>>> Any comments?
>>>
>>>
>>>
>>> Is this accounting even relevant anymore, now that cache allocation can
>>> be
>>> tweaked dynamically?
>>
>>
>> Can you elaborate?
>
>
> I'm wondering how this
>
> <https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology>
>
> technology affects what glibc records here.  How accurate are the values
> glibc computes?  Do they need to be recomputed during the life of a process?
> Is it still possible to obtain a reasonable approximation of the over-all
> CPU cache system from within a userspace process?
>

CAT applies to a specific thread/process. Cache sizes in glibc are applied
to string/memory functions for all threads/processes.  They both try to avoid
over-using shared cache by a single thread/process.  But they work at
different levels and have different behaviors.  Glibc also uses the cache size
to decide when to use non-temporal store to avoid cache pollution and speed
up writing a large amount of data..
  
Carlos O'Donell May 24, 2016, 5:49 p.m. UTC | #5
On 05/24/2016 11:02 AM, H.J. Lu wrote:
> CAT applies to a specific thread/process. Cache sizes in glibc are applied
> to string/memory functions for all threads/processes.  They both try to avoid
> over-using shared cache by a single thread/process.  But they work at
> different levels and have different behaviors.  Glibc also uses the cache size
> to decide when to use non-temporal store to avoid cache pollution and speed
> up writing a large amount of data..
 
Don't you mean that CAT applies to a core (and all of its logical cores)?

Might it be the case that a thread or process could be migrated by the linux kernel
between various cores configured with different CAT values and the glibc heuristics
could be poorly tuned for some of those cores?

As I see it the values computed by init_cacheinfo() are only average heuristics for
the core.

I agree that Florian has a point, that these values may become less useful in the
presence of the dynamically changing L3<->core partitioning enabled by CAT.

It is silly though to think that you would allow a thread or process to migrate
away from the CAT-tuned core. The design of CAT is such that you want to isolate
the tuned application to one ore more cores and use CAT to control the L3 allocation
for those cores.

In the case where you have a process pinned to a core, and CAT is used to limit the
L3 of that core, do the glibc heuristics computed in init_cacheinfo() match the
reality of the L3<->core allocation? Or would a lower L3 CAT-tuned value mean that
glibc would be mis-tuned for that core?
  
H.J. Lu May 24, 2016, 9:35 p.m. UTC | #6
On Tue, May 24, 2016 at 10:49 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 05/24/2016 11:02 AM, H.J. Lu wrote:
>> CAT applies to a specific thread/process. Cache sizes in glibc are applied
>> to string/memory functions for all threads/processes.  They both try to avoid
>> over-using shared cache by a single thread/process.  But they work at
>> different levels and have different behaviors.  Glibc also uses the cache size
>> to decide when to use non-temporal store to avoid cache pollution and speed
>> up writing a large amount of data..
>
> Don't you mean that CAT applies to a core (and all of its logical cores)?
>
> Might it be the case that a thread or process could be migrated by the linux kernel
> between various cores configured with different CAT values and the glibc heuristics
> could be poorly tuned for some of those cores?
>
> As I see it the values computed by init_cacheinfo() are only average heuristics for
> the core.
>
> I agree that Florian has a point, that these values may become less useful in the
> presence of the dynamically changing L3<->core partitioning enabled by CAT.
>
> It is silly though to think that you would allow a thread or process to migrate
> away from the CAT-tuned core. The design of CAT is such that you want to isolate
> the tuned application to one ore more cores and use CAT to control the L3 allocation
> for those cores.

I checked with our kernel CAT implementer.  CAT supports both
processor and process.

> In the case where you have a process pinned to a core, and CAT is used to limit the
> L3 of that core, do the glibc heuristics computed in init_cacheinfo() match the
> reality of the L3<->core allocation? Or would a lower L3 CAT-tuned value mean that
> glibc would be mis-tuned for that core?

CAT dedicates part of L3 cache to certain processor or process so
that L3 cache is always available to them.  Glibc tries not to take all
L3 cache in memcpy/memset so thar L3 cache is available for other
operations within the same process as well as to other processor/process.
CAT and glibc work at different angels.  There is no direct conflict between
CAT and glibc.  At the moment,  I am not sure if CAT-aware glibc will
improve performance.
  
H.J. Lu May 27, 2016, 10 p.m. UTC | #7
On Tue, May 24, 2016 at 2:35 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Tue, May 24, 2016 at 10:49 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> On 05/24/2016 11:02 AM, H.J. Lu wrote:
>>> CAT applies to a specific thread/process. Cache sizes in glibc are applied
>>> to string/memory functions for all threads/processes.  They both try to avoid
>>> over-using shared cache by a single thread/process.  But they work at
>>> different levels and have different behaviors.  Glibc also uses the cache size
>>> to decide when to use non-temporal store to avoid cache pollution and speed
>>> up writing a large amount of data..
>>
>> Don't you mean that CAT applies to a core (and all of its logical cores)?
>>
>> Might it be the case that a thread or process could be migrated by the linux kernel
>> between various cores configured with different CAT values and the glibc heuristics
>> could be poorly tuned for some of those cores?
>>
>> As I see it the values computed by init_cacheinfo() are only average heuristics for
>> the core.
>>
>> I agree that Florian has a point, that these values may become less useful in the
>> presence of the dynamically changing L3<->core partitioning enabled by CAT.
>>
>> It is silly though to think that you would allow a thread or process to migrate
>> away from the CAT-tuned core. The design of CAT is such that you want to isolate
>> the tuned application to one ore more cores and use CAT to control the L3 allocation
>> for those cores.
>
> I checked with our kernel CAT implementer.  CAT supports both
> processor and process.
>
>> In the case where you have a process pinned to a core, and CAT is used to limit the
>> L3 of that core, do the glibc heuristics computed in init_cacheinfo() match the
>> reality of the L3<->core allocation? Or would a lower L3 CAT-tuned value mean that
>> glibc would be mis-tuned for that core?
>
> CAT dedicates part of L3 cache to certain processor or process so
> that L3 cache is always available to them.  Glibc tries not to take all
> L3 cache in memcpy/memset so thar L3 cache is available for other
> operations within the same process as well as to other processor/process.
> CAT and glibc work at different angels.  There is no direct conflict between
> CAT and glibc.  At the moment,  I am not sure if CAT-aware glibc will
> improve performance.

I will check in my patch shortly.
  
Carlos O'Donell May 31, 2016, 2:01 p.m. UTC | #8
On 05/24/2016 05:35 PM, H.J. Lu wrote:
> CAT dedicates part of L3 cache to certain processor or process so
> that L3 cache is always available to them.  Glibc tries not to take all
> L3 cache in memcpy/memset so thar L3 cache is available for other
> operations within the same process as well as to other processor/process.
> CAT and glibc work at different angels.  There is no direct conflict between
> CAT and glibc.  At the moment,  I am not sure if CAT-aware glibc will
> improve performance.
 
What do you mean by "no direct conflict?"

If glibc tuns its own algorithms to use 1/4 of L3, but CAT has only
allocated 1/5 of L3 to that process, then glibc's algoirthms, whose
intent was to use a small amount of L3 are now using *more* L3 than
the entire process has and that could impact performance?

Did I miss something?

There could be opportunities for incorrect tunings if glibc is not
CAT-aware?
  
H.J. Lu May 31, 2016, 2:57 p.m. UTC | #9
On Tue, May 31, 2016 at 7:01 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 05/24/2016 05:35 PM, H.J. Lu wrote:
>> CAT dedicates part of L3 cache to certain processor or process so
>> that L3 cache is always available to them.  Glibc tries not to take all
>> L3 cache in memcpy/memset so thar L3 cache is available for other
>> operations within the same process as well as to other processor/process.
>> CAT and glibc work at different angels.  There is no direct conflict between
>> CAT and glibc.  At the moment,  I am not sure if CAT-aware glibc will
>> improve performance.
>
> What do you mean by "no direct conflict?"
>
> If glibc tuns its own algorithms to use 1/4 of L3, but CAT has only
> allocated 1/5 of L3 to that process, then glibc's algoirthms, whose
> intent was to use a small amount of L3 are now using *more* L3 than
> the entire process has and that could impact performance?

Cache sizes are only used for instructions selections in
string/memory functions, which don't use cache directly.
Glibc has NO control whatsoever on how much cache
string/memory functions use.

> Did I miss something?
>
> There could be opportunities for incorrect tunings if glibc is not
> CAT-aware?
>

Since CAT can change at any time during the life of a process,
checking CAT inside glibc isn't very practical.
  
Carlos O'Donell May 31, 2016, 3:26 p.m. UTC | #10
On 05/31/2016 10:57 AM, H.J. Lu wrote:
> On Tue, May 31, 2016 at 7:01 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> On 05/24/2016 05:35 PM, H.J. Lu wrote:
>>> CAT dedicates part of L3 cache to certain processor or process so
>>> that L3 cache is always available to them.  Glibc tries not to take all
>>> L3 cache in memcpy/memset so thar L3 cache is available for other
>>> operations within the same process as well as to other processor/process.
>>> CAT and glibc work at different angels.  There is no direct conflict between
>>> CAT and glibc.  At the moment,  I am not sure if CAT-aware glibc will
>>> improve performance.
>>
>> What do you mean by "no direct conflict?"
>>
>> If glibc tuns its own algorithms to use 1/4 of L3, but CAT has only
>> allocated 1/5 of L3 to that process, then glibc's algoirthms, whose
>> intent was to use a small amount of L3 are now using *more* L3 than
>> the entire process has and that could impact performance?
> 
> Cache sizes are only used for instructions selections in
> string/memory functions, which don't use cache directly.
> Glibc has NO control whatsoever on how much cache
> string/memory functions use.

Even if glibc doesn't use cache directly, it's making algorithm choices
based on those sizes and tunings.

For example:

sysdeps/x86/cacheinfo.c

765   /* The large memcpy micro benchmark in glibc shows that 6 times of
766      shared cache size is the approximate value above which non-temporal
767      store becomes faster.  */
768   __x86_shared_non_temporal_threshold = __x86_shared_cache_size * 6;

If, for example, CAT changed the shared L3 cache size to be smaller, would
it invalidate the above static tuning?

If it does, then we should change the comments in several places to mention
CAT.

>> Did I miss something?
>>
>> There could be opportunities for incorrect tunings if glibc is not
>> CAT-aware?
>>
> 
> Since CAT can change at any time during the life of a process,
> checking CAT inside glibc isn't very practical.
 
Most reasonable administrators would use CAT to limit one particular
workload and leave the CAT setting until the workload changed.
Therefore it is entirely reasonable IMO to check CAT settings at
process startup and adjust accordingly e.g. CAT-aware glibc.
Checking at every use is too expensive and does not mirror what
an administrator would be doing with CAT settings?
  
Paul Eggert May 31, 2016, 3:44 p.m. UTC | #11
On 05/31/2016 08:26 AM, Carlos O'Donell wrote:
> Most reasonable administrators would use CAT to limit one particular
> workload and leave the CAT setting until the workload changed.

This sounds like a reasonable assumption. That being said, on fancier 
processors (Intel Xeon E5 v4), the CAT COS tables can change at 
run-time. Plus, even on less-fancy processors I would not be surprised 
if some kernels blindly overwrite a CPU's COSid on context switch, so 
that the last logical thread to switch into a CPU socket determines the 
CAT COS tables for all physical threads currently on that socket. Either 
way, it would be wise to document the assumption, as it reminds me of 
other assumptions (e.g., "the time zone database does not change during 
execution of the program") that formerly were quite reasonable but now 
show signs of strain.
  
Carlos O'Donell May 31, 2016, 3:55 p.m. UTC | #12
On 05/31/2016 11:44 AM, Paul Eggert wrote:
> On 05/31/2016 08:26 AM, Carlos O'Donell wrote:
>> Most reasonable administrators would use CAT to limit one
>> particular workload and leave the CAT setting until the workload
>> changed.
> 
> This sounds like a reasonable assumption. That being said, on fancier
> processors (Intel Xeon E5 v4), the CAT COS tables can change at
> run-time. Plus, even on less-fancy processors I would not be
> surprised if some kernels blindly overwrite a CPU's COSid on context
> switch, so that the last logical thread to switch into a CPU socket
> determines the CAT COS tables for all physical threads currently on
> that socket. Either way, it would be wise to document the assumption,
> as it reminds me of other assumptions (e.g., "the time zone database
> does not change during execution of the program") that formerly were
> quite reasonable but now show signs of strain.
 
Agreed. I also believe that "the nameservers do not change during
the execution of the program" is a reasonable thing to want (I'm old),
but I have seen a number of cogent arguments against this that I'm
slowly starting to change my mind.
  
H.J. Lu May 31, 2016, 4:09 p.m. UTC | #13
On Tue, May 31, 2016 at 8:26 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 05/31/2016 10:57 AM, H.J. Lu wrote:
>> On Tue, May 31, 2016 at 7:01 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>>> On 05/24/2016 05:35 PM, H.J. Lu wrote:
>>>> CAT dedicates part of L3 cache to certain processor or process so
>>>> that L3 cache is always available to them.  Glibc tries not to take all
>>>> L3 cache in memcpy/memset so thar L3 cache is available for other
>>>> operations within the same process as well as to other processor/process.
>>>> CAT and glibc work at different angels.  There is no direct conflict between
>>>> CAT and glibc.  At the moment,  I am not sure if CAT-aware glibc will
>>>> improve performance.
>>>
>>> What do you mean by "no direct conflict?"
>>>
>>> If glibc tuns its own algorithms to use 1/4 of L3, but CAT has only
>>> allocated 1/5 of L3 to that process, then glibc's algoirthms, whose
>>> intent was to use a small amount of L3 are now using *more* L3 than
>>> the entire process has and that could impact performance?
>>
>> Cache sizes are only used for instructions selections in
>> string/memory functions, which don't use cache directly.
>> Glibc has NO control whatsoever on how much cache
>> string/memory functions use.
>
> Even if glibc doesn't use cache directly, it's making algorithm choices
> based on those sizes and tunings.
>
> For example:
>
> sysdeps/x86/cacheinfo.c
>
> 765   /* The large memcpy micro benchmark in glibc shows that 6 times of
> 766      shared cache size is the approximate value above which non-temporal
> 767      store becomes faster.  */
> 768   __x86_shared_non_temporal_threshold = __x86_shared_cache_size * 6;
>
> If, for example, CAT changed the shared L3 cache size to be smaller, would
> it invalidate the above static tuning?

CAT guarantees a lower limit of cache available to a process and
glibc makes sure that a process doesn't use too much of it in
string/memory functions.  I don't see there is a direct conflict.
  

Patch

From 5f7d26b9f1d85819700150ea9ee97f46d048ca54 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 13 May 2016 13:26:37 -0700
Subject: [PATCH] Count number of logical processors sharing L2 cache

For Intel processors, when there are both L2 and L3 caches, SMT level
type should be ued to count number of available logical processors
sharing L2 cache.  If there is only L2 cache, core level type should
be used to count number of available logical processors sharing L2
cache.  Number of available logical processors sharing L2 cache should
be used for non-inclusive L2 and L3 caches.

	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Count number of
	available logical processors with SMT level type sharing L2
	cache for Intel processors.
---
 sysdeps/x86/cacheinfo.c | 156 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 120 insertions(+), 36 deletions(-)

diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 020d3fd..c357bd4 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -499,11 +499,24 @@  init_cacheinfo (void)
       level  = 3;
       shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, max_cpuid);
 
+      /* Number of logical processors sharing L2 cache.  */
+      int threads_l2;
+
+      /* Number of logical processors sharing L3 cache.  */
+      int threads_l3;
+
       if (shared <= 0)
 	{
 	  /* Try L2 otherwise.  */
 	  level  = 2;
 	  shared = core;
+	  threads_l2 = 0;
+	  threads_l3 = -1;
+	}
+      else
+	{
+	  threads_l2 = 0;
+	  threads_l3 = 0;
 	}
 
       /* A value of 0 for the HTT bit indicates there is only a single
@@ -519,7 +532,8 @@  init_cacheinfo (void)
 
 	      int i = 0;
 
-	      /* Query until desired cache level is enumerated.  */
+	      /* Query until cache level 2 and 3 are enumerated.  */
+	      int check = 0x1 | (threads_l3 == 0) << 1;
 	      do
 		{
 		  __cpuid_count (4, i++, eax, ebx, ecx, edx);
@@ -530,24 +544,53 @@  init_cacheinfo (void)
 		     assume there is no such information.  */
 		  if ((eax & 0x1f) == 0)
 		    goto intel_bug_no_cache_info;
-		}
-	      while (((eax >> 5) & 0x7) != level);
 
-	      /* Check if cache is inclusive of lower cache levels.  */
-	      inclusive_cache = (edx & 0x2) != 0;
+		  switch ((eax >> 5) & 0x7)
+		    {
+		    default:
+		      break;
+		    case 2:
+		      if ((check & 0x1))
+			{
+			  /* Get maximum number of logical processors
+			     sharing L2 cache.  */
+			  threads_l2 = (eax >> 14) & 0x3ff;
+			  check &= ~0x1;
+			}
+		      break;
+		    case 3:
+		      if ((check & (0x1 << 1)))
+			{
+			  /* Get maximum number of logical processors
+			     sharing L3 cache.  */
+			  threads_l3 = (eax >> 14) & 0x3ff;
 
-	      threads = (eax >> 14) & 0x3ff;
+			  /* Check if L2 and L3 caches are inclusive.  */
+			  inclusive_cache = (edx & 0x2) != 0;
+			  check &= ~(0x1 << 1);
+			}
+		      break;
+		    }
+		}
+	      while (check);
 
-	      /* If max_cpuid >= 11, THREADS is the maximum number of
-		 addressable IDs for logical processors sharing the
-		 cache, instead of the maximum number of threads
+	      /* If max_cpuid >= 11, THREADS_L2/THREADS_L3 are the maximum
+		 numbers of addressable IDs for logical processors sharing
+		 the cache, instead of the maximum number of threads
 		 sharing the cache.  */
-	      if (threads && max_cpuid >= 11)
+	      if (max_cpuid >= 11)
 		{
 		  /* Find the number of logical processors shipped in
 		     one core and apply count mask.  */
 		  i = 0;
-		  while (1)
+
+		  /* Count SMT only if there is L3 cache.  Always count
+		     core if there is no L3 cache.  */
+		  int count = ((threads_l2 > 0 && level == 3)
+			       | ((threads_l3 > 0
+				   || (threads_l2 > 0 && level == 2)) << 1));
+
+		  while (count)
 		    {
 		      __cpuid_count (11, i++, eax, ebx, ecx, edx);
 
@@ -555,38 +598,75 @@  init_cacheinfo (void)
 		      int type = ecx & 0xff00;
 		      if (shipped == 0 || type == 0)
 			break;
+		      else if (type == 0x100)
+			{
+			  /* Count SMT.  */
+			  if ((count & 0x1))
+			    {
+			      int count_mask;
+
+			      /* Compute count mask.  */
+			      asm ("bsr %1, %0"
+				   : "=r" (count_mask) : "g" (threads_l2));
+			      count_mask = ~(-1 << (count_mask + 1));
+			      threads_l2 = (shipped - 1) & count_mask;
+			      count &= ~0x1;
+			    }
+			}
 		      else if (type == 0x200)
 			{
-			  int count_mask;
-
-			  /* Compute count mask.  */
-			  asm ("bsr %1, %0"
-			       : "=r" (count_mask) : "g" (threads));
-			  count_mask = ~(-1 << (count_mask + 1));
-			  threads = (shipped - 1) & count_mask;
-			  break;
+			  /* Count core.  */
+			  if ((count & (0x1 << 1)))
+			    {
+			      int count_mask;
+			      int threads_core
+				= (level == 2 ? threads_l2 : threads_l3);
+
+			      /* Compute count mask.  */
+			      asm ("bsr %1, %0"
+				   : "=r" (count_mask) : "g" (threads_core));
+			      count_mask = ~(-1 << (count_mask + 1));
+			      threads_core = (shipped - 1) & count_mask;
+			      if (level == 2)
+				threads_l2 = threads_core;
+			      else
+				threads_l3 = threads_core;
+			      count &= ~(0x1 << 1);
+			    }
 			}
 		    }
 		}
-	      threads += 1;
-	      if (threads > 2 && level == 2 && family == 6)
+	      if (threads_l2 > 0)
+		threads_l2 += 1;
+	      if (threads_l3 > 0)
+		threads_l3 += 1;
+	      if (level == 2)
 		{
-		  switch (model)
+		  if (threads_l2)
 		    {
-		    case 0x57:
-		      /* Knights Landing has L2 cache shared by 2 cores.  */
-		    case 0x37:
-		    case 0x4a:
-		    case 0x4d:
-		    case 0x5a:
-		    case 0x5d:
-		      /* Silvermont has L2 cache shared by 2 cores.  */
-		      threads = 2;
-		      break;
-		    default:
-		      break;
+		      threads = threads_l2;
+		      if (threads > 2 && family == 6)
+			{
+			  switch (model)
+			    {
+			    case 0x57:
+			      /* Knights Landing has L2 cache shared by 2 cores.  */
+			    case 0x37:
+			    case 0x4a:
+			    case 0x4d:
+			    case 0x5a:
+			    case 0x5d:
+			      /* Silvermont has L2 cache shared by 2 cores.  */
+			      threads = 2;
+			      break;
+			    default:
+			      break;
+			    }
+			}
 		    }
 		}
+	      else if (threads_l3)
+		threads = threads_l3;
 	    }
 	  else
 	    {
@@ -606,8 +686,12 @@  intel_bug_no_cache_info:
 	}
 
       /* Account for non-inclusive L2 and L3 caches.  */
-      if (level == 3 && !inclusive_cache)
-	shared += core;
+      if (!inclusive_cache)
+	{
+	  if (threads_l2 > 0)
+	    core /= threads_l2;
+	  shared += core;
+	}
     }
   /* This spells out "AuthenticAMD".  */
   else if (is_amd)
-- 
2.5.5