diff mbox series

[v3] Reversing calculation of __x86_shared_non_temporal_threshold

Message ID 1601072475-22682-1-git-send-email-patrick.mcgehearty@oracle.com
State New
Headers show
Series [v3] Reversing calculation of __x86_shared_non_temporal_threshold | expand

Commit Message

Patrick McGehearty Sept. 25, 2020, 10:21 p.m. UTC
The __x86_shared_non_temporal_threshold determines when memcpy on x86
uses non_temporal stores to avoid pushing other data out of the last
level cache.

This patch proposes to revert the calculation change made by H.J. Lu's
patch of June 2, 2017.

H.J. Lu's patch selected a threshold suitable for a single thread
getting maximum performance. It was tuned using the single threaded
large memcpy micro benchmark on an 8 core processor. The last change
changes the threshold from using 3/4 of one thread's share of the
cache to using 3/4 of the entire cache of a multi-threaded system
before switching to non-temporal stores. Multi-threaded systems with
more than a few threads are server-class and typically have many
active threads. If one thread consumes 3/4 of the available cache for
all threads, it will cause other active threads to have data removed
from the cache. Two examples show the range of the effect. John
McCalpin's widely parallel Stream benchmark, which runs in parallel
and fetches data sequentially, saw a 20% slowdown with this patch on
an internal system test of 128 threads. This regression was discovered
when comparing OL8 performance to OL7.  An example that compares
normal stores to non-temporal stores may be found at
https://vgatherps.github.io/2018-09-02-nontemporal/.  A simple test
shows performance loss of 400 to 500% due to a failure to use
nontemporal stores. These performance losses are most likely to occur
when the system load is heaviest and good performance is critical.

The tunable x86_non_temporal_threshold can be used to override the
default for the knowledgable user who really wants maximum cache
allocation to a single thread in a multi-threaded system.
The manual entry for the tunable has been expanded to provide
more information about its purpose.

	modified: sysdeps/x86/cacheinfo.c
	modified: manual/tunables.texi
---
 manual/tunables.texi    |  6 +++++-
 sysdeps/x86/cacheinfo.c | 16 +++++++++++-----
 2 files changed, 16 insertions(+), 6 deletions(-)

Comments

H.J. Lu Sept. 25, 2020, 10:26 p.m. UTC | #1
On Fri, Sep 25, 2020 at 3:21 PM Patrick McGehearty via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> uses non_temporal stores to avoid pushing other data out of the last
> level cache.
>
> This patch proposes to revert the calculation change made by H.J. Lu's
> patch of June 2, 2017.
>
> H.J. Lu's patch selected a threshold suitable for a single thread
> getting maximum performance. It was tuned using the single threaded
> large memcpy micro benchmark on an 8 core processor. The last change
> changes the threshold from using 3/4 of one thread's share of the
> cache to using 3/4 of the entire cache of a multi-threaded system
> before switching to non-temporal stores. Multi-threaded systems with
> more than a few threads are server-class and typically have many
> active threads. If one thread consumes 3/4 of the available cache for
> all threads, it will cause other active threads to have data removed
> from the cache. Two examples show the range of the effect. John
> McCalpin's widely parallel Stream benchmark, which runs in parallel
> and fetches data sequentially, saw a 20% slowdown with this patch on
> an internal system test of 128 threads. This regression was discovered
> when comparing OL8 performance to OL7.  An example that compares
> normal stores to non-temporal stores may be found at
> https://vgatherps.github.io/2018-09-02-nontemporal/.  A simple test
> shows performance loss of 400 to 500% due to a failure to use
> nontemporal stores. These performance losses are most likely to occur
> when the system load is heaviest and good performance is critical.
>
> The tunable x86_non_temporal_threshold can be used to override the
> default for the knowledgable user who really wants maximum cache
> allocation to a single thread in a multi-threaded system.
> The manual entry for the tunable has been expanded to provide
> more information about its purpose.
>
>         modified: sysdeps/x86/cacheinfo.c
>         modified: manual/tunables.texi
> ---
>  manual/tunables.texi    |  6 +++++-
>  sysdeps/x86/cacheinfo.c | 16 +++++++++++-----
>  2 files changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index b6bb54d..94d4fbd 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>
>  @deftp Tunable glibc.tune.x86_non_temporal_threshold
>  The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> -to set threshold in bytes for non temporal store.
> +to set threshold in bytes for non temporal store. Non temporal stores
> +give a hint to the hardware to move data directly to memory without
> +displacing other data from the cache. This tunable is used by some
> +platforms to determine when to use non temporal stores in operations
> +like memmove and memcpy.
>
>  This tunable is specific to i386 and x86-64.
>  @end deftp
> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> index b9444dd..42b468d 100644
> --- a/sysdeps/x86/cacheinfo.c
> +++ b/sysdeps/x86/cacheinfo.c
> @@ -778,14 +778,20 @@ intel_bug_no_cache_info:
>        __x86_shared_cache_size = shared;
>      }
>
> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> -     shared cache size is the approximate value above which non-temporal
> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> -     total shared cache size.  */
> +  /* The default setting for the non_temporal threshold is 3/4 of one
> +     thread's share of the chip's cache. For most Intel and AMD processors
> +     with an initial release date between 2017 and 2020, a thread's typical
> +     share of the cache is from 500 KBytes to 2 MBytes. Using the 3/4
> +     threshold leaves 125 KBytes to 500 KBytes of the thread's data
> +     in cache after a maximum temporal copy, which will maintain
> +     in cache a reasonable portion of the thread's stack and other
> +     active data. If the threshold is set higher than one thread's
> +     share of the cache, it has a substantial risk of negatively
> +     impacting the performance of other threads running on the chip. */
>    __x86_shared_non_temporal_threshold
>      = (cpu_features->non_temporal_threshold != 0
>         ? cpu_features->non_temporal_threshold
> -       : __x86_shared_cache_size * threads * 3 / 4);
> +       : __x86_shared_cache_size * 3 / 4);
>  }
>
>  #endif

LGTM.

Thanks.
Carlos O'Donell Sept. 27, 2020, 1:54 p.m. UTC | #2
On 9/25/20 6:21 PM, Patrick McGehearty via Libc-alpha wrote:
> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> uses non_temporal stores to avoid pushing other data out of the last
> level cache.
> 
> This patch proposes to revert the calculation change made by H.J. Lu's
> patch of June 2, 2017.
> 
> H.J. Lu's patch selected a threshold suitable for a single thread
> getting maximum performance. It was tuned using the single threaded
> large memcpy micro benchmark on an 8 core processor. The last change
> changes the threshold from using 3/4 of one thread's share of the
> cache to using 3/4 of the entire cache of a multi-threaded system
> before switching to non-temporal stores. Multi-threaded systems with
> more than a few threads are server-class and typically have many
> active threads. If one thread consumes 3/4 of the available cache for
> all threads, it will cause other active threads to have data removed
> from the cache. Two examples show the range of the effect. John
> McCalpin's widely parallel Stream benchmark, which runs in parallel
> and fetches data sequentially, saw a 20% slowdown with this patch on
> an internal system test of 128 threads. This regression was discovered
> when comparing OL8 performance to OL7.  An example that compares
> normal stores to non-temporal stores may be found at
> https://vgatherps.github.io/2018-09-02-nontemporal/.  A simple test
> shows performance loss of 400 to 500% due to a failure to use
> nontemporal stores. These performance losses are most likely to occur
> when the system load is heaviest and good performance is critical.
> 
> The tunable x86_non_temporal_threshold can be used to override the
> default for the knowledgable user who really wants maximum cache
> allocation to a single thread in a multi-threaded system.
> The manual entry for the tunable has been expanded to provide
> more information about its purpose.

Patrick,

Thank you for doing this work, and for all of the comments you made
downthread on the original posting.

I agree it is very easy to loose sight of the bigger "up and out"
picture of development when all you do is look at the core C library
performance for one process.

Your shared cautionary tales sparked various discussions within the
platform tools team here at Red Hat :-)

There is no silver bullet here, and the microbencmarks in glibc are
there to give us a starting point for a discussion.

I'm curious to know if you think there is some kind of balancing
microbenchmark we could write to show the effects of process-to-process
optimizations?

I'm happy if we all agree that the kind of "adjustments" you made
today will only be derived from an adaptive process involving customers,
applications, modern hardware, engineers, and the mixing of all of them
together to make such adjustments.

Thank you again.
Florian Weimer Sept. 28, 2020, 12:55 p.m. UTC | #3
* H. J. Lu via Libc-alpha:

> On Fri, Sep 25, 2020 at 3:21 PM Patrick McGehearty via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>> uses non_temporal stores to avoid pushing other data out of the last
>> level cache.

> LGTM.

Patrick, do you need help with committing this?

Thanks,
Florian
Patrick McGehearty Oct. 1, 2020, 4:04 p.m. UTC | #4
The following is a 'top of my head' response to Carlos's request:

"I'm curious to know if you think there is some kind of balancing
microbenchmark we could write to show the effects of process-to-process
optimizations?"

My comments that follow are not restricted to x86-only systems. For a 
general
glibc test, I try to think about supporting the range of platforms glibc 
runs on.

I can imagine writing a "cache sensitive" single threaded micro benchmark to
combine with a single threaded "large memcpy" benchmark. Run the appropriate
number of copies of each, add some ramp-up, ramp-down runs, then
report the throughput of each during the middle runs. For a particular
platform, it would be necessary to know the cache architecture to
be sure the memcpy operations and cache sensitive operations are large
enough to have potential to exhaust the available cache space.
That seems a doable exercise, although it would only test memcpy
stress on other caches, not any other glibc component's excess use of 
memory.

A challenge would be to keep the total test time "short" if it is to be 
added
to the standard test suite to run on a regular basis. Test time will tend to
grow on larger systems with more threads and larger caches.

I mention the "only tests memcpy" because I saw some recent performance
counter data  for SPECcpu2017 that suggests tuning some parameters in some
malloc implementations may have meaningful cache performance effects for
some applications. Measuring those effects would require replacing the
"large memcpy" micro-test with some sort of malloc micro-test, but
could use the same infrastructure.

After developing the test for a specific platform, it may be a bigger 
challenge
to make it adaptive to "all platforms" or even "a variety of common 
platforms".
HW methods and details of cache sharing varies widely across the range of
systems that glibc supports. And new configurations are likely to be
invented because that's the nature of our industry.

1) Number of threads on systems under test varies widely.
Today, desktops tend to be in the range of 2 to 8 threads on a single chip
Servers tend to be 24 to 128 per chip and more on multi-chip systems.
With some experimenting with tests on varying system sizes, this value
can probably be parameterized, but as noted, test times are likely
to be higher for larger systems.

2) Hyperthreading complicates testing. Two threads sharing a core
and its L1/L2 caches will have different interactions than two
threads that only share L3 cache. A diagnostic test for regressions
of the L3 cache sharing would get many false positives when run
on two threads that share L2 cache.  Different platforms number
threads differently for which threads share L2 cache, etc.
For example, assume we have 8 threads with 2 threads per core.
On some platforms, the threads that share cores and L2 cache might
be numbered  (0,1), (2,3),  (4,5), (6,7). On other platforms, they might
be numbered (0,4), (1,5), (2,6), (3,7). During early system bringup testing,
I've encountered both cases. I don't know of any standard system call
that reports this information.  It likely will require a build or run time
parameter setting, with the more common structure as the default
and offer a way to deal with alternative layouts. Also, some platforms
are highly multi-threads (some Sparc chips have offered 8-way
hyperthreading for example).

A first round to development might just ignore hyperthreading and
accept more noisy test results as test threads will sometimes share
cores and sometimes be spread apart.

3) Structure of cache sharing varies widely with different architectures
and different platforms within an architecture. Size also varies.
Modern chips have significant "per core" or "per core set" L1 and
L2 caches. A recent platform I read about at had a large L2 cache
per core that was not shared by other threads. The same platform had
a sizable L3 cache shared by all threads on a chip.  Some chips
share L2 cache among two or four threads or even more threads.
Both the size of the L2 cache and its size relative to the shared L3
cache would make a difference in how to construct a diagnostic
L3 cache sharing regression test.

4) Behavior in a VM environment?  Some virtual machine environments
hide the ability to find out specific HW details. May also present
other challenges. For initial development, provide a way to use
manual parameters for determining key HW parameters.

5) Other issues?  The above are just what immediately comes to mind.
I expect I've overlooked some challenges that may be best discovered
during development and testing across a range of platforms.

Having said that, a preliminary test could be developed by making
(and documenting) as many simplifying assumptions as necessary
to start. Then address each simplifying assumption in turn
according to their difficulty. I would find it to be an interesting project,
but I won't volunteer for it at the current time due to my existing
work backlog. I'd be happy to provide review & morale support to
someone else's efforts in this direction.

- patrick


On 9/27/2020 8:54 AM, Carlos O'Donell wrote:
> On 9/25/20 6:21 PM, Patrick McGehearty via Libc-alpha wrote:
>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>> uses non_temporal stores to avoid pushing other data out of the last
>> level cache.
>>
>> This patch proposes to revert the calculation change made by H.J. Lu's
>> patch of June 2, 2017.
>>
>> H.J. Lu's patch selected a threshold suitable for a single thread
>> getting maximum performance. It was tuned using the single threaded
>> large memcpy micro benchmark on an 8 core processor. The last change
>> changes the threshold from using 3/4 of one thread's share of the
>> cache to using 3/4 of the entire cache of a multi-threaded system
>> before switching to non-temporal stores. Multi-threaded systems with
>> more than a few threads are server-class and typically have many
>> active threads. If one thread consumes 3/4 of the available cache for
>> all threads, it will cause other active threads to have data removed
>> from the cache. Two examples show the range of the effect. John
>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>> and fetches data sequentially, saw a 20% slowdown with this patch on
>> an internal system test of 128 threads. This regression was discovered
>> when comparing OL8 performance to OL7.  An example that compares
>> normal stores to non-temporal stores may be found at
>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!KOEoezhJYQd0q9-dtZpUkQoKuGqn0bH71cha3ZzOTP2Kkq0_BFABvK8A2E_JWj64nfVtwo4$ .  A simple test
>> shows performance loss of 400 to 500% due to a failure to use
>> nontemporal stores. These performance losses are most likely to occur
>> when the system load is heaviest and good performance is critical.
>>
>> The tunable x86_non_temporal_threshold can be used to override the
>> default for the knowledgable user who really wants maximum cache
>> allocation to a single thread in a multi-threaded system.
>> The manual entry for the tunable has been expanded to provide
>> more information about its purpose.
> Patrick,
>
> Thank you for doing this work, and for all of the comments you made
> downthread on the original posting.
>
> I agree it is very easy to loose sight of the bigger "up and out"
> picture of development when all you do is look at the core C library
> performance for one process.
>
> Your shared cautionary tales sparked various discussions within the
> platform tools team here at Red Hat :-)
>
> There is no silver bullet here, and the microbencmarks in glibc are
> there to give us a starting point for a discussion.
>
> I'm curious to know if you think there is some kind of balancing
> microbenchmark we could write to show the effects of process-to-process
> optimizations?
>
> I'm happy if we all agree that the kind of "adjustments" you made
> today will only be derived from an adaptive process involving customers,
> applications, modern hardware, engineers, and the mixing of all of them
> together to make such adjustments.
>
> Thank you again.
>
Carlos O'Donell Oct. 1, 2020, 9:02 p.m. UTC | #5
On 10/1/20 12:04 PM, Patrick McGehearty wrote:
> Having said that, a preliminary test could be developed by making
> (and documenting) as many simplifying assumptions as necessary
> to start. Then address each simplifying assumption in turn
> according to their difficulty. I would find it to be an interesting project,
> but I won't volunteer for it at the current time due to my existing
> work backlog. I'd be happy to provide review & morale support to
> someone else's efforts in this direction.

Patrick,

Thank you for those notes. I've bookmarked this to refer back to them as
we continue to develop more microbenchmarks.

I think we *can* write more generic microbenchmarks that can range across
the implementation specific details of the hardware to verify where the
cliff lies for performance. Such paraterization of the problem space
seems doable. I think we need some "pin process to cpu" framework and
probably some "what does this topology look like" framework, so we'll
need access to some of the numa topology libraries on the benchmark to
suss out at runtime how we layout and pin things and provide overrides
e.g. spit out json of our understanding or take json as input for what
the developer tells you to do for layout.

I'm OK with `make bench` taking a long time because it is usually
run under direct supervision by someone on a stable, clean, unloaded
system. I am not naive enough to think we're ever going to reliably
automate any of this even with a benchset that is smaller.

Thanks again.
diff mbox series

Patch

diff --git a/manual/tunables.texi b/manual/tunables.texi
index b6bb54d..94d4fbd 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -364,7 +364,11 @@  set shared cache size in bytes for use in memory and string routines.
 
 @deftp Tunable glibc.tune.x86_non_temporal_threshold
 The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
-to set threshold in bytes for non temporal store.
+to set threshold in bytes for non temporal store. Non temporal stores
+give a hint to the hardware to move data directly to memory without
+displacing other data from the cache. This tunable is used by some
+platforms to determine when to use non temporal stores in operations
+like memmove and memcpy.
 
 This tunable is specific to i386 and x86-64.
 @end deftp
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index b9444dd..42b468d 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -778,14 +778,20 @@  intel_bug_no_cache_info:
       __x86_shared_cache_size = shared;
     }
 
-  /* The large memcpy micro benchmark in glibc shows that 6 times of
-     shared cache size is the approximate value above which non-temporal
-     store becomes faster on a 8-core processor.  This is the 3/4 of the
-     total shared cache size.  */
+  /* The default setting for the non_temporal threshold is 3/4 of one
+     thread's share of the chip's cache. For most Intel and AMD processors
+     with an initial release date between 2017 and 2020, a thread's typical
+     share of the cache is from 500 KBytes to 2 MBytes. Using the 3/4
+     threshold leaves 125 KBytes to 500 KBytes of the thread's data
+     in cache after a maximum temporal copy, which will maintain
+     in cache a reasonable portion of the thread's stack and other
+     active data. If the threshold is set higher than one thread's
+     share of the cache, it has a substantial risk of negatively
+     impacting the performance of other threads running on the chip. */
   __x86_shared_non_temporal_threshold
     = (cpu_features->non_temporal_threshold != 0
        ? cpu_features->non_temporal_threshold
-       : __x86_shared_cache_size * threads * 3 / 4);
+       : __x86_shared_cache_size * 3 / 4);
 }
 
 #endif