[v2] Reversing calculation of __x86_shared_non_temporal_threshold

Message ID 1600891781-9272-1-git-send-email-patrick.mcgehearty@oracle.com
State Committed
Headers
Series [v2] Reversing calculation of __x86_shared_non_temporal_threshold |

Commit Message

Patrick McGehearty Sept. 23, 2020, 8:09 p.m. UTC
  The __x86_shared_non_temporal_threshold determines when memcpy on x86
uses non_temporal stores to avoid pushing other data out of the last
level cache.

This patch proposes to revert the calculation change made by H.J. Lu's
patch of June 2, 2017.

H.J. Lu's patch selected a threshold suitable for a single thread
getting maximum performance. It was tuned using the single threaded
large memcpy micro benchmark on an 8 core processor. The last change
changes the threshold from using 3/4 of one thread's share of the
cache to using 3/4 of the entire cache of a multi-threaded system
before switching to non-temporal stores. Multi-threaded systems with
more than a few threads are server-class and typically have many
active threads. If one thread consumes 3/4 of the available cache for
all threads, it will cause other active threads to have data removed
from the cache. Two examples show the range of the effect. John
McCalpin's widely parallel Stream benchmark, which runs in parallel
and fetches data sequentially, saw a 20% slowdown with this patch on
an internal system test of 128 threads. This regression was discovered
when comparing OL8 performance to OL7.  An example that compares
normal stores to non-temporal stores may be found at
https://vgatherps.github.io/2018-09-02-nontemporal/.  A simple test
shows performance loss of 400 to 500% due to a failure to use
nontemporal stores. These performance losses are most likely to occur
when the system load is heaviest and good performance is critical.

The tunable x86_non_temporal_threshold can be used to override the
default for the knowledgable user who really wants maximum cache
allocation to a single thread in a multi-threaded system.
The manual entry for the tunable has been expanded to provide
more information about its purpose.

	modified: sysdeps/x86/cacheinfo.c
	modified: manual/tunables.texi
---
 manual/tunables.texi    |  6 +++++-
 sysdeps/x86/cacheinfo.c | 12 +++++++-----
 2 files changed, 12 insertions(+), 6 deletions(-)
  

Comments

H.J. Lu Sept. 23, 2020, 8:23 p.m. UTC | #1
On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> uses non_temporal stores to avoid pushing other data out of the last
> level cache.
>
> This patch proposes to revert the calculation change made by H.J. Lu's
> patch of June 2, 2017.
>
> H.J. Lu's patch selected a threshold suitable for a single thread
> getting maximum performance. It was tuned using the single threaded
> large memcpy micro benchmark on an 8 core processor. The last change
> changes the threshold from using 3/4 of one thread's share of the
> cache to using 3/4 of the entire cache of a multi-threaded system
> before switching to non-temporal stores. Multi-threaded systems with
> more than a few threads are server-class and typically have many
> active threads. If one thread consumes 3/4 of the available cache for
> all threads, it will cause other active threads to have data removed
> from the cache. Two examples show the range of the effect. John
> McCalpin's widely parallel Stream benchmark, which runs in parallel
> and fetches data sequentially, saw a 20% slowdown with this patch on
> an internal system test of 128 threads. This regression was discovered
> when comparing OL8 performance to OL7.  An example that compares
> normal stores to non-temporal stores may be found at
> https://vgatherps.github.io/2018-09-02-nontemporal/.  A simple test
> shows performance loss of 400 to 500% due to a failure to use
> nontemporal stores. These performance losses are most likely to occur
> when the system load is heaviest and good performance is critical.
>
> The tunable x86_non_temporal_threshold can be used to override the
> default for the knowledgable user who really wants maximum cache
> allocation to a single thread in a multi-threaded system.
> The manual entry for the tunable has been expanded to provide
> more information about its purpose.
>
>         modified: sysdeps/x86/cacheinfo.c
>         modified: manual/tunables.texi
> ---
>  manual/tunables.texi    |  6 +++++-
>  sysdeps/x86/cacheinfo.c | 12 +++++++-----
>  2 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index b6bb54d..94d4fbd 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>
>  @deftp Tunable glibc.tune.x86_non_temporal_threshold
>  The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> -to set threshold in bytes for non temporal store.
> +to set threshold in bytes for non temporal store. Non temporal stores
> +give a hint to the hardware to move data directly to memory without
> +displacing other data from the cache. This tunable is used by some
> +platforms to determine when to use non temporal stores in operations
> +like memmove and memcpy.
>
>  This tunable is specific to i386 and x86-64.
>  @end deftp
> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> index b9444dd..c6767d9 100644
> --- a/sysdeps/x86/cacheinfo.c
> +++ b/sysdeps/x86/cacheinfo.c
> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>        __x86_shared_cache_size = shared;
>      }
>
> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> -     shared cache size is the approximate value above which non-temporal
> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> -     total shared cache size.  */
> +  /* The default setting for the non_temporal threshold is 3/4
> +     of one thread's share of the chip's cache. While higher
> +     single thread performance may be observed with a higher
> +     threshold, having a single thread use more than it's share
> +     of the cache will negatively impact the performance of
> +     other threads running on the chip. */
>    __x86_shared_non_temporal_threshold
>      = (cpu_features->non_temporal_threshold != 0
>         ? cpu_features->non_temporal_threshold
> -       : __x86_shared_cache_size * threads * 3 / 4);
> +       : __x86_shared_cache_size * 3 / 4);
>  }
>

Can we tune it with the number of threads and/or total cache
size?
  
Patrick McGehearty Sept. 23, 2020, 8:57 p.m. UTC | #2
On 9/23/2020 3:23 PM, H.J. Lu wrote:
> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>> uses non_temporal stores to avoid pushing other data out of the last
>> level cache.
>>
>> This patch proposes to revert the calculation change made by H.J. Lu's
>> patch of June 2, 2017.
>>
>> H.J. Lu's patch selected a threshold suitable for a single thread
>> getting maximum performance. It was tuned using the single threaded
>> large memcpy micro benchmark on an 8 core processor. The last change
>> changes the threshold from using 3/4 of one thread's share of the
>> cache to using 3/4 of the entire cache of a multi-threaded system
>> before switching to non-temporal stores. Multi-threaded systems with
>> more than a few threads are server-class and typically have many
>> active threads. If one thread consumes 3/4 of the available cache for
>> all threads, it will cause other active threads to have data removed
>> from the cache. Two examples show the range of the effect. John
>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>> and fetches data sequentially, saw a 20% slowdown with this patch on
>> an internal system test of 128 threads. This regression was discovered
>> when comparing OL8 performance to OL7.  An example that compares
>> normal stores to non-temporal stores may be found at
>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
>> shows performance loss of 400 to 500% due to a failure to use
>> nontemporal stores. These performance losses are most likely to occur
>> when the system load is heaviest and good performance is critical.
>>
>> The tunable x86_non_temporal_threshold can be used to override the
>> default for the knowledgable user who really wants maximum cache
>> allocation to a single thread in a multi-threaded system.
>> The manual entry for the tunable has been expanded to provide
>> more information about its purpose.
>>
>>          modified: sysdeps/x86/cacheinfo.c
>>          modified: manual/tunables.texi
>> ---
>>   manual/tunables.texi    |  6 +++++-
>>   sysdeps/x86/cacheinfo.c | 12 +++++++-----
>>   2 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>> index b6bb54d..94d4fbd 100644
>> --- a/manual/tunables.texi
>> +++ b/manual/tunables.texi
>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>
>>   @deftp Tunable glibc.tune.x86_non_temporal_threshold
>>   The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>> -to set threshold in bytes for non temporal store.
>> +to set threshold in bytes for non temporal store. Non temporal stores
>> +give a hint to the hardware to move data directly to memory without
>> +displacing other data from the cache. This tunable is used by some
>> +platforms to determine when to use non temporal stores in operations
>> +like memmove and memcpy.
>>
>>   This tunable is specific to i386 and x86-64.
>>   @end deftp
>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>> index b9444dd..c6767d9 100644
>> --- a/sysdeps/x86/cacheinfo.c
>> +++ b/sysdeps/x86/cacheinfo.c
>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>>         __x86_shared_cache_size = shared;
>>       }
>>
>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
>> -     shared cache size is the approximate value above which non-temporal
>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
>> -     total shared cache size.  */
>> +  /* The default setting for the non_temporal threshold is 3/4
>> +     of one thread's share of the chip's cache. While higher
>> +     single thread performance may be observed with a higher
>> +     threshold, having a single thread use more than it's share
>> +     of the cache will negatively impact the performance of
>> +     other threads running on the chip. */
>>     __x86_shared_non_temporal_threshold
>>       = (cpu_features->non_temporal_threshold != 0
>>          ? cpu_features->non_temporal_threshold
>> -       : __x86_shared_cache_size * threads * 3 / 4);
>> +       : __x86_shared_cache_size * 3 / 4);
>>   }
>>
> Can we tune it with the number of threads and/or total cache
> size?
>

When you say "total cache size", is that different from 
shared_cache_size * threads?

I see a fundamental conflict of optimization goals:
1) Provide best single thread performance (current code)
2) Provide best overall system performance under full load (proposed patch)
I don't know of any way to have default behavior meet both goals without 
knowledge
of the system size/usage/requirements.

Consider a hypothetical single chip system with 64 threads and 128 MB of 
total cache on the chip.
That won't be uncommon in the coming years on server class systems, 
especially
in large databases or HPC environments (think vision processing or 
weather modeling for example).
If a single app owns the whole chip and is running a multi-threaded 
application but needs
to memcpy a really large block of data when one phase of computation 
finished
before moving to the next phase. A common practice would be to have 64 
parallel calls
to memcpy. The Stream benchmark demonstrates with OpenMP that current 
compilers
handle that with no trouble.

In the example, the per thread share of the cache is 2 MB and the 
proposed formula will set
the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or 
less, all threads comfortably
fit in cache. If the total copy size is over that, then non-temporal 
stores are used and all is well there too.

The current formula would set the threshold at 96 Mbytes for each 
thread. Only when the total
copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. 
We'd like
to switch to non-temporal stores much sooner as we will be thrashing all 
the threads caches.

In practical terms, I've had access to typical memcpy copy lengths for a 
variety of commerical
applications while studying memcpy on Solaris over the years. The vast 
majority of copies
are for 64Kbytes or less. Most modern chips have much more than 64Kbytes 
of cache
per thread, allowing in-cache copies for the common case, even without 
borrowing
cache from other threads. The occasional really large copies tend to be 
when an application
is passing a block of data to prepare for a new phase of computation or 
as a shared memory
communication to another thread. In these cases, having the data remain 
in cache is usually
not relevant and using non-temporal stores even when they are not 
strictly required does
not have a negative affect on performance.

A downside of tuning for a single thread comes in cloud computing 
environments, where
having neighboring threads being cache hogs, even if relatively isolated 
in virtual machines,
is a "bad thing" for having stable system performance. Whatever we can 
do to provide consistent,
reasonable performance whatever the neighboring threads might be doing 
is a "good thing".

- patrick
  
H.J. Lu Sept. 23, 2020, 9:37 p.m. UTC | #3
On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
<patrick.mcgehearty@oracle.com> wrote:
>
>
>
> On 9/23/2020 3:23 PM, H.J. Lu wrote:
> > On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> >> uses non_temporal stores to avoid pushing other data out of the last
> >> level cache.
> >>
> >> This patch proposes to revert the calculation change made by H.J. Lu's
> >> patch of June 2, 2017.
> >>
> >> H.J. Lu's patch selected a threshold suitable for a single thread
> >> getting maximum performance. It was tuned using the single threaded
> >> large memcpy micro benchmark on an 8 core processor. The last change
> >> changes the threshold from using 3/4 of one thread's share of the
> >> cache to using 3/4 of the entire cache of a multi-threaded system
> >> before switching to non-temporal stores. Multi-threaded systems with
> >> more than a few threads are server-class and typically have many
> >> active threads. If one thread consumes 3/4 of the available cache for
> >> all threads, it will cause other active threads to have data removed
> >> from the cache. Two examples show the range of the effect. John
> >> McCalpin's widely parallel Stream benchmark, which runs in parallel
> >> and fetches data sequentially, saw a 20% slowdown with this patch on
> >> an internal system test of 128 threads. This regression was discovered
> >> when comparing OL8 performance to OL7.  An example that compares
> >> normal stores to non-temporal stores may be found at
> >> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
> >> shows performance loss of 400 to 500% due to a failure to use
> >> nontemporal stores. These performance losses are most likely to occur
> >> when the system load is heaviest and good performance is critical.
> >>
> >> The tunable x86_non_temporal_threshold can be used to override the
> >> default for the knowledgable user who really wants maximum cache
> >> allocation to a single thread in a multi-threaded system.
> >> The manual entry for the tunable has been expanded to provide
> >> more information about its purpose.
> >>
> >>          modified: sysdeps/x86/cacheinfo.c
> >>          modified: manual/tunables.texi
> >> ---
> >>   manual/tunables.texi    |  6 +++++-
> >>   sysdeps/x86/cacheinfo.c | 12 +++++++-----
> >>   2 files changed, 12 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/manual/tunables.texi b/manual/tunables.texi
> >> index b6bb54d..94d4fbd 100644
> >> --- a/manual/tunables.texi
> >> +++ b/manual/tunables.texi
> >> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
> >>
> >>   @deftp Tunable glibc.tune.x86_non_temporal_threshold
> >>   The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> >> -to set threshold in bytes for non temporal store.
> >> +to set threshold in bytes for non temporal store. Non temporal stores
> >> +give a hint to the hardware to move data directly to memory without
> >> +displacing other data from the cache. This tunable is used by some
> >> +platforms to determine when to use non temporal stores in operations
> >> +like memmove and memcpy.
> >>
> >>   This tunable is specific to i386 and x86-64.
> >>   @end deftp
> >> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> >> index b9444dd..c6767d9 100644
> >> --- a/sysdeps/x86/cacheinfo.c
> >> +++ b/sysdeps/x86/cacheinfo.c
> >> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
> >>         __x86_shared_cache_size = shared;
> >>       }
> >>
> >> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> >> -     shared cache size is the approximate value above which non-temporal
> >> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> >> -     total shared cache size.  */
> >> +  /* The default setting for the non_temporal threshold is 3/4
> >> +     of one thread's share of the chip's cache. While higher
> >> +     single thread performance may be observed with a higher
> >> +     threshold, having a single thread use more than it's share
> >> +     of the cache will negatively impact the performance of
> >> +     other threads running on the chip. */
> >>     __x86_shared_non_temporal_threshold
> >>       = (cpu_features->non_temporal_threshold != 0
> >>          ? cpu_features->non_temporal_threshold
> >> -       : __x86_shared_cache_size * threads * 3 / 4);
> >> +       : __x86_shared_cache_size * 3 / 4);
> >>   }
> >>
> > Can we tune it with the number of threads and/or total cache
> > size?
> >
>
> When you say "total cache size", is that different from
> shared_cache_size * threads?
>
> I see a fundamental conflict of optimization goals:
> 1) Provide best single thread performance (current code)
> 2) Provide best overall system performance under full load (proposed patch)
> I don't know of any way to have default behavior meet both goals without
> knowledge
> of the system size/usage/requirements.
>
> Consider a hypothetical single chip system with 64 threads and 128 MB of
> total cache on the chip.
> That won't be uncommon in the coming years on server class systems,
> especially
> in large databases or HPC environments (think vision processing or
> weather modeling for example).
> If a single app owns the whole chip and is running a multi-threaded
> application but needs
> to memcpy a really large block of data when one phase of computation
> finished
> before moving to the next phase. A common practice would be to have 64
> parallel calls
> to memcpy. The Stream benchmark demonstrates with OpenMP that current
> compilers
> handle that with no trouble.
>
> In the example, the per thread share of the cache is 2 MB and the
> proposed formula will set
> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
> less, all threads comfortably
> fit in cache. If the total copy size is over that, then non-temporal
> stores are used and all is well there too.
>
> The current formula would set the threshold at 96 Mbytes for each
> thread. Only when the total
> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
> We'd like
> to switch to non-temporal stores much sooner as we will be thrashing all
> the threads caches.
>
> In practical terms, I've had access to typical memcpy copy lengths for a
> variety of commerical
> applications while studying memcpy on Solaris over the years. The vast
> majority of copies
> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
> of cache
> per thread, allowing in-cache copies for the common case, even without
> borrowing
> cache from other threads. The occasional really large copies tend to be
> when an application
> is passing a block of data to prepare for a new phase of computation or
> as a shared memory
> communication to another thread. In these cases, having the data remain
> in cache is usually
> not relevant and using non-temporal stores even when they are not
> strictly required does
> not have a negative affect on performance.
>
> A downside of tuning for a single thread comes in cloud computing
> environments, where
> having neighboring threads being cache hogs, even if relatively isolated
> in virtual machines,
> is a "bad thing" for having stable system performance. Whatever we can
> do to provide consistent,
> reasonable performance whatever the neighboring threads might be doing
> is a "good thing".
>

Have you tried the full __x86_shared_cache_size instead of 3 / 4?
  
Patrick McGehearty Sept. 23, 2020, 10:39 p.m. UTC | #4
On 9/23/2020 4:37 PM, H.J. Lu wrote:
> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
> <patrick.mcgehearty@oracle.com> wrote:
>>
>>
>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>>>> uses non_temporal stores to avoid pushing other data out of the last
>>>> level cache.
>>>>
>>>> This patch proposes to revert the calculation change made by H.J. Lu's
>>>> patch of June 2, 2017.
>>>>
>>>> H.J. Lu's patch selected a threshold suitable for a single thread
>>>> getting maximum performance. It was tuned using the single threaded
>>>> large memcpy micro benchmark on an 8 core processor. The last change
>>>> changes the threshold from using 3/4 of one thread's share of the
>>>> cache to using 3/4 of the entire cache of a multi-threaded system
>>>> before switching to non-temporal stores. Multi-threaded systems with
>>>> more than a few threads are server-class and typically have many
>>>> active threads. If one thread consumes 3/4 of the available cache for
>>>> all threads, it will cause other active threads to have data removed
>>>> from the cache. Two examples show the range of the effect. John
>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
>>>> an internal system test of 128 threads. This regression was discovered
>>>> when comparing OL8 performance to OL7.  An example that compares
>>>> normal stores to non-temporal stores may be found at
>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
>>>> shows performance loss of 400 to 500% due to a failure to use
>>>> nontemporal stores. These performance losses are most likely to occur
>>>> when the system load is heaviest and good performance is critical.
>>>>
>>>> The tunable x86_non_temporal_threshold can be used to override the
>>>> default for the knowledgable user who really wants maximum cache
>>>> allocation to a single thread in a multi-threaded system.
>>>> The manual entry for the tunable has been expanded to provide
>>>> more information about its purpose.
>>>>
>>>>           modified: sysdeps/x86/cacheinfo.c
>>>>           modified: manual/tunables.texi
>>>> ---
>>>>    manual/tunables.texi    |  6 +++++-
>>>>    sysdeps/x86/cacheinfo.c | 12 +++++++-----
>>>>    2 files changed, 12 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>>>> index b6bb54d..94d4fbd 100644
>>>> --- a/manual/tunables.texi
>>>> +++ b/manual/tunables.texi
>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>>>
>>>>    @deftp Tunable glibc.tune.x86_non_temporal_threshold
>>>>    The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>>>> -to set threshold in bytes for non temporal store.
>>>> +to set threshold in bytes for non temporal store. Non temporal stores
>>>> +give a hint to the hardware to move data directly to memory without
>>>> +displacing other data from the cache. This tunable is used by some
>>>> +platforms to determine when to use non temporal stores in operations
>>>> +like memmove and memcpy.
>>>>
>>>>    This tunable is specific to i386 and x86-64.
>>>>    @end deftp
>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>>>> index b9444dd..c6767d9 100644
>>>> --- a/sysdeps/x86/cacheinfo.c
>>>> +++ b/sysdeps/x86/cacheinfo.c
>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>>>>          __x86_shared_cache_size = shared;
>>>>        }
>>>>
>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
>>>> -     shared cache size is the approximate value above which non-temporal
>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
>>>> -     total shared cache size.  */
>>>> +  /* The default setting for the non_temporal threshold is 3/4
>>>> +     of one thread's share of the chip's cache. While higher
>>>> +     single thread performance may be observed with a higher
>>>> +     threshold, having a single thread use more than it's share
>>>> +     of the cache will negatively impact the performance of
>>>> +     other threads running on the chip. */
>>>>      __x86_shared_non_temporal_threshold
>>>>        = (cpu_features->non_temporal_threshold != 0
>>>>           ? cpu_features->non_temporal_threshold
>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
>>>> +       : __x86_shared_cache_size * 3 / 4);
>>>>    }
>>>>
>>> Can we tune it with the number of threads and/or total cache
>>> size?
>>>
>> When you say "total cache size", is that different from
>> shared_cache_size * threads?
>>
>> I see a fundamental conflict of optimization goals:
>> 1) Provide best single thread performance (current code)
>> 2) Provide best overall system performance under full load (proposed patch)
>> I don't know of any way to have default behavior meet both goals without
>> knowledge
>> of the system size/usage/requirements.
>>
>> Consider a hypothetical single chip system with 64 threads and 128 MB of
>> total cache on the chip.
>> That won't be uncommon in the coming years on server class systems,
>> especially
>> in large databases or HPC environments (think vision processing or
>> weather modeling for example).
>> If a single app owns the whole chip and is running a multi-threaded
>> application but needs
>> to memcpy a really large block of data when one phase of computation
>> finished
>> before moving to the next phase. A common practice would be to have 64
>> parallel calls
>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
>> compilers
>> handle that with no trouble.
>>
>> In the example, the per thread share of the cache is 2 MB and the
>> proposed formula will set
>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
>> less, all threads comfortably
>> fit in cache. If the total copy size is over that, then non-temporal
>> stores are used and all is well there too.
>>
>> The current formula would set the threshold at 96 Mbytes for each
>> thread. Only when the total
>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
>> We'd like
>> to switch to non-temporal stores much sooner as we will be thrashing all
>> the threads caches.
>>
>> In practical terms, I've had access to typical memcpy copy lengths for a
>> variety of commerical
>> applications while studying memcpy on Solaris over the years. The vast
>> majority of copies
>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
>> of cache
>> per thread, allowing in-cache copies for the common case, even without
>> borrowing
>> cache from other threads. The occasional really large copies tend to be
>> when an application
>> is passing a block of data to prepare for a new phase of computation or
>> as a shared memory
>> communication to another thread. In these cases, having the data remain
>> in cache is usually
>> not relevant and using non-temporal stores even when they are not
>> strictly required does
>> not have a negative affect on performance.
>>
>> A downside of tuning for a single thread comes in cloud computing
>> environments, where
>> having neighboring threads being cache hogs, even if relatively isolated
>> in virtual machines,
>> is a "bad thing" for having stable system performance. Whatever we can
>> do to provide consistent,
>> reasonable performance whatever the neighboring threads might be doing
>> is a "good thing".
>>
> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
>

I have not tested larger thresholds. I'd be more comfortable with a 
smaller one.
We could construct specific tests to show either advantage or disadvantage
to shifting from 3/4 to all of cache depending on what data access was used
between memcpy operations.

I consider pushing the limit on cache usage to be a risky approach. Few 
applications
only work on a single block of data.  If all threads are doing a shared 
copy and
they use all the available cache, then after the memcpy returns, any other
active data would have been pushed out of the cache. That's likely to cost
severe performance loss in more cases than the modest performance gains for
a few cases where the application only is concerned with using the data that
was just copied.

Just to give a more detailed example where large copies are not followed 
by using
the data. Consider garbage collection followed by compression. With a 
multi-age
garbage collector, stable data that is active and survived several 
garbage collections
is in a 'old' region. It does not need to be copied. The current 'new' 
region is full
but has both referenced and unreferenced data. After the marking phase,
the individual elements of the referenced data is copied to the base of 
the 'new' region.
When complete, the rest of the 'new' region becomes the new free pool.
The total amount copied may far exceed the processor cache.  Then the 
application
exits garbage collection and resumes active use of mostly the stable 
data with
some accesses to the just moved new data and fresh allocations. If we 
under-use
non-temporal stores, we clear the cache and the whole application runs 
slower
than otherwise.

Individual memcpy benchmarks are useful in isolation testing and comparing
code patterns but can mislead about overall application performance in the
context of potential for cache abuse. I fell into that tarpit once while 
tuning
memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
5% faster for in-cache data) caused a major customer application to run 
slower
because my new code abused the cache.  I modified my code to  only use the
new "in-cache fast copy" for copies less than a threshold (64Kbytes or
128Kbytes if I remember right) and all was well.

- patrick
  
H.J. Lu Sept. 23, 2020, 11:13 p.m. UTC | #5
On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
<patrick.mcgehearty@oracle.com> wrote:
>
>
>
> On 9/23/2020 4:37 PM, H.J. Lu wrote:
> > On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
> > <patrick.mcgehearty@oracle.com> wrote:
> >>
> >>
> >> On 9/23/2020 3:23 PM, H.J. Lu wrote:
> >>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> >>> <libc-alpha@sourceware.org> wrote:
> >>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> >>>> uses non_temporal stores to avoid pushing other data out of the last
> >>>> level cache.
> >>>>
> >>>> This patch proposes to revert the calculation change made by H.J. Lu's
> >>>> patch of June 2, 2017.
> >>>>
> >>>> H.J. Lu's patch selected a threshold suitable for a single thread
> >>>> getting maximum performance. It was tuned using the single threaded
> >>>> large memcpy micro benchmark on an 8 core processor. The last change
> >>>> changes the threshold from using 3/4 of one thread's share of the
> >>>> cache to using 3/4 of the entire cache of a multi-threaded system
> >>>> before switching to non-temporal stores. Multi-threaded systems with
> >>>> more than a few threads are server-class and typically have many
> >>>> active threads. If one thread consumes 3/4 of the available cache for
> >>>> all threads, it will cause other active threads to have data removed
> >>>> from the cache. Two examples show the range of the effect. John
> >>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
> >>>> and fetches data sequentially, saw a 20% slowdown with this patch on
> >>>> an internal system test of 128 threads. This regression was discovered
> >>>> when comparing OL8 performance to OL7.  An example that compares
> >>>> normal stores to non-temporal stores may be found at
> >>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
> >>>> shows performance loss of 400 to 500% due to a failure to use
> >>>> nontemporal stores. These performance losses are most likely to occur
> >>>> when the system load is heaviest and good performance is critical.
> >>>>
> >>>> The tunable x86_non_temporal_threshold can be used to override the
> >>>> default for the knowledgable user who really wants maximum cache
> >>>> allocation to a single thread in a multi-threaded system.
> >>>> The manual entry for the tunable has been expanded to provide
> >>>> more information about its purpose.
> >>>>
> >>>>           modified: sysdeps/x86/cacheinfo.c
> >>>>           modified: manual/tunables.texi
> >>>> ---
> >>>>    manual/tunables.texi    |  6 +++++-
> >>>>    sysdeps/x86/cacheinfo.c | 12 +++++++-----
> >>>>    2 files changed, 12 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
> >>>> index b6bb54d..94d4fbd 100644
> >>>> --- a/manual/tunables.texi
> >>>> +++ b/manual/tunables.texi
> >>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
> >>>>
> >>>>    @deftp Tunable glibc.tune.x86_non_temporal_threshold
> >>>>    The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> >>>> -to set threshold in bytes for non temporal store.
> >>>> +to set threshold in bytes for non temporal store. Non temporal stores
> >>>> +give a hint to the hardware to move data directly to memory without
> >>>> +displacing other data from the cache. This tunable is used by some
> >>>> +platforms to determine when to use non temporal stores in operations
> >>>> +like memmove and memcpy.
> >>>>
> >>>>    This tunable is specific to i386 and x86-64.
> >>>>    @end deftp
> >>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> >>>> index b9444dd..c6767d9 100644
> >>>> --- a/sysdeps/x86/cacheinfo.c
> >>>> +++ b/sysdeps/x86/cacheinfo.c
> >>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
> >>>>          __x86_shared_cache_size = shared;
> >>>>        }
> >>>>
> >>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> >>>> -     shared cache size is the approximate value above which non-temporal
> >>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> >>>> -     total shared cache size.  */
> >>>> +  /* The default setting for the non_temporal threshold is 3/4
> >>>> +     of one thread's share of the chip's cache. While higher
> >>>> +     single thread performance may be observed with a higher
> >>>> +     threshold, having a single thread use more than it's share
> >>>> +     of the cache will negatively impact the performance of
> >>>> +     other threads running on the chip. */
> >>>>      __x86_shared_non_temporal_threshold
> >>>>        = (cpu_features->non_temporal_threshold != 0
> >>>>           ? cpu_features->non_temporal_threshold
> >>>> -       : __x86_shared_cache_size * threads * 3 / 4);
> >>>> +       : __x86_shared_cache_size * 3 / 4);
> >>>>    }
> >>>>
> >>> Can we tune it with the number of threads and/or total cache
> >>> size?
> >>>
> >> When you say "total cache size", is that different from
> >> shared_cache_size * threads?
> >>
> >> I see a fundamental conflict of optimization goals:
> >> 1) Provide best single thread performance (current code)
> >> 2) Provide best overall system performance under full load (proposed patch)
> >> I don't know of any way to have default behavior meet both goals without
> >> knowledge
> >> of the system size/usage/requirements.
> >>
> >> Consider a hypothetical single chip system with 64 threads and 128 MB of
> >> total cache on the chip.
> >> That won't be uncommon in the coming years on server class systems,
> >> especially
> >> in large databases or HPC environments (think vision processing or
> >> weather modeling for example).
> >> If a single app owns the whole chip and is running a multi-threaded
> >> application but needs
> >> to memcpy a really large block of data when one phase of computation
> >> finished
> >> before moving to the next phase. A common practice would be to have 64
> >> parallel calls
> >> to memcpy. The Stream benchmark demonstrates with OpenMP that current
> >> compilers
> >> handle that with no trouble.
> >>
> >> In the example, the per thread share of the cache is 2 MB and the
> >> proposed formula will set
> >> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
> >> less, all threads comfortably
> >> fit in cache. If the total copy size is over that, then non-temporal
> >> stores are used and all is well there too.
> >>
> >> The current formula would set the threshold at 96 Mbytes for each
> >> thread. Only when the total
> >> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
> >> We'd like
> >> to switch to non-temporal stores much sooner as we will be thrashing all
> >> the threads caches.
> >>
> >> In practical terms, I've had access to typical memcpy copy lengths for a
> >> variety of commerical
> >> applications while studying memcpy on Solaris over the years. The vast
> >> majority of copies
> >> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
> >> of cache
> >> per thread, allowing in-cache copies for the common case, even without
> >> borrowing
> >> cache from other threads. The occasional really large copies tend to be
> >> when an application
> >> is passing a block of data to prepare for a new phase of computation or
> >> as a shared memory
> >> communication to another thread. In these cases, having the data remain
> >> in cache is usually
> >> not relevant and using non-temporal stores even when they are not
> >> strictly required does
> >> not have a negative affect on performance.
> >>
> >> A downside of tuning for a single thread comes in cloud computing
> >> environments, where
> >> having neighboring threads being cache hogs, even if relatively isolated
> >> in virtual machines,
> >> is a "bad thing" for having stable system performance. Whatever we can
> >> do to provide consistent,
> >> reasonable performance whatever the neighboring threads might be doing
> >> is a "good thing".
> >>
> > Have you tried the full __x86_shared_cache_size instead of 3 / 4?
> >
>
> I have not tested larger thresholds. I'd be more comfortable with a
> smaller one.
> We could construct specific tests to show either advantage or disadvantage
> to shifting from 3/4 to all of cache depending on what data access was used
> between memcpy operations.
>
> I consider pushing the limit on cache usage to be a risky approach. Few
> applications
> only work on a single block of data.  If all threads are doing a shared
> copy and
> they use all the available cache, then after the memcpy returns, any other
> active data would have been pushed out of the cache. That's likely to cost
> severe performance loss in more cases than the modest performance gains for
> a few cases where the application only is concerned with using the data that
> was just copied.
>
> Just to give a more detailed example where large copies are not followed
> by using
> the data. Consider garbage collection followed by compression. With a
> multi-age
> garbage collector, stable data that is active and survived several
> garbage collections
> is in a 'old' region. It does not need to be copied. The current 'new'
> region is full
> but has both referenced and unreferenced data. After the marking phase,
> the individual elements of the referenced data is copied to the base of
> the 'new' region.
> When complete, the rest of the 'new' region becomes the new free pool.
> The total amount copied may far exceed the processor cache.  Then the
> application
> exits garbage collection and resumes active use of mostly the stable
> data with
> some accesses to the just moved new data and fresh allocations. If we
> under-use
> non-temporal stores, we clear the cache and the whole application runs
> slower
> than otherwise.
>
> Individual memcpy benchmarks are useful in isolation testing and comparing
> code patterns but can mislead about overall application performance in the
> context of potential for cache abuse. I fell into that tarpit once while
> tuning
> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
> 5% faster for in-cache data) caused a major customer application to run
> slower
> because my new code abused the cache.  I modified my code to  only use the
> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
> 128Kbytes if I remember right) and all was well.
>

The new threshold can be substantially smaller with large core count.
Are you saying that even 3 / 4 may be too big?  Is there a reasonable
fixed threshold?
  
Patrick McGehearty Sept. 24, 2020, 9:47 p.m. UTC | #6
On 9/23/2020 6:13 PM, H.J. Lu wrote:
> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
> <patrick.mcgehearty@oracle.com> wrote:
>>
>>
>> On 9/23/2020 4:37 PM, H.J. Lu wrote:
>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
>>> <patrick.mcgehearty@oracle.com> wrote:
>>>>
>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
>>>>> <libc-alpha@sourceware.org> wrote:
>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>>>>>> uses non_temporal stores to avoid pushing other data out of the last
>>>>>> level cache.
>>>>>>
>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
>>>>>> patch of June 2, 2017.
>>>>>>
>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
>>>>>> getting maximum performance. It was tuned using the single threaded
>>>>>> large memcpy micro benchmark on an 8 core processor. The last change
>>>>>> changes the threshold from using 3/4 of one thread's share of the
>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
>>>>>> before switching to non-temporal stores. Multi-threaded systems with
>>>>>> more than a few threads are server-class and typically have many
>>>>>> active threads. If one thread consumes 3/4 of the available cache for
>>>>>> all threads, it will cause other active threads to have data removed
>>>>>> from the cache. Two examples show the range of the effect. John
>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
>>>>>> an internal system test of 128 threads. This regression was discovered
>>>>>> when comparing OL8 performance to OL7.  An example that compares
>>>>>> normal stores to non-temporal stores may be found at
>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
>>>>>> shows performance loss of 400 to 500% due to a failure to use
>>>>>> nontemporal stores. These performance losses are most likely to occur
>>>>>> when the system load is heaviest and good performance is critical.
>>>>>>
>>>>>> The tunable x86_non_temporal_threshold can be used to override the
>>>>>> default for the knowledgable user who really wants maximum cache
>>>>>> allocation to a single thread in a multi-threaded system.
>>>>>> The manual entry for the tunable has been expanded to provide
>>>>>> more information about its purpose.
>>>>>>
>>>>>>            modified: sysdeps/x86/cacheinfo.c
>>>>>>            modified: manual/tunables.texi
>>>>>> ---
>>>>>>     manual/tunables.texi    |  6 +++++-
>>>>>>     sysdeps/x86/cacheinfo.c | 12 +++++++-----
>>>>>>     2 files changed, 12 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>>>>>> index b6bb54d..94d4fbd 100644
>>>>>> --- a/manual/tunables.texi
>>>>>> +++ b/manual/tunables.texi
>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>>>>>
>>>>>>     @deftp Tunable glibc.tune.x86_non_temporal_threshold
>>>>>>     The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>>>>>> -to set threshold in bytes for non temporal store.
>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
>>>>>> +give a hint to the hardware to move data directly to memory without
>>>>>> +displacing other data from the cache. This tunable is used by some
>>>>>> +platforms to determine when to use non temporal stores in operations
>>>>>> +like memmove and memcpy.
>>>>>>
>>>>>>     This tunable is specific to i386 and x86-64.
>>>>>>     @end deftp
>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>>>>>> index b9444dd..c6767d9 100644
>>>>>> --- a/sysdeps/x86/cacheinfo.c
>>>>>> +++ b/sysdeps/x86/cacheinfo.c
>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>>>>>>           __x86_shared_cache_size = shared;
>>>>>>         }
>>>>>>
>>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
>>>>>> -     shared cache size is the approximate value above which non-temporal
>>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
>>>>>> -     total shared cache size.  */
>>>>>> +  /* The default setting for the non_temporal threshold is 3/4
>>>>>> +     of one thread's share of the chip's cache. While higher
>>>>>> +     single thread performance may be observed with a higher
>>>>>> +     threshold, having a single thread use more than it's share
>>>>>> +     of the cache will negatively impact the performance of
>>>>>> +     other threads running on the chip. */
>>>>>>       __x86_shared_non_temporal_threshold
>>>>>>         = (cpu_features->non_temporal_threshold != 0
>>>>>>            ? cpu_features->non_temporal_threshold
>>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
>>>>>> +       : __x86_shared_cache_size * 3 / 4);
>>>>>>     }
>>>>>>
>>>>> Can we tune it with the number of threads and/or total cache
>>>>> size?
>>>>>
>>>> When you say "total cache size", is that different from
>>>> shared_cache_size * threads?
>>>>
>>>> I see a fundamental conflict of optimization goals:
>>>> 1) Provide best single thread performance (current code)
>>>> 2) Provide best overall system performance under full load (proposed patch)
>>>> I don't know of any way to have default behavior meet both goals without
>>>> knowledge
>>>> of the system size/usage/requirements.
>>>>
>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
>>>> total cache on the chip.
>>>> That won't be uncommon in the coming years on server class systems,
>>>> especially
>>>> in large databases or HPC environments (think vision processing or
>>>> weather modeling for example).
>>>> If a single app owns the whole chip and is running a multi-threaded
>>>> application but needs
>>>> to memcpy a really large block of data when one phase of computation
>>>> finished
>>>> before moving to the next phase. A common practice would be to have 64
>>>> parallel calls
>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
>>>> compilers
>>>> handle that with no trouble.
>>>>
>>>> In the example, the per thread share of the cache is 2 MB and the
>>>> proposed formula will set
>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
>>>> less, all threads comfortably
>>>> fit in cache. If the total copy size is over that, then non-temporal
>>>> stores are used and all is well there too.
>>>>
>>>> The current formula would set the threshold at 96 Mbytes for each
>>>> thread. Only when the total
>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
>>>> We'd like
>>>> to switch to non-temporal stores much sooner as we will be thrashing all
>>>> the threads caches.
>>>>
>>>> In practical terms, I've had access to typical memcpy copy lengths for a
>>>> variety of commerical
>>>> applications while studying memcpy on Solaris over the years. The vast
>>>> majority of copies
>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
>>>> of cache
>>>> per thread, allowing in-cache copies for the common case, even without
>>>> borrowing
>>>> cache from other threads. The occasional really large copies tend to be
>>>> when an application
>>>> is passing a block of data to prepare for a new phase of computation or
>>>> as a shared memory
>>>> communication to another thread. In these cases, having the data remain
>>>> in cache is usually
>>>> not relevant and using non-temporal stores even when they are not
>>>> strictly required does
>>>> not have a negative affect on performance.
>>>>
>>>> A downside of tuning for a single thread comes in cloud computing
>>>> environments, where
>>>> having neighboring threads being cache hogs, even if relatively isolated
>>>> in virtual machines,
>>>> is a "bad thing" for having stable system performance. Whatever we can
>>>> do to provide consistent,
>>>> reasonable performance whatever the neighboring threads might be doing
>>>> is a "good thing".
>>>>
>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
>>>
>> I have not tested larger thresholds. I'd be more comfortable with a
>> smaller one.
>> We could construct specific tests to show either advantage or disadvantage
>> to shifting from 3/4 to all of cache depending on what data access was used
>> between memcpy operations.
>>
>> I consider pushing the limit on cache usage to be a risky approach. Few
>> applications
>> only work on a single block of data.  If all threads are doing a shared
>> copy and
>> they use all the available cache, then after the memcpy returns, any other
>> active data would have been pushed out of the cache. That's likely to cost
>> severe performance loss in more cases than the modest performance gains for
>> a few cases where the application only is concerned with using the data that
>> was just copied.
>>
>> Just to give a more detailed example where large copies are not followed
>> by using
>> the data. Consider garbage collection followed by compression. With a
>> multi-age
>> garbage collector, stable data that is active and survived several
>> garbage collections
>> is in a 'old' region. It does not need to be copied. The current 'new'
>> region is full
>> but has both referenced and unreferenced data. After the marking phase,
>> the individual elements of the referenced data is copied to the base of
>> the 'new' region.
>> When complete, the rest of the 'new' region becomes the new free pool.
>> The total amount copied may far exceed the processor cache.  Then the
>> application
>> exits garbage collection and resumes active use of mostly the stable
>> data with
>> some accesses to the just moved new data and fresh allocations. If we
>> under-use
>> non-temporal stores, we clear the cache and the whole application runs
>> slower
>> than otherwise.
>>
>> Individual memcpy benchmarks are useful in isolation testing and comparing
>> code patterns but can mislead about overall application performance in the
>> context of potential for cache abuse. I fell into that tarpit once while
>> tuning
>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
>> 5% faster for in-cache data) caused a major customer application to run
>> slower
>> because my new code abused the cache.  I modified my code to  only use the
>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
>> 128Kbytes if I remember right) and all was well.
>>
> The new threshold can be substantially smaller with large core count.
> Are you saying that even 3 / 4 may be too big?  Is there a reasonable
> fixed threshold?
>

I don't have any evidence to say 3/4 is too big for typical applications 
and environments.
In 2012, the default for memcpy was set to 1/2 the shared_cache_size 
which is what is
the current default for Oracle el7 and Red Hat el7.

Given the typically larger sized caches/thread today than 8 years, 3/4 
may work out well
since the remaining 1/4 of today's larger cache is often greater than 
1/2 of yesteryear's smaller cache.

- patrick
  
H.J. Lu Sept. 24, 2020, 9:54 p.m. UTC | #7
On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty
<patrick.mcgehearty@oracle.com> wrote:
>
>
>
> On 9/23/2020 6:13 PM, H.J. Lu wrote:
> > On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
> > <patrick.mcgehearty@oracle.com> wrote:
> >>
> >>
> >> On 9/23/2020 4:37 PM, H.J. Lu wrote:
> >>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
> >>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>
> >>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
> >>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> >>>>> <libc-alpha@sourceware.org> wrote:
> >>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> >>>>>> uses non_temporal stores to avoid pushing other data out of the last
> >>>>>> level cache.
> >>>>>>
> >>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
> >>>>>> patch of June 2, 2017.
> >>>>>>
> >>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
> >>>>>> getting maximum performance. It was tuned using the single threaded
> >>>>>> large memcpy micro benchmark on an 8 core processor. The last change
> >>>>>> changes the threshold from using 3/4 of one thread's share of the
> >>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
> >>>>>> before switching to non-temporal stores. Multi-threaded systems with
> >>>>>> more than a few threads are server-class and typically have many
> >>>>>> active threads. If one thread consumes 3/4 of the available cache for
> >>>>>> all threads, it will cause other active threads to have data removed
> >>>>>> from the cache. Two examples show the range of the effect. John
> >>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
> >>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
> >>>>>> an internal system test of 128 threads. This regression was discovered
> >>>>>> when comparing OL8 performance to OL7.  An example that compares
> >>>>>> normal stores to non-temporal stores may be found at
> >>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
> >>>>>> shows performance loss of 400 to 500% due to a failure to use
> >>>>>> nontemporal stores. These performance losses are most likely to occur
> >>>>>> when the system load is heaviest and good performance is critical.
> >>>>>>
> >>>>>> The tunable x86_non_temporal_threshold can be used to override the
> >>>>>> default for the knowledgable user who really wants maximum cache
> >>>>>> allocation to a single thread in a multi-threaded system.
> >>>>>> The manual entry for the tunable has been expanded to provide
> >>>>>> more information about its purpose.
> >>>>>>
> >>>>>>            modified: sysdeps/x86/cacheinfo.c
> >>>>>>            modified: manual/tunables.texi
> >>>>>> ---
> >>>>>>     manual/tunables.texi    |  6 +++++-
> >>>>>>     sysdeps/x86/cacheinfo.c | 12 +++++++-----
> >>>>>>     2 files changed, 12 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
> >>>>>> index b6bb54d..94d4fbd 100644
> >>>>>> --- a/manual/tunables.texi
> >>>>>> +++ b/manual/tunables.texi
> >>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
> >>>>>>
> >>>>>>     @deftp Tunable glibc.tune.x86_non_temporal_threshold
> >>>>>>     The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> >>>>>> -to set threshold in bytes for non temporal store.
> >>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
> >>>>>> +give a hint to the hardware to move data directly to memory without
> >>>>>> +displacing other data from the cache. This tunable is used by some
> >>>>>> +platforms to determine when to use non temporal stores in operations
> >>>>>> +like memmove and memcpy.
> >>>>>>
> >>>>>>     This tunable is specific to i386 and x86-64.
> >>>>>>     @end deftp
> >>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> >>>>>> index b9444dd..c6767d9 100644
> >>>>>> --- a/sysdeps/x86/cacheinfo.c
> >>>>>> +++ b/sysdeps/x86/cacheinfo.c
> >>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
> >>>>>>           __x86_shared_cache_size = shared;
> >>>>>>         }
> >>>>>>
> >>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> >>>>>> -     shared cache size is the approximate value above which non-temporal
> >>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> >>>>>> -     total shared cache size.  */
> >>>>>> +  /* The default setting for the non_temporal threshold is 3/4
> >>>>>> +     of one thread's share of the chip's cache. While higher
> >>>>>> +     single thread performance may be observed with a higher
> >>>>>> +     threshold, having a single thread use more than it's share
> >>>>>> +     of the cache will negatively impact the performance of
> >>>>>> +     other threads running on the chip. */
> >>>>>>       __x86_shared_non_temporal_threshold
> >>>>>>         = (cpu_features->non_temporal_threshold != 0
> >>>>>>            ? cpu_features->non_temporal_threshold
> >>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
> >>>>>> +       : __x86_shared_cache_size * 3 / 4);
> >>>>>>     }
> >>>>>>
> >>>>> Can we tune it with the number of threads and/or total cache
> >>>>> size?
> >>>>>
> >>>> When you say "total cache size", is that different from
> >>>> shared_cache_size * threads?
> >>>>
> >>>> I see a fundamental conflict of optimization goals:
> >>>> 1) Provide best single thread performance (current code)
> >>>> 2) Provide best overall system performance under full load (proposed patch)
> >>>> I don't know of any way to have default behavior meet both goals without
> >>>> knowledge
> >>>> of the system size/usage/requirements.
> >>>>
> >>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
> >>>> total cache on the chip.
> >>>> That won't be uncommon in the coming years on server class systems,
> >>>> especially
> >>>> in large databases or HPC environments (think vision processing or
> >>>> weather modeling for example).
> >>>> If a single app owns the whole chip and is running a multi-threaded
> >>>> application but needs
> >>>> to memcpy a really large block of data when one phase of computation
> >>>> finished
> >>>> before moving to the next phase. A common practice would be to have 64
> >>>> parallel calls
> >>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
> >>>> compilers
> >>>> handle that with no trouble.
> >>>>
> >>>> In the example, the per thread share of the cache is 2 MB and the
> >>>> proposed formula will set
> >>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
> >>>> less, all threads comfortably
> >>>> fit in cache. If the total copy size is over that, then non-temporal
> >>>> stores are used and all is well there too.
> >>>>
> >>>> The current formula would set the threshold at 96 Mbytes for each
> >>>> thread. Only when the total
> >>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
> >>>> We'd like
> >>>> to switch to non-temporal stores much sooner as we will be thrashing all
> >>>> the threads caches.
> >>>>
> >>>> In practical terms, I've had access to typical memcpy copy lengths for a
> >>>> variety of commerical
> >>>> applications while studying memcpy on Solaris over the years. The vast
> >>>> majority of copies
> >>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
> >>>> of cache
> >>>> per thread, allowing in-cache copies for the common case, even without
> >>>> borrowing
> >>>> cache from other threads. The occasional really large copies tend to be
> >>>> when an application
> >>>> is passing a block of data to prepare for a new phase of computation or
> >>>> as a shared memory
> >>>> communication to another thread. In these cases, having the data remain
> >>>> in cache is usually
> >>>> not relevant and using non-temporal stores even when they are not
> >>>> strictly required does
> >>>> not have a negative affect on performance.
> >>>>
> >>>> A downside of tuning for a single thread comes in cloud computing
> >>>> environments, where
> >>>> having neighboring threads being cache hogs, even if relatively isolated
> >>>> in virtual machines,
> >>>> is a "bad thing" for having stable system performance. Whatever we can
> >>>> do to provide consistent,
> >>>> reasonable performance whatever the neighboring threads might be doing
> >>>> is a "good thing".
> >>>>
> >>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
> >>>
> >> I have not tested larger thresholds. I'd be more comfortable with a
> >> smaller one.
> >> We could construct specific tests to show either advantage or disadvantage
> >> to shifting from 3/4 to all of cache depending on what data access was used
> >> between memcpy operations.
> >>
> >> I consider pushing the limit on cache usage to be a risky approach. Few
> >> applications
> >> only work on a single block of data.  If all threads are doing a shared
> >> copy and
> >> they use all the available cache, then after the memcpy returns, any other
> >> active data would have been pushed out of the cache. That's likely to cost
> >> severe performance loss in more cases than the modest performance gains for
> >> a few cases where the application only is concerned with using the data that
> >> was just copied.
> >>
> >> Just to give a more detailed example where large copies are not followed
> >> by using
> >> the data. Consider garbage collection followed by compression. With a
> >> multi-age
> >> garbage collector, stable data that is active and survived several
> >> garbage collections
> >> is in a 'old' region. It does not need to be copied. The current 'new'
> >> region is full
> >> but has both referenced and unreferenced data. After the marking phase,
> >> the individual elements of the referenced data is copied to the base of
> >> the 'new' region.
> >> When complete, the rest of the 'new' region becomes the new free pool.
> >> The total amount copied may far exceed the processor cache.  Then the
> >> application
> >> exits garbage collection and resumes active use of mostly the stable
> >> data with
> >> some accesses to the just moved new data and fresh allocations. If we
> >> under-use
> >> non-temporal stores, we clear the cache and the whole application runs
> >> slower
> >> than otherwise.
> >>
> >> Individual memcpy benchmarks are useful in isolation testing and comparing
> >> code patterns but can mislead about overall application performance in the
> >> context of potential for cache abuse. I fell into that tarpit once while
> >> tuning
> >> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
> >> 5% faster for in-cache data) caused a major customer application to run
> >> slower
> >> because my new code abused the cache.  I modified my code to  only use the
> >> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
> >> 128Kbytes if I remember right) and all was well.
> >>
> > The new threshold can be substantially smaller with large core count.
> > Are you saying that even 3 / 4 may be too big?  Is there a reasonable
> > fixed threshold?
> >
>
> I don't have any evidence to say 3/4 is too big for typical applications
> and environments.
> In 2012, the default for memcpy was set to 1/2 the shared_cache_size
> which is what is
> the current default for Oracle el7 and Red Hat el7.
>
> Given the typically larger sized caches/thread today than 8 years, 3/4
> may work out well
> since the remaining 1/4 of today's larger cache is often greater than
> 1/2 of yesteryear's smaller cache.
>

Please update the comment with your rationale for 3/4.  Don't use
today or current.   Use 2020 instead.

Thanks.
  
Patrick McGehearty Sept. 24, 2020, 11:22 p.m. UTC | #8
On 9/24/2020 4:54 PM, H.J. Lu wrote:
> On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty
> <patrick.mcgehearty@oracle.com> wrote:
>>
>>
>> On 9/23/2020 6:13 PM, H.J. Lu wrote:
>>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
>>> <patrick.mcgehearty@oracle.com> wrote:
>>>>
>>>> On 9/23/2020 4:37 PM, H.J. Lu wrote:
>>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
>>>>> <patrick.mcgehearty@oracle.com> wrote:
>>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
>>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
>>>>>>> <libc-alpha@sourceware.org> wrote:
>>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>>>>>>>> uses non_temporal stores to avoid pushing other data out of the last
>>>>>>>> level cache.
>>>>>>>>
>>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
>>>>>>>> patch of June 2, 2017.
>>>>>>>>
>>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
>>>>>>>> getting maximum performance. It was tuned using the single threaded
>>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change
>>>>>>>> changes the threshold from using 3/4 of one thread's share of the
>>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
>>>>>>>> before switching to non-temporal stores. Multi-threaded systems with
>>>>>>>> more than a few threads are server-class and typically have many
>>>>>>>> active threads. If one thread consumes 3/4 of the available cache for
>>>>>>>> all threads, it will cause other active threads to have data removed
>>>>>>>> from the cache. Two examples show the range of the effect. John
>>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
>>>>>>>> an internal system test of 128 threads. This regression was discovered
>>>>>>>> when comparing OL8 performance to OL7.  An example that compares
>>>>>>>> normal stores to non-temporal stores may be found at
>>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
>>>>>>>> shows performance loss of 400 to 500% due to a failure to use
>>>>>>>> nontemporal stores. These performance losses are most likely to occur
>>>>>>>> when the system load is heaviest and good performance is critical.
>>>>>>>>
>>>>>>>> The tunable x86_non_temporal_threshold can be used to override the
>>>>>>>> default for the knowledgable user who really wants maximum cache
>>>>>>>> allocation to a single thread in a multi-threaded system.
>>>>>>>> The manual entry for the tunable has been expanded to provide
>>>>>>>> more information about its purpose.
>>>>>>>>
>>>>>>>>             modified: sysdeps/x86/cacheinfo.c
>>>>>>>>             modified: manual/tunables.texi
>>>>>>>> ---
>>>>>>>>      manual/tunables.texi    |  6 +++++-
>>>>>>>>      sysdeps/x86/cacheinfo.c | 12 +++++++-----
>>>>>>>>      2 files changed, 12 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>>>>>>>> index b6bb54d..94d4fbd 100644
>>>>>>>> --- a/manual/tunables.texi
>>>>>>>> +++ b/manual/tunables.texi
>>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>>>>>>>
>>>>>>>>      @deftp Tunable glibc.tune.x86_non_temporal_threshold
>>>>>>>>      The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>>>>>>>> -to set threshold in bytes for non temporal store.
>>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
>>>>>>>> +give a hint to the hardware to move data directly to memory without
>>>>>>>> +displacing other data from the cache. This tunable is used by some
>>>>>>>> +platforms to determine when to use non temporal stores in operations
>>>>>>>> +like memmove and memcpy.
>>>>>>>>
>>>>>>>>      This tunable is specific to i386 and x86-64.
>>>>>>>>      @end deftp
>>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>>>>>>>> index b9444dd..c6767d9 100644
>>>>>>>> --- a/sysdeps/x86/cacheinfo.c
>>>>>>>> +++ b/sysdeps/x86/cacheinfo.c
>>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>>>>>>>>            __x86_shared_cache_size = shared;
>>>>>>>>          }
>>>>>>>>
>>>>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
>>>>>>>> -     shared cache size is the approximate value above which non-temporal
>>>>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
>>>>>>>> -     total shared cache size.  */
>>>>>>>> +  /* The default setting for the non_temporal threshold is 3/4
>>>>>>>> +     of one thread's share of the chip's cache. While higher
>>>>>>>> +     single thread performance may be observed with a higher
>>>>>>>> +     threshold, having a single thread use more than it's share
>>>>>>>> +     of the cache will negatively impact the performance of
>>>>>>>> +     other threads running on the chip. */
>>>>>>>>        __x86_shared_non_temporal_threshold
>>>>>>>>          = (cpu_features->non_temporal_threshold != 0
>>>>>>>>             ? cpu_features->non_temporal_threshold
>>>>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
>>>>>>>> +       : __x86_shared_cache_size * 3 / 4);
>>>>>>>>      }
>>>>>>>>
>>>>>>> Can we tune it with the number of threads and/or total cache
>>>>>>> size?
>>>>>>>
>>>>>> When you say "total cache size", is that different from
>>>>>> shared_cache_size * threads?
>>>>>>
>>>>>> I see a fundamental conflict of optimization goals:
>>>>>> 1) Provide best single thread performance (current code)
>>>>>> 2) Provide best overall system performance under full load (proposed patch)
>>>>>> I don't know of any way to have default behavior meet both goals without
>>>>>> knowledge
>>>>>> of the system size/usage/requirements.
>>>>>>
>>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
>>>>>> total cache on the chip.
>>>>>> That won't be uncommon in the coming years on server class systems,
>>>>>> especially
>>>>>> in large databases or HPC environments (think vision processing or
>>>>>> weather modeling for example).
>>>>>> If a single app owns the whole chip and is running a multi-threaded
>>>>>> application but needs
>>>>>> to memcpy a really large block of data when one phase of computation
>>>>>> finished
>>>>>> before moving to the next phase. A common practice would be to have 64
>>>>>> parallel calls
>>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
>>>>>> compilers
>>>>>> handle that with no trouble.
>>>>>>
>>>>>> In the example, the per thread share of the cache is 2 MB and the
>>>>>> proposed formula will set
>>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
>>>>>> less, all threads comfortably
>>>>>> fit in cache. If the total copy size is over that, then non-temporal
>>>>>> stores are used and all is well there too.
>>>>>>
>>>>>> The current formula would set the threshold at 96 Mbytes for each
>>>>>> thread. Only when the total
>>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
>>>>>> We'd like
>>>>>> to switch to non-temporal stores much sooner as we will be thrashing all
>>>>>> the threads caches.
>>>>>>
>>>>>> In practical terms, I've had access to typical memcpy copy lengths for a
>>>>>> variety of commerical
>>>>>> applications while studying memcpy on Solaris over the years. The vast
>>>>>> majority of copies
>>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
>>>>>> of cache
>>>>>> per thread, allowing in-cache copies for the common case, even without
>>>>>> borrowing
>>>>>> cache from other threads. The occasional really large copies tend to be
>>>>>> when an application
>>>>>> is passing a block of data to prepare for a new phase of computation or
>>>>>> as a shared memory
>>>>>> communication to another thread. In these cases, having the data remain
>>>>>> in cache is usually
>>>>>> not relevant and using non-temporal stores even when they are not
>>>>>> strictly required does
>>>>>> not have a negative affect on performance.
>>>>>>
>>>>>> A downside of tuning for a single thread comes in cloud computing
>>>>>> environments, where
>>>>>> having neighboring threads being cache hogs, even if relatively isolated
>>>>>> in virtual machines,
>>>>>> is a "bad thing" for having stable system performance. Whatever we can
>>>>>> do to provide consistent,
>>>>>> reasonable performance whatever the neighboring threads might be doing
>>>>>> is a "good thing".
>>>>>>
>>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
>>>>>
>>>> I have not tested larger thresholds. I'd be more comfortable with a
>>>> smaller one.
>>>> We could construct specific tests to show either advantage or disadvantage
>>>> to shifting from 3/4 to all of cache depending on what data access was used
>>>> between memcpy operations.
>>>>
>>>> I consider pushing the limit on cache usage to be a risky approach. Few
>>>> applications
>>>> only work on a single block of data.  If all threads are doing a shared
>>>> copy and
>>>> they use all the available cache, then after the memcpy returns, any other
>>>> active data would have been pushed out of the cache. That's likely to cost
>>>> severe performance loss in more cases than the modest performance gains for
>>>> a few cases where the application only is concerned with using the data that
>>>> was just copied.
>>>>
>>>> Just to give a more detailed example where large copies are not followed
>>>> by using
>>>> the data. Consider garbage collection followed by compression. With a
>>>> multi-age
>>>> garbage collector, stable data that is active and survived several
>>>> garbage collections
>>>> is in a 'old' region. It does not need to be copied. The current 'new'
>>>> region is full
>>>> but has both referenced and unreferenced data. After the marking phase,
>>>> the individual elements of the referenced data is copied to the base of
>>>> the 'new' region.
>>>> When complete, the rest of the 'new' region becomes the new free pool.
>>>> The total amount copied may far exceed the processor cache.  Then the
>>>> application
>>>> exits garbage collection and resumes active use of mostly the stable
>>>> data with
>>>> some accesses to the just moved new data and fresh allocations. If we
>>>> under-use
>>>> non-temporal stores, we clear the cache and the whole application runs
>>>> slower
>>>> than otherwise.
>>>>
>>>> Individual memcpy benchmarks are useful in isolation testing and comparing
>>>> code patterns but can mislead about overall application performance in the
>>>> context of potential for cache abuse. I fell into that tarpit once while
>>>> tuning
>>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
>>>> 5% faster for in-cache data) caused a major customer application to run
>>>> slower
>>>> because my new code abused the cache.  I modified my code to  only use the
>>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
>>>> 128Kbytes if I remember right) and all was well.
>>>>
>>> The new threshold can be substantially smaller with large core count.
>>> Are you saying that even 3 / 4 may be too big?  Is there a reasonable
>>> fixed threshold?
>>>
>> I don't have any evidence to say 3/4 is too big for typical applications
>> and environments.
>> In 2012, the default for memcpy was set to 1/2 the shared_cache_size
>> which is what is
>> the current default for Oracle el7 and Red Hat el7.
>>
>> Given the typically larger sized caches/thread today than 8 years, 3/4
>> may work out well
>> since the remaining 1/4 of today's larger cache is often greater than
>> 1/2 of yesteryear's smaller cache.
>>
> Please update the comment with your rationale for 3/4.  Don't use
> today or current.   Use 2020 instead.
>
> Thanks.
>
I'm unsure about what needs to change in the comment which does not mention
any dates currently. I'm assuming you are referring to the following 
comment in cacheinfo.c

   /* The default setting for the non_temporal threshold is 3/4
      of one thread's share of the chip's cache. While higher
      single thread performance may be observed with a higher
      threshold, having a single thread use more than it's share
      of the cache will negatively impact the performance of
      other threads running on the chip. */

While I could add a comment on why 3/4 vs 1/2 is the best choice, I 
don't have hard
data to back it up. I'd be comfortable with either  3/4 or 1/2. I 
selected 3/4 as it
was closer to the formula you chose in 2017 instead of the formula you 
chose in 2012.

- patrick
  
H.J. Lu Sept. 24, 2020, 11:57 p.m. UTC | #9
On Thu, Sep 24, 2020 at 4:22 PM Patrick McGehearty
<patrick.mcgehearty@oracle.com> wrote:
>
>
>
> On 9/24/2020 4:54 PM, H.J. Lu wrote:
> > On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty
> > <patrick.mcgehearty@oracle.com> wrote:
> >>
> >>
> >> On 9/23/2020 6:13 PM, H.J. Lu wrote:
> >>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
> >>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>
> >>>> On 9/23/2020 4:37 PM, H.J. Lu wrote:
> >>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
> >>>>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
> >>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> >>>>>>> <libc-alpha@sourceware.org> wrote:
> >>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> >>>>>>>> uses non_temporal stores to avoid pushing other data out of the last
> >>>>>>>> level cache.
> >>>>>>>>
> >>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
> >>>>>>>> patch of June 2, 2017.
> >>>>>>>>
> >>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
> >>>>>>>> getting maximum performance. It was tuned using the single threaded
> >>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change
> >>>>>>>> changes the threshold from using 3/4 of one thread's share of the
> >>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
> >>>>>>>> before switching to non-temporal stores. Multi-threaded systems with
> >>>>>>>> more than a few threads are server-class and typically have many
> >>>>>>>> active threads. If one thread consumes 3/4 of the available cache for
> >>>>>>>> all threads, it will cause other active threads to have data removed
> >>>>>>>> from the cache. Two examples show the range of the effect. John
> >>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
> >>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
> >>>>>>>> an internal system test of 128 threads. This regression was discovered
> >>>>>>>> when comparing OL8 performance to OL7.  An example that compares
> >>>>>>>> normal stores to non-temporal stores may be found at
> >>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
> >>>>>>>> shows performance loss of 400 to 500% due to a failure to use
> >>>>>>>> nontemporal stores. These performance losses are most likely to occur
> >>>>>>>> when the system load is heaviest and good performance is critical.
> >>>>>>>>
> >>>>>>>> The tunable x86_non_temporal_threshold can be used to override the
> >>>>>>>> default for the knowledgable user who really wants maximum cache
> >>>>>>>> allocation to a single thread in a multi-threaded system.
> >>>>>>>> The manual entry for the tunable has been expanded to provide
> >>>>>>>> more information about its purpose.
> >>>>>>>>
> >>>>>>>>             modified: sysdeps/x86/cacheinfo.c
> >>>>>>>>             modified: manual/tunables.texi
> >>>>>>>> ---
> >>>>>>>>      manual/tunables.texi    |  6 +++++-
> >>>>>>>>      sysdeps/x86/cacheinfo.c | 12 +++++++-----
> >>>>>>>>      2 files changed, 12 insertions(+), 6 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
> >>>>>>>> index b6bb54d..94d4fbd 100644
> >>>>>>>> --- a/manual/tunables.texi
> >>>>>>>> +++ b/manual/tunables.texi
> >>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
> >>>>>>>>
> >>>>>>>>      @deftp Tunable glibc.tune.x86_non_temporal_threshold
> >>>>>>>>      The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> >>>>>>>> -to set threshold in bytes for non temporal store.
> >>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
> >>>>>>>> +give a hint to the hardware to move data directly to memory without
> >>>>>>>> +displacing other data from the cache. This tunable is used by some
> >>>>>>>> +platforms to determine when to use non temporal stores in operations
> >>>>>>>> +like memmove and memcpy.
> >>>>>>>>
> >>>>>>>>      This tunable is specific to i386 and x86-64.
> >>>>>>>>      @end deftp
> >>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> >>>>>>>> index b9444dd..c6767d9 100644
> >>>>>>>> --- a/sysdeps/x86/cacheinfo.c
> >>>>>>>> +++ b/sysdeps/x86/cacheinfo.c
> >>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
> >>>>>>>>            __x86_shared_cache_size = shared;
> >>>>>>>>          }
> >>>>>>>>
> >>>>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> >>>>>>>> -     shared cache size is the approximate value above which non-temporal
> >>>>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> >>>>>>>> -     total shared cache size.  */
> >>>>>>>> +  /* The default setting for the non_temporal threshold is 3/4
> >>>>>>>> +     of one thread's share of the chip's cache. While higher
> >>>>>>>> +     single thread performance may be observed with a higher
> >>>>>>>> +     threshold, having a single thread use more than it's share
> >>>>>>>> +     of the cache will negatively impact the performance of
> >>>>>>>> +     other threads running on the chip. */
> >>>>>>>>        __x86_shared_non_temporal_threshold
> >>>>>>>>          = (cpu_features->non_temporal_threshold != 0
> >>>>>>>>             ? cpu_features->non_temporal_threshold
> >>>>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
> >>>>>>>> +       : __x86_shared_cache_size * 3 / 4);
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>> Can we tune it with the number of threads and/or total cache
> >>>>>>> size?
> >>>>>>>
> >>>>>> When you say "total cache size", is that different from
> >>>>>> shared_cache_size * threads?
> >>>>>>
> >>>>>> I see a fundamental conflict of optimization goals:
> >>>>>> 1) Provide best single thread performance (current code)
> >>>>>> 2) Provide best overall system performance under full load (proposed patch)
> >>>>>> I don't know of any way to have default behavior meet both goals without
> >>>>>> knowledge
> >>>>>> of the system size/usage/requirements.
> >>>>>>
> >>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
> >>>>>> total cache on the chip.
> >>>>>> That won't be uncommon in the coming years on server class systems,
> >>>>>> especially
> >>>>>> in large databases or HPC environments (think vision processing or
> >>>>>> weather modeling for example).
> >>>>>> If a single app owns the whole chip and is running a multi-threaded
> >>>>>> application but needs
> >>>>>> to memcpy a really large block of data when one phase of computation
> >>>>>> finished
> >>>>>> before moving to the next phase. A common practice would be to have 64
> >>>>>> parallel calls
> >>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
> >>>>>> compilers
> >>>>>> handle that with no trouble.
> >>>>>>
> >>>>>> In the example, the per thread share of the cache is 2 MB and the
> >>>>>> proposed formula will set
> >>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
> >>>>>> less, all threads comfortably
> >>>>>> fit in cache. If the total copy size is over that, then non-temporal
> >>>>>> stores are used and all is well there too.
> >>>>>>
> >>>>>> The current formula would set the threshold at 96 Mbytes for each
> >>>>>> thread. Only when the total
> >>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
> >>>>>> We'd like
> >>>>>> to switch to non-temporal stores much sooner as we will be thrashing all
> >>>>>> the threads caches.
> >>>>>>
> >>>>>> In practical terms, I've had access to typical memcpy copy lengths for a
> >>>>>> variety of commerical
> >>>>>> applications while studying memcpy on Solaris over the years. The vast
> >>>>>> majority of copies
> >>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
> >>>>>> of cache
> >>>>>> per thread, allowing in-cache copies for the common case, even without
> >>>>>> borrowing
> >>>>>> cache from other threads. The occasional really large copies tend to be
> >>>>>> when an application
> >>>>>> is passing a block of data to prepare for a new phase of computation or
> >>>>>> as a shared memory
> >>>>>> communication to another thread. In these cases, having the data remain
> >>>>>> in cache is usually
> >>>>>> not relevant and using non-temporal stores even when they are not
> >>>>>> strictly required does
> >>>>>> not have a negative affect on performance.
> >>>>>>
> >>>>>> A downside of tuning for a single thread comes in cloud computing
> >>>>>> environments, where
> >>>>>> having neighboring threads being cache hogs, even if relatively isolated
> >>>>>> in virtual machines,
> >>>>>> is a "bad thing" for having stable system performance. Whatever we can
> >>>>>> do to provide consistent,
> >>>>>> reasonable performance whatever the neighboring threads might be doing
> >>>>>> is a "good thing".
> >>>>>>
> >>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
> >>>>>
> >>>> I have not tested larger thresholds. I'd be more comfortable with a
> >>>> smaller one.
> >>>> We could construct specific tests to show either advantage or disadvantage
> >>>> to shifting from 3/4 to all of cache depending on what data access was used
> >>>> between memcpy operations.
> >>>>
> >>>> I consider pushing the limit on cache usage to be a risky approach. Few
> >>>> applications
> >>>> only work on a single block of data.  If all threads are doing a shared
> >>>> copy and
> >>>> they use all the available cache, then after the memcpy returns, any other
> >>>> active data would have been pushed out of the cache. That's likely to cost
> >>>> severe performance loss in more cases than the modest performance gains for
> >>>> a few cases where the application only is concerned with using the data that
> >>>> was just copied.
> >>>>
> >>>> Just to give a more detailed example where large copies are not followed
> >>>> by using
> >>>> the data. Consider garbage collection followed by compression. With a
> >>>> multi-age
> >>>> garbage collector, stable data that is active and survived several
> >>>> garbage collections
> >>>> is in a 'old' region. It does not need to be copied. The current 'new'
> >>>> region is full
> >>>> but has both referenced and unreferenced data. After the marking phase,
> >>>> the individual elements of the referenced data is copied to the base of
> >>>> the 'new' region.
> >>>> When complete, the rest of the 'new' region becomes the new free pool.
> >>>> The total amount copied may far exceed the processor cache.  Then the
> >>>> application
> >>>> exits garbage collection and resumes active use of mostly the stable
> >>>> data with
> >>>> some accesses to the just moved new data and fresh allocations. If we
> >>>> under-use
> >>>> non-temporal stores, we clear the cache and the whole application runs
> >>>> slower
> >>>> than otherwise.
> >>>>
> >>>> Individual memcpy benchmarks are useful in isolation testing and comparing
> >>>> code patterns but can mislead about overall application performance in the
> >>>> context of potential for cache abuse. I fell into that tarpit once while
> >>>> tuning
> >>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
> >>>> 5% faster for in-cache data) caused a major customer application to run
> >>>> slower
> >>>> because my new code abused the cache.  I modified my code to  only use the
> >>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
> >>>> 128Kbytes if I remember right) and all was well.
> >>>>
> >>> The new threshold can be substantially smaller with large core count.
> >>> Are you saying that even 3 / 4 may be too big?  Is there a reasonable
> >>> fixed threshold?
> >>>
> >> I don't have any evidence to say 3/4 is too big for typical applications
> >> and environments.
> >> In 2012, the default for memcpy was set to 1/2 the shared_cache_size
> >> which is what is
> >> the current default for Oracle el7 and Red Hat el7.
> >>
> >> Given the typically larger sized caches/thread today than 8 years, 3/4
> >> may work out well
> >> since the remaining 1/4 of today's larger cache is often greater than
> >> 1/2 of yesteryear's smaller cache.
> >>
> > Please update the comment with your rationale for 3/4.  Don't use
> > today or current.   Use 2020 instead.
> >
> > Thanks.
> >
> I'm unsure about what needs to change in the comment which does not mention
> any dates currently. I'm assuming you are referring to the following
> comment in cacheinfo.c
>
>    /* The default setting for the non_temporal threshold is 3/4
>       of one thread's share of the chip's cache. While higher
>       single thread performance may be observed with a higher
>       threshold, having a single thread use more than it's share
>       of the cache will negatively impact the performance of
>       other threads running on the chip. */
>
> While I could add a comment on why 3/4 vs 1/2 is the best choice, I
> don't have hard
> data to back it up. I'd be comfortable with either  3/4 or 1/2. I
> selected 3/4 as it
> was closer to the formula you chose in 2017 instead of the formula you
> chose in 2012.

The comment is for readers 5 years from now who may be wondering
where 3/4 came from.  Just add something close to what you have said above.
  
Patrick McGehearty Sept. 25, 2020, 8:53 p.m. UTC | #10
On 9/24/2020 6:57 PM, H.J. Lu wrote:
> On Thu, Sep 24, 2020 at 4:22 PM Patrick McGehearty
> <patrick.mcgehearty@oracle.com> wrote:
>>
>>
>> On 9/24/2020 4:54 PM, H.J. Lu wrote:
>>> On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty
>>> <patrick.mcgehearty@oracle.com> wrote:
>>>>
>>>> On 9/23/2020 6:13 PM, H.J. Lu wrote:
>>>>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
>>>>> <patrick.mcgehearty@oracle.com> wrote:
>>>>>> On 9/23/2020 4:37 PM, H.J. Lu wrote:
>>>>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
>>>>>>> <patrick.mcgehearty@oracle.com> wrote:
>>>>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
>>>>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
>>>>>>>>> <libc-alpha@sourceware.org> wrote:
>>>>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>>>>>>>>>> uses non_temporal stores to avoid pushing other data out of the last
>>>>>>>>>> level cache.
>>>>>>>>>>
>>>>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
>>>>>>>>>> patch of June 2, 2017.
>>>>>>>>>>
>>>>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
>>>>>>>>>> getting maximum performance. It was tuned using the single threaded
>>>>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change
>>>>>>>>>> changes the threshold from using 3/4 of one thread's share of the
>>>>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
>>>>>>>>>> before switching to non-temporal stores. Multi-threaded systems with
>>>>>>>>>> more than a few threads are server-class and typically have many
>>>>>>>>>> active threads. If one thread consumes 3/4 of the available cache for
>>>>>>>>>> all threads, it will cause other active threads to have data removed
>>>>>>>>>> from the cache. Two examples show the range of the effect. John
>>>>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>>>>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
>>>>>>>>>> an internal system test of 128 threads. This regression was discovered
>>>>>>>>>> when comparing OL8 performance to OL7.  An example that compares
>>>>>>>>>> normal stores to non-temporal stores may be found at
>>>>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
>>>>>>>>>> shows performance loss of 400 to 500% due to a failure to use
>>>>>>>>>> nontemporal stores. These performance losses are most likely to occur
>>>>>>>>>> when the system load is heaviest and good performance is critical.
>>>>>>>>>>
>>>>>>>>>> The tunable x86_non_temporal_threshold can be used to override the
>>>>>>>>>> default for the knowledgable user who really wants maximum cache
>>>>>>>>>> allocation to a single thread in a multi-threaded system.
>>>>>>>>>> The manual entry for the tunable has been expanded to provide
>>>>>>>>>> more information about its purpose.
>>>>>>>>>>
>>>>>>>>>>              modified: sysdeps/x86/cacheinfo.c
>>>>>>>>>>              modified: manual/tunables.texi
>>>>>>>>>> ---
>>>>>>>>>>       manual/tunables.texi    |  6 +++++-
>>>>>>>>>>       sysdeps/x86/cacheinfo.c | 12 +++++++-----
>>>>>>>>>>       2 files changed, 12 insertions(+), 6 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>>>>>>>>>> index b6bb54d..94d4fbd 100644
>>>>>>>>>> --- a/manual/tunables.texi
>>>>>>>>>> +++ b/manual/tunables.texi
>>>>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>>>>>>>>>
>>>>>>>>>>       @deftp Tunable glibc.tune.x86_non_temporal_threshold
>>>>>>>>>>       The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>>>>>>>>>> -to set threshold in bytes for non temporal store.
>>>>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
>>>>>>>>>> +give a hint to the hardware to move data directly to memory without
>>>>>>>>>> +displacing other data from the cache. This tunable is used by some
>>>>>>>>>> +platforms to determine when to use non temporal stores in operations
>>>>>>>>>> +like memmove and memcpy.
>>>>>>>>>>
>>>>>>>>>>       This tunable is specific to i386 and x86-64.
>>>>>>>>>>       @end deftp
>>>>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>>>>>>>>>> index b9444dd..c6767d9 100644
>>>>>>>>>> --- a/sysdeps/x86/cacheinfo.c
>>>>>>>>>> +++ b/sysdeps/x86/cacheinfo.c
>>>>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>>>>>>>>>>             __x86_shared_cache_size = shared;
>>>>>>>>>>           }
>>>>>>>>>>
>>>>>>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
>>>>>>>>>> -     shared cache size is the approximate value above which non-temporal
>>>>>>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
>>>>>>>>>> -     total shared cache size.  */
>>>>>>>>>> +  /* The default setting for the non_temporal threshold is 3/4
>>>>>>>>>> +     of one thread's share of the chip's cache. While higher
>>>>>>>>>> +     single thread performance may be observed with a higher
>>>>>>>>>> +     threshold, having a single thread use more than it's share
>>>>>>>>>> +     of the cache will negatively impact the performance of
>>>>>>>>>> +     other threads running on the chip. */
>>>>>>>>>>         __x86_shared_non_temporal_threshold
>>>>>>>>>>           = (cpu_features->non_temporal_threshold != 0
>>>>>>>>>>              ? cpu_features->non_temporal_threshold
>>>>>>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
>>>>>>>>>> +       : __x86_shared_cache_size * 3 / 4);
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>> Can we tune it with the number of threads and/or total cache
>>>>>>>>> size?
>>>>>>>>>
>>>>>>>> When you say "total cache size", is that different from
>>>>>>>> shared_cache_size * threads?
>>>>>>>>
>>>>>>>> I see a fundamental conflict of optimization goals:
>>>>>>>> 1) Provide best single thread performance (current code)
>>>>>>>> 2) Provide best overall system performance under full load (proposed patch)
>>>>>>>> I don't know of any way to have default behavior meet both goals without
>>>>>>>> knowledge
>>>>>>>> of the system size/usage/requirements.
>>>>>>>>
>>>>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
>>>>>>>> total cache on the chip.
>>>>>>>> That won't be uncommon in the coming years on server class systems,
>>>>>>>> especially
>>>>>>>> in large databases or HPC environments (think vision processing or
>>>>>>>> weather modeling for example).
>>>>>>>> If a single app owns the whole chip and is running a multi-threaded
>>>>>>>> application but needs
>>>>>>>> to memcpy a really large block of data when one phase of computation
>>>>>>>> finished
>>>>>>>> before moving to the next phase. A common practice would be to have 64
>>>>>>>> parallel calls
>>>>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
>>>>>>>> compilers
>>>>>>>> handle that with no trouble.
>>>>>>>>
>>>>>>>> In the example, the per thread share of the cache is 2 MB and the
>>>>>>>> proposed formula will set
>>>>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
>>>>>>>> less, all threads comfortably
>>>>>>>> fit in cache. If the total copy size is over that, then non-temporal
>>>>>>>> stores are used and all is well there too.
>>>>>>>>
>>>>>>>> The current formula would set the threshold at 96 Mbytes for each
>>>>>>>> thread. Only when the total
>>>>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
>>>>>>>> We'd like
>>>>>>>> to switch to non-temporal stores much sooner as we will be thrashing all
>>>>>>>> the threads caches.
>>>>>>>>
>>>>>>>> In practical terms, I've had access to typical memcpy copy lengths for a
>>>>>>>> variety of commerical
>>>>>>>> applications while studying memcpy on Solaris over the years. The vast
>>>>>>>> majority of copies
>>>>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
>>>>>>>> of cache
>>>>>>>> per thread, allowing in-cache copies for the common case, even without
>>>>>>>> borrowing
>>>>>>>> cache from other threads. The occasional really large copies tend to be
>>>>>>>> when an application
>>>>>>>> is passing a block of data to prepare for a new phase of computation or
>>>>>>>> as a shared memory
>>>>>>>> communication to another thread. In these cases, having the data remain
>>>>>>>> in cache is usually
>>>>>>>> not relevant and using non-temporal stores even when they are not
>>>>>>>> strictly required does
>>>>>>>> not have a negative affect on performance.
>>>>>>>>
>>>>>>>> A downside of tuning for a single thread comes in cloud computing
>>>>>>>> environments, where
>>>>>>>> having neighboring threads being cache hogs, even if relatively isolated
>>>>>>>> in virtual machines,
>>>>>>>> is a "bad thing" for having stable system performance. Whatever we can
>>>>>>>> do to provide consistent,
>>>>>>>> reasonable performance whatever the neighboring threads might be doing
>>>>>>>> is a "good thing".
>>>>>>>>
>>>>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
>>>>>>>
>>>>>> I have not tested larger thresholds. I'd be more comfortable with a
>>>>>> smaller one.
>>>>>> We could construct specific tests to show either advantage or disadvantage
>>>>>> to shifting from 3/4 to all of cache depending on what data access was used
>>>>>> between memcpy operations.
>>>>>>
>>>>>> I consider pushing the limit on cache usage to be a risky approach. Few
>>>>>> applications
>>>>>> only work on a single block of data.  If all threads are doing a shared
>>>>>> copy and
>>>>>> they use all the available cache, then after the memcpy returns, any other
>>>>>> active data would have been pushed out of the cache. That's likely to cost
>>>>>> severe performance loss in more cases than the modest performance gains for
>>>>>> a few cases where the application only is concerned with using the data that
>>>>>> was just copied.
>>>>>>
>>>>>> Just to give a more detailed example where large copies are not followed
>>>>>> by using
>>>>>> the data. Consider garbage collection followed by compression. With a
>>>>>> multi-age
>>>>>> garbage collector, stable data that is active and survived several
>>>>>> garbage collections
>>>>>> is in a 'old' region. It does not need to be copied. The current 'new'
>>>>>> region is full
>>>>>> but has both referenced and unreferenced data. After the marking phase,
>>>>>> the individual elements of the referenced data is copied to the base of
>>>>>> the 'new' region.
>>>>>> When complete, the rest of the 'new' region becomes the new free pool.
>>>>>> The total amount copied may far exceed the processor cache.  Then the
>>>>>> application
>>>>>> exits garbage collection and resumes active use of mostly the stable
>>>>>> data with
>>>>>> some accesses to the just moved new data and fresh allocations. If we
>>>>>> under-use
>>>>>> non-temporal stores, we clear the cache and the whole application runs
>>>>>> slower
>>>>>> than otherwise.
>>>>>>
>>>>>> Individual memcpy benchmarks are useful in isolation testing and comparing
>>>>>> code patterns but can mislead about overall application performance in the
>>>>>> context of potential for cache abuse. I fell into that tarpit once while
>>>>>> tuning
>>>>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
>>>>>> 5% faster for in-cache data) caused a major customer application to run
>>>>>> slower
>>>>>> because my new code abused the cache.  I modified my code to  only use the
>>>>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
>>>>>> 128Kbytes if I remember right) and all was well.
>>>>>>
>>>>> The new threshold can be substantially smaller with large core count.
>>>>> Are you saying that even 3 / 4 may be too big?  Is there a reasonable
>>>>> fixed threshold?
>>>>>
>>>> I don't have any evidence to say 3/4 is too big for typical applications
>>>> and environments.
>>>> In 2012, the default for memcpy was set to 1/2 the shared_cache_size
>>>> which is what is
>>>> the current default for Oracle el7 and Red Hat el7.
>>>>
>>>> Given the typically larger sized caches/thread today than 8 years, 3/4
>>>> may work out well
>>>> since the remaining 1/4 of today's larger cache is often greater than
>>>> 1/2 of yesteryear's smaller cache.
>>>>
>>> Please update the comment with your rationale for 3/4.  Don't use
>>> today or current.   Use 2020 instead.
>>>
>>> Thanks.
>>>
>> I'm unsure about what needs to change in the comment which does not mention
>> any dates currently. I'm assuming you are referring to the following
>> comment in cacheinfo.c
>>
>>     /* The default setting for the non_temporal threshold is 3/4
>>        of one thread's share of the chip's cache. While higher
>>        single thread performance may be observed with a higher
>>        threshold, having a single thread use more than it's share
>>        of the cache will negatively impact the performance of
>>        other threads running on the chip. */
>>
>> While I could add a comment on why 3/4 vs 1/2 is the best choice, I
>> don't have hard
>> data to back it up. I'd be comfortable with either  3/4 or 1/2. I
>> selected 3/4 as it
>> was closer to the formula you chose in 2017 instead of the formula you
>> chose in 2012.
> The comment is for readers 5 years from now who may be wondering
> where 3/4 came from.  Just add something close to what you have said above.
>
Before I redo the commit and resubmit the whole patch. I thought I'd present
a revised comment for review.The value of 500KB to 2MB/thread is based
on a quick review of the wikipedia entries for Intel and AMD processors
released since 2017. There may be a few outliers, but the vast majority
fit that range for L3/thread. I tried to balance giving a sense of the
situation without diving too deeply into application specific details.


Comment in v2:
   /* The default setting for the non_temporal threshold is 3/4
      of one thread's share of the chip's cache. While higher
      single thread performance may be observed with a higher
      threshold, having a single thread use more than it's share
      of the cache will negatively impact the performance of
      other threads running on the chip. */

Proposed comment for v3:
   /* The default setting for the non_temporal threshold is 3/4 of one
      thread's share of the chip's cache. For most Intel and AMD processors
      with an initial release date between 2017 and 2020,a thread's typical
      share ofthe cache is from 500 KBytes to 2 MBytes. Using the 3/4
threshold leaves 125 KBytes to 500 KBytes of the thread'sdata
      in cache after a maximum temporal copy, which will maintain
      in cache a reasonable portion of the thread's stack and other
      active data. If the threshold is set higher than one thread's
      share of the cache, it has a substantial risk of negatively
      impacting the performance of other threads running on the chip. */
  
H.J. Lu Sept. 25, 2020, 9:04 p.m. UTC | #11
On Fri, Sep 25, 2020 at 1:53 PM Patrick McGehearty
<patrick.mcgehearty@oracle.com> wrote:
>
>
>
> On 9/24/2020 6:57 PM, H.J. Lu wrote:
> > On Thu, Sep 24, 2020 at 4:22 PM Patrick McGehearty
> > <patrick.mcgehearty@oracle.com> wrote:
> >>
> >>
> >> On 9/24/2020 4:54 PM, H.J. Lu wrote:
> >>> On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty
> >>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>
> >>>> On 9/23/2020 6:13 PM, H.J. Lu wrote:
> >>>>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
> >>>>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>>> On 9/23/2020 4:37 PM, H.J. Lu wrote:
> >>>>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
> >>>>>>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
> >>>>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> >>>>>>>>> <libc-alpha@sourceware.org> wrote:
> >>>>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> >>>>>>>>>> uses non_temporal stores to avoid pushing other data out of the last
> >>>>>>>>>> level cache.
> >>>>>>>>>>
> >>>>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
> >>>>>>>>>> patch of June 2, 2017.
> >>>>>>>>>>
> >>>>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
> >>>>>>>>>> getting maximum performance. It was tuned using the single threaded
> >>>>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change
> >>>>>>>>>> changes the threshold from using 3/4 of one thread's share of the
> >>>>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
> >>>>>>>>>> before switching to non-temporal stores. Multi-threaded systems with
> >>>>>>>>>> more than a few threads are server-class and typically have many
> >>>>>>>>>> active threads. If one thread consumes 3/4 of the available cache for
> >>>>>>>>>> all threads, it will cause other active threads to have data removed
> >>>>>>>>>> from the cache. Two examples show the range of the effect. John
> >>>>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
> >>>>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
> >>>>>>>>>> an internal system test of 128 threads. This regression was discovered
> >>>>>>>>>> when comparing OL8 performance to OL7.  An example that compares
> >>>>>>>>>> normal stores to non-temporal stores may be found at
> >>>>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
> >>>>>>>>>> shows performance loss of 400 to 500% due to a failure to use
> >>>>>>>>>> nontemporal stores. These performance losses are most likely to occur
> >>>>>>>>>> when the system load is heaviest and good performance is critical.
> >>>>>>>>>>
> >>>>>>>>>> The tunable x86_non_temporal_threshold can be used to override the
> >>>>>>>>>> default for the knowledgable user who really wants maximum cache
> >>>>>>>>>> allocation to a single thread in a multi-threaded system.
> >>>>>>>>>> The manual entry for the tunable has been expanded to provide
> >>>>>>>>>> more information about its purpose.
> >>>>>>>>>>
> >>>>>>>>>>              modified: sysdeps/x86/cacheinfo.c
> >>>>>>>>>>              modified: manual/tunables.texi
> >>>>>>>>>> ---
> >>>>>>>>>>       manual/tunables.texi    |  6 +++++-
> >>>>>>>>>>       sysdeps/x86/cacheinfo.c | 12 +++++++-----
> >>>>>>>>>>       2 files changed, 12 insertions(+), 6 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
> >>>>>>>>>> index b6bb54d..94d4fbd 100644
> >>>>>>>>>> --- a/manual/tunables.texi
> >>>>>>>>>> +++ b/manual/tunables.texi
> >>>>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
> >>>>>>>>>>
> >>>>>>>>>>       @deftp Tunable glibc.tune.x86_non_temporal_threshold
> >>>>>>>>>>       The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> >>>>>>>>>> -to set threshold in bytes for non temporal store.
> >>>>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
> >>>>>>>>>> +give a hint to the hardware to move data directly to memory without
> >>>>>>>>>> +displacing other data from the cache. This tunable is used by some
> >>>>>>>>>> +platforms to determine when to use non temporal stores in operations
> >>>>>>>>>> +like memmove and memcpy.
> >>>>>>>>>>
> >>>>>>>>>>       This tunable is specific to i386 and x86-64.
> >>>>>>>>>>       @end deftp
> >>>>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> >>>>>>>>>> index b9444dd..c6767d9 100644
> >>>>>>>>>> --- a/sysdeps/x86/cacheinfo.c
> >>>>>>>>>> +++ b/sysdeps/x86/cacheinfo.c
> >>>>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
> >>>>>>>>>>             __x86_shared_cache_size = shared;
> >>>>>>>>>>           }
> >>>>>>>>>>
> >>>>>>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> >>>>>>>>>> -     shared cache size is the approximate value above which non-temporal
> >>>>>>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> >>>>>>>>>> -     total shared cache size.  */
> >>>>>>>>>> +  /* The default setting for the non_temporal threshold is 3/4
> >>>>>>>>>> +     of one thread's share of the chip's cache. While higher
> >>>>>>>>>> +     single thread performance may be observed with a higher
> >>>>>>>>>> +     threshold, having a single thread use more than it's share
> >>>>>>>>>> +     of the cache will negatively impact the performance of
> >>>>>>>>>> +     other threads running on the chip. */
> >>>>>>>>>>         __x86_shared_non_temporal_threshold
> >>>>>>>>>>           = (cpu_features->non_temporal_threshold != 0
> >>>>>>>>>>              ? cpu_features->non_temporal_threshold
> >>>>>>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
> >>>>>>>>>> +       : __x86_shared_cache_size * 3 / 4);
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>> Can we tune it with the number of threads and/or total cache
> >>>>>>>>> size?
> >>>>>>>>>
> >>>>>>>> When you say "total cache size", is that different from
> >>>>>>>> shared_cache_size * threads?
> >>>>>>>>
> >>>>>>>> I see a fundamental conflict of optimization goals:
> >>>>>>>> 1) Provide best single thread performance (current code)
> >>>>>>>> 2) Provide best overall system performance under full load (proposed patch)
> >>>>>>>> I don't know of any way to have default behavior meet both goals without
> >>>>>>>> knowledge
> >>>>>>>> of the system size/usage/requirements.
> >>>>>>>>
> >>>>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
> >>>>>>>> total cache on the chip.
> >>>>>>>> That won't be uncommon in the coming years on server class systems,
> >>>>>>>> especially
> >>>>>>>> in large databases or HPC environments (think vision processing or
> >>>>>>>> weather modeling for example).
> >>>>>>>> If a single app owns the whole chip and is running a multi-threaded
> >>>>>>>> application but needs
> >>>>>>>> to memcpy a really large block of data when one phase of computation
> >>>>>>>> finished
> >>>>>>>> before moving to the next phase. A common practice would be to have 64
> >>>>>>>> parallel calls
> >>>>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
> >>>>>>>> compilers
> >>>>>>>> handle that with no trouble.
> >>>>>>>>
> >>>>>>>> In the example, the per thread share of the cache is 2 MB and the
> >>>>>>>> proposed formula will set
> >>>>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
> >>>>>>>> less, all threads comfortably
> >>>>>>>> fit in cache. If the total copy size is over that, then non-temporal
> >>>>>>>> stores are used and all is well there too.
> >>>>>>>>
> >>>>>>>> The current formula would set the threshold at 96 Mbytes for each
> >>>>>>>> thread. Only when the total
> >>>>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
> >>>>>>>> We'd like
> >>>>>>>> to switch to non-temporal stores much sooner as we will be thrashing all
> >>>>>>>> the threads caches.
> >>>>>>>>
> >>>>>>>> In practical terms, I've had access to typical memcpy copy lengths for a
> >>>>>>>> variety of commerical
> >>>>>>>> applications while studying memcpy on Solaris over the years. The vast
> >>>>>>>> majority of copies
> >>>>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
> >>>>>>>> of cache
> >>>>>>>> per thread, allowing in-cache copies for the common case, even without
> >>>>>>>> borrowing
> >>>>>>>> cache from other threads. The occasional really large copies tend to be
> >>>>>>>> when an application
> >>>>>>>> is passing a block of data to prepare for a new phase of computation or
> >>>>>>>> as a shared memory
> >>>>>>>> communication to another thread. In these cases, having the data remain
> >>>>>>>> in cache is usually
> >>>>>>>> not relevant and using non-temporal stores even when they are not
> >>>>>>>> strictly required does
> >>>>>>>> not have a negative affect on performance.
> >>>>>>>>
> >>>>>>>> A downside of tuning for a single thread comes in cloud computing
> >>>>>>>> environments, where
> >>>>>>>> having neighboring threads being cache hogs, even if relatively isolated
> >>>>>>>> in virtual machines,
> >>>>>>>> is a "bad thing" for having stable system performance. Whatever we can
> >>>>>>>> do to provide consistent,
> >>>>>>>> reasonable performance whatever the neighboring threads might be doing
> >>>>>>>> is a "good thing".
> >>>>>>>>
> >>>>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
> >>>>>>>
> >>>>>> I have not tested larger thresholds. I'd be more comfortable with a
> >>>>>> smaller one.
> >>>>>> We could construct specific tests to show either advantage or disadvantage
> >>>>>> to shifting from 3/4 to all of cache depending on what data access was used
> >>>>>> between memcpy operations.
> >>>>>>
> >>>>>> I consider pushing the limit on cache usage to be a risky approach. Few
> >>>>>> applications
> >>>>>> only work on a single block of data.  If all threads are doing a shared
> >>>>>> copy and
> >>>>>> they use all the available cache, then after the memcpy returns, any other
> >>>>>> active data would have been pushed out of the cache. That's likely to cost
> >>>>>> severe performance loss in more cases than the modest performance gains for
> >>>>>> a few cases where the application only is concerned with using the data that
> >>>>>> was just copied.
> >>>>>>
> >>>>>> Just to give a more detailed example where large copies are not followed
> >>>>>> by using
> >>>>>> the data. Consider garbage collection followed by compression. With a
> >>>>>> multi-age
> >>>>>> garbage collector, stable data that is active and survived several
> >>>>>> garbage collections
> >>>>>> is in a 'old' region. It does not need to be copied. The current 'new'
> >>>>>> region is full
> >>>>>> but has both referenced and unreferenced data. After the marking phase,
> >>>>>> the individual elements of the referenced data is copied to the base of
> >>>>>> the 'new' region.
> >>>>>> When complete, the rest of the 'new' region becomes the new free pool.
> >>>>>> The total amount copied may far exceed the processor cache.  Then the
> >>>>>> application
> >>>>>> exits garbage collection and resumes active use of mostly the stable
> >>>>>> data with
> >>>>>> some accesses to the just moved new data and fresh allocations. If we
> >>>>>> under-use
> >>>>>> non-temporal stores, we clear the cache and the whole application runs
> >>>>>> slower
> >>>>>> than otherwise.
> >>>>>>
> >>>>>> Individual memcpy benchmarks are useful in isolation testing and comparing
> >>>>>> code patterns but can mislead about overall application performance in the
> >>>>>> context of potential for cache abuse. I fell into that tarpit once while
> >>>>>> tuning
> >>>>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
> >>>>>> 5% faster for in-cache data) caused a major customer application to run
> >>>>>> slower
> >>>>>> because my new code abused the cache.  I modified my code to  only use the
> >>>>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
> >>>>>> 128Kbytes if I remember right) and all was well.
> >>>>>>
> >>>>> The new threshold can be substantially smaller with large core count.
> >>>>> Are you saying that even 3 / 4 may be too big?  Is there a reasonable
> >>>>> fixed threshold?
> >>>>>
> >>>> I don't have any evidence to say 3/4 is too big for typical applications
> >>>> and environments.
> >>>> In 2012, the default for memcpy was set to 1/2 the shared_cache_size
> >>>> which is what is
> >>>> the current default for Oracle el7 and Red Hat el7.
> >>>>
> >>>> Given the typically larger sized caches/thread today than 8 years, 3/4
> >>>> may work out well
> >>>> since the remaining 1/4 of today's larger cache is often greater than
> >>>> 1/2 of yesteryear's smaller cache.
> >>>>
> >>> Please update the comment with your rationale for 3/4.  Don't use
> >>> today or current.   Use 2020 instead.
> >>>
> >>> Thanks.
> >>>
> >> I'm unsure about what needs to change in the comment which does not mention
> >> any dates currently. I'm assuming you are referring to the following
> >> comment in cacheinfo.c
> >>
> >>     /* The default setting for the non_temporal threshold is 3/4
> >>        of one thread's share of the chip's cache. While higher
> >>        single thread performance may be observed with a higher
> >>        threshold, having a single thread use more than it's share
> >>        of the cache will negatively impact the performance of
> >>        other threads running on the chip. */
> >>
> >> While I could add a comment on why 3/4 vs 1/2 is the best choice, I
> >> don't have hard
> >> data to back it up. I'd be comfortable with either  3/4 or 1/2. I
> >> selected 3/4 as it
> >> was closer to the formula you chose in 2017 instead of the formula you
> >> chose in 2012.
> > The comment is for readers 5 years from now who may be wondering
> > where 3/4 came from.  Just add something close to what you have said above.
> >
> Before I redo the commit and resubmit the whole patch. I thought I'd present
> a revised comment for review.The value of 500KB to 2MB/thread is based
> on a quick review of the wikipedia entries for Intel and AMD processors
> released since 2017. There may be a few outliers, but the vast majority
> fit that range for L3/thread. I tried to balance giving a sense of the
> situation without diving too deeply into application specific details.
>
>
> Comment in v2:
>    /* The default setting for the non_temporal threshold is 3/4
>       of one thread's share of the chip's cache. While higher
>       single thread performance may be observed with a higher
>       threshold, having a single thread use more than it's share
>       of the cache will negatively impact the performance of
>       other threads running on the chip. */
>
> Proposed comment for v3:
>    /* The default setting for the non_temporal threshold is 3/4 of one
>       thread's share of the chip's cache. For most Intel and AMD processors
>       with an initial release date between 2017 and 2020,a thread's typical
>       share ofthe cache is from 500 KBytes to 2 MBytes. Using the 3/4
> threshold leaves 125 KBytes to 500 KBytes of the thread'sdata
>       in cache after a maximum temporal copy, which will maintain
>       in cache a reasonable portion of the thread's stack and other
>       active data. If the threshold is set higher than one thread's
>       share of the cache, it has a substantial risk of negatively
>       impacting the performance of other threads running on the chip. */
>

Comments look good.  Please submit the patch with the updated
comment.

Thanks.
  

Patch

diff --git a/manual/tunables.texi b/manual/tunables.texi
index b6bb54d..94d4fbd 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -364,7 +364,11 @@  set shared cache size in bytes for use in memory and string routines.
 
 @deftp Tunable glibc.tune.x86_non_temporal_threshold
 The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
-to set threshold in bytes for non temporal store.
+to set threshold in bytes for non temporal store. Non temporal stores
+give a hint to the hardware to move data directly to memory without
+displacing other data from the cache. This tunable is used by some
+platforms to determine when to use non temporal stores in operations
+like memmove and memcpy.
 
 This tunable is specific to i386 and x86-64.
 @end deftp
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index b9444dd..c6767d9 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -778,14 +778,16 @@  intel_bug_no_cache_info:
       __x86_shared_cache_size = shared;
     }
 
-  /* The large memcpy micro benchmark in glibc shows that 6 times of
-     shared cache size is the approximate value above which non-temporal
-     store becomes faster on a 8-core processor.  This is the 3/4 of the
-     total shared cache size.  */
+  /* The default setting for the non_temporal threshold is 3/4
+     of one thread's share of the chip's cache. While higher
+     single thread performance may be observed with a higher
+     threshold, having a single thread use more than it's share
+     of the cache will negatively impact the performance of
+     other threads running on the chip. */
   __x86_shared_non_temporal_threshold
     = (cpu_features->non_temporal_threshold != 0
        ? cpu_features->non_temporal_threshold
-       : __x86_shared_cache_size * threads * 3 / 4);
+       : __x86_shared_cache_size * 3 / 4);
 }
 
 #endif