memcpy performance regressions 2.19 -> 2.24(5)
Commit Message
On Thu, May 18, 2017 at 1:59 PM, Erich Elsen <eriche@google.com> wrote:
> Hi H.J.,
>
> I was on vacation, sorry for the slow reply. The updated benchmark
> still shows the same behavior, thanks.
>
> I'll try my hand at creating a patch that makes that variable
> __x86_shared_non_temporal_threshold a tunable. It will be necessary
> to do internal experiments anyway.
>
__x86_shared_non_temporal_threshold was set to 6 times of per-core
shared cache size, based on the large memcpy micro benchmark in glibc
on a 8-core processor. For a processor with more than 8 cores, the
threshold is too low. Set __x86_shared_non_temporal_threshold to the
3/4 of the total shared cache size so that it is unchanged on 8-core
processors. On processors with less than 8 cores, the threshold is
lower.
Any comments?
Comments
I definitely think increasing the size in the case of processors with
a large number of cores makes sense. Hopefully with some testing we
can confirm it is a net win and/or find a more empirical number.
Thanks for that patch with the tunable support. I've just put a
similar patch in review for sharing right now. It adds support in the
case that HAVE_TUNABLES isn't defined like the similar code in arena.c
and also makes a minor change that turns init_cacheinfo into a
init_cacheinfo_impl (a hidden callable). init_cacheinfo is now a
constructor that just calls the impl and passes the cpu_features
struct. This is useful in that it makes the code a bit more modular
(something that we'll need to be able to test this internally).
On Mon, May 22, 2017 at 12:17 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Thu, May 18, 2017 at 1:59 PM, Erich Elsen <eriche@google.com> wrote:
>> Hi H.J.,
>>
>> I was on vacation, sorry for the slow reply. The updated benchmark
>> still shows the same behavior, thanks.
>>
>> I'll try my hand at creating a patch that makes that variable
>> __x86_shared_non_temporal_threshold a tunable. It will be necessary
>> to do internal experiments anyway.
>>
>
> __x86_shared_non_temporal_threshold was set to 6 times of per-core
> shared cache size, based on the large memcpy micro benchmark in glibc
> on a 8-core processor. For a processor with more than 8 cores, the
> threshold is too low. Set __x86_shared_non_temporal_threshold to the
> 3/4 of the total shared cache size so that it is unchanged on 8-core
> processors. On processors with less than 8 cores, the threshold is
> lower.
>
> Any comments?
>
> --
> H.J.
On Mon, May 22, 2017 at 6:23 PM, Erich Elsen <eriche@google.com> wrote:
> I definitely think increasing the size in the case of processors with
> a large number of cores makes sense. Hopefully with some testing we
> can confirm it is a net win and/or find a more empirical number.
>
> Thanks for that patch with the tunable support. I've just put a
> similar patch in review for sharing right now. It adds support in the
> case that HAVE_TUNABLES isn't defined like the similar code in arena.c
> and also makes a minor change that turns init_cacheinfo into a
> init_cacheinfo_impl (a hidden callable). init_cacheinfo is now a
> constructor that just calls the impl and passes the cpu_features
> struct. This is useful in that it makes the code a bit more modular
> (something that we'll need to be able to test this internally).
This sounds a good idea. I'd also like to add tunable support in
init_cpu_features to turn on/off CPU features. non_temporal_threshold
will be one of them.
From bfb716e07b77f0ed8e0c2689d5cd01e2c8251fc5 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 12 May 2017 13:38:04 -0700
Subject: [PATCH] x86: Update __x86_shared_non_temporal_threshold
__x86_shared_non_temporal_threshold was set to 6 times of per-core
shared cache size, based on the large memcpy micro benchmark in glibc
on a 8-core processor. For a processor with more than 8 cores, the
threshold is too low. Set __x86_shared_non_temporal_threshold to the
3/4 of the total shared cache size so that it is unchanged on 8-core
processors. On processors with less than 8 cores, the threshold is
lower.
* sysdeps/x86/cacheinfo.c (__x86_shared_non_temporal_threshold):
Set to the 3/4 of the total shared cache size.
---
sysdeps/x86/cacheinfo.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
@@ -766,6 +766,8 @@ intel_bug_no_cache_info:
/* The large memcpy micro benchmark in glibc shows that 6 times of
shared cache size is the approximate value above which non-temporal
- store becomes faster. */
- __x86_shared_non_temporal_threshold = __x86_shared_cache_size * 6;
+ store becomes faster on a 8-core processor. This is the 3/4 of the
+ total shared cache size. */
+ __x86_shared_non_temporal_threshold
+ = __x86_shared_cache_size * threads * 3 / 4;
}
--
2.9.4