memcpy performance regressions 2.19 -> 2.24(5)

Message ID CAMe9rOpO_q8hMt+U4xUB8HoR9i7E0LO92SCiU_=v_306XdtjJQ@mail.gmail.com
State New, archived
Headers

Commit Message

H.J. Lu May 22, 2017, 7:17 p.m. UTC
  On Thu, May 18, 2017 at 1:59 PM, Erich Elsen <eriche@google.com> wrote:
> Hi H.J.,
>
> I was on vacation, sorry for the slow reply.  The updated benchmark
> still shows the same behavior, thanks.
>
> I'll try my hand at creating a patch that makes that variable
> __x86_shared_non_temporal_threshold a tunable.  It will be necessary
> to do internal experiments anyway.
>

__x86_shared_non_temporal_threshold was set to 6 times of per-core
shared cache size, based on the large memcpy micro benchmark in glibc
on a 8-core processor.  For a processor with more than 8 cores, the
threshold is too low.  Set __x86_shared_non_temporal_threshold to the
3/4 of the total shared cache size so that it is unchanged on 8-core
processors.  On processors with less than 8 cores, the threshold is
lower.

Any comments?
  

Comments

Erich Elsen May 23, 2017, 1:23 a.m. UTC | #1
I definitely think increasing the size in the case of processors with
a large number of cores makes sense.  Hopefully with some testing we
can confirm it is a net win and/or find a more empirical number.

Thanks for that patch with the tunable support.  I've just put a
similar patch in review for sharing right now.  It adds support in the
case that HAVE_TUNABLES isn't defined like the similar code in arena.c
 and also makes a minor change that turns init_cacheinfo into a
init_cacheinfo_impl (a hidden callable).  init_cacheinfo is now a
constructor that just calls the impl and passes the cpu_features
struct.  This is useful in that it makes the code a bit more modular
(something that we'll need to be able to test this internally).

On Mon, May 22, 2017 at 12:17 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Thu, May 18, 2017 at 1:59 PM, Erich Elsen <eriche@google.com> wrote:
>> Hi H.J.,
>>
>> I was on vacation, sorry for the slow reply.  The updated benchmark
>> still shows the same behavior, thanks.
>>
>> I'll try my hand at creating a patch that makes that variable
>> __x86_shared_non_temporal_threshold a tunable.  It will be necessary
>> to do internal experiments anyway.
>>
>
> __x86_shared_non_temporal_threshold was set to 6 times of per-core
> shared cache size, based on the large memcpy micro benchmark in glibc
> on a 8-core processor.  For a processor with more than 8 cores, the
> threshold is too low.  Set __x86_shared_non_temporal_threshold to the
> 3/4 of the total shared cache size so that it is unchanged on 8-core
> processors.  On processors with less than 8 cores, the threshold is
> lower.
>
> Any comments?
>
> --
> H.J.
  
H.J. Lu May 23, 2017, 2:24 a.m. UTC | #2
On Mon, May 22, 2017 at 6:23 PM, Erich Elsen <eriche@google.com> wrote:
> I definitely think increasing the size in the case of processors with
> a large number of cores makes sense.  Hopefully with some testing we
> can confirm it is a net win and/or find a more empirical number.
>
> Thanks for that patch with the tunable support.  I've just put a
> similar patch in review for sharing right now.  It adds support in the
> case that HAVE_TUNABLES isn't defined like the similar code in arena.c
>  and also makes a minor change that turns init_cacheinfo into a
> init_cacheinfo_impl (a hidden callable).  init_cacheinfo is now a
> constructor that just calls the impl and passes the cpu_features
> struct.  This is useful in that it makes the code a bit more modular
> (something that we'll need to be able to test this internally).

This sounds a good idea.  I'd also like to add tunable support in
init_cpu_features to turn on/off CPU features.   non_temporal_threshold
will be one of them.
  

Patch

From bfb716e07b77f0ed8e0c2689d5cd01e2c8251fc5 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 12 May 2017 13:38:04 -0700
Subject: [PATCH] x86: Update __x86_shared_non_temporal_threshold

__x86_shared_non_temporal_threshold was set to 6 times of per-core
shared cache size, based on the large memcpy micro benchmark in glibc
on a 8-core processor.  For a processor with more than 8 cores, the
threshold is too low.  Set __x86_shared_non_temporal_threshold to the
3/4 of the total shared cache size so that it is unchanged on 8-core
processors.  On processors with less than 8 cores, the threshold is
lower.

	* sysdeps/x86/cacheinfo.c (__x86_shared_non_temporal_threshold):
	Set to the 3/4 of the total shared cache size.
---
 sysdeps/x86/cacheinfo.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 1ccbe41..3434d97 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -766,6 +766,8 @@  intel_bug_no_cache_info:
 
   /* The large memcpy micro benchmark in glibc shows that 6 times of
      shared cache size is the approximate value above which non-temporal
-     store becomes faster.  */
-  __x86_shared_non_temporal_threshold = __x86_shared_cache_size * 6;
+     store becomes faster on a 8-core processor.  This is the 3/4 of the
+     total shared cache size.  */
+  __x86_shared_non_temporal_threshold
+    = __x86_shared_cache_size * threads * 3 / 4;
 }
-- 
2.9.4