From patchwork Sun Jul 26 13:16:22 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Ondrej Bilka <neleai@seznam.cz>
X-Patchwork-Id: 7861
Received: (qmail 95710 invoked by alias); 26 Jul 2015 13:16:50 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 94684 invoked by uid 89); 26 Jul 2015 13:16:50 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=0.7 required=5.0 tests=AWL, BAYES_50,
	FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2
X-HELO: popelka.ms.mff.cuni.cz
Date: Sun, 26 Jul 2015 15:16:22 +0200
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: "libc-alpha@sourceware.org" <libc-alpha@sourceware.org>
Subject: Re: [PATCH] Save and restore xmm0-xmm7 in _dl_runtime_resolve
Message-ID: <20150726131622.GA10623@domone>
References: 
 <CAMe9rOoUr1fjCHDsd+kbiFZ5KL_HoDB_GG65epxqpX7AcocvZw@mail.gmail.com>
	<0EFAB2BDD0F67E4FB6CCC8B9F87D75696A9220AE@IRSMSX101.ger.corp.intel.com>
	<CAMe9rOo81zoKpt+QmmVYjfjV2=KwLkYiqeADv3kMyeouM+9uug@mail.gmail.com>
	<0EFAB2BDD0F67E4FB6CCC8B9F87D75696A9235F6@IRSMSX101.ger.corp.intel.com>
	<CAMe9rOoXLPUr_LUexoRKjrCdNhP0J8EMY+1XNAaLnpW1qknb7w@mail.gmail.com>
	<20150709142827.GA18030@domone>
	<CAMe9rOoXCwiPdQVP7_tV7599f6y9w_n1P+SXsE7urb69f3v7gA@mail.gmail.com>
	<20150711104654.GA26570@domone> <20150711202742.GA9074@gmail.com>
	<20150711235002.GA7543@gmail.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20150711235002.GA7543@gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)

On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote:
> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote:
> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote:
> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
> > > > >> >> wrote:
> > > > >> >> > Fixed in the attached patch
> > > > >> >> >
> > > > >> >>
> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for
> > > > >> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
> > > > >> >> without on i386 and x86-64.
> > > > >> >
> > > > >> > Done, all works fine
> > > > >> >
> > > > >>
> > > > >> I checked it in for you.
> > > > >>
> > > > > These are nice but you could have same problem with lazy tls allocation.
> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write
> > > > > similar patch to solve that? Original purpose was to always save xmm
> > > > > registers so we could use sse2 routines which speeds up lookup time.
> > > > 
> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
> > > > much gain it will give us?
> > > >
> > > I couldn't measure that without patch. Gain now would be big as we now
> > > use byte-by-byte loop to check symbol name which is slow, especially
> > > with c++ name mangling. Would be following benchmark good to measure
> > > speedup or do I need to measure startup time which is bit harder?
> > > 
> > 
> > Please try this.
> > 
> 
> We have to use movups instead of movaps due to
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
> 
>
Thanks, this looks promising.

I think how to do definite benchmark, Now I have evidence that its
likely improvement but not definite.

I found that benchmark that i intended causes too much noise and I
didn't get useful from that yet. It was creating 1000 functions in
library and calling them from main where performance between runs vary
by factor of 3 for same implementation.

I have indirect evidence. With attached patch to use sse2 routines I
decreased startup time of running binaries when you run "make bench" 
by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge.

See results on haswell of 

LD_DEBUG=statistics make bench &> old_rtld

that are large so you could browse these here

http://kam.mff.cuni.cz/~ondra/old_rtld
http://kam.mff.cuni.cz/~ondra/new_rtld

For dlopen benchmark I measure ten times performance of
dlopen(RTLD_DEFAULT,"memcpy");
dlopen(RTLD_DEFAULT,"strlen");

Without patch I get
 624.49  559.58  556.6 556.04  558.42  557.86  559.46  555.17  556.93  555.32
and with patch
  604.71  536.74  536.08  535.78  534.11  533.67  534.8 534.8 533.46 536.08

I attached vip patches, I didn't change memcpy yet.

So if you have idea how directly measure fixup change it would be
welcome.
From 8bad3c9751cba151b5f4ad2108bf2b860705c6eb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ond=C5=99ej=20B=C3=ADlka?= <neleai@seznam.cz>
Date: Sun, 26 Jul 2015 10:15:56 +0200
Subject: [PATCH 2/4] dlopen benchmark
---
 benchtests/Makefile       |  3 ++-
 benchtests/bench-dlopen.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+), 1 deletion(-)
 create mode 100644 benchtests/bench-dlopen.c

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 8e615e5..9e82e43 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -30,7 +30,7 @@ bench-pthread := pthread_once
 bench := $(bench-math) $(bench-pthread)
 
 # String function benchmarks.
-string-bench := bcopy bzero memccpy memchr memcmp memcpy memmem memmove \
+string-bench := dlopen bcopy bzero memccpy memchr memcmp memcpy memmem memmove \
 		mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \
 		strcat strchr strchrnul strcmp strcpy strcspn strlen \
 		strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
@@ -57,6 +57,7 @@ CFLAGS-bench-ffsll.c += -fno-builtin
 
 bench-malloc := malloc-thread
 
+$(objpfx)bench-dlopen: -ldl
 $(addprefix $(objpfx)bench-,$(bench-math)): $(libm)
 $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library)
 $(objpfx)bench-malloc-thread: $(shared-thread-library)
diff --git a/benchtests/bench-dlopen.c b/benchtests/bench-dlopen.c
new file mode 100644
index 0000000..b47d18b
--- /dev/null
+++ b/benchtests/bench-dlopen.c
@@ -0,0 +1,47 @@
+/* Measure strcpy functions.
+   Copyright (C) 2013-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+#include <dlfcn.h>
+#include <string.h>
+#include "bench-timing.h"
+
+int
+main (void)
+{
+  long ret = 0;
+
+  size_t i, j,iters = 100;
+  timing_t start, stop, cur;
+  for (j=0;j<10;j++)
+    {
+      TIMING_NOW (start);
+      for (i = 0; i < iters; ++i)
+        {
+          ret += (long) dlsym (RTLD_DEFAULT, "strlen");
+          ret += (long) dlsym (RTLD_DEFAULT, "memcpy");
+        }
+      TIMING_NOW (stop);
+
+      TIMING_DIFF (cur, start, stop);
+
+      TIMING_PRINT_MEAN ((double) cur, (double) iters);
+    }
+
+  return ret;
+}
+
+