From patchwork Sun Jul 26 13:16:22 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Ondrej Bilka X-Patchwork-Id: 7861 Received: (qmail 95710 invoked by alias); 26 Jul 2015 13:16:50 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 94684 invoked by uid 89); 26 Jul 2015 13:16:50 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.7 required=5.0 tests=AWL, BAYES_50, FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Sun, 26 Jul 2015 15:16:22 +0200 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: "H.J. Lu" Cc: "libc-alpha@sourceware.org" Subject: Re: [PATCH] Save and restore xmm0-xmm7 in _dl_runtime_resolve Message-ID: <20150726131622.GA10623@domone> References: <0EFAB2BDD0F67E4FB6CCC8B9F87D75696A9220AE@IRSMSX101.ger.corp.intel.com> <0EFAB2BDD0F67E4FB6CCC8B9F87D75696A9235F6@IRSMSX101.ger.corp.intel.com> <20150709142827.GA18030@domone> <20150711104654.GA26570@domone> <20150711202742.GA9074@gmail.com> <20150711235002.GA7543@gmail.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20150711235002.GA7543@gmail.com> User-Agent: Mutt/1.5.20 (2009-06-14) On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote: > On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote: > > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote: > > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote: > > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka wrote: > > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote: > > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor wrote: > > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor > > > > >> >> wrote: > > > > >> >> > Fixed in the attached patch > > > > >> >> > > > > > >> >> > > > > >> >> I fixed some typos and updated sysdeps/i386/configure for > > > > >> >> HAVE_MPX_SUPPORT. Please verify both with HAVE_MPX_SUPPORT and > > > > >> >> without on i386 and x86-64. > > > > >> > > > > > >> > Done, all works fine > > > > >> > > > > > >> > > > > >> I checked it in for you. > > > > >> > > > > > These are nice but you could have same problem with lazy tls allocation. > > > > > I wrote patch to merge trampolines, which now conflicts. Could you write > > > > > similar patch to solve that? Original purpose was to always save xmm > > > > > registers so we could use sse2 routines which speeds up lookup time. > > > > > > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How > > > > much gain it will give us? > > > > > > > I couldn't measure that without patch. Gain now would be big as we now > > > use byte-by-byte loop to check symbol name which is slow, especially > > > with c++ name mangling. Would be following benchmark good to measure > > > speedup or do I need to measure startup time which is bit harder? > > > > > > > Please try this. > > > > We have to use movups instead of movaps due to > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066 > > Thanks, this looks promising. I think how to do definite benchmark, Now I have evidence that its likely improvement but not definite. I found that benchmark that i intended causes too much noise and I didn't get useful from that yet. It was creating 1000 functions in library and calling them from main where performance between runs vary by factor of 3 for same implementation. I have indirect evidence. With attached patch to use sse2 routines I decreased startup time of running binaries when you run "make bench" by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge. See results on haswell of LD_DEBUG=statistics make bench &> old_rtld that are large so you could browse these here http://kam.mff.cuni.cz/~ondra/old_rtld http://kam.mff.cuni.cz/~ondra/new_rtld For dlopen benchmark I measure ten times performance of dlopen(RTLD_DEFAULT,"memcpy"); dlopen(RTLD_DEFAULT,"strlen"); Without patch I get 624.49 559.58 556.6 556.04 558.42 557.86 559.46 555.17 556.93 555.32 and with patch 604.71 536.74 536.08 535.78 534.11 533.67 534.8 534.8 533.46 536.08 I attached vip patches, I didn't change memcpy yet. So if you have idea how directly measure fixup change it would be welcome. From 8bad3c9751cba151b5f4ad2108bf2b860705c6eb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ond=C5=99ej=20B=C3=ADlka?= Date: Sun, 26 Jul 2015 10:15:56 +0200 Subject: [PATCH 2/4] dlopen benchmark --- benchtests/Makefile | 3 ++- benchtests/bench-dlopen.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+), 1 deletion(-) create mode 100644 benchtests/bench-dlopen.c diff --git a/benchtests/Makefile b/benchtests/Makefile index 8e615e5..9e82e43 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -30,7 +30,7 @@ bench-pthread := pthread_once bench := $(bench-math) $(bench-pthread) # String function benchmarks. -string-bench := bcopy bzero memccpy memchr memcmp memcpy memmem memmove \ +string-bench := dlopen bcopy bzero memccpy memchr memcmp memcpy memmem memmove \ mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \ strcat strchr strchrnul strcmp strcpy strcspn strlen \ strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \ @@ -57,6 +57,7 @@ CFLAGS-bench-ffsll.c += -fno-builtin bench-malloc := malloc-thread +$(objpfx)bench-dlopen: -ldl $(addprefix $(objpfx)bench-,$(bench-math)): $(libm) $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library) $(objpfx)bench-malloc-thread: $(shared-thread-library) diff --git a/benchtests/bench-dlopen.c b/benchtests/bench-dlopen.c new file mode 100644 index 0000000..b47d18b --- /dev/null +++ b/benchtests/bench-dlopen.c @@ -0,0 +1,47 @@ +/* Measure strcpy functions. + Copyright (C) 2013-2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ +#include +#include +#include "bench-timing.h" + +int +main (void) +{ + long ret = 0; + + size_t i, j,iters = 100; + timing_t start, stop, cur; + for (j=0;j<10;j++) + { + TIMING_NOW (start); + for (i = 0; i < iters; ++i) + { + ret += (long) dlsym (RTLD_DEFAULT, "strlen"); + ret += (long) dlsym (RTLD_DEFAULT, "memcpy"); + } + TIMING_NOW (stop); + + TIMING_DIFF (cur, start, stop); + + TIMING_PRINT_MEAN ((double) cur, (double) iters); + } + + return ret; +} + +