Save and restore xmm0-xmm7 in _dl_runtime_resolve

Message ID 20150726131622.GA10623@domone
State New, archived
Headers

Commit Message

Ondrej Bilka July 26, 2015, 1:16 p.m. UTC
  On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote:
> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote:
> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote:
> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
> > > > >> >> wrote:
> > > > >> >> > Fixed in the attached patch
> > > > >> >> >
> > > > >> >>
> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for
> > > > >> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
> > > > >> >> without on i386 and x86-64.
> > > > >> >
> > > > >> > Done, all works fine
> > > > >> >
> > > > >>
> > > > >> I checked it in for you.
> > > > >>
> > > > > These are nice but you could have same problem with lazy tls allocation.
> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write
> > > > > similar patch to solve that? Original purpose was to always save xmm
> > > > > registers so we could use sse2 routines which speeds up lookup time.
> > > > 
> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
> > > > much gain it will give us?
> > > >
> > > I couldn't measure that without patch. Gain now would be big as we now
> > > use byte-by-byte loop to check symbol name which is slow, especially
> > > with c++ name mangling. Would be following benchmark good to measure
> > > speedup or do I need to measure startup time which is bit harder?
> > > 
> > 
> > Please try this.
> > 
> 
> We have to use movups instead of movaps due to
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
> 
>
Thanks, this looks promising.

I think how to do definite benchmark, Now I have evidence that its
likely improvement but not definite.

I found that benchmark that i intended causes too much noise and I
didn't get useful from that yet. It was creating 1000 functions in
library and calling them from main where performance between runs vary
by factor of 3 for same implementation.

I have indirect evidence. With attached patch to use sse2 routines I
decreased startup time of running binaries when you run "make bench" 
by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge.

See results on haswell of 

LD_DEBUG=statistics make bench &> old_rtld

that are large so you could browse these here

http://kam.mff.cuni.cz/~ondra/old_rtld
http://kam.mff.cuni.cz/~ondra/new_rtld

For dlopen benchmark I measure ten times performance of
dlopen(RTLD_DEFAULT,"memcpy");
dlopen(RTLD_DEFAULT,"strlen");

Without patch I get
 624.49  559.58  556.6 556.04  558.42  557.86  559.46  555.17  556.93  555.32
and with patch
  604.71  536.74  536.08  535.78  534.11  533.67  534.8 534.8 533.46 536.08

I attached vip patches, I didn't change memcpy yet.

So if you have idea how directly measure fixup change it would be
welcome.
From 8bad3c9751cba151b5f4ad2108bf2b860705c6eb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ond=C5=99ej=20B=C3=ADlka?= <neleai@seznam.cz>
Date: Sun, 26 Jul 2015 10:15:56 +0200
Subject: [PATCH 2/4] dlopen benchmark

---
 benchtests/Makefile       |  3 ++-
 benchtests/bench-dlopen.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+), 1 deletion(-)
 create mode 100644 benchtests/bench-dlopen.c
  

Comments

H.J. Lu July 26, 2015, 7:38 p.m. UTC | #1
On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote:
>> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote:
>> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote:
>> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
>> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
>> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
>> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
>> > > > >> >> wrote:
>> > > > >> >> > Fixed in the attached patch
>> > > > >> >> >
>> > > > >> >>
>> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for
>> > > > >> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
>> > > > >> >> without on i386 and x86-64.
>> > > > >> >
>> > > > >> > Done, all works fine
>> > > > >> >
>> > > > >>
>> > > > >> I checked it in for you.
>> > > > >>
>> > > > > These are nice but you could have same problem with lazy tls allocation.
>> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write
>> > > > > similar patch to solve that? Original purpose was to always save xmm
>> > > > > registers so we could use sse2 routines which speeds up lookup time.
>> > > >
>> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
>> > > > much gain it will give us?
>> > > >
>> > > I couldn't measure that without patch. Gain now would be big as we now
>> > > use byte-by-byte loop to check symbol name which is slow, especially
>> > > with c++ name mangling. Would be following benchmark good to measure
>> > > speedup or do I need to measure startup time which is bit harder?
>> > >
>> >
>> > Please try this.
>> >
>>
>> We have to use movups instead of movaps due to
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
>>
>>
> Thanks, this looks promising.
>
> I think how to do definite benchmark, Now I have evidence that its
> likely improvement but not definite.
>
> I found that benchmark that i intended causes too much noise and I
> didn't get useful from that yet. It was creating 1000 functions in
> library and calling them from main where performance between runs vary
> by factor of 3 for same implementation.
>
> I have indirect evidence. With attached patch to use sse2 routines I
> decreased startup time of running binaries when you run "make bench"
> by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge.
>
> See results on haswell of
>
> LD_DEBUG=statistics make bench &> old_rtld
>
> that are large so you could browse these here
>
> http://kam.mff.cuni.cz/~ondra/old_rtld
> http://kam.mff.cuni.cz/~ondra/new_rtld
>
> For dlopen benchmark I measure ten times performance of
> dlopen(RTLD_DEFAULT,"memcpy");
> dlopen(RTLD_DEFAULT,"strlen");
>
> Without patch I get
>  624.49  559.58  556.6 556.04  558.42  557.86  559.46  555.17  556.93  555.32
> and with patch
>   604.71  536.74  536.08  535.78  534.11  533.67  534.8 534.8 533.46 536.08
>
> I attached vip patches, I didn't change memcpy yet.
>
> So if you have idea how directly measure fixup change it would be
> welcome.
>

There is a potential performance issue.  This won't change parameters
passed in S256-bit/512-bit vector registers because SSE load will only
update the lower 128 bits of 256-bit/512-bit vector registers while
preserving the upper bits.  But these SSE load operations may not be
fast on all current and future processors.  To load the entire
256-bit/512-bit vector registers, we need to check CPU feature in
each symbol lookup.  On the other hand, we can compile x86-64 ld.so
with -msse2.  I don't know what the final performance impact is.
  
Ondrej Bilka July 27, 2015, 10:10 a.m. UTC | #2
On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote:
> On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote:
> >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote:
> >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote:
> >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
> >> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
> >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
> >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
> >> > > > >> >> wrote:
> >> > > > >> >> > Fixed in the attached patch
> >> > > > >> >> >
> >> > > > >> >>
> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for
> >> > > > >> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
> >> > > > >> >> without on i386 and x86-64.
> >> > > > >> >
> >> > > > >> > Done, all works fine
> >> > > > >> >
> >> > > > >>
> >> > > > >> I checked it in for you.
> >> > > > >>
> >> > > > > These are nice but you could have same problem with lazy tls allocation.
> >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write
> >> > > > > similar patch to solve that? Original purpose was to always save xmm
> >> > > > > registers so we could use sse2 routines which speeds up lookup time.
> >> > > >
> >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
> >> > > > much gain it will give us?
> >> > > >
> >> > > I couldn't measure that without patch. Gain now would be big as we now
> >> > > use byte-by-byte loop to check symbol name which is slow, especially
> >> > > with c++ name mangling. Would be following benchmark good to measure
> >> > > speedup or do I need to measure startup time which is bit harder?
> >> > >
> >> >
> >> > Please try this.
> >> >
> >>
> >> We have to use movups instead of movaps due to
> >>
> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
> >>
> >>
> > Thanks, this looks promising.
> >
> > I think how to do definite benchmark, Now I have evidence that its
> > likely improvement but not definite.
> >
> > I found that benchmark that i intended causes too much noise and I
> > didn't get useful from that yet. It was creating 1000 functions in
> > library and calling them from main where performance between runs vary
> > by factor of 3 for same implementation.
> >
> > I have indirect evidence. With attached patch to use sse2 routines I
> > decreased startup time of running binaries when you run "make bench"
> > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge.
> >
> > See results on haswell of
> >
> > LD_DEBUG=statistics make bench &> old_rtld
> >
> > that are large so you could browse these here
> >
> > http://kam.mff.cuni.cz/~ondra/old_rtld
> > http://kam.mff.cuni.cz/~ondra/new_rtld
> >
> > For dlopen benchmark I measure ten times performance of
> > dlopen(RTLD_DEFAULT,"memcpy");
> > dlopen(RTLD_DEFAULT,"strlen");
> >
> > Without patch I get
> >  624.49  559.58  556.6 556.04  558.42  557.86  559.46  555.17  556.93  555.32
> > and with patch
> >   604.71  536.74  536.08  535.78  534.11  533.67  534.8 534.8 533.46 536.08
> >
> > I attached vip patches, I didn't change memcpy yet.
> >
> > So if you have idea how directly measure fixup change it would be
> > welcome.
> >
> 
> There is a potential performance issue.  This won't change parameters
> passed in S256-bit/512-bit vector registers because SSE load will only
> update the lower 128 bits of 256-bit/512-bit vector registers while
> preserving the upper bits.  But these SSE load operations may not be
> fast on all current and future processors.  To load the entire
> 256-bit/512-bit vector registers, we need to check CPU feature in
> each symbol lookup.  On the other hand, we can compile x86-64 ld.so
> with -msse2.  I don't know what the final performance impact is.
> 
Yes, these should be saved due problems with modes. There could be 
problem that saving these takes longer. You don't need
check cpu features on each call. 
Make _dl_runtime_resolve a function pointer and on
startup initialize it to correct variant.

This depends if more complicated code is worth gain. On other hand I
read original code to find it was also about fixing bug 15128 where
ifunc resolver clobbers xmm registers, similar problem is with LD_AUDIT.

These could be solved with this or by saving xmm registers only when
ifunc is called. I don't know whats best maintainable solution.
  
H.J. Lu July 27, 2015, 1:14 p.m. UTC | #3
On Mon, Jul 27, 2015 at 3:10 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote:
>> On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>> > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote:
>> >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote:
>> >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote:
>> >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
>> >> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>> >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
>> >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
>> >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
>> >> > > > >> >> wrote:
>> >> > > > >> >> > Fixed in the attached patch
>> >> > > > >> >> >
>> >> > > > >> >>
>> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for
>> >> > > > >> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
>> >> > > > >> >> without on i386 and x86-64.
>> >> > > > >> >
>> >> > > > >> > Done, all works fine
>> >> > > > >> >
>> >> > > > >>
>> >> > > > >> I checked it in for you.
>> >> > > > >>
>> >> > > > > These are nice but you could have same problem with lazy tls allocation.
>> >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write
>> >> > > > > similar patch to solve that? Original purpose was to always save xmm
>> >> > > > > registers so we could use sse2 routines which speeds up lookup time.
>> >> > > >
>> >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
>> >> > > > much gain it will give us?
>> >> > > >
>> >> > > I couldn't measure that without patch. Gain now would be big as we now
>> >> > > use byte-by-byte loop to check symbol name which is slow, especially
>> >> > > with c++ name mangling. Would be following benchmark good to measure
>> >> > > speedup or do I need to measure startup time which is bit harder?
>> >> > >
>> >> >
>> >> > Please try this.
>> >> >
>> >>
>> >> We have to use movups instead of movaps due to
>> >>
>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
>> >>
>> >>
>> > Thanks, this looks promising.
>> >
>> > I think how to do definite benchmark, Now I have evidence that its
>> > likely improvement but not definite.
>> >
>> > I found that benchmark that i intended causes too much noise and I
>> > didn't get useful from that yet. It was creating 1000 functions in
>> > library and calling them from main where performance between runs vary
>> > by factor of 3 for same implementation.
>> >
>> > I have indirect evidence. With attached patch to use sse2 routines I
>> > decreased startup time of running binaries when you run "make bench"
>> > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge.
>> >
>> > See results on haswell of
>> >
>> > LD_DEBUG=statistics make bench &> old_rtld
>> >
>> > that are large so you could browse these here
>> >
>> > http://kam.mff.cuni.cz/~ondra/old_rtld
>> > http://kam.mff.cuni.cz/~ondra/new_rtld
>> >
>> > For dlopen benchmark I measure ten times performance of
>> > dlopen(RTLD_DEFAULT,"memcpy");
>> > dlopen(RTLD_DEFAULT,"strlen");
>> >
>> > Without patch I get
>> >  624.49  559.58  556.6 556.04  558.42  557.86  559.46  555.17  556.93  555.32
>> > and with patch
>> >   604.71  536.74  536.08  535.78  534.11  533.67  534.8 534.8 533.46 536.08
>> >
>> > I attached vip patches, I didn't change memcpy yet.
>> >
>> > So if you have idea how directly measure fixup change it would be
>> > welcome.
>> >
>>
>> There is a potential performance issue.  This won't change parameters
>> passed in S256-bit/512-bit vector registers because SSE load will only
>> update the lower 128 bits of 256-bit/512-bit vector registers while
>> preserving the upper bits.  But these SSE load operations may not be
>> fast on all current and future processors.  To load the entire
>> 256-bit/512-bit vector registers, we need to check CPU feature in
>> each symbol lookup.  On the other hand, we can compile x86-64 ld.so
>> with -msse2.  I don't know what the final performance impact is.
>>
> Yes, these should be saved due problems with modes. There could be
> problem that saving these takes longer. You don't need
> check cpu features on each call.
> Make _dl_runtime_resolve a function pointer and on
> startup initialize it to correct variant.

One more indirect call.

> This depends if more complicated code is worth gain. On other hand I
> read original code to find it was also about fixing bug 15128 where
> ifunc resolver clobbers xmm registers, similar problem is with LD_AUDIT.
>
> These could be solved with this or by saving xmm registers only when
> ifunc is called. I don't know whats best maintainable solution.

That is true.  Can you try compile ld.so with -msse2?
  
Ondrej Bilka July 27, 2015, 1:26 p.m. UTC | #4
On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote:
> On Mon, Jul 27, 2015 at 3:10 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote:
> >> On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> >> > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote:
> >> >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote:
> >> >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote:
> >> >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
> >> >> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> >> >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
> >> >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
> >> >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
> >> >> > > > >> >> wrote:
> >> >> > > > >> >> > Fixed in the attached patch
> >> >> > > > >> >> >
> >> >> > > > >> >>
> >> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for
> >> >> > > > >> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
> >> >> > > > >> >> without on i386 and x86-64.
> >> >> > > > >> >
> >> >> > > > >> > Done, all works fine
> >> >> > > > >> >
> >> >> > > > >>
> >> >> > > > >> I checked it in for you.
> >> >> > > > >>
> >> >> > > > > These are nice but you could have same problem with lazy tls allocation.
> >> >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write
> >> >> > > > > similar patch to solve that? Original purpose was to always save xmm
> >> >> > > > > registers so we could use sse2 routines which speeds up lookup time.
> >> >> > > >
> >> >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
> >> >> > > > much gain it will give us?
> >> >> > > >
> >> >> > > I couldn't measure that without patch. Gain now would be big as we now
> >> >> > > use byte-by-byte loop to check symbol name which is slow, especially
> >> >> > > with c++ name mangling. Would be following benchmark good to measure
> >> >> > > speedup or do I need to measure startup time which is bit harder?
> >> >> > >
> >> >> >
> >> >> > Please try this.
> >> >> >
> >> >>
> >> >> We have to use movups instead of movaps due to
> >> >>
> >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
> >> >>
> >> >>
> >> > Thanks, this looks promising.
> >> >
> >> > I think how to do definite benchmark, Now I have evidence that its
> >> > likely improvement but not definite.
> >> >
> >> > I found that benchmark that i intended causes too much noise and I
> >> > didn't get useful from that yet. It was creating 1000 functions in
> >> > library and calling them from main where performance between runs vary
> >> > by factor of 3 for same implementation.
> >> >
> >> > I have indirect evidence. With attached patch to use sse2 routines I
> >> > decreased startup time of running binaries when you run "make bench"
> >> > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge.
> >> >
> >> > See results on haswell of
> >> >
> >> > LD_DEBUG=statistics make bench &> old_rtld
> >> >
> >> > that are large so you could browse these here
> >> >
> >> > http://kam.mff.cuni.cz/~ondra/old_rtld
> >> > http://kam.mff.cuni.cz/~ondra/new_rtld
> >> >
> >> > For dlopen benchmark I measure ten times performance of
> >> > dlopen(RTLD_DEFAULT,"memcpy");
> >> > dlopen(RTLD_DEFAULT,"strlen");
> >> >
> >> > Without patch I get
> >> >  624.49  559.58  556.6 556.04  558.42  557.86  559.46  555.17  556.93  555.32
> >> > and with patch
> >> >   604.71  536.74  536.08  535.78  534.11  533.67  534.8 534.8 533.46 536.08
> >> >
> >> > I attached vip patches, I didn't change memcpy yet.
> >> >
> >> > So if you have idea how directly measure fixup change it would be
> >> > welcome.
> >> >
> >>
> >> There is a potential performance issue.  This won't change parameters
> >> passed in S256-bit/512-bit vector registers because SSE load will only
> >> update the lower 128 bits of 256-bit/512-bit vector registers while
> >> preserving the upper bits.  But these SSE load operations may not be
> >> fast on all current and future processors.  To load the entire
> >> 256-bit/512-bit vector registers, we need to check CPU feature in
> >> each symbol lookup.  On the other hand, we can compile x86-64 ld.so
> >> with -msse2.  I don't know what the final performance impact is.
> >>
> > Yes, these should be saved due problems with modes. There could be
> > problem that saving these takes longer. You don't need
> > check cpu features on each call.
> > Make _dl_runtime_resolve a function pointer and on
> > startup initialize it to correct variant.
> 
> One more indirect call.
>
no, my proposal is different, we could do this:

void *_dl_runtime_resolve;
int startup()
{
  if (has_avx())
    _dl_runtime_resolve = _dl_runtime_resolve_avx;
  else
    _dl_runtime_resolve = _dl_runtime_resolve_sse;
}

Then we will assign correct variant.
  
H.J. Lu July 27, 2015, 1:37 p.m. UTC | #5
On Mon, Jul 27, 2015 at 6:26 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote:
>> On Mon, Jul 27, 2015 at 3:10 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>> > On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote:
>> >> On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>> >> > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote:
>> >> >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote:
>> >> >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote:
>> >> >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
>> >> >> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>> >> >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
>> >> >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
>> >> >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
>> >> >> > > > >> >> wrote:
>> >> >> > > > >> >> > Fixed in the attached patch
>> >> >> > > > >> >> >
>> >> >> > > > >> >>
>> >> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for
>> >> >> > > > >> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
>> >> >> > > > >> >> without on i386 and x86-64.
>> >> >> > > > >> >
>> >> >> > > > >> > Done, all works fine
>> >> >> > > > >> >
>> >> >> > > > >>
>> >> >> > > > >> I checked it in for you.
>> >> >> > > > >>
>> >> >> > > > > These are nice but you could have same problem with lazy tls allocation.
>> >> >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write
>> >> >> > > > > similar patch to solve that? Original purpose was to always save xmm
>> >> >> > > > > registers so we could use sse2 routines which speeds up lookup time.
>> >> >> > > >
>> >> >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
>> >> >> > > > much gain it will give us?
>> >> >> > > >
>> >> >> > > I couldn't measure that without patch. Gain now would be big as we now
>> >> >> > > use byte-by-byte loop to check symbol name which is slow, especially
>> >> >> > > with c++ name mangling. Would be following benchmark good to measure
>> >> >> > > speedup or do I need to measure startup time which is bit harder?
>> >> >> > >
>> >> >> >
>> >> >> > Please try this.
>> >> >> >
>> >> >>
>> >> >> We have to use movups instead of movaps due to
>> >> >>
>> >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
>> >> >>
>> >> >>
>> >> > Thanks, this looks promising.
>> >> >
>> >> > I think how to do definite benchmark, Now I have evidence that its
>> >> > likely improvement but not definite.
>> >> >
>> >> > I found that benchmark that i intended causes too much noise and I
>> >> > didn't get useful from that yet. It was creating 1000 functions in
>> >> > library and calling them from main where performance between runs vary
>> >> > by factor of 3 for same implementation.
>> >> >
>> >> > I have indirect evidence. With attached patch to use sse2 routines I
>> >> > decreased startup time of running binaries when you run "make bench"
>> >> > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge.
>> >> >
>> >> > See results on haswell of
>> >> >
>> >> > LD_DEBUG=statistics make bench &> old_rtld
>> >> >
>> >> > that are large so you could browse these here
>> >> >
>> >> > http://kam.mff.cuni.cz/~ondra/old_rtld
>> >> > http://kam.mff.cuni.cz/~ondra/new_rtld
>> >> >
>> >> > For dlopen benchmark I measure ten times performance of
>> >> > dlopen(RTLD_DEFAULT,"memcpy");
>> >> > dlopen(RTLD_DEFAULT,"strlen");
>> >> >
>> >> > Without patch I get
>> >> >  624.49  559.58  556.6 556.04  558.42  557.86  559.46  555.17  556.93  555.32
>> >> > and with patch
>> >> >   604.71  536.74  536.08  535.78  534.11  533.67  534.8 534.8 533.46 536.08
>> >> >
>> >> > I attached vip patches, I didn't change memcpy yet.
>> >> >
>> >> > So if you have idea how directly measure fixup change it would be
>> >> > welcome.
>> >> >
>> >>
>> >> There is a potential performance issue.  This won't change parameters
>> >> passed in S256-bit/512-bit vector registers because SSE load will only
>> >> update the lower 128 bits of 256-bit/512-bit vector registers while
>> >> preserving the upper bits.  But these SSE load operations may not be
>> >> fast on all current and future processors.  To load the entire
>> >> 256-bit/512-bit vector registers, we need to check CPU feature in
>> >> each symbol lookup.  On the other hand, we can compile x86-64 ld.so
>> >> with -msse2.  I don't know what the final performance impact is.
>> >>
>> > Yes, these should be saved due problems with modes. There could be
>> > problem that saving these takes longer. You don't need
>> > check cpu features on each call.
>> > Make _dl_runtime_resolve a function pointer and on
>> > startup initialize it to correct variant.
>>
>> One more indirect call.
>>
> no, my proposal is different, we could do this:
>
> void *_dl_runtime_resolve;
> int startup()
> {
>   if (has_avx())
>     _dl_runtime_resolve = _dl_runtime_resolve_avx;
>   else
>     _dl_runtime_resolve = _dl_runtime_resolve_sse;
> }
>
> Then we will assign correct variant.

Yes, this may work for both _dl_runtime_profile and
 _dl_runtime_resolve.  I will see what I can do.
  
H.J. Lu July 28, 2015, 8:55 p.m. UTC | #6
On Mon, Jul 27, 2015 at 6:37 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Jul 27, 2015 at 6:26 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>> On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote:
>>> >>
>>> >> There is a potential performance issue.  This won't change parameters
>>> >> passed in S256-bit/512-bit vector registers because SSE load will only
>>> >> update the lower 128 bits of 256-bit/512-bit vector registers while
>>> >> preserving the upper bits.  But these SSE load operations may not be
>>> >> fast on all current and future processors.  To load the entire
>>> >> 256-bit/512-bit vector registers, we need to check CPU feature in
>>> >> each symbol lookup.  On the other hand, we can compile x86-64 ld.so
>>> >> with -msse2.  I don't know what the final performance impact is.
>>> >>
>>> > Yes, these should be saved due problems with modes. There could be
>>> > problem that saving these takes longer. You don't need
>>> > check cpu features on each call.
>>> > Make _dl_runtime_resolve a function pointer and on
>>> > startup initialize it to correct variant.
>>>
>>> One more indirect call.
>>>
>> no, my proposal is different, we could do this:
>>
>> void *_dl_runtime_resolve;
>> int startup()
>> {
>>   if (has_avx())
>>     _dl_runtime_resolve = _dl_runtime_resolve_avx;
>>   else
>>     _dl_runtime_resolve = _dl_runtime_resolve_sse;
>> }
>>
>> Then we will assign correct variant.
>
> Yes, this may work for both _dl_runtime_profile and
>  _dl_runtime_resolve.  I will see what I can do.
>

Please try hjl/pr18661 branch.  I implemented:

0000000000016fd0 t _dl_runtime_profile_avx
0000000000016b50 t _dl_runtime_profile_avx512
0000000000017450 t _dl_runtime_profile_sse
00000000000168d0 t _dl_runtime_resolve_avx
0000000000016780 t _dl_runtime_resolve_avx512
0000000000016a20 t _dl_runtime_resolve_sse
  
H.J. Lu July 29, 2015, 2:03 a.m. UTC | #7
On Tue, Jul 28, 2015 at 1:55 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Jul 27, 2015 at 6:37 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Mon, Jul 27, 2015 at 6:26 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>>> On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote:
>>>> >>
>>>> >> There is a potential performance issue.  This won't change parameters
>>>> >> passed in S256-bit/512-bit vector registers because SSE load will only
>>>> >> update the lower 128 bits of 256-bit/512-bit vector registers while
>>>> >> preserving the upper bits.  But these SSE load operations may not be
>>>> >> fast on all current and future processors.  To load the entire
>>>> >> 256-bit/512-bit vector registers, we need to check CPU feature in
>>>> >> each symbol lookup.  On the other hand, we can compile x86-64 ld.so
>>>> >> with -msse2.  I don't know what the final performance impact is.
>>>> >>
>>>> > Yes, these should be saved due problems with modes. There could be
>>>> > problem that saving these takes longer. You don't need
>>>> > check cpu features on each call.
>>>> > Make _dl_runtime_resolve a function pointer and on
>>>> > startup initialize it to correct variant.
>>>>
>>>> One more indirect call.
>>>>
>>> no, my proposal is different, we could do this:
>>>
>>> void *_dl_runtime_resolve;
>>> int startup()
>>> {
>>>   if (has_avx())
>>>     _dl_runtime_resolve = _dl_runtime_resolve_avx;
>>>   else
>>>     _dl_runtime_resolve = _dl_runtime_resolve_sse;
>>> }
>>>
>>> Then we will assign correct variant.
>>
>> Yes, this may work for both _dl_runtime_profile and
>>  _dl_runtime_resolve.  I will see what I can do.
>>
>
> Please try hjl/pr18661 branch.  I implemented:
>
> 0000000000016fd0 t _dl_runtime_profile_avx
> 0000000000016b50 t _dl_runtime_profile_avx512
> 0000000000017450 t _dl_runtime_profile_sse
> 00000000000168d0 t _dl_runtime_resolve_avx
> 0000000000016780 t _dl_runtime_resolve_avx512
> 0000000000016a20 t _dl_runtime_resolve_sse

I enabled SSE in ld.so and it works fine.
  
H.J. Lu July 29, 2015, 12:11 p.m. UTC | #8
On Tue, Jul 28, 2015 at 7:03 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Tue, Jul 28, 2015 at 1:55 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Mon, Jul 27, 2015 at 6:37 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Mon, Jul 27, 2015 at 6:26 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>>>> On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote:
>>>>> >>
>>>>> >> There is a potential performance issue.  This won't change parameters
>>>>> >> passed in S256-bit/512-bit vector registers because SSE load will only
>>>>> >> update the lower 128 bits of 256-bit/512-bit vector registers while
>>>>> >> preserving the upper bits.  But these SSE load operations may not be
>>>>> >> fast on all current and future processors.  To load the entire
>>>>> >> 256-bit/512-bit vector registers, we need to check CPU feature in
>>>>> >> each symbol lookup.  On the other hand, we can compile x86-64 ld.so
>>>>> >> with -msse2.  I don't know what the final performance impact is.
>>>>> >>
>>>>> > Yes, these should be saved due problems with modes. There could be
>>>>> > problem that saving these takes longer. You don't need
>>>>> > check cpu features on each call.
>>>>> > Make _dl_runtime_resolve a function pointer and on
>>>>> > startup initialize it to correct variant.
>>>>>
>>>>> One more indirect call.
>>>>>
>>>> no, my proposal is different, we could do this:
>>>>
>>>> void *_dl_runtime_resolve;
>>>> int startup()
>>>> {
>>>>   if (has_avx())
>>>>     _dl_runtime_resolve = _dl_runtime_resolve_avx;
>>>>   else
>>>>     _dl_runtime_resolve = _dl_runtime_resolve_sse;
>>>> }
>>>>
>>>> Then we will assign correct variant.
>>>
>>> Yes, this may work for both _dl_runtime_profile and
>>>  _dl_runtime_resolve.  I will see what I can do.
>>>
>>
>> Please try hjl/pr18661 branch.  I implemented:
>>
>> 0000000000016fd0 t _dl_runtime_profile_avx
>> 0000000000016b50 t _dl_runtime_profile_avx512
>> 0000000000017450 t _dl_runtime_profile_sse
>> 00000000000168d0 t _dl_runtime_resolve_avx
>> 0000000000016780 t _dl_runtime_resolve_avx512
>> 0000000000016a20 t _dl_runtime_resolve_sse
>
> I enabled SSE in ld.so and it works fine.
>

I fully enabled SSE in ld.so on hjl/pr18661 branch.  I didn't notice
any issue on both AVX and non-AVX machines.  There may be
some performance improvement.  But it is hard to tell.
  

Patch

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 8e615e5..9e82e43 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -30,7 +30,7 @@  bench-pthread := pthread_once
 bench := $(bench-math) $(bench-pthread)
 
 # String function benchmarks.
-string-bench := bcopy bzero memccpy memchr memcmp memcpy memmem memmove \
+string-bench := dlopen bcopy bzero memccpy memchr memcmp memcpy memmem memmove \
 		mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \
 		strcat strchr strchrnul strcmp strcpy strcspn strlen \
 		strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
@@ -57,6 +57,7 @@  CFLAGS-bench-ffsll.c += -fno-builtin
 
 bench-malloc := malloc-thread
 
+$(objpfx)bench-dlopen: -ldl
 $(addprefix $(objpfx)bench-,$(bench-math)): $(libm)
 $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library)
 $(objpfx)bench-malloc-thread: $(shared-thread-library)
diff --git a/benchtests/bench-dlopen.c b/benchtests/bench-dlopen.c
new file mode 100644
index 0000000..b47d18b
--- /dev/null
+++ b/benchtests/bench-dlopen.c
@@ -0,0 +1,47 @@ 
+/* Measure strcpy functions.
+   Copyright (C) 2013-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+#include <dlfcn.h>
+#include <string.h>
+#include "bench-timing.h"
+
+int
+main (void)
+{
+  long ret = 0;
+
+  size_t i, j,iters = 100;
+  timing_t start, stop, cur;
+  for (j=0;j<10;j++)
+    {
+      TIMING_NOW (start);
+      for (i = 0; i < iters; ++i)
+        {
+          ret += (long) dlsym (RTLD_DEFAULT, "strlen");
+          ret += (long) dlsym (RTLD_DEFAULT, "memcpy");
+        }
+      TIMING_NOW (stop);
+
+      TIMING_DIFF (cur, start, stop);
+
+      TIMING_PRINT_MEAN ((double) cur, (double) iters);
+    }
+
+  return ret;
+}
+
+