RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]

  On Thu, Oct 19, 2017 at 2:55 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 10/19/2017 10:41 AM, H.J. Lu wrote:
>> In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
>> mask and bound registers.  It simplifies _dl_runtime_resolve and supports
>> different calling conventions.  ld.so code size is reduced by more than
>> 1 KB.  However, use fxsave/xsave/xsavec takes a little bit more cycles
>> than saving and restoring vector and bound registers individually.
>>
>> Latency for _dl_runtime_resolve to lookup the function, foo, from one
>> shared library plus libc.so:
>>
>>                              Before    After     Change
>>
>> Westmere (SSE)/fxsave         345      866       151%
>> IvyBridge (AVX)/xsave         420      643       53%
>> Haswell (AVX)/xsave           713      1252      75%
>> Skylake (AVX+MPX)/xsavec      559      719       28%
>> Skylake (AVX512+MPX)/xsavec   145      272       87%
>
> This is a good baseline, but as you note, the change may not be observable
> in any real world programs.
>
> The case I made to David Kreitzer here:
> https://sourceware.org/ml/libc-alpha/2017-03/msg00430.html
> ~~~
>   ... Alternatively a more detailed performance analysis of
>   the impact on applications that don't use __regcall is required before adding
>   instructions to the hot path of the average application (or removing their use
>   in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
>   on hardware that supports those vector registers).
> ~~~
>
>> This is the worst case where portion of time spent for saving and
>> restoring registers is bigger than majority of cases.  With smaller
>> _dl_runtime_resolve code size, overall performance impact is negligible.
>>
>> On IvyBridge, differences in build and test time of binutils with lazy
>> binding GCC and binutils are noises.  On Westmere, differences in
>> bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
>> binutils are also noises.
> Do you have any statistics on the timing for large applications that
> use a lot of libraries? I don't see gcc, binutils, or glibc as indicative
> of the complexity of shared libraries in terms of loaded shared libraries.

_dl_runtime_resolve is only called once when an external function is
called the first time.  Many shared libraries isn't a problem unless
all execution
time is spent in _dl_runtime_resolve.  I don't believe this is a
typical behavior.

> Something like libreoffice's soffice.bin has 142 DSOs, or chrome's
> 103 DSOs. It might be hard to measure if the lazy resolution is impacting
> the performance or if you are hitting some other performance boundary, but
> a black-box test showing performance didn't get *worse* for startup and
> exit, would mean it isn't the bottlneck (but might be some day). To test
> this you should be able to use libreoffice's CLI arguments to batch process
> some files and time that (or the --cat files option).

My machines run Fedora 26, which default to DT_BIND_NOW.  Both
libreoffice and chrom are marked with:

 0x000000000000001e (FLAGS)              BIND_NOW

_dl_runtime_resolve isn't used.

> If we can show that the above latency is in the noise for real applications
> using many DSOs, then it makes your case better for supporting the alternate
> calling conventions.
>

Here is the updated patch which updates xsave state size for

GLIBC_TUNABLES=glibc.tune.hwcaps=-XSAVEC_Usable

RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]

Commit Message

Comments

Patch