[VERY,RFC,2/2] hurd: Make it possible to call memcpy very early

Message ID 20230420184220.300862-2-bugaevc@gmail.com
State Superseded, archived
Headers
Series [1/2] hurd: Don't migrate reply port into __init1_tcbhead |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Sergey Bugaev April 20, 2023, 6:42 p.m. UTC
  Normally, in static builds, the first code that runs is _start, in e.g.
sysdeps/x86_64/start.S, which quickly calls __libc_start_main, passing
it the argv etc. Among the first things __libc_start_main does is
initializing the tunables (based on env), then CPU features, and then
calls _dl_relocate_static_pie (). Specifically, this runs ifunc
resolvers to pick, based on the CPU features discovered earlier, the
most suitable implementation of "string" functions such as memcpy.

Before that point, calling memcpy (or other ifunc-resolved functions)
will not work.

In the Hurd port, things are more complex. In order to get argv/env for
our process, glibc normally needs to do an RPC to the exec server,
unless our args/env are already located on the stack (which is what
happens to bootstrap processes spawned by GNU Mach). Fetching our
argv/env from the exec server has to be done before the call to
__libc_start_main, since we need to know what our argv/env are to pass
them to __libc_start_main.

On the other hand, the implementation of the RPC (and other initial
setup needed on the Hurd before __libc_start_main can be run) is not
very trivial. In particular, it may (and on x86_64, will) use memcpy.
But as described above, calling memcpy before __libc_start_main can not
work, since the GOT entry for it is not yet initialized at that point.

Work around this by pre-filling the GOT entry with the baseline version
of memcpy, __memcpy_sse2_unaligned. This makes it possible for early
calls to memcpy to just work. Once _dl_relocate_static_pie () is called,
the baseline version will get replaced with the most suitable one, and
that's what subsequent calls of memcpy are going to call.

Also, apply the same treatment to __stpncpy, which can also be used by
the RPCs (see mig_strncpy.c), and is an ifunc-resolved function on both
x86_64 and i386.

Tested on x86_64-gnu (!).

Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
---

Please tell me:

* if the approach is at all sane
* if there's a better way to do this without hardcoding
  "__memcpy_sse2_unaligned"
* are the GOT entries for indirect functions supposed to be statically
  initialized to anything (in the binary)? if yes, why? if not, why is
  PROGBITS and not NOBITS?
* am I doing all this _GLOBAL_OFFSET_TABLE_, @GOT, @GOTOFF, @GOTPCREL
  correctly?
* should there be a !PIC version as well? does the GOT exist under
  !PIC (to access indirect functions), and if it does then how do I
  access it? it would seem gcc just generates a direct $function even
  for indirect functions in this case.

 sysdeps/mach/hurd/i386/static-start.S   | 7 +++++++
 sysdeps/mach/hurd/x86_64/static-start.S | 8 ++++++++
 2 files changed, 15 insertions(+)
  

Comments

H.J. Lu April 20, 2023, 8:25 p.m. UTC | #1
On Thu, Apr 20, 2023 at 11:43 AM Sergey Bugaev <bugaevc@gmail.com> wrote:
>
> Normally, in static builds, the first code that runs is _start, in e.g.
> sysdeps/x86_64/start.S, which quickly calls __libc_start_main, passing
> it the argv etc. Among the first things __libc_start_main does is
> initializing the tunables (based on env), then CPU features, and then
> calls _dl_relocate_static_pie (). Specifically, this runs ifunc
> resolvers to pick, based on the CPU features discovered earlier, the
> most suitable implementation of "string" functions such as memcpy.
>
> Before that point, calling memcpy (or other ifunc-resolved functions)
> will not work.
>
> In the Hurd port, things are more complex. In order to get argv/env for
> our process, glibc normally needs to do an RPC to the exec server,
> unless our args/env are already located on the stack (which is what
> happens to bootstrap processes spawned by GNU Mach). Fetching our
> argv/env from the exec server has to be done before the call to
> __libc_start_main, since we need to know what our argv/env are to pass
> them to __libc_start_main.
>
> On the other hand, the implementation of the RPC (and other initial
> setup needed on the Hurd before __libc_start_main can be run) is not
> very trivial. In particular, it may (and on x86_64, will) use memcpy.
> But as described above, calling memcpy before __libc_start_main can not
> work, since the GOT entry for it is not yet initialized at that point.
>
> Work around this by pre-filling the GOT entry with the baseline version
> of memcpy, __memcpy_sse2_unaligned. This makes it possible for early
> calls to memcpy to just work. Once _dl_relocate_static_pie () is called,
> the baseline version will get replaced with the most suitable one, and
> that's what subsequent calls of memcpy are going to call.
>
> Also, apply the same treatment to __stpncpy, which can also be used by
> the RPCs (see mig_strncpy.c), and is an ifunc-resolved function on both
> x86_64 and i386.
>
> Tested on x86_64-gnu (!).
>
> Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
> ---
>
> Please tell me:
>
> * if the approach is at all sane
> * if there's a better way to do this without hardcoding
>   "__memcpy_sse2_unaligned"
> * are the GOT entries for indirect functions supposed to be statically
>   initialized to anything (in the binary)? if yes, why? if not, why is
>   PROGBITS and not NOBITS?
> * am I doing all this _GLOBAL_OFFSET_TABLE_, @GOT, @GOTOFF, @GOTPCREL
>   correctly?
> * should there be a !PIC version as well? does the GOT exist under
>   !PIC (to access indirect functions), and if it does then how do I
>   access it? it would seem gcc just generates a direct $function even
>   for indirect functions in this case.
>
>  sysdeps/mach/hurd/i386/static-start.S   | 7 +++++++
>  sysdeps/mach/hurd/x86_64/static-start.S | 8 ++++++++
>  2 files changed, 15 insertions(+)
>
> diff --git a/sysdeps/mach/hurd/i386/static-start.S b/sysdeps/mach/hurd/i386/static-start.S
> index c5d12645..1b1ae559 100644
> --- a/sysdeps/mach/hurd/i386/static-start.S
> +++ b/sysdeps/mach/hurd/i386/static-start.S
> @@ -19,6 +19,13 @@
>         .text
>         .globl _start
>  _start:
> +#ifdef PIC
> +       call __x86.get_pc_thunk.bx
> +       addl $_GLOBAL_OFFSET_TABLE_, %ebx
> +       leal __stpncpy_ia32@GOTOFF(%ebx), %eax
> +       movl %eax, __stpncpy@GOT(%ebx)
> +#endif
> +
>         call _hurd_stack_setup
>         xorl %edx, %edx
>         jmp _start1
> diff --git a/sysdeps/mach/hurd/x86_64/static-start.S b/sysdeps/mach/hurd/x86_64/static-start.S
> index 982d3d52..81b3c0ac 100644
> --- a/sysdeps/mach/hurd/x86_64/static-start.S
> +++ b/sysdeps/mach/hurd/x86_64/static-start.S
> @@ -19,6 +19,14 @@
>         .text
>         .globl _start
>  _start:
> +
> +#ifdef PIC
> +       leaq __memcpy_sse2_unaligned(%rip), %rax
> +       movq %rax, memcpy@GOTPCREL(%rip)
> +       leaq __stpncpy_sse2_unaligned(%rip), %rax
> +       movq %rax, __stpncpy@GOTPCREL(%rip)
> +#endif
> +
>         call _hurd_stack_setup
>         xorq %rdx, %rdx
>         jmp _start1
> --
> 2.40.0
>

Doesn't it disable IFUNC for memcpy and stpncpy?
  
Adhemerval Zanella April 20, 2023, 8:38 p.m. UTC | #2
On 20/04/23 17:25, H.J. Lu via Libc-alpha wrote:
> On Thu, Apr 20, 2023 at 11:43 AM Sergey Bugaev <bugaevc@gmail.com> wrote:
>>
>> Normally, in static builds, the first code that runs is _start, in e.g.
>> sysdeps/x86_64/start.S, which quickly calls __libc_start_main, passing
>> it the argv etc. Among the first things __libc_start_main does is
>> initializing the tunables (based on env), then CPU features, and then
>> calls _dl_relocate_static_pie (). Specifically, this runs ifunc
>> resolvers to pick, based on the CPU features discovered earlier, the
>> most suitable implementation of "string" functions such as memcpy.
>>
>> Before that point, calling memcpy (or other ifunc-resolved functions)
>> will not work.
>>
>> In the Hurd port, things are more complex. In order to get argv/env for
>> our process, glibc normally needs to do an RPC to the exec server,
>> unless our args/env are already located on the stack (which is what
>> happens to bootstrap processes spawned by GNU Mach). Fetching our
>> argv/env from the exec server has to be done before the call to
>> __libc_start_main, since we need to know what our argv/env are to pass
>> them to __libc_start_main.
>>
>> On the other hand, the implementation of the RPC (and other initial
>> setup needed on the Hurd before __libc_start_main can be run) is not
>> very trivial. In particular, it may (and on x86_64, will) use memcpy.
>> But as described above, calling memcpy before __libc_start_main can not
>> work, since the GOT entry for it is not yet initialized at that point.
>>
>> Work around this by pre-filling the GOT entry with the baseline version
>> of memcpy, __memcpy_sse2_unaligned. This makes it possible for early
>> calls to memcpy to just work. Once _dl_relocate_static_pie () is called,
>> the baseline version will get replaced with the most suitable one, and
>> that's what subsequent calls of memcpy are going to call.
>>
>> Also, apply the same treatment to __stpncpy, which can also be used by
>> the RPCs (see mig_strncpy.c), and is an ifunc-resolved function on both
>> x86_64 and i386.
>>
>> Tested on x86_64-gnu (!).
>>
>> Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
>> ---
>>
>> Please tell me:
>>
>> * if the approach is at all sane
>> * if there's a better way to do this without hardcoding
>>   "__memcpy_sse2_unaligned"
>> * are the GOT entries for indirect functions supposed to be statically
>>   initialized to anything (in the binary)? if yes, why? if not, why is
>>   PROGBITS and not NOBITS?
>> * am I doing all this _GLOBAL_OFFSET_TABLE_, @GOT, @GOTOFF, @GOTPCREL
>>   correctly?
>> * should there be a !PIC version as well? does the GOT exist under
>>   !PIC (to access indirect functions), and if it does then how do I
>>   access it? it would seem gcc just generates a direct $function even
>>   for indirect functions in this case.
>>
>>  sysdeps/mach/hurd/i386/static-start.S   | 7 +++++++
>>  sysdeps/mach/hurd/x86_64/static-start.S | 8 ++++++++
>>  2 files changed, 15 insertions(+)
>>
>> diff --git a/sysdeps/mach/hurd/i386/static-start.S b/sysdeps/mach/hurd/i386/static-start.S
>> index c5d12645..1b1ae559 100644
>> --- a/sysdeps/mach/hurd/i386/static-start.S
>> +++ b/sysdeps/mach/hurd/i386/static-start.S
>> @@ -19,6 +19,13 @@
>>         .text
>>         .globl _start
>>  _start:
>> +#ifdef PIC
>> +       call __x86.get_pc_thunk.bx
>> +       addl $_GLOBAL_OFFSET_TABLE_, %ebx
>> +       leal __stpncpy_ia32@GOTOFF(%ebx), %eax
>> +       movl %eax, __stpncpy@GOT(%ebx)
>> +#endif
>> +
>>         call _hurd_stack_setup
>>         xorl %edx, %edx
>>         jmp _start1
>> diff --git a/sysdeps/mach/hurd/x86_64/static-start.S b/sysdeps/mach/hurd/x86_64/static-start.S
>> index 982d3d52..81b3c0ac 100644
>> --- a/sysdeps/mach/hurd/x86_64/static-start.S
>> +++ b/sysdeps/mach/hurd/x86_64/static-start.S
>> @@ -19,6 +19,14 @@
>>         .text
>>         .globl _start
>>  _start:
>> +
>> +#ifdef PIC
>> +       leaq __memcpy_sse2_unaligned(%rip), %rax
>> +       movq %rax, memcpy@GOTPCREL(%rip)
>> +       leaq __stpncpy_sse2_unaligned(%rip), %rax
>> +       movq %rax, __stpncpy@GOTPCREL(%rip)
>> +#endif
>> +
>>         call _hurd_stack_setup
>>         xorq %rdx, %rdx
>>         jmp _start1
>> --
>> 2.40.0
>>
> 
> Doesn't it disable IFUNC for memcpy and stpncpy?
> 


Can't you use a similar strategy done by 5355f9ca7b10183ce06e8a18003ba30f43774858 ?
  
Sergey Bugaev April 21, 2023, 9:27 a.m. UTC | #3
Hello,

On Thu, Apr 20, 2023 at 11:38 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
> Can't you use a similar strategy done by 5355f9ca7b10183ce06e8a18003ba30f43774858 ?

Do I understand it right that that is the moral equivalent of

#define memcpy __memcpy_sse2_unaligned

except that it works aat assembly level and so will catch implicit
calls to memcpy that the compiler may insert?

Assuming for a second that we don't care about implicit memcpys (but
we should) and only about the explicit one made by the MIG runtime, we
could maybe even just do

 void *
 __mig_memcpy (void *dst, const void *src, vm_size_t len)
 {
-  return memcpy (dst, src, len);
+  return __memcpy_sse2_unaligned (dst, src, len);
 }

but that (as well as your proposal) would make *all* calls to memcpy
in this place go through the baseline version, even after the early
startup is done. Whereas my proposal attempted to avoid that -- unless
of course H.J. is right and this prevents the indirect relocation from
replacing the function pointer later, in which case it's even worse
because it would sabotage memcpy for the whole program and not just a
couple of files.

Sergey
  
Adhemerval Zanella April 21, 2023, 11:50 a.m. UTC | #4
On 21/04/23 06:27, Sergey Bugaev wrote:
> Hello,
> 
> On Thu, Apr 20, 2023 at 11:38 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
>> Can't you use a similar strategy done by 5355f9ca7b10183ce06e8a18003ba30f43774858 ?
> 
> Do I understand it right that that is the moral equivalent of
> 
> #define memcpy __memcpy_sse2_unaligned
> 
> except that it works aat assembly level and so will catch implicit
> calls to memcpy that the compiler may insert?

Yes, that's the idea.

> 
> Assuming for a second that we don't care about implicit memcpys (but
> we should) and only about the explicit one made by the MIG runtime, we
> could maybe even just do
> 
>  void *
>  __mig_memcpy (void *dst, const void *src, vm_size_t len)
>  {
> -  return memcpy (dst, src, len);
> +  return __memcpy_sse2_unaligned (dst, src, len);
>  }
> 
> but that (as well as your proposal) would make *all* calls to memcpy
> in this place go through the baseline version, even after the early
> startup is done. Whereas my proposal attempted to avoid that -- unless
> of course H.J. is right and this prevents the indirect relocation from
> replacing the function pointer later, in which case it's even worse
> because it would sabotage memcpy for the whole program and not just a
> couple of files.

It might work if you don't care about a different architecture than x86,
and that's why I added the alias (so each architecture is free to name
its ifunc variant).  And the patch was exactly to handle the implicit
created mem* call from compiler, so I think you should take it in
consideration.

And I trying to make reason why you need __mig_memcpy indirection for
MIG.
  
Sergey Bugaev April 21, 2023, 11:51 a.m. UTC | #5
Hello,

On Thu, Apr 20, 2023 at 11:26 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> Doesn't it disable IFUNC for memcpy and stpncpy?

I was hoping you'd tell me whether it does :|

I *think* on i386 it does indeed (so I'd need to rework that part of
the patch), but not on x86_64. This is based on the following
observations:

1. in glibc source, sysdeps/i386/dl-irel.h does this:

static inline void
__attribute ((always_inline))
elf_irel (const Elf32_Rel *reloc)
{
  Elf32_Addr *const reloc_addr = (void *) reloc->r_offset;
  const unsigned long int r_type = ELF32_R_TYPE (reloc->r_info);

  if (__glibc_likely (r_type == R_386_IRELATIVE))
    {
      Elf32_Addr value = elf_ifunc_invoke(*reloc_addr);
      *reloc_addr = value;
    }
  else
    __libc_fatal ("Unexpected reloc type in static binary.\n");
}

i.e. reading the ifunc ptr from the relocated address itself, whereas
on x86_64 it's:

static inline void
__attribute ((always_inline))
elf_irela (const ElfW(Rela) *reloc)
{
  ElfW(Addr) *const reloc_addr = (void *) reloc->r_offset;
  const unsigned long int r_type = ELFW(R_TYPE) (reloc->r_info);

  if (__glibc_likely (r_type == R_X86_64_IRELATIVE))
    {
      ElfW(Addr) value = elf_ifunc_invoke(reloc->r_addend);
      *reloc_addr = value;
    }
  else
    __libc_fatal ("Unexpected reloc type in static binary.\n");
}

i.e. the ifunc resolver is stored in the addend, and the initial value
of *reloc_addr is ignored.

Checking arm and aarch64, I see that arm uses *reloc_addr like i386,
and aarch64 uses r_addend like x86_64. But (unlike i386 and like
x86_64) arm also has an ifunc relocation for memcpy, so (if someone
was to work on a arm-gnu port) we would still have the same issue
there, and this approach wouldn't work -- but see below.

2. When dumping relocations with readelf --wide --relocs, for the
   x86_64-gnu build I see the addends vary, but for i386-gnu they're
   just empty. That means readelf considers R_X86_64_IRELATIVE to
   be a rela, and R_386_IRELATIVE to be a rel.

3. When looking at the initial values of the GOT entries, on i386
   they do point to ifunc resolvers; on x86_64 they don't seem to be.

4. I've now tried asking qemu for a better CPU, and sure enough, I
   get the GOT entry pointing to __memcpy_avx_unaligned_erms.

Here's a little demo:

(gdb) bt
#0  __memcpy_avx_unaligned_erms () at
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:256
#1  0x00000000004b48d8 in __device_write_inband
(device=device@entry=6, mode=mode@entry=0, recnum=recnum@entry=0,
data=data@entry=0x4db0f2 <crlf> "\r\n",
    dataCnt=dataCnt@entry=2,
bytes_written=bytes_written@entry=0xbffffcdc) at
/home/sergey/dev/crosshurd64/src/glibc/build/mach/RPC_device_write_inband.c:158
#2  0x0000000000400d6b in write_some (to_write=2, p=0x4db0f2 <crlf>
"\r\n") at devstream.c:45
#3  write_crlf () at devstream.c:58
#4  devstream_write (cookie=<optimized out>, buffer=0x20000d30 "Well
hello friends!\n", n=20) at devstream.c:70
#5  0x0000000000424841 in _IO_cookie_write (fp=0x20000c10,
buf=<optimized out>, size=20) at iofopncook.c:59
#6  0x0000000000425234 in new_do_write (fp=0x20000c10, data=0x20000d30
"Well hello friends!\n", to_do=to_do@entry=20)
    at /home/sergey/dev/crosshurd64/src/glibc/libio/libioP.h:1031
#7  0x0000000000425959 in _IO_new_do_write (fp=<optimized out>,
data=<optimized out>, to_do=20) at fileops.c:425
#8  0x00000000004266e0 in _IO_new_file_sync (fp=0x20000c10) at fileops.c:798
#9  0x0000000000424542 in _IO_fflush (fp=0x20000c10) at
/home/sergey/dev/crosshurd64/src/glibc/libio/libioP.h:1031
#10 0x0000000000400bea in main (argc=2, argv=0xbfffffa8) at
/home/sergey/dev/mach-bootstrap-hello.c:69
#11 0x000000000040de73 in __libc_start_call_main (argv=0xbfffffa8,
argc=2, main=0x400ad2 <main>) at
../sysdeps/generic/libc_start_call_main.h:23
#12 __libc_start_main_impl (main=<optimized out>, argc=2,
argv=0xbfffffa8, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>,
    stack_end=<optimized out>) at ../csu/libc-start.c:360
#13 0x0000000000400961 in _start1 () at ../sysdeps/x86_64/start.S:115

Actually maybe we could make this hack work for the architectures that
do use *reloc_addr too: instead of just rewriting the GOT entry and
leaving it at that, we'd restore the original pointer (i.e.
__memcpy_ifunc) after we've done the early Hurd-specific setup, right
before jumping to _start1.

Maybe this awesome/horrible trick could be even used to enable
ifunc-selected memcpy for i386 -- and not only on the Hurd, but for
i386-linux-gnu as well? To reiterate: set the GOT entry to a
known-good baseline version very early, then call memcpy the usual way
all you like, then before doing _dl_relocate_static_pie reset the GOT
entry back to the ifunc resolver.

Please tell me if what I'm saying makes sense, I may sound confident,
but I'm really not. This really needs someone way more experienced
than me to look into it and judge.

Sergey
  
Sergey Bugaev April 21, 2023, 12:38 p.m. UTC | #6
On Fri, Apr 21, 2023 at 2:50 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
> It might work if you don't care about a different architecture than x86,
> and that's why I added the alias (so each architecture is free to name
> its ifunc variant).  And the patch was exactly to handle the implicit
> created mem* call from compiler, so I think you should take it in
> consideration.

Yes, sure, I wasn't really suggesting we do that change. My point is,
I would like to make the same memcpy callsites both work during early
startup and start calling the more efficient implementation once early
startup is done -- if that's possible.

> And I trying to make reason why you need __mig_memcpy indirection for
> MIG.

See commit 56010b73e81e2cb1082e418699f98353598fe671 (as pointed out by
Samuel yesterday), which references [0]. But that's only relevant for
SHARED; for static we could avoid that redirection and call memcpy
directly indeed.

[0]: https://sourceware.org/pipermail/libc-alpha/2020-November/119575.html

Sergey
  
Adhemerval Zanella April 21, 2023, 12:58 p.m. UTC | #7
On 21/04/23 09:38, Sergey Bugaev wrote:
> On Fri, Apr 21, 2023 at 2:50 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
>> It might work if you don't care about a different architecture than x86,
>> and that's why I added the alias (so each architecture is free to name
>> its ifunc variant).  And the patch was exactly to handle the implicit
>> created mem* call from compiler, so I think you should take it in
>> consideration.
> 
> Yes, sure, I wasn't really suggesting we do that change. My point is,
> I would like to make the same memcpy callsites both work during early
> startup and start calling the more efficient implementation once early
> startup is done -- if that's possible.

That's the whole idea of dl-symbol-redir-ifunc.h, since it is explicit
enable by TU.

> 
>> And I trying to make reason why you need __mig_memcpy indirection for
>> MIG.
> 
> See commit 56010b73e81e2cb1082e418699f98353598fe671 (as pointed out by
> Samuel yesterday), which references [0]. But that's only relevant for
> SHARED; for static we could avoid that redirection and call memcpy
> directly indeed.
> 
> [0]: https://sourceware.org/pipermail/libc-alpha/2020-November/119575.html

This seems essentially the very issue dl-symbol-redir-ifunc.h aims to fix.
  
Sergey Bugaev April 21, 2023, 1:56 p.m. UTC | #8
On Fri, Apr 21, 2023 at 3:58 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
> > Yes, sure, I wasn't really suggesting we do that change. My point is,
> > I would like to make the same memcpy callsites both work during early
> > startup and start calling the more efficient implementation once early
> > startup is done -- if that's possible.
>
> That's the whole idea of dl-symbol-redir-ifunc.h, since it is explicit
> enable by TU.

Then I must be misunderstanding how dl-symbol-redir-ifunc.h works
(please explain!),

because I don't see how it would achieve that. Doesn't it change all
memcpy calls in the current translation unit (.c file) to always call
the baseline memcpy -- not only during startup, but even after the
startup has been completed?

To be clear, this is about memcpy calls made *during* very early
startup, yes, but not *in* some special startup-only code (like
init-first.c and hurdstartup.c for instance): that startup-specific
code calls into the fairly generic code (mig_memcpy.c and the various
MIG-generated RPC routines) that is used both during this very early
startup and then much later during the normal runtime, and *that's*
what calls memcpy. I'd like for memcpy calls in that generic code to
both work during early startup and use the efficient memcpy
implementation later. If dl-symbol-redir-ifunc.h can achieve that,
please help me understand how.

Sergey
  
Adhemerval Zanella April 21, 2023, 2:28 p.m. UTC | #9
On 21/04/23 10:56, Sergey Bugaev wrote:
> On Fri, Apr 21, 2023 at 3:58 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
>>> Yes, sure, I wasn't really suggesting we do that change. My point is,
>>> I would like to make the same memcpy callsites both work during early
>>> startup and start calling the more efficient implementation once early
>>> startup is done -- if that's possible.
>>
>> That's the whole idea of dl-symbol-redir-ifunc.h, since it is explicit
>> enable by TU.
> 
> Then I must be misunderstanding how dl-symbol-redir-ifunc.h works
> (please explain!),
> 
> because I don't see how it would achieve that. Doesn't it change all
> memcpy calls in the current translation unit (.c file) to always call
> the baseline memcpy -- not only during startup, but even after the
> startup has been completed?

Indeed dl-symbol-redir-ifunc.h does not allow redirect the symbol in 
this dynamic way. But you can't really accomplish it without either 
hacking the ABI or using an override scheme, as we do for loader
malloc (elf/dl-minimal.c) where the minimal malloc is replaced on
final relocation phase.

The memcpy/memset/memmove is tricky because since compiler is free to
auto-generate calls to it we need to redirect it on a more direct way
(as explicit inline asm alias).

> 
> To be clear, this is about memcpy calls made *during* very early
> startup, yes, but not *in* some special startup-only code (like
> init-first.c and hurdstartup.c for instance): that startup-specific
> code calls into the fairly generic code (mig_memcpy.c and the various
> MIG-generated RPC routines) that is used both during this very early
> startup and then much later during the normal runtime, and *that's*
> what calls memcpy. I'd like for memcpy calls in that generic code to
> both work during early startup and use the efficient memcpy
> implementation later. If dl-symbol-redir-ifunc.h can achieve that,
> please help me understand how.

For Linux, the redirection is only used on very specific TU that can't 
really work with ifunc, and it is used only during static process 
initialization (like dl-support.c).

The 56010b73e81e2 commit message describes what seems to be a really
wacky scenario: lib{mach,hurd}user.so is relocated *before* libc.so; 
but they still call the libc.so symbols (because which seems to be
the compiler generated memcpy call).  The __mig_memcpy solution 
might work, but it is *really* hacky; but at same time I don't really
have a solution for this Hurd mess...
  
Samuel Thibault April 22, 2023, 11:59 a.m. UTC | #10
Sergey Bugaev, le ven. 21 avril 2023 16:56:57 +0300, a ecrit:
> On Fri, Apr 21, 2023 at 3:58 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> > > Yes, sure, I wasn't really suggesting we do that change. My point is,
> > > I would like to make the same memcpy callsites both work during early
> > > startup and start calling the more efficient implementation once early
> > > startup is done -- if that's possible.
> >
> > That's the whole idea of dl-symbol-redir-ifunc.h, since it is explicit
> > enable by TU.
> 
> Then I must be misunderstanding how dl-symbol-redir-ifunc.h works
> (please explain!),
> 
> because I don't see how it would achieve that. Doesn't it change all
> memcpy calls in the current translation unit (.c file) to always call
> the baseline memcpy -- not only during startup, but even after the
> startup has been completed?

Perhaps we can generate "init" variants of the RPCs that we need at
initialization, that are made to use the base memcpy implementation, and
only use them in the init code?

Samuel
  
Sergey Bugaev April 22, 2023, 12:18 p.m. UTC | #11
On Sat, Apr 22, 2023 at 2:59 PM Samuel Thibault <samuel.thibault@gnu.org> wrote:
> Perhaps we can generate "init" variants of the RPCs that we need at
> initialization, that are made to use the base memcpy implementation, and
> only use them in the init code?

If we undo 56010b73e81e2cb1082e418699f98353598fe671 for static builds
(basically stick && defined (SHARED) in there) and apply Adhemerval's
suggestion, we might even get away without compiling separate init
versions. We should only need to de-ifunc-ify init-first.c,
thread_set_state, hurdstartup.c, and exec_startup_get_info. If they
all go through __mig_memcpy which is in a separate translation unit,
we'd have to make __mig_memcpy slow for everyone; but if they call
memcpy directly, we can selectively make only those ones use the slow
memcpy.

What do you think?

Sergey
  
Samuel Thibault April 22, 2023, 4:20 p.m. UTC | #12
Sergey Bugaev, le sam. 22 avril 2023 15:18:08 +0300, a ecrit:
> We should only need to de-ifunc-ify init-first.c,
> thread_set_state, hurdstartup.c, and exec_startup_get_info. If they
> all go through __mig_memcpy which is in a separate translation unit,
> we'd have to make __mig_memcpy slow for everyone; but if they call
> memcpy directly, we can selectively make only those ones use the slow
> memcpy.

But what about the memcpy calls from the RPC stubs used by the init
code?

Samuel
  
Sergey Bugaev April 22, 2023, 5:04 p.m. UTC | #13
On Sat, Apr 22, 2023 at 7:20 PM Samuel Thibault <samuel.thibault@gnu.org> wrote:
> But what about the memcpy calls from the RPC stubs used by the init
> code?

Ah, I see that we do actually call more than I thought before _start1.
In particular, we initialize the dtable (so the ctty stuff...) and get
our pid/pgrp from the proc server. We should probably do this later,
at time __libc_init_first probably. We want to do as little as
possible before _start1; we must get our argv and env, so that must
happen before _start1, and to do that we need to set up
__init_tcbhead, but the rest we can do later.

Guess I still like my GOT prefilling approach better then. It's not
that big of a hack, it's just two lines of assembly (we can drop
stpcpy, it's not used during early startup) and it makes memcpy just
work from anywhere (any RPCs, any other code) without any additional
trickery at use sites or in the build system. And all the same code
gets automatically upgraded to the better memcpy when it's available.
We know it works in practice and doesn't prevent ifunc selectors from
doing their thing later. What's not to like?

Sergey
  

Patch

diff --git a/sysdeps/mach/hurd/i386/static-start.S b/sysdeps/mach/hurd/i386/static-start.S
index c5d12645..1b1ae559 100644
--- a/sysdeps/mach/hurd/i386/static-start.S
+++ b/sysdeps/mach/hurd/i386/static-start.S
@@ -19,6 +19,13 @@ 
 	.text
 	.globl _start
 _start:
+#ifdef PIC
+	call __x86.get_pc_thunk.bx
+	addl $_GLOBAL_OFFSET_TABLE_, %ebx
+	leal __stpncpy_ia32@GOTOFF(%ebx), %eax
+	movl %eax, __stpncpy@GOT(%ebx)
+#endif
+
 	call _hurd_stack_setup
 	xorl %edx, %edx
 	jmp _start1
diff --git a/sysdeps/mach/hurd/x86_64/static-start.S b/sysdeps/mach/hurd/x86_64/static-start.S
index 982d3d52..81b3c0ac 100644
--- a/sysdeps/mach/hurd/x86_64/static-start.S
+++ b/sysdeps/mach/hurd/x86_64/static-start.S
@@ -19,6 +19,14 @@ 
 	.text
 	.globl _start
 _start:
+
+#ifdef PIC
+	leaq __memcpy_sse2_unaligned(%rip), %rax
+	movq %rax, memcpy@GOTPCREL(%rip)
+	leaq __stpncpy_sse2_unaligned(%rip), %rax
+	movq %rax, __stpncpy@GOTPCREL(%rip)
+#endif
+
 	call _hurd_stack_setup
 	xorq %rdx, %rdx
 	jmp _start1