[v6,5/5] riscv: Add and use alignment-ignorant memcpy

Message ID 20230802155903.2552780-6-evan@rivosinc.com
State Superseded
Headers
Series RISC-V: ifunced memcpy using new kernel hwprobe interface |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed

Commit Message

Evan Green Aug. 2, 2023, 3:59 p.m. UTC
  For CPU implementations that can perform unaligned accesses with little
or no performance penalty, create a memcpy implementation that does not
bother aligning buffers. It will use a block of integer registers, a
single integer register, and fall back to bytewise copy for the
remainder.

Signed-off-by: Evan Green <evan@rivosinc.com>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>

---

Changes in v6:
 - Fix a couple regressions in the assembly from v5 :/
 - Use passed hwprobe pointer in memcpy ifunc selector.

Changes in v5:
 - Do unaligned word access for final trailing bytes (Richard)

Changes in v4:
 - Fixed comment style (Florian)

Changes in v3:
 - Word align dest for large memcpy()s.
 - Add tags
 - Remove spurious blank line from sysdeps/riscv/memcpy.c

Changes in v2:
 - Used _MASK instead of _FAST value itself.


---
 sysdeps/riscv/memcopy.h                       |  26 ++++
 sysdeps/riscv/memcpy.c                        |  66 +++++++++
 sysdeps/riscv/memcpy_noalignment.S            | 138 ++++++++++++++++++
 sysdeps/unix/sysv/linux/riscv/Makefile        |   4 +
 .../unix/sysv/linux/riscv/memcpy-generic.c    |  24 +++
 5 files changed, 258 insertions(+)
 create mode 100644 sysdeps/riscv/memcopy.h
 create mode 100644 sysdeps/riscv/memcpy.c
 create mode 100644 sysdeps/riscv/memcpy_noalignment.S
 create mode 100644 sysdeps/unix/sysv/linux/riscv/memcpy-generic.c
  

Comments

Florian Weimer Aug. 3, 2023, 7:25 a.m. UTC | #1
* Evan Green:

> +static inline __typeof (__redirect_memcpy) *
> +select_memcpy_ifunc (uint64_t dl_hwcap, __riscv_hwprobe_t hwprobe_func)
> +{
> +  INIT_ARCH ();
> +
> +  struct riscv_hwprobe pair;
> +
> +  pair.key = RISCV_HWPROBE_KEY_CPUPERF_0;
> +  if (!hwprobe_func || hwprobe_func(&pair, 1, 0, NULL, 0) != 0)
> +    return __memcpy_generic;
> +
> +  if ((pair.key > 0) &&
> +      (pair.value & RISCV_HWPROBE_MISALIGNED_MASK) ==
> +       RISCV_HWPROBE_MISALIGNED_FAST)
> +    return __memcpy_noalignment;
> +
> +  return __memcpy_generic;
> +}

In libc, you could call __riscv_hwprobe directly, so the additional
argument isn't needed after all.

Thanks,
Florian
  
Richard Henderson Aug. 3, 2023, 5:50 p.m. UTC | #2
On 8/3/23 00:25, Florian Weimer via Libc-alpha wrote:
> * Evan Green:
> 
>> +static inline __typeof (__redirect_memcpy) *
>> +select_memcpy_ifunc (uint64_t dl_hwcap, __riscv_hwprobe_t hwprobe_func)
>> +{
>> +  INIT_ARCH ();
>> +
>> +  struct riscv_hwprobe pair;
>> +
>> +  pair.key = RISCV_HWPROBE_KEY_CPUPERF_0;
>> +  if (!hwprobe_func || hwprobe_func(&pair, 1, 0, NULL, 0) != 0)
>> +    return __memcpy_generic;
>> +
>> +  if ((pair.key > 0) &&
>> +      (pair.value & RISCV_HWPROBE_MISALIGNED_MASK) ==
>> +       RISCV_HWPROBE_MISALIGNED_FAST)
>> +    return __memcpy_noalignment;
>> +
>> +  return __memcpy_generic;
>> +}
> 
> In libc, you could call __riscv_hwprobe directly, so the additional
> argument isn't needed after all.

Outside libc something is required.

An extra parameter to ifunc is surprising though, and clearly not ideal per the extra 
hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that 
could always be called directly.


r~
  
Evan Green Aug. 3, 2023, 6:42 p.m. UTC | #3
On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 8/3/23 00:25, Florian Weimer via Libc-alpha wrote:
> > * Evan Green:
> >
> >> +static inline __typeof (__redirect_memcpy) *
> >> +select_memcpy_ifunc (uint64_t dl_hwcap, __riscv_hwprobe_t hwprobe_func)
> >> +{
> >> +  INIT_ARCH ();
> >> +
> >> +  struct riscv_hwprobe pair;
> >> +
> >> +  pair.key = RISCV_HWPROBE_KEY_CPUPERF_0;
> >> +  if (!hwprobe_func || hwprobe_func(&pair, 1, 0, NULL, 0) != 0)
> >> +    return __memcpy_generic;
> >> +
> >> +  if ((pair.key > 0) &&
> >> +      (pair.value & RISCV_HWPROBE_MISALIGNED_MASK) ==
> >> +       RISCV_HWPROBE_MISALIGNED_FAST)
> >> +    return __memcpy_noalignment;
> >> +
> >> +  return __memcpy_generic;
> >> +}
> >
> > In libc, you could call __riscv_hwprobe directly, so the additional
> > argument isn't needed after all.

So you think I should drop the libc-symbols.h change and call
__riscv_hwprobe directly here? Sure, I can do that.

>
> Outside libc something is required.
>
> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> could always be called directly.

My previous spin took that approach, defining a
__riscv_hwprobe_early() in libc_nonshared that could route to the real
function if available, or make the syscall directly if not. But that
approach had the drawback that ifunc users couldn't take advantage of
the vDSO, and then all users had to comprehend the difference between
__riscv_hwprobe() and __riscv_hwprobe_early().

In contrast, IMO this approach is much nicer. Ifunc writers are
already used to getting hwcap info via a parameter. Adding this second
parameter, which also provides hwcap-like things, seems like a natural
extension. I didn't quite follow what you meant by the "extra hoops
above". If you meant the previous patch to libc-symbols.h, that's all
glibc-internal-isms, and shouldn't affect external callers. As per
Florian's comment above I can drop that patch for now since I don't
strictly need it.

-Evan
  
Richard Henderson Aug. 3, 2023, 10:30 p.m. UTC | #4
On 8/3/23 11:42, Evan Green wrote:
> On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> <richard.henderson@linaro.org> wrote:
>> Outside libc something is required.
>>
>> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
>> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
>> could always be called directly.
> 
> My previous spin took that approach, defining a
> __riscv_hwprobe_early() in libc_nonshared that could route to the real
> function if available, or make the syscall directly if not. But that
> approach had the drawback that ifunc users couldn't take advantage of
> the vDSO, and then all users had to comprehend the difference between
> __riscv_hwprobe() and __riscv_hwprobe_early().

I would define __riscv_hwprobe such that it could take advantage of the vDSO once 
initialization reaches a certain point, but cope with being run earlier than that point by 
falling back to the syscall.

That constrains the implementation, I guess, in that it can't set errno, but just 
returning the negative errno from the syscall seems fine.

It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I 
would hope that some application of __attribute__((weak)) might correctly get you a NULL 
prior to full relocations being complete.


> In contrast, IMO this approach is much nicer. Ifunc writers are
> already used to getting hwcap info via a parameter. Adding this second
> parameter, which also provides hwcap-like things, seems like a natural
> extension. I didn't quite follow what you meant by the "extra hoops
> above".

The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is 
going to be used.

It might be worth defining some helper functions for probing a single key or a single 
field.  E.g.

uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
{
   struct riscv_hwprobe pair = { .key = key };
   int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
   if (err)
     return err;
   if (pair.key == -1)
     return -ENOENT;
   return pair.value;
}

This implementation requires that no future hwprobe key define a value which as a valid 
value in the errno range (or better, bit 63 unused).  Alternately, or additionally:

bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
{
   struct riscv_hwprobe pair = { .key = key };
   return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
           && pair.key != -1
           && (pair.value & mask) == val);
}

These yield either

     int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
     if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
       return __memcpy_noalignment;
     return __memcpy_generic;

or

     if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
       return __memcpy_noalignment;
     return __memcpy_generic;

which to my mind looks much better for a pattern you'll be replicating so very many times 
across all of the ifunc implementations in the system.


r~
  
Evan Green Aug. 7, 2023, 10:10 p.m. UTC | #5
On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 8/3/23 11:42, Evan Green wrote:
> > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >> Outside libc something is required.
> >>
> >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> >> could always be called directly.
> >
> > My previous spin took that approach, defining a
> > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > function if available, or make the syscall directly if not. But that
> > approach had the drawback that ifunc users couldn't take advantage of
> > the vDSO, and then all users had to comprehend the difference between
> > __riscv_hwprobe() and __riscv_hwprobe_early().
>
> I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> initialization reaches a certain point, but cope with being run earlier than that point by
> falling back to the syscall.
>
> That constrains the implementation, I guess, in that it can't set errno, but just
> returning the negative errno from the syscall seems fine.
>
> It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> would hope that some application of __attribute__((weak)) might correctly get you a NULL
> prior to full relocations being complete.

Right, this is what we had in the previous iteration of this series,
and it did work ok. But it wasn't as good since it meant ifunc
selectors always got stuck in the null/fallback case and were forced
to make the syscall. With this mechanism they get to take advantage of
the vDSO.

>
>
> > In contrast, IMO this approach is much nicer. Ifunc writers are
> > already used to getting hwcap info via a parameter. Adding this second
> > parameter, which also provides hwcap-like things, seems like a natural
> > extension. I didn't quite follow what you meant by the "extra hoops
> > above".
>
> The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> going to be used.
>
> It might be worth defining some helper functions for probing a single key or a single
> field.  E.g.
>
> uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> {
>    struct riscv_hwprobe pair = { .key = key };
>    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
>    if (err)
>      return err;
>    if (pair.key == -1)
>      return -ENOENT;
>    return pair.value;
> }
>
> This implementation requires that no future hwprobe key define a value which as a valid
> value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
>
> bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> {
>    struct riscv_hwprobe pair = { .key = key };
>    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
>            && pair.key != -1
>            && (pair.value & mask) == val);
> }
>
> These yield either
>
>      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
>      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
>        return __memcpy_noalignment;
>      return __memcpy_generic;
>
> or
>
>      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
>        return __memcpy_noalignment;
>      return __memcpy_generic;
>
> which to my mind looks much better for a pattern you'll be replicating so very many times
> across all of the ifunc implementations in the system.

Ah, I see. I could make a static inline function in the header that
looks something like this (mangled by gmail, sorry):

/* Helper function usable from ifunc selectors that probes a single key. */
static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
signed long long int key,
unsigned long long int *value)
{
struct riscv_hwprobe pair;
int rc;

if (!hwprobe_func)
return -ENOSYS;

pair.key = key;
rc = hwprobe_func(&pair, 1, 0, NULL, 0);
if (rc) {
return rc;
}

if (pair.key < 0) {
return -ENOENT;
}

*value = pair.value;
return 0;
}

The ifunc selector would then be significantly cleaned up, looking
something like:

if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
return __memcpy_generic;

if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
return __memcpy_noalignment;

-Evan
  
Florian Weimer Aug. 7, 2023, 10:21 p.m. UTC | #6
* Evan Green:

> Right, this is what we had in the previous iteration of this series,
> and it did work ok. But it wasn't as good since it meant ifunc
> selectors always got stuck in the null/fallback case and were forced
> to make the syscall. With this mechanism they get to take advantage of
> the vDSO.

The system call is only required when the IFUNC resolver is called in
advance of relocation.  In most cases, the ELF dependencies work as
expected and ensure that the object containing the IFUNC resolver is
already relocation, and use of the fallback is avoided.

Thanks,
Florian
  
Evan Green Aug. 7, 2023, 10:30 p.m. UTC | #7
On Mon, Aug 7, 2023 at 3:21 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Evan Green:
>
> > Right, this is what we had in the previous iteration of this series,
> > and it did work ok. But it wasn't as good since it meant ifunc
> > selectors always got stuck in the null/fallback case and were forced
> > to make the syscall. With this mechanism they get to take advantage of
> > the vDSO.
>
> The system call is only required when the IFUNC resolver is called in
> advance of relocation.  In most cases, the ELF dependencies work as
> expected and ensure that the object containing the IFUNC resolver is
> already relocation, and use of the fallback is avoided.

Ah that's true, we did have to go through some hoops with LD_BIND_NOW
and LD_PRELOAD to observe problems. So the cases where the
__riscv_hwprobe_early() approach doesn't get to use the vDSO isn't
"always" as I stated above but "certain exotic cases". I'm still
leaning towards the ifunc parameter (now with inline helper in the
header), but could be convinced to go back to the _early() function if
there's some advantage to it.

-Evan


>
> Thanks,
> Florian
>
  
enh Aug. 7, 2023, 10:48 p.m. UTC | #8
On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
>
> On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> <richard.henderson@linaro.org> wrote:
> >
> > On 8/3/23 11:42, Evan Green wrote:
> > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > <richard.henderson@linaro.org> wrote:
> > >> Outside libc something is required.
> > >>
> > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > >> could always be called directly.
> > >
> > > My previous spin took that approach, defining a
> > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > function if available, or make the syscall directly if not. But that
> > > approach had the drawback that ifunc users couldn't take advantage of
> > > the vDSO, and then all users had to comprehend the difference between
> > > __riscv_hwprobe() and __riscv_hwprobe_early().
> >
> > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > initialization reaches a certain point, but cope with being run earlier than that point by
> > falling back to the syscall.
> >
> > That constrains the implementation, I guess, in that it can't set errno, but just
> > returning the negative errno from the syscall seems fine.
> >
> > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > prior to full relocations being complete.
>
> Right, this is what we had in the previous iteration of this series,
> and it did work ok. But it wasn't as good since it meant ifunc
> selectors always got stuck in the null/fallback case and were forced
> to make the syscall. With this mechanism they get to take advantage of
> the vDSO.
>
> >
> >
> > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > already used to getting hwcap info via a parameter. Adding this second
> > > parameter, which also provides hwcap-like things, seems like a natural
> > > extension. I didn't quite follow what you meant by the "extra hoops
> > > above".
> >
> > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > going to be used.
> >
> > It might be worth defining some helper functions for probing a single key or a single
> > field.  E.g.
> >
> > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > {
> >    struct riscv_hwprobe pair = { .key = key };
> >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> >    if (err)
> >      return err;
> >    if (pair.key == -1)
> >      return -ENOENT;
> >    return pair.value;
> > }
> >
> > This implementation requires that no future hwprobe key define a value which as a valid
> > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> >
> > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > {
> >    struct riscv_hwprobe pair = { .key = key };
> >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> >            && pair.key != -1
> >            && (pair.value & mask) == val);
> > }
> >
> > These yield either
> >
> >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> >        return __memcpy_noalignment;
> >      return __memcpy_generic;
> >
> > or
> >
> >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> >        return __memcpy_noalignment;
> >      return __memcpy_generic;
> >
> > which to my mind looks much better for a pattern you'll be replicating so very many times
> > across all of the ifunc implementations in the system.
>
> Ah, I see. I could make a static inline function in the header that
> looks something like this (mangled by gmail, sorry):
>
> /* Helper function usable from ifunc selectors that probes a single key. */
> static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> signed long long int key,
> unsigned long long int *value)
> {
> struct riscv_hwprobe pair;
> int rc;
>
> if (!hwprobe_func)
> return -ENOSYS;
>
> pair.key = key;
> rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> if (rc) {
> return rc;
> }
>
> if (pair.key < 0) {
> return -ENOENT;
> }
>
> *value = pair.value;
> return 0;
> }
>
> The ifunc selector would then be significantly cleaned up, looking
> something like:
>
> if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> return __memcpy_generic;
>
> if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> return __memcpy_noalignment;

(Android's libc maintainer here, having joined the list just to talk
about risc-v ifuncs :-) )

has anyone thought about calling ifunc resolvers more like this...

--same part of the dynamic loader that caches the two getauxval()s for arm64--
static struct riscv_hwprobe probes[] = {
 {.value = RISCV_HWPROBE_KEY_MVENDORID},
 {.value = RISCV_HWPROBE_KEY_MARCHID},
 {.value = RISCV_HWPROBE_KEY_MIMPID},
 {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
 {.value = RISCV_HWPROBE_KEY_IMA_EXT},
 {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
... // every time a new key is added to the kernel, we add it here
};
__riscv_hwprobe(...); // called once

--part of the dynamic loader that calls ifunc resolvers--
(*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);

this is similar to what we already have for arm64 (where there's a
getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
potentially others), but more uniform, and avoiding the source
(in)compatibility issues of adding new fields to a struct [even if it
does have a size_t to "version" it like the arm64 ifunc struct].

yes, it means everyone pays to get all the hwprobes, but that gets
amortized. and lookup in the ifunc resolver is simple and quick. if we
know that the keys will be kept dense, we can even have code in ifunc
resolvers like

if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...

though personally for the "big ticket items" that get a letter to
themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
probe_count, probes_ptr)` to the resolver, but i hear that's
controversial :-)

> -Evan
  
Evan Green Aug. 8, 2023, 12:01 a.m. UTC | #9
On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
>
> On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> >
> > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> > >
> > > On 8/3/23 11:42, Evan Green wrote:
> > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > <richard.henderson@linaro.org> wrote:
> > > >> Outside libc something is required.
> > > >>
> > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > >> could always be called directly.
> > > >
> > > > My previous spin took that approach, defining a
> > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > function if available, or make the syscall directly if not. But that
> > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > the vDSO, and then all users had to comprehend the difference between
> > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > >
> > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > falling back to the syscall.
> > >
> > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > returning the negative errno from the syscall seems fine.
> > >
> > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > prior to full relocations being complete.
> >
> > Right, this is what we had in the previous iteration of this series,
> > and it did work ok. But it wasn't as good since it meant ifunc
> > selectors always got stuck in the null/fallback case and were forced
> > to make the syscall. With this mechanism they get to take advantage of
> > the vDSO.
> >
> > >
> > >
> > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > already used to getting hwcap info via a parameter. Adding this second
> > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > above".
> > >
> > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > going to be used.
> > >
> > > It might be worth defining some helper functions for probing a single key or a single
> > > field.  E.g.
> > >
> > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > {
> > >    struct riscv_hwprobe pair = { .key = key };
> > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > >    if (err)
> > >      return err;
> > >    if (pair.key == -1)
> > >      return -ENOENT;
> > >    return pair.value;
> > > }
> > >
> > > This implementation requires that no future hwprobe key define a value which as a valid
> > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > >
> > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > {
> > >    struct riscv_hwprobe pair = { .key = key };
> > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > >            && pair.key != -1
> > >            && (pair.value & mask) == val);
> > > }
> > >
> > > These yield either
> > >
> > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > >        return __memcpy_noalignment;
> > >      return __memcpy_generic;
> > >
> > > or
> > >
> > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > >        return __memcpy_noalignment;
> > >      return __memcpy_generic;
> > >
> > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > across all of the ifunc implementations in the system.
> >
> > Ah, I see. I could make a static inline function in the header that
> > looks something like this (mangled by gmail, sorry):
> >
> > /* Helper function usable from ifunc selectors that probes a single key. */
> > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > signed long long int key,
> > unsigned long long int *value)
> > {
> > struct riscv_hwprobe pair;
> > int rc;
> >
> > if (!hwprobe_func)
> > return -ENOSYS;
> >
> > pair.key = key;
> > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > if (rc) {
> > return rc;
> > }
> >
> > if (pair.key < 0) {
> > return -ENOENT;
> > }
> >
> > *value = pair.value;
> > return 0;
> > }
> >
> > The ifunc selector would then be significantly cleaned up, looking
> > something like:
> >
> > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > return __memcpy_generic;
> >
> > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > return __memcpy_noalignment;
>
> (Android's libc maintainer here, having joined the list just to talk
> about risc-v ifuncs :-) )
>
> has anyone thought about calling ifunc resolvers more like this...
>
> --same part of the dynamic loader that caches the two getauxval()s for arm64--
> static struct riscv_hwprobe probes[] = {
>  {.value = RISCV_HWPROBE_KEY_MVENDORID},
>  {.value = RISCV_HWPROBE_KEY_MARCHID},
>  {.value = RISCV_HWPROBE_KEY_MIMPID},
>  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
>  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
>  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> ... // every time a new key is added to the kernel, we add it here
> };
> __riscv_hwprobe(...); // called once
>
> --part of the dynamic loader that calls ifunc resolvers--
> (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
>
> this is similar to what we already have for arm64 (where there's a
> getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> potentially others), but more uniform, and avoiding the source
> (in)compatibility issues of adding new fields to a struct [even if it
> does have a size_t to "version" it like the arm64 ifunc struct].
>
> yes, it means everyone pays to get all the hwprobes, but that gets
> amortized. and lookup in the ifunc resolver is simple and quick. if we
> know that the keys will be kept dense, we can even have code in ifunc
> resolvers like
>
> if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
>
> though personally for the "big ticket items" that get a letter to
> themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> probe_count, probes_ptr)` to the resolver, but i hear that's
> controversial :-)

Hello, welcome to the fun! :)

What you're describing here is almost exactly what we did inside the
vDSO function. The vDSO function acts as a front for a handful of
probe values that we've already completed and cached in userspace. We
opted to make it a function, rather than exposing the data itself via
vDSO, so that we had future flexibility in what elements we cached in
userspace and their storage format. We can update the kernel as needed
to cache the hottest things in userspace, even if that means
rearranging the data format, passing through some extra information,
or adding an extra snip of code. My hope is callers can directly
interact with the vDSO function (though maybe as Richard suggested
maybe with the help of a tidy inline helper), rather than trying to
add a second layer of userspace caching.

-Evan
  
enh Aug. 12, 2023, 12:01 a.m. UTC | #10
On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
>
> On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> >
> > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > >
> > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > <richard.henderson@linaro.org> wrote:
> > > >
> > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > <richard.henderson@linaro.org> wrote:
> > > > >> Outside libc something is required.
> > > > >>
> > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > > >> could always be called directly.
> > > > >
> > > > > My previous spin took that approach, defining a
> > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > function if available, or make the syscall directly if not. But that
> > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > >
> > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > falling back to the syscall.
> > > >
> > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > returning the negative errno from the syscall seems fine.
> > > >
> > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > prior to full relocations being complete.
> > >
> > > Right, this is what we had in the previous iteration of this series,
> > > and it did work ok. But it wasn't as good since it meant ifunc
> > > selectors always got stuck in the null/fallback case and were forced
> > > to make the syscall. With this mechanism they get to take advantage of
> > > the vDSO.
> > >
> > > >
> > > >
> > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > above".
> > > >
> > > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > > going to be used.
> > > >
> > > > It might be worth defining some helper functions for probing a single key or a single
> > > > field.  E.g.
> > > >
> > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > {
> > > >    struct riscv_hwprobe pair = { .key = key };
> > > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > >    if (err)
> > > >      return err;
> > > >    if (pair.key == -1)
> > > >      return -ENOENT;
> > > >    return pair.value;
> > > > }
> > > >
> > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > > >
> > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > {
> > > >    struct riscv_hwprobe pair = { .key = key };
> > > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > >            && pair.key != -1
> > > >            && (pair.value & mask) == val);
> > > > }
> > > >
> > > > These yield either
> > > >
> > > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > >        return __memcpy_noalignment;
> > > >      return __memcpy_generic;
> > > >
> > > > or
> > > >
> > > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > >        return __memcpy_noalignment;
> > > >      return __memcpy_generic;
> > > >
> > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > across all of the ifunc implementations in the system.
> > >
> > > Ah, I see. I could make a static inline function in the header that
> > > looks something like this (mangled by gmail, sorry):
> > >
> > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > signed long long int key,
> > > unsigned long long int *value)
> > > {
> > > struct riscv_hwprobe pair;
> > > int rc;
> > >
> > > if (!hwprobe_func)
> > > return -ENOSYS;
> > >
> > > pair.key = key;
> > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > if (rc) {
> > > return rc;
> > > }
> > >
> > > if (pair.key < 0) {
> > > return -ENOENT;
> > > }
> > >
> > > *value = pair.value;
> > > return 0;
> > > }
> > >
> > > The ifunc selector would then be significantly cleaned up, looking
> > > something like:
> > >
> > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > return __memcpy_generic;
> > >
> > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > return __memcpy_noalignment;
> >
> > (Android's libc maintainer here, having joined the list just to talk
> > about risc-v ifuncs :-) )
> >
> > has anyone thought about calling ifunc resolvers more like this...
> >
> > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > static struct riscv_hwprobe probes[] = {
> >  {.value = RISCV_HWPROBE_KEY_MVENDORID},
> >  {.value = RISCV_HWPROBE_KEY_MARCHID},
> >  {.value = RISCV_HWPROBE_KEY_MIMPID},
> >  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> >  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> >  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > ... // every time a new key is added to the kernel, we add it here
> > };
> > __riscv_hwprobe(...); // called once
> >
> > --part of the dynamic loader that calls ifunc resolvers--
> > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> >
> > this is similar to what we already have for arm64 (where there's a
> > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > potentially others), but more uniform, and avoiding the source
> > (in)compatibility issues of adding new fields to a struct [even if it
> > does have a size_t to "version" it like the arm64 ifunc struct].
> >
> > yes, it means everyone pays to get all the hwprobes, but that gets
> > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > know that the keys will be kept dense, we can even have code in ifunc
> > resolvers like
> >
> > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> >
> > though personally for the "big ticket items" that get a letter to
> > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > probe_count, probes_ptr)` to the resolver, but i hear that's
> > controversial :-)
>
> Hello, welcome to the fun! :)

(sorry for the delay. i've been thinking :-) )

> What you're describing here is almost exactly what we did inside the
> vDSO function. The vDSO function acts as a front for a handful of
> probe values that we've already completed and cached in userspace. We
> opted to make it a function, rather than exposing the data itself via
> vDSO, so that we had future flexibility in what elements we cached in
> userspace and their storage format. We can update the kernel as needed
> to cache the hottest things in userspace, even if that means
> rearranging the data format, passing through some extra information,
> or adding an extra snip of code. My hope is callers can directly
> interact with the vDSO function (though maybe as Richard suggested
> maybe with the help of a tidy inline helper), rather than trying to
> add a second layer of userspace caching.

on reflection i think i might be too focused on the FMV use case, in
part because we're looking at those compiler-generated ifuncs for
arm64 on Android atm. i think i'm imagining a world where there's a
lot of that, and worrying about having to pay for the setup, call, and
loop for each ifunc, and wondering why we don't just pay once instead.
(as a bit of background context, Android "app" start is actually a
dlopen() in a clone of an existing zygote process, and in general app
launch time is one of the key metrics anyone who's serious is
optimizing for. you'd be surprised how much of my life i spend
explaining to people that if they want dlopen() to be faster, maybe
they shouldn't ask us to run thousands of ELF constructors.)

but... the more time i spend looking at what we actually need in
third-party open source libraries right now i realize that libc and
FMV (which is still a future thing for us anyway) are really the only
_actual_ ifunc users. perhaps in part because macOS/iOS don't have
ifuncs, all the libraries that are part of the OS itself, for example,
are just doing their own thing with function pointers and
pthread_once() or whatever.

(i have yet to try to get any data on actual apps. i have no reason to
think they'll be very different, but that could easily be skewed by
popular middleware or a popular game engine using ifuncs, so i do plan
on following up on that.)

"how do they decide what to set that function pointer to?". well, it
looks like in most cases cpuid on x86 and calls to getauxval()
everywhere else. in some cases that's actually via some other library:
https://github.com/pytorch/cpuinfo or
https://github.com/google/cpu_features for example. so they have a
layer of caching there, even in cases where they don't have a single
function that sets all the function pointers.

so assuming i don't find that apps look very different from the OS
(that is: that apps use lots of ifuncs), i probably don't care at all
until we get to FMV. and i probably don't care for FMV, because
compiler-rt (or gcc's equivalent) will be the "caching layer" there.
(and on Android it'll be a while before i have to worry about libc's
ifuncs because we'll require V and not use ifuncs there for the
foreseeable future.)

so, yeah, given that i've adopted the "pass a null pointer rather than
no arguments" convention you have, we have room for expansion if/when
FMV is a big thing, and until then -- unless i'm shocked by what i
find looking at actual apps -- i don't think i have any reason to
believe that ifuncs matter that much, and if compiler-rt makes one
__riscv_hwprobe() call per .so, that's probably fine. (i already spend
a big chunk of my life advising people to just have one .so file,
exporting nothing but a JNI_OnLoad symbol, so this will just make that
advice even better advice :-) )

> -Evan
  
Evan Green Aug. 15, 2023, 4:40 p.m. UTC | #11
On Fri, Aug 11, 2023 at 5:01 PM enh <enh@google.com> wrote:
>
> On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
> >
> > On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> > >
> > > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > > >
> > > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > > <richard.henderson@linaro.org> wrote:
> > > > >
> > > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > > <richard.henderson@linaro.org> wrote:
> > > > > >> Outside libc something is required.
> > > > > >>
> > > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > > > >> could always be called directly.
> > > > > >
> > > > > > My previous spin took that approach, defining a
> > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > > function if available, or make the syscall directly if not. But that
> > > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > > >
> > > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > > falling back to the syscall.
> > > > >
> > > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > > returning the negative errno from the syscall seems fine.
> > > > >
> > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > > prior to full relocations being complete.
> > > >
> > > > Right, this is what we had in the previous iteration of this series,
> > > > and it did work ok. But it wasn't as good since it meant ifunc
> > > > selectors always got stuck in the null/fallback case and were forced
> > > > to make the syscall. With this mechanism they get to take advantage of
> > > > the vDSO.
> > > >
> > > > >
> > > > >
> > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > > above".
> > > > >
> > > > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > > > going to be used.
> > > > >
> > > > > It might be worth defining some helper functions for probing a single key or a single
> > > > > field.  E.g.
> > > > >
> > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > > {
> > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > > >    if (err)
> > > > >      return err;
> > > > >    if (pair.key == -1)
> > > > >      return -ENOENT;
> > > > >    return pair.value;
> > > > > }
> > > > >
> > > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > > > >
> > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > > {
> > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > > >            && pair.key != -1
> > > > >            && (pair.value & mask) == val);
> > > > > }
> > > > >
> > > > > These yield either
> > > > >
> > > > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > > >        return __memcpy_noalignment;
> > > > >      return __memcpy_generic;
> > > > >
> > > > > or
> > > > >
> > > > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > > >        return __memcpy_noalignment;
> > > > >      return __memcpy_generic;
> > > > >
> > > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > > across all of the ifunc implementations in the system.
> > > >
> > > > Ah, I see. I could make a static inline function in the header that
> > > > looks something like this (mangled by gmail, sorry):
> > > >
> > > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > > signed long long int key,
> > > > unsigned long long int *value)
> > > > {
> > > > struct riscv_hwprobe pair;
> > > > int rc;
> > > >
> > > > if (!hwprobe_func)
> > > > return -ENOSYS;
> > > >
> > > > pair.key = key;
> > > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > > if (rc) {
> > > > return rc;
> > > > }
> > > >
> > > > if (pair.key < 0) {
> > > > return -ENOENT;
> > > > }
> > > >
> > > > *value = pair.value;
> > > > return 0;
> > > > }
> > > >
> > > > The ifunc selector would then be significantly cleaned up, looking
> > > > something like:
> > > >
> > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > > return __memcpy_generic;
> > > >
> > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > > return __memcpy_noalignment;
> > >
> > > (Android's libc maintainer here, having joined the list just to talk
> > > about risc-v ifuncs :-) )
> > >
> > > has anyone thought about calling ifunc resolvers more like this...
> > >
> > > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > > static struct riscv_hwprobe probes[] = {
> > >  {.value = RISCV_HWPROBE_KEY_MVENDORID},
> > >  {.value = RISCV_HWPROBE_KEY_MARCHID},
> > >  {.value = RISCV_HWPROBE_KEY_MIMPID},
> > >  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> > >  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> > >  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > > ... // every time a new key is added to the kernel, we add it here
> > > };
> > > __riscv_hwprobe(...); // called once
> > >
> > > --part of the dynamic loader that calls ifunc resolvers--
> > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> > >
> > > this is similar to what we already have for arm64 (where there's a
> > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > > potentially others), but more uniform, and avoiding the source
> > > (in)compatibility issues of adding new fields to a struct [even if it
> > > does have a size_t to "version" it like the arm64 ifunc struct].
> > >
> > > yes, it means everyone pays to get all the hwprobes, but that gets
> > > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > > know that the keys will be kept dense, we can even have code in ifunc
> > > resolvers like
> > >
> > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> > >
> > > though personally for the "big ticket items" that get a letter to
> > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > > probe_count, probes_ptr)` to the resolver, but i hear that's
> > > controversial :-)
> >
> > Hello, welcome to the fun! :)
>
> (sorry for the delay. i've been thinking :-) )
>
> > What you're describing here is almost exactly what we did inside the
> > vDSO function. The vDSO function acts as a front for a handful of
> > probe values that we've already completed and cached in userspace. We
> > opted to make it a function, rather than exposing the data itself via
> > vDSO, so that we had future flexibility in what elements we cached in
> > userspace and their storage format. We can update the kernel as needed
> > to cache the hottest things in userspace, even if that means
> > rearranging the data format, passing through some extra information,
> > or adding an extra snip of code. My hope is callers can directly
> > interact with the vDSO function (though maybe as Richard suggested
> > maybe with the help of a tidy inline helper), rather than trying to
> > add a second layer of userspace caching.
>
> on reflection i think i might be too focused on the FMV use case, in
> part because we're looking at those compiler-generated ifuncs for
> arm64 on Android atm. i think i'm imagining a world where there's a
> lot of that, and worrying about having to pay for the setup, call, and
> loop for each ifunc, and wondering why we don't just pay once instead.
> (as a bit of background context, Android "app" start is actually a
> dlopen() in a clone of an existing zygote process, and in general app
> launch time is one of the key metrics anyone who's serious is
> optimizing for. you'd be surprised how much of my life i spend
> explaining to people that if they want dlopen() to be faster, maybe
> they shouldn't ask us to run thousands of ELF constructors.)
>
> but... the more time i spend looking at what we actually need in
> third-party open source libraries right now i realize that libc and
> FMV (which is still a future thing for us anyway) are really the only
> _actual_ ifunc users. perhaps in part because macOS/iOS don't have
> ifuncs, all the libraries that are part of the OS itself, for example,
> are just doing their own thing with function pointers and
> pthread_once() or whatever.
>
> (i have yet to try to get any data on actual apps. i have no reason to
> think they'll be very different, but that could easily be skewed by
> popular middleware or a popular game engine using ifuncs, so i do plan
> on following up on that.)
>
> "how do they decide what to set that function pointer to?". well, it
> looks like in most cases cpuid on x86 and calls to getauxval()
> everywhere else. in some cases that's actually via some other library:
> https://github.com/pytorch/cpuinfo or
> https://github.com/google/cpu_features for example. so they have a
> layer of caching there, even in cases where they don't have a single
> function that sets all the function pointers.

Right, function multi-versioning is just the sort of spot where we'd
imagine hwprobe gets used, since it's providing similar/equivalent
information to what cpuid does on x86. It may not be quite as fast as
cpuid (I don't know how fast cpuid actually is). But with the vDSO
function+data in userspace it should be able to match getauxval() in
performance, as they're both a function pointer plus a loop. We're
sort of planning for a world in which RISC-V has a wider set of these
values to fetch, such that a ifunc selector may need a more complex
set of information. Hwprobe and the vDSO gives us the ability both to
answer multiple queries fast, and freely allocate more keys that may
represent versioned features or even compound features.

>
> so assuming i don't find that apps look very different from the OS
> (that is: that apps use lots of ifuncs), i probably don't care at all
> until we get to FMV. and i probably don't care for FMV, because
> compiler-rt (or gcc's equivalent) will be the "caching layer" there.
> (and on Android it'll be a while before i have to worry about libc's
> ifuncs because we'll require V and not use ifuncs there for the
> foreseeable future.)
>
> so, yeah, given that i've adopted the "pass a null pointer rather than
> no arguments" convention you have, we have room for expansion if/when
> FMV is a big thing, and until then -- unless i'm shocked by what i
> find looking at actual apps -- i don't think i have any reason to
> believe that ifuncs matter that much, and if compiler-rt makes one
> __riscv_hwprobe() call per .so, that's probably fine. (i already spend
> a big chunk of my life advising people to just have one .so file,
> exporting nothing but a JNI_OnLoad symbol, so this will just make that
> advice even better advice :-) )

Just to confirm, by "pass a null pointer", you're saying that the
Android libc also passes NULL as the second ifunc selector argument
(or first)? That's good. It sounds like you're planning to just
continue passing NULL for now, and wait for people to start clamoring
for this in android libc?
-Evan
  
enh Aug. 15, 2023, 9:53 p.m. UTC | #12
On Tue, Aug 15, 2023 at 9:41 AM Evan Green <evan@rivosinc.com> wrote:
>
> On Fri, Aug 11, 2023 at 5:01 PM enh <enh@google.com> wrote:
> >
> > On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
> > >
> > > On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> > > >
> > > > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > > > >
> > > > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > > > <richard.henderson@linaro.org> wrote:
> > > > > >
> > > > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > >> Outside libc something is required.
> > > > > > >>
> > > > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > > > > >> could always be called directly.
> > > > > > >
> > > > > > > My previous spin took that approach, defining a
> > > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > > > function if available, or make the syscall directly if not. But that
> > > > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > > > >
> > > > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > > > falling back to the syscall.
> > > > > >
> > > > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > > > returning the negative errno from the syscall seems fine.
> > > > > >
> > > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > > > prior to full relocations being complete.
> > > > >
> > > > > Right, this is what we had in the previous iteration of this series,
> > > > > and it did work ok. But it wasn't as good since it meant ifunc
> > > > > selectors always got stuck in the null/fallback case and were forced
> > > > > to make the syscall. With this mechanism they get to take advantage of
> > > > > the vDSO.
> > > > >
> > > > > >
> > > > > >
> > > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > > > above".
> > > > > >
> > > > > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > > > > going to be used.
> > > > > >
> > > > > > It might be worth defining some helper functions for probing a single key or a single
> > > > > > field.  E.g.
> > > > > >
> > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > > > {
> > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > > > >    if (err)
> > > > > >      return err;
> > > > > >    if (pair.key == -1)
> > > > > >      return -ENOENT;
> > > > > >    return pair.value;
> > > > > > }
> > > > > >
> > > > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > > > > >
> > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > > > {
> > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > > > >            && pair.key != -1
> > > > > >            && (pair.value & mask) == val);
> > > > > > }
> > > > > >
> > > > > > These yield either
> > > > > >
> > > > > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > > > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > > > >        return __memcpy_noalignment;
> > > > > >      return __memcpy_generic;
> > > > > >
> > > > > > or
> > > > > >
> > > > > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > > > >        return __memcpy_noalignment;
> > > > > >      return __memcpy_generic;
> > > > > >
> > > > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > > > across all of the ifunc implementations in the system.
> > > > >
> > > > > Ah, I see. I could make a static inline function in the header that
> > > > > looks something like this (mangled by gmail, sorry):
> > > > >
> > > > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > > > signed long long int key,
> > > > > unsigned long long int *value)
> > > > > {
> > > > > struct riscv_hwprobe pair;
> > > > > int rc;
> > > > >
> > > > > if (!hwprobe_func)
> > > > > return -ENOSYS;
> > > > >
> > > > > pair.key = key;
> > > > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > > > if (rc) {
> > > > > return rc;
> > > > > }
> > > > >
> > > > > if (pair.key < 0) {
> > > > > return -ENOENT;
> > > > > }
> > > > >
> > > > > *value = pair.value;
> > > > > return 0;
> > > > > }
> > > > >
> > > > > The ifunc selector would then be significantly cleaned up, looking
> > > > > something like:
> > > > >
> > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > > > return __memcpy_generic;
> > > > >
> > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > > > return __memcpy_noalignment;
> > > >
> > > > (Android's libc maintainer here, having joined the list just to talk
> > > > about risc-v ifuncs :-) )
> > > >
> > > > has anyone thought about calling ifunc resolvers more like this...
> > > >
> > > > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > > > static struct riscv_hwprobe probes[] = {
> > > >  {.value = RISCV_HWPROBE_KEY_MVENDORID},
> > > >  {.value = RISCV_HWPROBE_KEY_MARCHID},
> > > >  {.value = RISCV_HWPROBE_KEY_MIMPID},
> > > >  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> > > >  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> > > >  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > > > ... // every time a new key is added to the kernel, we add it here
> > > > };
> > > > __riscv_hwprobe(...); // called once
> > > >
> > > > --part of the dynamic loader that calls ifunc resolvers--
> > > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> > > >
> > > > this is similar to what we already have for arm64 (where there's a
> > > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > > > potentially others), but more uniform, and avoiding the source
> > > > (in)compatibility issues of adding new fields to a struct [even if it
> > > > does have a size_t to "version" it like the arm64 ifunc struct].
> > > >
> > > > yes, it means everyone pays to get all the hwprobes, but that gets
> > > > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > > > know that the keys will be kept dense, we can even have code in ifunc
> > > > resolvers like
> > > >
> > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> > > >
> > > > though personally for the "big ticket items" that get a letter to
> > > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > > > probe_count, probes_ptr)` to the resolver, but i hear that's
> > > > controversial :-)
> > >
> > > Hello, welcome to the fun! :)
> >
> > (sorry for the delay. i've been thinking :-) )
> >
> > > What you're describing here is almost exactly what we did inside the
> > > vDSO function. The vDSO function acts as a front for a handful of
> > > probe values that we've already completed and cached in userspace. We
> > > opted to make it a function, rather than exposing the data itself via
> > > vDSO, so that we had future flexibility in what elements we cached in
> > > userspace and their storage format. We can update the kernel as needed
> > > to cache the hottest things in userspace, even if that means
> > > rearranging the data format, passing through some extra information,
> > > or adding an extra snip of code. My hope is callers can directly
> > > interact with the vDSO function (though maybe as Richard suggested
> > > maybe with the help of a tidy inline helper), rather than trying to
> > > add a second layer of userspace caching.
> >
> > on reflection i think i might be too focused on the FMV use case, in
> > part because we're looking at those compiler-generated ifuncs for
> > arm64 on Android atm. i think i'm imagining a world where there's a
> > lot of that, and worrying about having to pay for the setup, call, and
> > loop for each ifunc, and wondering why we don't just pay once instead.
> > (as a bit of background context, Android "app" start is actually a
> > dlopen() in a clone of an existing zygote process, and in general app
> > launch time is one of the key metrics anyone who's serious is
> > optimizing for. you'd be surprised how much of my life i spend
> > explaining to people that if they want dlopen() to be faster, maybe
> > they shouldn't ask us to run thousands of ELF constructors.)
> >
> > but... the more time i spend looking at what we actually need in
> > third-party open source libraries right now i realize that libc and
> > FMV (which is still a future thing for us anyway) are really the only
> > _actual_ ifunc users. perhaps in part because macOS/iOS don't have
> > ifuncs, all the libraries that are part of the OS itself, for example,
> > are just doing their own thing with function pointers and
> > pthread_once() or whatever.
> >
> > (i have yet to try to get any data on actual apps. i have no reason to
> > think they'll be very different, but that could easily be skewed by
> > popular middleware or a popular game engine using ifuncs, so i do plan
> > on following up on that.)
> >
> > "how do they decide what to set that function pointer to?". well, it
> > looks like in most cases cpuid on x86 and calls to getauxval()
> > everywhere else. in some cases that's actually via some other library:
> > https://github.com/pytorch/cpuinfo or
> > https://github.com/google/cpu_features for example. so they have a
> > layer of caching there, even in cases where they don't have a single
> > function that sets all the function pointers.
>
> Right, function multi-versioning is just the sort of spot where we'd
> imagine hwprobe gets used, since it's providing similar/equivalent
> information to what cpuid does on x86. It may not be quite as fast as
> cpuid (I don't know how fast cpuid actually is). But with the vDSO
> function+data in userspace it should be able to match getauxval() in
> performance, as they're both a function pointer plus a loop. We're
> sort of planning for a world in which RISC-V has a wider set of these
> values to fetch, such that a ifunc selector may need a more complex
> set of information. Hwprobe and the vDSO gives us the ability both to
> answer multiple queries fast, and freely allocate more keys that may
> represent versioned features or even compound features.

yeah, my incorrect mental model was that -- primarily because of
x86-64 and cpuid -- every function would get its own ifunc resolver
that would have to make a query. but the [in progress] arm64
implementation shows that that's not really the case anyway, and we
can just cache __riscv_hwprobe() in the same [one] place that
getauxval() is already being cached for arm64.

> > so assuming i don't find that apps look very different from the OS
> > (that is: that apps use lots of ifuncs), i probably don't care at all
> > until we get to FMV. and i probably don't care for FMV, because
> > compiler-rt (or gcc's equivalent) will be the "caching layer" there.
> > (and on Android it'll be a while before i have to worry about libc's
> > ifuncs because we'll require V and not use ifuncs there for the
> > foreseeable future.)
> >
> > so, yeah, given that i've adopted the "pass a null pointer rather than
> > no arguments" convention you have, we have room for expansion if/when
> > FMV is a big thing, and until then -- unless i'm shocked by what i
> > find looking at actual apps -- i don't think i have any reason to
> > believe that ifuncs matter that much, and if compiler-rt makes one
> > __riscv_hwprobe() call per .so, that's probably fine. (i already spend
> > a big chunk of my life advising people to just have one .so file,
> > exporting nothing but a JNI_OnLoad symbol, so this will just make that
> > advice even better advice :-) )
>
> Just to confirm, by "pass a null pointer", you're saying that the
> Android libc also passes NULL as the second ifunc selector argument
> (or first)?

#elif defined(__riscv)
  // This argument and its value is just a placeholder for now,
  // but it means that if we do pass something in future (such as
  // getauxval() and/or hwprobe key/value pairs), callees will be able to
  // recognize what they're being given.
  typedef ElfW(Addr) (*ifunc_resolver_t)(void*);
  return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(nullptr);

it's arm64 that has the initial getauxval() argument:

#if defined(__aarch64__)
  typedef ElfW(Addr) (*ifunc_resolver_t)(uint64_t, __ifunc_arg_t*);
  static __ifunc_arg_t arg;
  static bool initialized = false;
  if (!initialized) {
    initialized = true;
    arg._size = sizeof(__ifunc_arg_t);
    arg._hwcap = getauxval(AT_HWCAP);
    arg._hwcap2 = getauxval(AT_HWCAP2);
  }
  return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(arg._hwcap
| _IFUNC_ARG_HWCAP, &arg);

https://android.googlesource.com/platform/bionic/+/main/libc/bionic/bionic_call_ifunc_resolver.cpp

> That's good. It sounds like you're planning to just
> continue passing NULL for now, and wait for people to start clamoring
> for this in android libc?

yeah, and i'm assuming there will never be any clamor ... yesterday
and today i actually checked a bunch of popular apks, and didn't find
any that were currently using ifuncs.

the only change i'm thinking of making right now is that "there's a
single argument, and it's null" should probably be the default.
obviously since Android doesn't add new architectures very often, this
is only likely to affect x86/x86-64 for the foreseeable future, but
being able to recognize at a glance "am i running under a libc new
enough to pass me arguments?" would certainly have helped for arm64.
even if x86/x86-64 never benefit, it seems like the right default for
the #else clause...

> -Evan
  
Evan Green Aug. 15, 2023, 11:01 p.m. UTC | #13
On Tue, Aug 15, 2023 at 2:54 PM enh <enh@google.com> wrote:
>
> On Tue, Aug 15, 2023 at 9:41 AM Evan Green <evan@rivosinc.com> wrote:
> >
> > On Fri, Aug 11, 2023 at 5:01 PM enh <enh@google.com> wrote:
> > >
> > > On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
> > > >
> > > > On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> > > > >
> > > > > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > >
> > > > > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > >> Outside libc something is required.
> > > > > > > >>
> > > > > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > > > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > > > > > >> could always be called directly.
> > > > > > > >
> > > > > > > > My previous spin took that approach, defining a
> > > > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > > > > function if available, or make the syscall directly if not. But that
> > > > > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > > > > >
> > > > > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > > > > falling back to the syscall.
> > > > > > >
> > > > > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > > > > returning the negative errno from the syscall seems fine.
> > > > > > >
> > > > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > > > > prior to full relocations being complete.
> > > > > >
> > > > > > Right, this is what we had in the previous iteration of this series,
> > > > > > and it did work ok. But it wasn't as good since it meant ifunc
> > > > > > selectors always got stuck in the null/fallback case and were forced
> > > > > > to make the syscall. With this mechanism they get to take advantage of
> > > > > > the vDSO.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > > > > above".
> > > > > > >
> > > > > > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > > > > > going to be used.
> > > > > > >
> > > > > > > It might be worth defining some helper functions for probing a single key or a single
> > > > > > > field.  E.g.
> > > > > > >
> > > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > > > > {
> > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > > > > >    if (err)
> > > > > > >      return err;
> > > > > > >    if (pair.key == -1)
> > > > > > >      return -ENOENT;
> > > > > > >    return pair.value;
> > > > > > > }
> > > > > > >
> > > > > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > > > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > > > > > >
> > > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > > > > {
> > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > > > > >            && pair.key != -1
> > > > > > >            && (pair.value & mask) == val);
> > > > > > > }
> > > > > > >
> > > > > > > These yield either
> > > > > > >
> > > > > > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > > > > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > > > > >        return __memcpy_noalignment;
> > > > > > >      return __memcpy_generic;
> > > > > > >
> > > > > > > or
> > > > > > >
> > > > > > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > > > > >        return __memcpy_noalignment;
> > > > > > >      return __memcpy_generic;
> > > > > > >
> > > > > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > > > > across all of the ifunc implementations in the system.
> > > > > >
> > > > > > Ah, I see. I could make a static inline function in the header that
> > > > > > looks something like this (mangled by gmail, sorry):
> > > > > >
> > > > > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > > > > signed long long int key,
> > > > > > unsigned long long int *value)
> > > > > > {
> > > > > > struct riscv_hwprobe pair;
> > > > > > int rc;
> > > > > >
> > > > > > if (!hwprobe_func)
> > > > > > return -ENOSYS;
> > > > > >
> > > > > > pair.key = key;
> > > > > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > > > > if (rc) {
> > > > > > return rc;
> > > > > > }
> > > > > >
> > > > > > if (pair.key < 0) {
> > > > > > return -ENOENT;
> > > > > > }
> > > > > >
> > > > > > *value = pair.value;
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > > The ifunc selector would then be significantly cleaned up, looking
> > > > > > something like:
> > > > > >
> > > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > > > > return __memcpy_generic;
> > > > > >
> > > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > > > > return __memcpy_noalignment;
> > > > >
> > > > > (Android's libc maintainer here, having joined the list just to talk
> > > > > about risc-v ifuncs :-) )
> > > > >
> > > > > has anyone thought about calling ifunc resolvers more like this...
> > > > >
> > > > > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > > > > static struct riscv_hwprobe probes[] = {
> > > > >  {.value = RISCV_HWPROBE_KEY_MVENDORID},
> > > > >  {.value = RISCV_HWPROBE_KEY_MARCHID},
> > > > >  {.value = RISCV_HWPROBE_KEY_MIMPID},
> > > > >  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> > > > >  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> > > > >  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > > > > ... // every time a new key is added to the kernel, we add it here
> > > > > };
> > > > > __riscv_hwprobe(...); // called once
> > > > >
> > > > > --part of the dynamic loader that calls ifunc resolvers--
> > > > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> > > > >
> > > > > this is similar to what we already have for arm64 (where there's a
> > > > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > > > > potentially others), but more uniform, and avoiding the source
> > > > > (in)compatibility issues of adding new fields to a struct [even if it
> > > > > does have a size_t to "version" it like the arm64 ifunc struct].
> > > > >
> > > > > yes, it means everyone pays to get all the hwprobes, but that gets
> > > > > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > > > > know that the keys will be kept dense, we can even have code in ifunc
> > > > > resolvers like
> > > > >
> > > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> > > > >
> > > > > though personally for the "big ticket items" that get a letter to
> > > > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > > > > probe_count, probes_ptr)` to the resolver, but i hear that's
> > > > > controversial :-)
> > > >
> > > > Hello, welcome to the fun! :)
> > >
> > > (sorry for the delay. i've been thinking :-) )
> > >
> > > > What you're describing here is almost exactly what we did inside the
> > > > vDSO function. The vDSO function acts as a front for a handful of
> > > > probe values that we've already completed and cached in userspace. We
> > > > opted to make it a function, rather than exposing the data itself via
> > > > vDSO, so that we had future flexibility in what elements we cached in
> > > > userspace and their storage format. We can update the kernel as needed
> > > > to cache the hottest things in userspace, even if that means
> > > > rearranging the data format, passing through some extra information,
> > > > or adding an extra snip of code. My hope is callers can directly
> > > > interact with the vDSO function (though maybe as Richard suggested
> > > > maybe with the help of a tidy inline helper), rather than trying to
> > > > add a second layer of userspace caching.
> > >
> > > on reflection i think i might be too focused on the FMV use case, in
> > > part because we're looking at those compiler-generated ifuncs for
> > > arm64 on Android atm. i think i'm imagining a world where there's a
> > > lot of that, and worrying about having to pay for the setup, call, and
> > > loop for each ifunc, and wondering why we don't just pay once instead.
> > > (as a bit of background context, Android "app" start is actually a
> > > dlopen() in a clone of an existing zygote process, and in general app
> > > launch time is one of the key metrics anyone who's serious is
> > > optimizing for. you'd be surprised how much of my life i spend
> > > explaining to people that if they want dlopen() to be faster, maybe
> > > they shouldn't ask us to run thousands of ELF constructors.)
> > >
> > > but... the more time i spend looking at what we actually need in
> > > third-party open source libraries right now i realize that libc and
> > > FMV (which is still a future thing for us anyway) are really the only
> > > _actual_ ifunc users. perhaps in part because macOS/iOS don't have
> > > ifuncs, all the libraries that are part of the OS itself, for example,
> > > are just doing their own thing with function pointers and
> > > pthread_once() or whatever.
> > >
> > > (i have yet to try to get any data on actual apps. i have no reason to
> > > think they'll be very different, but that could easily be skewed by
> > > popular middleware or a popular game engine using ifuncs, so i do plan
> > > on following up on that.)
> > >
> > > "how do they decide what to set that function pointer to?". well, it
> > > looks like in most cases cpuid on x86 and calls to getauxval()
> > > everywhere else. in some cases that's actually via some other library:
> > > https://github.com/pytorch/cpuinfo or
> > > https://github.com/google/cpu_features for example. so they have a
> > > layer of caching there, even in cases where they don't have a single
> > > function that sets all the function pointers.
> >
> > Right, function multi-versioning is just the sort of spot where we'd
> > imagine hwprobe gets used, since it's providing similar/equivalent
> > information to what cpuid does on x86. It may not be quite as fast as
> > cpuid (I don't know how fast cpuid actually is). But with the vDSO
> > function+data in userspace it should be able to match getauxval() in
> > performance, as they're both a function pointer plus a loop. We're
> > sort of planning for a world in which RISC-V has a wider set of these
> > values to fetch, such that a ifunc selector may need a more complex
> > set of information. Hwprobe and the vDSO gives us the ability both to
> > answer multiple queries fast, and freely allocate more keys that may
> > represent versioned features or even compound features.
>
> yeah, my incorrect mental model was that -- primarily because of
> x86-64 and cpuid -- every function would get its own ifunc resolver
> that would have to make a query. but the [in progress] arm64
> implementation shows that that's not really the case anyway, and we
> can just cache __riscv_hwprobe() in the same [one] place that
> getauxval() is already being cached for arm64.

Sounds good.

>
> > > so assuming i don't find that apps look very different from the OS
> > > (that is: that apps use lots of ifuncs), i probably don't care at all
> > > until we get to FMV. and i probably don't care for FMV, because
> > > compiler-rt (or gcc's equivalent) will be the "caching layer" there.
> > > (and on Android it'll be a while before i have to worry about libc's
> > > ifuncs because we'll require V and not use ifuncs there for the
> > > foreseeable future.)
> > >
> > > so, yeah, given that i've adopted the "pass a null pointer rather than
> > > no arguments" convention you have, we have room for expansion if/when
> > > FMV is a big thing, and until then -- unless i'm shocked by what i
> > > find looking at actual apps -- i don't think i have any reason to
> > > believe that ifuncs matter that much, and if compiler-rt makes one
> > > __riscv_hwprobe() call per .so, that's probably fine. (i already spend
> > > a big chunk of my life advising people to just have one .so file,
> > > exporting nothing but a JNI_OnLoad symbol, so this will just make that
> > > advice even better advice :-) )
> >
> > Just to confirm, by "pass a null pointer", you're saying that the
> > Android libc also passes NULL as the second ifunc selector argument
> > (or first)?
>
> #elif defined(__riscv)
>   // This argument and its value is just a placeholder for now,
>   // but it means that if we do pass something in future (such as
>   // getauxval() and/or hwprobe key/value pairs), callees will be able to
>   // recognize what they're being given.
>   typedef ElfW(Addr) (*ifunc_resolver_t)(void*);
>   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(nullptr);
>
> it's arm64 that has the initial getauxval() argument:
>
> #if defined(__aarch64__)
>   typedef ElfW(Addr) (*ifunc_resolver_t)(uint64_t, __ifunc_arg_t*);
>   static __ifunc_arg_t arg;
>   static bool initialized = false;
>   if (!initialized) {
>     initialized = true;
>     arg._size = sizeof(__ifunc_arg_t);
>     arg._hwcap = getauxval(AT_HWCAP);
>     arg._hwcap2 = getauxval(AT_HWCAP2);
>   }
>   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(arg._hwcap
> | _IFUNC_ARG_HWCAP, &arg);
>
> https://android.googlesource.com/platform/bionic/+/main/libc/bionic/bionic_call_ifunc_resolver.cpp
>
> > That's good. It sounds like you're planning to just
> > continue passing NULL for now, and wait for people to start clamoring
> > for this in android libc?
>
> yeah, and i'm assuming there will never be any clamor ... yesterday
> and today i actually checked a bunch of popular apks, and didn't find
> any that were currently using ifuncs.
>
> the only change i'm thinking of making right now is that "there's a
> single argument, and it's null" should probably be the default.
> obviously since Android doesn't add new architectures very often, this
> is only likely to affect x86/x86-64 for the foreseeable future, but
> being able to recognize at a glance "am i running under a libc new
> enough to pass me arguments?" would certainly have helped for arm64.
> even if x86/x86-64 never benefit, it seems like the right default for
> the #else clause...

Sounds good, thanks for the pointers. The paranoid person in me would
also add a comment in the risc-v section that if a pointer to hwprobe
is added, it should be added as the second argument, behind hwcap as
the first (assuming this change lands).

Come to think of it, the static inline helper I'm proposing in my
discussion with Richard needs to take both arguments, since callers
need to check both ((arg1 != 0) && (arg2 != NULL)) to safely know that
arg2 is a pointer to __riscv_hwprobe().

-Evan
  
enh Aug. 16, 2023, 11:18 p.m. UTC | #14
On Tue, Aug 15, 2023 at 4:02 PM Evan Green <evan@rivosinc.com> wrote:
>
> On Tue, Aug 15, 2023 at 2:54 PM enh <enh@google.com> wrote:
> >
> > On Tue, Aug 15, 2023 at 9:41 AM Evan Green <evan@rivosinc.com> wrote:
> > >
> > > On Fri, Aug 11, 2023 at 5:01 PM enh <enh@google.com> wrote:
> > > >
> > > > On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
> > > > >
> > > > > On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > >
> > > > > > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > >> Outside libc something is required.
> > > > > > > > >>
> > > > > > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > > > > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > > > > > > >> could always be called directly.
> > > > > > > > >
> > > > > > > > > My previous spin took that approach, defining a
> > > > > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > > > > > function if available, or make the syscall directly if not. But that
> > > > > > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > > > > > >
> > > > > > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > > > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > > > > > falling back to the syscall.
> > > > > > > >
> > > > > > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > > > > > returning the negative errno from the syscall seems fine.
> > > > > > > >
> > > > > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > > > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > > > > > prior to full relocations being complete.
> > > > > > >
> > > > > > > Right, this is what we had in the previous iteration of this series,
> > > > > > > and it did work ok. But it wasn't as good since it meant ifunc
> > > > > > > selectors always got stuck in the null/fallback case and were forced
> > > > > > > to make the syscall. With this mechanism they get to take advantage of
> > > > > > > the vDSO.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > > > > > above".
> > > > > > > >
> > > > > > > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > > > > > > going to be used.
> > > > > > > >
> > > > > > > > It might be worth defining some helper functions for probing a single key or a single
> > > > > > > > field.  E.g.
> > > > > > > >
> > > > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > > > > > {
> > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > > > > > >    if (err)
> > > > > > > >      return err;
> > > > > > > >    if (pair.key == -1)
> > > > > > > >      return -ENOENT;
> > > > > > > >    return pair.value;
> > > > > > > > }
> > > > > > > >
> > > > > > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > > > > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > > > > > > >
> > > > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > > > > > {
> > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > > > > > >            && pair.key != -1
> > > > > > > >            && (pair.value & mask) == val);
> > > > > > > > }
> > > > > > > >
> > > > > > > > These yield either
> > > > > > > >
> > > > > > > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > > > > > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > > > > > >        return __memcpy_noalignment;
> > > > > > > >      return __memcpy_generic;
> > > > > > > >
> > > > > > > > or
> > > > > > > >
> > > > > > > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > > > > > >        return __memcpy_noalignment;
> > > > > > > >      return __memcpy_generic;
> > > > > > > >
> > > > > > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > > > > > across all of the ifunc implementations in the system.
> > > > > > >
> > > > > > > Ah, I see. I could make a static inline function in the header that
> > > > > > > looks something like this (mangled by gmail, sorry):
> > > > > > >
> > > > > > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > > > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > > > > > signed long long int key,
> > > > > > > unsigned long long int *value)
> > > > > > > {
> > > > > > > struct riscv_hwprobe pair;
> > > > > > > int rc;
> > > > > > >
> > > > > > > if (!hwprobe_func)
> > > > > > > return -ENOSYS;
> > > > > > >
> > > > > > > pair.key = key;
> > > > > > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > > > > > if (rc) {
> > > > > > > return rc;
> > > > > > > }
> > > > > > >
> > > > > > > if (pair.key < 0) {
> > > > > > > return -ENOENT;
> > > > > > > }
> > > > > > >
> > > > > > > *value = pair.value;
> > > > > > > return 0;
> > > > > > > }
> > > > > > >
> > > > > > > The ifunc selector would then be significantly cleaned up, looking
> > > > > > > something like:
> > > > > > >
> > > > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > > > > > return __memcpy_generic;
> > > > > > >
> > > > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > > > > > return __memcpy_noalignment;
> > > > > >
> > > > > > (Android's libc maintainer here, having joined the list just to talk
> > > > > > about risc-v ifuncs :-) )
> > > > > >
> > > > > > has anyone thought about calling ifunc resolvers more like this...
> > > > > >
> > > > > > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > > > > > static struct riscv_hwprobe probes[] = {
> > > > > >  {.value = RISCV_HWPROBE_KEY_MVENDORID},
> > > > > >  {.value = RISCV_HWPROBE_KEY_MARCHID},
> > > > > >  {.value = RISCV_HWPROBE_KEY_MIMPID},
> > > > > >  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> > > > > >  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> > > > > >  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > > > > > ... // every time a new key is added to the kernel, we add it here
> > > > > > };
> > > > > > __riscv_hwprobe(...); // called once
> > > > > >
> > > > > > --part of the dynamic loader that calls ifunc resolvers--
> > > > > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> > > > > >
> > > > > > this is similar to what we already have for arm64 (where there's a
> > > > > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > > > > > potentially others), but more uniform, and avoiding the source
> > > > > > (in)compatibility issues of adding new fields to a struct [even if it
> > > > > > does have a size_t to "version" it like the arm64 ifunc struct].
> > > > > >
> > > > > > yes, it means everyone pays to get all the hwprobes, but that gets
> > > > > > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > > > > > know that the keys will be kept dense, we can even have code in ifunc
> > > > > > resolvers like
> > > > > >
> > > > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> > > > > >
> > > > > > though personally for the "big ticket items" that get a letter to
> > > > > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > > > > > probe_count, probes_ptr)` to the resolver, but i hear that's
> > > > > > controversial :-)
> > > > >
> > > > > Hello, welcome to the fun! :)
> > > >
> > > > (sorry for the delay. i've been thinking :-) )
> > > >
> > > > > What you're describing here is almost exactly what we did inside the
> > > > > vDSO function. The vDSO function acts as a front for a handful of
> > > > > probe values that we've already completed and cached in userspace. We
> > > > > opted to make it a function, rather than exposing the data itself via
> > > > > vDSO, so that we had future flexibility in what elements we cached in
> > > > > userspace and their storage format. We can update the kernel as needed
> > > > > to cache the hottest things in userspace, even if that means
> > > > > rearranging the data format, passing through some extra information,
> > > > > or adding an extra snip of code. My hope is callers can directly
> > > > > interact with the vDSO function (though maybe as Richard suggested
> > > > > maybe with the help of a tidy inline helper), rather than trying to
> > > > > add a second layer of userspace caching.
> > > >
> > > > on reflection i think i might be too focused on the FMV use case, in
> > > > part because we're looking at those compiler-generated ifuncs for
> > > > arm64 on Android atm. i think i'm imagining a world where there's a
> > > > lot of that, and worrying about having to pay for the setup, call, and
> > > > loop for each ifunc, and wondering why we don't just pay once instead.
> > > > (as a bit of background context, Android "app" start is actually a
> > > > dlopen() in a clone of an existing zygote process, and in general app
> > > > launch time is one of the key metrics anyone who's serious is
> > > > optimizing for. you'd be surprised how much of my life i spend
> > > > explaining to people that if they want dlopen() to be faster, maybe
> > > > they shouldn't ask us to run thousands of ELF constructors.)
> > > >
> > > > but... the more time i spend looking at what we actually need in
> > > > third-party open source libraries right now i realize that libc and
> > > > FMV (which is still a future thing for us anyway) are really the only
> > > > _actual_ ifunc users. perhaps in part because macOS/iOS don't have
> > > > ifuncs, all the libraries that are part of the OS itself, for example,
> > > > are just doing their own thing with function pointers and
> > > > pthread_once() or whatever.
> > > >
> > > > (i have yet to try to get any data on actual apps. i have no reason to
> > > > think they'll be very different, but that could easily be skewed by
> > > > popular middleware or a popular game engine using ifuncs, so i do plan
> > > > on following up on that.)
> > > >
> > > > "how do they decide what to set that function pointer to?". well, it
> > > > looks like in most cases cpuid on x86 and calls to getauxval()
> > > > everywhere else. in some cases that's actually via some other library:
> > > > https://github.com/pytorch/cpuinfo or
> > > > https://github.com/google/cpu_features for example. so they have a
> > > > layer of caching there, even in cases where they don't have a single
> > > > function that sets all the function pointers.
> > >
> > > Right, function multi-versioning is just the sort of spot where we'd
> > > imagine hwprobe gets used, since it's providing similar/equivalent
> > > information to what cpuid does on x86. It may not be quite as fast as
> > > cpuid (I don't know how fast cpuid actually is). But with the vDSO
> > > function+data in userspace it should be able to match getauxval() in
> > > performance, as they're both a function pointer plus a loop. We're
> > > sort of planning for a world in which RISC-V has a wider set of these
> > > values to fetch, such that a ifunc selector may need a more complex
> > > set of information. Hwprobe and the vDSO gives us the ability both to
> > > answer multiple queries fast, and freely allocate more keys that may
> > > represent versioned features or even compound features.
> >
> > yeah, my incorrect mental model was that -- primarily because of
> > x86-64 and cpuid -- every function would get its own ifunc resolver
> > that would have to make a query. but the [in progress] arm64
> > implementation shows that that's not really the case anyway, and we
> > can just cache __riscv_hwprobe() in the same [one] place that
> > getauxval() is already being cached for arm64.
>
> Sounds good.
>
> >
> > > > so assuming i don't find that apps look very different from the OS
> > > > (that is: that apps use lots of ifuncs), i probably don't care at all
> > > > until we get to FMV. and i probably don't care for FMV, because
> > > > compiler-rt (or gcc's equivalent) will be the "caching layer" there.
> > > > (and on Android it'll be a while before i have to worry about libc's
> > > > ifuncs because we'll require V and not use ifuncs there for the
> > > > foreseeable future.)
> > > >
> > > > so, yeah, given that i've adopted the "pass a null pointer rather than
> > > > no arguments" convention you have, we have room for expansion if/when
> > > > FMV is a big thing, and until then -- unless i'm shocked by what i
> > > > find looking at actual apps -- i don't think i have any reason to
> > > > believe that ifuncs matter that much, and if compiler-rt makes one
> > > > __riscv_hwprobe() call per .so, that's probably fine. (i already spend
> > > > a big chunk of my life advising people to just have one .so file,
> > > > exporting nothing but a JNI_OnLoad symbol, so this will just make that
> > > > advice even better advice :-) )
> > >
> > > Just to confirm, by "pass a null pointer", you're saying that the
> > > Android libc also passes NULL as the second ifunc selector argument
> > > (or first)?
> >
> > #elif defined(__riscv)
> >   // This argument and its value is just a placeholder for now,
> >   // but it means that if we do pass something in future (such as
> >   // getauxval() and/or hwprobe key/value pairs), callees will be able to
> >   // recognize what they're being given.
> >   typedef ElfW(Addr) (*ifunc_resolver_t)(void*);
> >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(nullptr);
> >
> > it's arm64 that has the initial getauxval() argument:
> >
> > #if defined(__aarch64__)
> >   typedef ElfW(Addr) (*ifunc_resolver_t)(uint64_t, __ifunc_arg_t*);
> >   static __ifunc_arg_t arg;
> >   static bool initialized = false;
> >   if (!initialized) {
> >     initialized = true;
> >     arg._size = sizeof(__ifunc_arg_t);
> >     arg._hwcap = getauxval(AT_HWCAP);
> >     arg._hwcap2 = getauxval(AT_HWCAP2);
> >   }
> >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(arg._hwcap
> > | _IFUNC_ARG_HWCAP, &arg);
> >
> > https://android.googlesource.com/platform/bionic/+/main/libc/bionic/bionic_call_ifunc_resolver.cpp
> >
> > > That's good. It sounds like you're planning to just
> > > continue passing NULL for now, and wait for people to start clamoring
> > > for this in android libc?
> >
> > yeah, and i'm assuming there will never be any clamor ... yesterday
> > and today i actually checked a bunch of popular apks, and didn't find
> > any that were currently using ifuncs.
> >
> > the only change i'm thinking of making right now is that "there's a
> > single argument, and it's null" should probably be the default.
> > obviously since Android doesn't add new architectures very often, this
> > is only likely to affect x86/x86-64 for the foreseeable future, but
> > being able to recognize at a glance "am i running under a libc new
> > enough to pass me arguments?" would certainly have helped for arm64.
> > even if x86/x86-64 never benefit, it seems like the right default for
> > the #else clause...
>
> Sounds good, thanks for the pointers. The paranoid person in me would
> also add a comment in the risc-v section that if a pointer to hwprobe
> is added, it should be added as the second argument, behind hwcap as
> the first (assuming this change lands).
>
> Come to think of it, the static inline helper I'm proposing in my
> discussion with Richard needs to take both arguments, since callers
> need to check both ((arg1 != 0) && (arg2 != NULL)) to safely know that
> arg2 is a pointer to __riscv_hwprobe().

presumably not `(arg1 != 0)` but `(arg1 & _IFUNC_ARG_HWCAP)` to match arm64?

> -Evan
  
Evan Green Aug. 17, 2023, 4:27 p.m. UTC | #15
On Wed, Aug 16, 2023 at 4:18 PM enh <enh@google.com> wrote:
>
> On Tue, Aug 15, 2023 at 4:02 PM Evan Green <evan@rivosinc.com> wrote:
> >
> > On Tue, Aug 15, 2023 at 2:54 PM enh <enh@google.com> wrote:
> > >
> > > On Tue, Aug 15, 2023 at 9:41 AM Evan Green <evan@rivosinc.com> wrote:
> > > >
> > > > On Fri, Aug 11, 2023 at 5:01 PM enh <enh@google.com> wrote:
> > > > >
> > > > > On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > > > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > > >> Outside libc something is required.
> > > > > > > > > >>
> > > > > > > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > > > > > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > > > > > > > >> could always be called directly.
> > > > > > > > > >
> > > > > > > > > > My previous spin took that approach, defining a
> > > > > > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > > > > > > function if available, or make the syscall directly if not. But that
> > > > > > > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > > > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > > > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > > > > > > >
> > > > > > > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > > > > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > > > > > > falling back to the syscall.
> > > > > > > > >
> > > > > > > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > > > > > > returning the negative errno from the syscall seems fine.
> > > > > > > > >
> > > > > > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > > > > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > > > > > > prior to full relocations being complete.
> > > > > > > >
> > > > > > > > Right, this is what we had in the previous iteration of this series,
> > > > > > > > and it did work ok. But it wasn't as good since it meant ifunc
> > > > > > > > selectors always got stuck in the null/fallback case and were forced
> > > > > > > > to make the syscall. With this mechanism they get to take advantage of
> > > > > > > > the vDSO.
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > > > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > > > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > > > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > > > > > > above".
> > > > > > > > >
> > > > > > > > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > > > > > > > going to be used.
> > > > > > > > >
> > > > > > > > > It might be worth defining some helper functions for probing a single key or a single
> > > > > > > > > field.  E.g.
> > > > > > > > >
> > > > > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > > > > > > {
> > > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > > > > > > >    if (err)
> > > > > > > > >      return err;
> > > > > > > > >    if (pair.key == -1)
> > > > > > > > >      return -ENOENT;
> > > > > > > > >    return pair.value;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > > > > > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > > > > > > > >
> > > > > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > > > > > > {
> > > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > > > > > > >            && pair.key != -1
> > > > > > > > >            && (pair.value & mask) == val);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > These yield either
> > > > > > > > >
> > > > > > > > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > > > > > > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > > > > > > >        return __memcpy_noalignment;
> > > > > > > > >      return __memcpy_generic;
> > > > > > > > >
> > > > > > > > > or
> > > > > > > > >
> > > > > > > > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > > > > > > >        return __memcpy_noalignment;
> > > > > > > > >      return __memcpy_generic;
> > > > > > > > >
> > > > > > > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > > > > > > across all of the ifunc implementations in the system.
> > > > > > > >
> > > > > > > > Ah, I see. I could make a static inline function in the header that
> > > > > > > > looks something like this (mangled by gmail, sorry):
> > > > > > > >
> > > > > > > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > > > > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > > > > > > signed long long int key,
> > > > > > > > unsigned long long int *value)
> > > > > > > > {
> > > > > > > > struct riscv_hwprobe pair;
> > > > > > > > int rc;
> > > > > > > >
> > > > > > > > if (!hwprobe_func)
> > > > > > > > return -ENOSYS;
> > > > > > > >
> > > > > > > > pair.key = key;
> > > > > > > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > > > > > > if (rc) {
> > > > > > > > return rc;
> > > > > > > > }
> > > > > > > >
> > > > > > > > if (pair.key < 0) {
> > > > > > > > return -ENOENT;
> > > > > > > > }
> > > > > > > >
> > > > > > > > *value = pair.value;
> > > > > > > > return 0;
> > > > > > > > }
> > > > > > > >
> > > > > > > > The ifunc selector would then be significantly cleaned up, looking
> > > > > > > > something like:
> > > > > > > >
> > > > > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > > > > > > return __memcpy_generic;
> > > > > > > >
> > > > > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > > > > > > return __memcpy_noalignment;
> > > > > > >
> > > > > > > (Android's libc maintainer here, having joined the list just to talk
> > > > > > > about risc-v ifuncs :-) )
> > > > > > >
> > > > > > > has anyone thought about calling ifunc resolvers more like this...
> > > > > > >
> > > > > > > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > > > > > > static struct riscv_hwprobe probes[] = {
> > > > > > >  {.value = RISCV_HWPROBE_KEY_MVENDORID},
> > > > > > >  {.value = RISCV_HWPROBE_KEY_MARCHID},
> > > > > > >  {.value = RISCV_HWPROBE_KEY_MIMPID},
> > > > > > >  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> > > > > > >  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> > > > > > >  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > > > > > > ... // every time a new key is added to the kernel, we add it here
> > > > > > > };
> > > > > > > __riscv_hwprobe(...); // called once
> > > > > > >
> > > > > > > --part of the dynamic loader that calls ifunc resolvers--
> > > > > > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> > > > > > >
> > > > > > > this is similar to what we already have for arm64 (where there's a
> > > > > > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > > > > > > potentially others), but more uniform, and avoiding the source
> > > > > > > (in)compatibility issues of adding new fields to a struct [even if it
> > > > > > > does have a size_t to "version" it like the arm64 ifunc struct].
> > > > > > >
> > > > > > > yes, it means everyone pays to get all the hwprobes, but that gets
> > > > > > > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > > > > > > know that the keys will be kept dense, we can even have code in ifunc
> > > > > > > resolvers like
> > > > > > >
> > > > > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> > > > > > >
> > > > > > > though personally for the "big ticket items" that get a letter to
> > > > > > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > > > > > > probe_count, probes_ptr)` to the resolver, but i hear that's
> > > > > > > controversial :-)
> > > > > >
> > > > > > Hello, welcome to the fun! :)
> > > > >
> > > > > (sorry for the delay. i've been thinking :-) )
> > > > >
> > > > > > What you're describing here is almost exactly what we did inside the
> > > > > > vDSO function. The vDSO function acts as a front for a handful of
> > > > > > probe values that we've already completed and cached in userspace. We
> > > > > > opted to make it a function, rather than exposing the data itself via
> > > > > > vDSO, so that we had future flexibility in what elements we cached in
> > > > > > userspace and their storage format. We can update the kernel as needed
> > > > > > to cache the hottest things in userspace, even if that means
> > > > > > rearranging the data format, passing through some extra information,
> > > > > > or adding an extra snip of code. My hope is callers can directly
> > > > > > interact with the vDSO function (though maybe as Richard suggested
> > > > > > maybe with the help of a tidy inline helper), rather than trying to
> > > > > > add a second layer of userspace caching.
> > > > >
> > > > > on reflection i think i might be too focused on the FMV use case, in
> > > > > part because we're looking at those compiler-generated ifuncs for
> > > > > arm64 on Android atm. i think i'm imagining a world where there's a
> > > > > lot of that, and worrying about having to pay for the setup, call, and
> > > > > loop for each ifunc, and wondering why we don't just pay once instead.
> > > > > (as a bit of background context, Android "app" start is actually a
> > > > > dlopen() in a clone of an existing zygote process, and in general app
> > > > > launch time is one of the key metrics anyone who's serious is
> > > > > optimizing for. you'd be surprised how much of my life i spend
> > > > > explaining to people that if they want dlopen() to be faster, maybe
> > > > > they shouldn't ask us to run thousands of ELF constructors.)
> > > > >
> > > > > but... the more time i spend looking at what we actually need in
> > > > > third-party open source libraries right now i realize that libc and
> > > > > FMV (which is still a future thing for us anyway) are really the only
> > > > > _actual_ ifunc users. perhaps in part because macOS/iOS don't have
> > > > > ifuncs, all the libraries that are part of the OS itself, for example,
> > > > > are just doing their own thing with function pointers and
> > > > > pthread_once() or whatever.
> > > > >
> > > > > (i have yet to try to get any data on actual apps. i have no reason to
> > > > > think they'll be very different, but that could easily be skewed by
> > > > > popular middleware or a popular game engine using ifuncs, so i do plan
> > > > > on following up on that.)
> > > > >
> > > > > "how do they decide what to set that function pointer to?". well, it
> > > > > looks like in most cases cpuid on x86 and calls to getauxval()
> > > > > everywhere else. in some cases that's actually via some other library:
> > > > > https://github.com/pytorch/cpuinfo or
> > > > > https://github.com/google/cpu_features for example. so they have a
> > > > > layer of caching there, even in cases where they don't have a single
> > > > > function that sets all the function pointers.
> > > >
> > > > Right, function multi-versioning is just the sort of spot where we'd
> > > > imagine hwprobe gets used, since it's providing similar/equivalent
> > > > information to what cpuid does on x86. It may not be quite as fast as
> > > > cpuid (I don't know how fast cpuid actually is). But with the vDSO
> > > > function+data in userspace it should be able to match getauxval() in
> > > > performance, as they're both a function pointer plus a loop. We're
> > > > sort of planning for a world in which RISC-V has a wider set of these
> > > > values to fetch, such that a ifunc selector may need a more complex
> > > > set of information. Hwprobe and the vDSO gives us the ability both to
> > > > answer multiple queries fast, and freely allocate more keys that may
> > > > represent versioned features or even compound features.
> > >
> > > yeah, my incorrect mental model was that -- primarily because of
> > > x86-64 and cpuid -- every function would get its own ifunc resolver
> > > that would have to make a query. but the [in progress] arm64
> > > implementation shows that that's not really the case anyway, and we
> > > can just cache __riscv_hwprobe() in the same [one] place that
> > > getauxval() is already being cached for arm64.
> >
> > Sounds good.
> >
> > >
> > > > > so assuming i don't find that apps look very different from the OS
> > > > > (that is: that apps use lots of ifuncs), i probably don't care at all
> > > > > until we get to FMV. and i probably don't care for FMV, because
> > > > > compiler-rt (or gcc's equivalent) will be the "caching layer" there.
> > > > > (and on Android it'll be a while before i have to worry about libc's
> > > > > ifuncs because we'll require V and not use ifuncs there for the
> > > > > foreseeable future.)
> > > > >
> > > > > so, yeah, given that i've adopted the "pass a null pointer rather than
> > > > > no arguments" convention you have, we have room for expansion if/when
> > > > > FMV is a big thing, and until then -- unless i'm shocked by what i
> > > > > find looking at actual apps -- i don't think i have any reason to
> > > > > believe that ifuncs matter that much, and if compiler-rt makes one
> > > > > __riscv_hwprobe() call per .so, that's probably fine. (i already spend
> > > > > a big chunk of my life advising people to just have one .so file,
> > > > > exporting nothing but a JNI_OnLoad symbol, so this will just make that
> > > > > advice even better advice :-) )
> > > >
> > > > Just to confirm, by "pass a null pointer", you're saying that the
> > > > Android libc also passes NULL as the second ifunc selector argument
> > > > (or first)?
> > >
> > > #elif defined(__riscv)
> > >   // This argument and its value is just a placeholder for now,
> > >   // but it means that if we do pass something in future (such as
> > >   // getauxval() and/or hwprobe key/value pairs), callees will be able to
> > >   // recognize what they're being given.
> > >   typedef ElfW(Addr) (*ifunc_resolver_t)(void*);
> > >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(nullptr);
> > >
> > > it's arm64 that has the initial getauxval() argument:
> > >
> > > #if defined(__aarch64__)
> > >   typedef ElfW(Addr) (*ifunc_resolver_t)(uint64_t, __ifunc_arg_t*);
> > >   static __ifunc_arg_t arg;
> > >   static bool initialized = false;
> > >   if (!initialized) {
> > >     initialized = true;
> > >     arg._size = sizeof(__ifunc_arg_t);
> > >     arg._hwcap = getauxval(AT_HWCAP);
> > >     arg._hwcap2 = getauxval(AT_HWCAP2);
> > >   }
> > >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(arg._hwcap
> > > | _IFUNC_ARG_HWCAP, &arg);
> > >
> > > https://android.googlesource.com/platform/bionic/+/main/libc/bionic/bionic_call_ifunc_resolver.cpp
> > >
> > > > That's good. It sounds like you're planning to just
> > > > continue passing NULL for now, and wait for people to start clamoring
> > > > for this in android libc?
> > >
> > > yeah, and i'm assuming there will never be any clamor ... yesterday
> > > and today i actually checked a bunch of popular apks, and didn't find
> > > any that were currently using ifuncs.
> > >
> > > the only change i'm thinking of making right now is that "there's a
> > > single argument, and it's null" should probably be the default.
> > > obviously since Android doesn't add new architectures very often, this
> > > is only likely to affect x86/x86-64 for the foreseeable future, but
> > > being able to recognize at a glance "am i running under a libc new
> > > enough to pass me arguments?" would certainly have helped for arm64.
> > > even if x86/x86-64 never benefit, it seems like the right default for
> > > the #else clause...
> >
> > Sounds good, thanks for the pointers. The paranoid person in me would
> > also add a comment in the risc-v section that if a pointer to hwprobe
> > is added, it should be added as the second argument, behind hwcap as
> > the first (assuming this change lands).
> >
> > Come to think of it, the static inline helper I'm proposing in my
> > discussion with Richard needs to take both arguments, since callers
> > need to check both ((arg1 != 0) && (arg2 != NULL)) to safely know that
> > arg2 is a pointer to __riscv_hwprobe().
>
> presumably not `(arg1 != 0)` but `(arg1 & _IFUNC_ARG_HWCAP)` to match arm64?

It looks like we didn't do that _IFUNC_ARG_HWCAP bit on riscv.
Actually, looking at the history of sysdeps/riscv/dl-irel.h, hwcap has
always been passed as the first argument. So I think I don't need to
check it in the (glibc-specific) inline helper function, I can safely
assume it's there and go straight to checking the second argument.

If you were coding this directly in a library or application, you
would need to check the first arg to be compatible with other libcs
like Android's. I think checking against zero should be ok since bits
like I, M, A should always be set. (I didn't dig through the kernel
history, so maybe not on really old kernels? But you're not going to
get hwprobe on those anyway either so the false bailout is correct).

-Evan
  
enh Aug. 17, 2023, 4:37 p.m. UTC | #16
On Thu, Aug 17, 2023 at 9:27 AM Evan Green <evan@rivosinc.com> wrote:
>
> On Wed, Aug 16, 2023 at 4:18 PM enh <enh@google.com> wrote:
> >
> > On Tue, Aug 15, 2023 at 4:02 PM Evan Green <evan@rivosinc.com> wrote:
> > >
> > > On Tue, Aug 15, 2023 at 2:54 PM enh <enh@google.com> wrote:
> > > >
> > > > On Tue, Aug 15, 2023 at 9:41 AM Evan Green <evan@rivosinc.com> wrote:
> > > > >
> > > > > On Fri, Aug 11, 2023 at 5:01 PM enh <enh@google.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > > > > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > > > >> Outside libc something is required.
> > > > > > > > > > >>
> > > > > > > > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > > > > > > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > > > > > > > > >> could always be called directly.
> > > > > > > > > > >
> > > > > > > > > > > My previous spin took that approach, defining a
> > > > > > > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > > > > > > > function if available, or make the syscall directly if not. But that
> > > > > > > > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > > > > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > > > > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > > > > > > > >
> > > > > > > > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > > > > > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > > > > > > > falling back to the syscall.
> > > > > > > > > >
> > > > > > > > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > > > > > > > returning the negative errno from the syscall seems fine.
> > > > > > > > > >
> > > > > > > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > > > > > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > > > > > > > prior to full relocations being complete.
> > > > > > > > >
> > > > > > > > > Right, this is what we had in the previous iteration of this series,
> > > > > > > > > and it did work ok. But it wasn't as good since it meant ifunc
> > > > > > > > > selectors always got stuck in the null/fallback case and were forced
> > > > > > > > > to make the syscall. With this mechanism they get to take advantage of
> > > > > > > > > the vDSO.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > > > > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > > > > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > > > > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > > > > > > > above".
> > > > > > > > > >
> > > > > > > > > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > > > > > > > > going to be used.
> > > > > > > > > >
> > > > > > > > > > It might be worth defining some helper functions for probing a single key or a single
> > > > > > > > > > field.  E.g.
> > > > > > > > > >
> > > > > > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > > > > > > > {
> > > > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > > > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > > > > > > > >    if (err)
> > > > > > > > > >      return err;
> > > > > > > > > >    if (pair.key == -1)
> > > > > > > > > >      return -ENOENT;
> > > > > > > > > >    return pair.value;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > > > > > > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > > > > > > > > >
> > > > > > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > > > > > > > {
> > > > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > > > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > > > > > > > >            && pair.key != -1
> > > > > > > > > >            && (pair.value & mask) == val);
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > These yield either
> > > > > > > > > >
> > > > > > > > > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > > > > > > > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > > > > > > > >        return __memcpy_noalignment;
> > > > > > > > > >      return __memcpy_generic;
> > > > > > > > > >
> > > > > > > > > > or
> > > > > > > > > >
> > > > > > > > > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > > > > > > > >        return __memcpy_noalignment;
> > > > > > > > > >      return __memcpy_generic;
> > > > > > > > > >
> > > > > > > > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > > > > > > > across all of the ifunc implementations in the system.
> > > > > > > > >
> > > > > > > > > Ah, I see. I could make a static inline function in the header that
> > > > > > > > > looks something like this (mangled by gmail, sorry):
> > > > > > > > >
> > > > > > > > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > > > > > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > > > > > > > signed long long int key,
> > > > > > > > > unsigned long long int *value)
> > > > > > > > > {
> > > > > > > > > struct riscv_hwprobe pair;
> > > > > > > > > int rc;
> > > > > > > > >
> > > > > > > > > if (!hwprobe_func)
> > > > > > > > > return -ENOSYS;
> > > > > > > > >
> > > > > > > > > pair.key = key;
> > > > > > > > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > > > > > > > if (rc) {
> > > > > > > > > return rc;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > if (pair.key < 0) {
> > > > > > > > > return -ENOENT;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > *value = pair.value;
> > > > > > > > > return 0;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > The ifunc selector would then be significantly cleaned up, looking
> > > > > > > > > something like:
> > > > > > > > >
> > > > > > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > > > > > > > return __memcpy_generic;
> > > > > > > > >
> > > > > > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > > > > > > > return __memcpy_noalignment;
> > > > > > > >
> > > > > > > > (Android's libc maintainer here, having joined the list just to talk
> > > > > > > > about risc-v ifuncs :-) )
> > > > > > > >
> > > > > > > > has anyone thought about calling ifunc resolvers more like this...
> > > > > > > >
> > > > > > > > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > > > > > > > static struct riscv_hwprobe probes[] = {
> > > > > > > >  {.value = RISCV_HWPROBE_KEY_MVENDORID},
> > > > > > > >  {.value = RISCV_HWPROBE_KEY_MARCHID},
> > > > > > > >  {.value = RISCV_HWPROBE_KEY_MIMPID},
> > > > > > > >  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> > > > > > > >  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> > > > > > > >  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > > > > > > > ... // every time a new key is added to the kernel, we add it here
> > > > > > > > };
> > > > > > > > __riscv_hwprobe(...); // called once
> > > > > > > >
> > > > > > > > --part of the dynamic loader that calls ifunc resolvers--
> > > > > > > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> > > > > > > >
> > > > > > > > this is similar to what we already have for arm64 (where there's a
> > > > > > > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > > > > > > > potentially others), but more uniform, and avoiding the source
> > > > > > > > (in)compatibility issues of adding new fields to a struct [even if it
> > > > > > > > does have a size_t to "version" it like the arm64 ifunc struct].
> > > > > > > >
> > > > > > > > yes, it means everyone pays to get all the hwprobes, but that gets
> > > > > > > > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > > > > > > > know that the keys will be kept dense, we can even have code in ifunc
> > > > > > > > resolvers like
> > > > > > > >
> > > > > > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> > > > > > > >
> > > > > > > > though personally for the "big ticket items" that get a letter to
> > > > > > > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > > > > > > > probe_count, probes_ptr)` to the resolver, but i hear that's
> > > > > > > > controversial :-)
> > > > > > >
> > > > > > > Hello, welcome to the fun! :)
> > > > > >
> > > > > > (sorry for the delay. i've been thinking :-) )
> > > > > >
> > > > > > > What you're describing here is almost exactly what we did inside the
> > > > > > > vDSO function. The vDSO function acts as a front for a handful of
> > > > > > > probe values that we've already completed and cached in userspace. We
> > > > > > > opted to make it a function, rather than exposing the data itself via
> > > > > > > vDSO, so that we had future flexibility in what elements we cached in
> > > > > > > userspace and their storage format. We can update the kernel as needed
> > > > > > > to cache the hottest things in userspace, even if that means
> > > > > > > rearranging the data format, passing through some extra information,
> > > > > > > or adding an extra snip of code. My hope is callers can directly
> > > > > > > interact with the vDSO function (though maybe as Richard suggested
> > > > > > > maybe with the help of a tidy inline helper), rather than trying to
> > > > > > > add a second layer of userspace caching.
> > > > > >
> > > > > > on reflection i think i might be too focused on the FMV use case, in
> > > > > > part because we're looking at those compiler-generated ifuncs for
> > > > > > arm64 on Android atm. i think i'm imagining a world where there's a
> > > > > > lot of that, and worrying about having to pay for the setup, call, and
> > > > > > loop for each ifunc, and wondering why we don't just pay once instead.
> > > > > > (as a bit of background context, Android "app" start is actually a
> > > > > > dlopen() in a clone of an existing zygote process, and in general app
> > > > > > launch time is one of the key metrics anyone who's serious is
> > > > > > optimizing for. you'd be surprised how much of my life i spend
> > > > > > explaining to people that if they want dlopen() to be faster, maybe
> > > > > > they shouldn't ask us to run thousands of ELF constructors.)
> > > > > >
> > > > > > but... the more time i spend looking at what we actually need in
> > > > > > third-party open source libraries right now i realize that libc and
> > > > > > FMV (which is still a future thing for us anyway) are really the only
> > > > > > _actual_ ifunc users. perhaps in part because macOS/iOS don't have
> > > > > > ifuncs, all the libraries that are part of the OS itself, for example,
> > > > > > are just doing their own thing with function pointers and
> > > > > > pthread_once() or whatever.
> > > > > >
> > > > > > (i have yet to try to get any data on actual apps. i have no reason to
> > > > > > think they'll be very different, but that could easily be skewed by
> > > > > > popular middleware or a popular game engine using ifuncs, so i do plan
> > > > > > on following up on that.)
> > > > > >
> > > > > > "how do they decide what to set that function pointer to?". well, it
> > > > > > looks like in most cases cpuid on x86 and calls to getauxval()
> > > > > > everywhere else. in some cases that's actually via some other library:
> > > > > > https://github.com/pytorch/cpuinfo or
> > > > > > https://github.com/google/cpu_features for example. so they have a
> > > > > > layer of caching there, even in cases where they don't have a single
> > > > > > function that sets all the function pointers.
> > > > >
> > > > > Right, function multi-versioning is just the sort of spot where we'd
> > > > > imagine hwprobe gets used, since it's providing similar/equivalent
> > > > > information to what cpuid does on x86. It may not be quite as fast as
> > > > > cpuid (I don't know how fast cpuid actually is). But with the vDSO
> > > > > function+data in userspace it should be able to match getauxval() in
> > > > > performance, as they're both a function pointer plus a loop. We're
> > > > > sort of planning for a world in which RISC-V has a wider set of these
> > > > > values to fetch, such that a ifunc selector may need a more complex
> > > > > set of information. Hwprobe and the vDSO gives us the ability both to
> > > > > answer multiple queries fast, and freely allocate more keys that may
> > > > > represent versioned features or even compound features.
> > > >
> > > > yeah, my incorrect mental model was that -- primarily because of
> > > > x86-64 and cpuid -- every function would get its own ifunc resolver
> > > > that would have to make a query. but the [in progress] arm64
> > > > implementation shows that that's not really the case anyway, and we
> > > > can just cache __riscv_hwprobe() in the same [one] place that
> > > > getauxval() is already being cached for arm64.
> > >
> > > Sounds good.
> > >
> > > >
> > > > > > so assuming i don't find that apps look very different from the OS
> > > > > > (that is: that apps use lots of ifuncs), i probably don't care at all
> > > > > > until we get to FMV. and i probably don't care for FMV, because
> > > > > > compiler-rt (or gcc's equivalent) will be the "caching layer" there.
> > > > > > (and on Android it'll be a while before i have to worry about libc's
> > > > > > ifuncs because we'll require V and not use ifuncs there for the
> > > > > > foreseeable future.)
> > > > > >
> > > > > > so, yeah, given that i've adopted the "pass a null pointer rather than
> > > > > > no arguments" convention you have, we have room for expansion if/when
> > > > > > FMV is a big thing, and until then -- unless i'm shocked by what i
> > > > > > find looking at actual apps -- i don't think i have any reason to
> > > > > > believe that ifuncs matter that much, and if compiler-rt makes one
> > > > > > __riscv_hwprobe() call per .so, that's probably fine. (i already spend
> > > > > > a big chunk of my life advising people to just have one .so file,
> > > > > > exporting nothing but a JNI_OnLoad symbol, so this will just make that
> > > > > > advice even better advice :-) )
> > > > >
> > > > > Just to confirm, by "pass a null pointer", you're saying that the
> > > > > Android libc also passes NULL as the second ifunc selector argument
> > > > > (or first)?
> > > >
> > > > #elif defined(__riscv)
> > > >   // This argument and its value is just a placeholder for now,
> > > >   // but it means that if we do pass something in future (such as
> > > >   // getauxval() and/or hwprobe key/value pairs), callees will be able to
> > > >   // recognize what they're being given.
> > > >   typedef ElfW(Addr) (*ifunc_resolver_t)(void*);
> > > >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(nullptr);
> > > >
> > > > it's arm64 that has the initial getauxval() argument:
> > > >
> > > > #if defined(__aarch64__)
> > > >   typedef ElfW(Addr) (*ifunc_resolver_t)(uint64_t, __ifunc_arg_t*);
> > > >   static __ifunc_arg_t arg;
> > > >   static bool initialized = false;
> > > >   if (!initialized) {
> > > >     initialized = true;
> > > >     arg._size = sizeof(__ifunc_arg_t);
> > > >     arg._hwcap = getauxval(AT_HWCAP);
> > > >     arg._hwcap2 = getauxval(AT_HWCAP2);
> > > >   }
> > > >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(arg._hwcap
> > > > | _IFUNC_ARG_HWCAP, &arg);
> > > >
> > > > https://android.googlesource.com/platform/bionic/+/main/libc/bionic/bionic_call_ifunc_resolver.cpp
> > > >
> > > > > That's good. It sounds like you're planning to just
> > > > > continue passing NULL for now, and wait for people to start clamoring
> > > > > for this in android libc?
> > > >
> > > > yeah, and i'm assuming there will never be any clamor ... yesterday
> > > > and today i actually checked a bunch of popular apks, and didn't find
> > > > any that were currently using ifuncs.
> > > >
> > > > the only change i'm thinking of making right now is that "there's a
> > > > single argument, and it's null" should probably be the default.
> > > > obviously since Android doesn't add new architectures very often, this
> > > > is only likely to affect x86/x86-64 for the foreseeable future, but
> > > > being able to recognize at a glance "am i running under a libc new
> > > > enough to pass me arguments?" would certainly have helped for arm64.
> > > > even if x86/x86-64 never benefit, it seems like the right default for
> > > > the #else clause...
> > >
> > > Sounds good, thanks for the pointers. The paranoid person in me would
> > > also add a comment in the risc-v section that if a pointer to hwprobe
> > > is added, it should be added as the second argument, behind hwcap as
> > > the first (assuming this change lands).
> > >
> > > Come to think of it, the static inline helper I'm proposing in my
> > > discussion with Richard needs to take both arguments, since callers
> > > need to check both ((arg1 != 0) && (arg2 != NULL)) to safely know that
> > > arg2 is a pointer to __riscv_hwprobe().
> >
> > presumably not `(arg1 != 0)` but `(arg1 & _IFUNC_ARG_HWCAP)` to match arm64?
>
> It looks like we didn't do that _IFUNC_ARG_HWCAP bit on riscv.
> Actually, looking at the history of sysdeps/riscv/dl-irel.h, hwcap has
> always been passed as the first argument. So I think I don't need to
> check it in the (glibc-specific) inline helper function, I can safely
> assume it's there and go straight to checking the second argument.

oh i misunderstood what you were saying earlier.

> If you were coding this directly in a library or application, you
> would need to check the first arg to be compatible with other libcs
> like Android's.

we haven't shipped yet, so if you're telling me glibc is passing
`(getauxval(AT_HWCAP), nullptr)`, i'll change bionic to do the same
today ... the whole reason i'm here on this thread is to ensure source
compatibility for anyone writing ifuncs :-)

> I think checking against zero should be ok since bits
> like I, M, A should always be set. (I didn't dig through the kernel
> history, so maybe not on really old kernels? But you're not going to
> get hwprobe on those anyway either so the false bailout is correct).
>
> -Evan
  
Evan Green Aug. 17, 2023, 5:40 p.m. UTC | #17
On Thu, Aug 17, 2023 at 9:37 AM enh <enh@google.com> wrote:
>
> On Thu, Aug 17, 2023 at 9:27 AM Evan Green <evan@rivosinc.com> wrote:
> >
> > On Wed, Aug 16, 2023 at 4:18 PM enh <enh@google.com> wrote:
> > >
> > > On Tue, Aug 15, 2023 at 4:02 PM Evan Green <evan@rivosinc.com> wrote:
> > > >
> > > > On Tue, Aug 15, 2023 at 2:54 PM enh <enh@google.com> wrote:
> > > > >
> > > > > On Tue, Aug 15, 2023 at 9:41 AM Evan Green <evan@rivosinc.com> wrote:
> > > > > >
> > > > > > On Fri, Aug 11, 2023 at 5:01 PM enh <enh@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > > > > > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > > > > >> Outside libc something is required.
> > > > > > > > > > > >>
> > > > > > > > > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > > > > > > > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > > > > > > > > > >> could always be called directly.
> > > > > > > > > > > >
> > > > > > > > > > > > My previous spin took that approach, defining a
> > > > > > > > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > > > > > > > > function if available, or make the syscall directly if not. But that
> > > > > > > > > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > > > > > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > > > > > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > > > > > > > > >
> > > > > > > > > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > > > > > > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > > > > > > > > falling back to the syscall.
> > > > > > > > > > >
> > > > > > > > > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > > > > > > > > returning the negative errno from the syscall seems fine.
> > > > > > > > > > >
> > > > > > > > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > > > > > > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > > > > > > > > prior to full relocations being complete.
> > > > > > > > > >
> > > > > > > > > > Right, this is what we had in the previous iteration of this series,
> > > > > > > > > > and it did work ok. But it wasn't as good since it meant ifunc
> > > > > > > > > > selectors always got stuck in the null/fallback case and were forced
> > > > > > > > > > to make the syscall. With this mechanism they get to take advantage of
> > > > > > > > > > the vDSO.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > > > > > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > > > > > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > > > > > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > > > > > > > > above".
> > > > > > > > > > >
> > > > > > > > > > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > > > > > > > > > going to be used.
> > > > > > > > > > >
> > > > > > > > > > > It might be worth defining some helper functions for probing a single key or a single
> > > > > > > > > > > field.  E.g.
> > > > > > > > > > >
> > > > > > > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > > > > > > > > {
> > > > > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > > > > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > > > > > > > > >    if (err)
> > > > > > > > > > >      return err;
> > > > > > > > > > >    if (pair.key == -1)
> > > > > > > > > > >      return -ENOENT;
> > > > > > > > > > >    return pair.value;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > > > > > > > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > > > > > > > > > >
> > > > > > > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > > > > > > > > {
> > > > > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > > > > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > > > > > > > > >            && pair.key != -1
> > > > > > > > > > >            && (pair.value & mask) == val);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > These yield either
> > > > > > > > > > >
> > > > > > > > > > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > > > > > > > > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > > > > > > > > >        return __memcpy_noalignment;
> > > > > > > > > > >      return __memcpy_generic;
> > > > > > > > > > >
> > > > > > > > > > > or
> > > > > > > > > > >
> > > > > > > > > > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > > > > > > > > >        return __memcpy_noalignment;
> > > > > > > > > > >      return __memcpy_generic;
> > > > > > > > > > >
> > > > > > > > > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > > > > > > > > across all of the ifunc implementations in the system.
> > > > > > > > > >
> > > > > > > > > > Ah, I see. I could make a static inline function in the header that
> > > > > > > > > > looks something like this (mangled by gmail, sorry):
> > > > > > > > > >
> > > > > > > > > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > > > > > > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > > > > > > > > signed long long int key,
> > > > > > > > > > unsigned long long int *value)
> > > > > > > > > > {
> > > > > > > > > > struct riscv_hwprobe pair;
> > > > > > > > > > int rc;
> > > > > > > > > >
> > > > > > > > > > if (!hwprobe_func)
> > > > > > > > > > return -ENOSYS;
> > > > > > > > > >
> > > > > > > > > > pair.key = key;
> > > > > > > > > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > > > > > > > > if (rc) {
> > > > > > > > > > return rc;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > if (pair.key < 0) {
> > > > > > > > > > return -ENOENT;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > *value = pair.value;
> > > > > > > > > > return 0;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > The ifunc selector would then be significantly cleaned up, looking
> > > > > > > > > > something like:
> > > > > > > > > >
> > > > > > > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > > > > > > > > return __memcpy_generic;
> > > > > > > > > >
> > > > > > > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > > > > > > > > return __memcpy_noalignment;
> > > > > > > > >
> > > > > > > > > (Android's libc maintainer here, having joined the list just to talk
> > > > > > > > > about risc-v ifuncs :-) )
> > > > > > > > >
> > > > > > > > > has anyone thought about calling ifunc resolvers more like this...
> > > > > > > > >
> > > > > > > > > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > > > > > > > > static struct riscv_hwprobe probes[] = {
> > > > > > > > >  {.value = RISCV_HWPROBE_KEY_MVENDORID},
> > > > > > > > >  {.value = RISCV_HWPROBE_KEY_MARCHID},
> > > > > > > > >  {.value = RISCV_HWPROBE_KEY_MIMPID},
> > > > > > > > >  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> > > > > > > > >  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> > > > > > > > >  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > > > > > > > > ... // every time a new key is added to the kernel, we add it here
> > > > > > > > > };
> > > > > > > > > __riscv_hwprobe(...); // called once
> > > > > > > > >
> > > > > > > > > --part of the dynamic loader that calls ifunc resolvers--
> > > > > > > > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> > > > > > > > >
> > > > > > > > > this is similar to what we already have for arm64 (where there's a
> > > > > > > > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > > > > > > > > potentially others), but more uniform, and avoiding the source
> > > > > > > > > (in)compatibility issues of adding new fields to a struct [even if it
> > > > > > > > > does have a size_t to "version" it like the arm64 ifunc struct].
> > > > > > > > >
> > > > > > > > > yes, it means everyone pays to get all the hwprobes, but that gets
> > > > > > > > > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > > > > > > > > know that the keys will be kept dense, we can even have code in ifunc
> > > > > > > > > resolvers like
> > > > > > > > >
> > > > > > > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> > > > > > > > >
> > > > > > > > > though personally for the "big ticket items" that get a letter to
> > > > > > > > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > > > > > > > > probe_count, probes_ptr)` to the resolver, but i hear that's
> > > > > > > > > controversial :-)
> > > > > > > >
> > > > > > > > Hello, welcome to the fun! :)
> > > > > > >
> > > > > > > (sorry for the delay. i've been thinking :-) )
> > > > > > >
> > > > > > > > What you're describing here is almost exactly what we did inside the
> > > > > > > > vDSO function. The vDSO function acts as a front for a handful of
> > > > > > > > probe values that we've already completed and cached in userspace. We
> > > > > > > > opted to make it a function, rather than exposing the data itself via
> > > > > > > > vDSO, so that we had future flexibility in what elements we cached in
> > > > > > > > userspace and their storage format. We can update the kernel as needed
> > > > > > > > to cache the hottest things in userspace, even if that means
> > > > > > > > rearranging the data format, passing through some extra information,
> > > > > > > > or adding an extra snip of code. My hope is callers can directly
> > > > > > > > interact with the vDSO function (though maybe as Richard suggested
> > > > > > > > maybe with the help of a tidy inline helper), rather than trying to
> > > > > > > > add a second layer of userspace caching.
> > > > > > >
> > > > > > > on reflection i think i might be too focused on the FMV use case, in
> > > > > > > part because we're looking at those compiler-generated ifuncs for
> > > > > > > arm64 on Android atm. i think i'm imagining a world where there's a
> > > > > > > lot of that, and worrying about having to pay for the setup, call, and
> > > > > > > loop for each ifunc, and wondering why we don't just pay once instead.
> > > > > > > (as a bit of background context, Android "app" start is actually a
> > > > > > > dlopen() in a clone of an existing zygote process, and in general app
> > > > > > > launch time is one of the key metrics anyone who's serious is
> > > > > > > optimizing for. you'd be surprised how much of my life i spend
> > > > > > > explaining to people that if they want dlopen() to be faster, maybe
> > > > > > > they shouldn't ask us to run thousands of ELF constructors.)
> > > > > > >
> > > > > > > but... the more time i spend looking at what we actually need in
> > > > > > > third-party open source libraries right now i realize that libc and
> > > > > > > FMV (which is still a future thing for us anyway) are really the only
> > > > > > > _actual_ ifunc users. perhaps in part because macOS/iOS don't have
> > > > > > > ifuncs, all the libraries that are part of the OS itself, for example,
> > > > > > > are just doing their own thing with function pointers and
> > > > > > > pthread_once() or whatever.
> > > > > > >
> > > > > > > (i have yet to try to get any data on actual apps. i have no reason to
> > > > > > > think they'll be very different, but that could easily be skewed by
> > > > > > > popular middleware or a popular game engine using ifuncs, so i do plan
> > > > > > > on following up on that.)
> > > > > > >
> > > > > > > "how do they decide what to set that function pointer to?". well, it
> > > > > > > looks like in most cases cpuid on x86 and calls to getauxval()
> > > > > > > everywhere else. in some cases that's actually via some other library:
> > > > > > > https://github.com/pytorch/cpuinfo or
> > > > > > > https://github.com/google/cpu_features for example. so they have a
> > > > > > > layer of caching there, even in cases where they don't have a single
> > > > > > > function that sets all the function pointers.
> > > > > >
> > > > > > Right, function multi-versioning is just the sort of spot where we'd
> > > > > > imagine hwprobe gets used, since it's providing similar/equivalent
> > > > > > information to what cpuid does on x86. It may not be quite as fast as
> > > > > > cpuid (I don't know how fast cpuid actually is). But with the vDSO
> > > > > > function+data in userspace it should be able to match getauxval() in
> > > > > > performance, as they're both a function pointer plus a loop. We're
> > > > > > sort of planning for a world in which RISC-V has a wider set of these
> > > > > > values to fetch, such that a ifunc selector may need a more complex
> > > > > > set of information. Hwprobe and the vDSO gives us the ability both to
> > > > > > answer multiple queries fast, and freely allocate more keys that may
> > > > > > represent versioned features or even compound features.
> > > > >
> > > > > yeah, my incorrect mental model was that -- primarily because of
> > > > > x86-64 and cpuid -- every function would get its own ifunc resolver
> > > > > that would have to make a query. but the [in progress] arm64
> > > > > implementation shows that that's not really the case anyway, and we
> > > > > can just cache __riscv_hwprobe() in the same [one] place that
> > > > > getauxval() is already being cached for arm64.
> > > >
> > > > Sounds good.
> > > >
> > > > >
> > > > > > > so assuming i don't find that apps look very different from the OS
> > > > > > > (that is: that apps use lots of ifuncs), i probably don't care at all
> > > > > > > until we get to FMV. and i probably don't care for FMV, because
> > > > > > > compiler-rt (or gcc's equivalent) will be the "caching layer" there.
> > > > > > > (and on Android it'll be a while before i have to worry about libc's
> > > > > > > ifuncs because we'll require V and not use ifuncs there for the
> > > > > > > foreseeable future.)
> > > > > > >
> > > > > > > so, yeah, given that i've adopted the "pass a null pointer rather than
> > > > > > > no arguments" convention you have, we have room for expansion if/when
> > > > > > > FMV is a big thing, and until then -- unless i'm shocked by what i
> > > > > > > find looking at actual apps -- i don't think i have any reason to
> > > > > > > believe that ifuncs matter that much, and if compiler-rt makes one
> > > > > > > __riscv_hwprobe() call per .so, that's probably fine. (i already spend
> > > > > > > a big chunk of my life advising people to just have one .so file,
> > > > > > > exporting nothing but a JNI_OnLoad symbol, so this will just make that
> > > > > > > advice even better advice :-) )
> > > > > >
> > > > > > Just to confirm, by "pass a null pointer", you're saying that the
> > > > > > Android libc also passes NULL as the second ifunc selector argument
> > > > > > (or first)?
> > > > >
> > > > > #elif defined(__riscv)
> > > > >   // This argument and its value is just a placeholder for now,
> > > > >   // but it means that if we do pass something in future (such as
> > > > >   // getauxval() and/or hwprobe key/value pairs), callees will be able to
> > > > >   // recognize what they're being given.
> > > > >   typedef ElfW(Addr) (*ifunc_resolver_t)(void*);
> > > > >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(nullptr);
> > > > >
> > > > > it's arm64 that has the initial getauxval() argument:
> > > > >
> > > > > #if defined(__aarch64__)
> > > > >   typedef ElfW(Addr) (*ifunc_resolver_t)(uint64_t, __ifunc_arg_t*);
> > > > >   static __ifunc_arg_t arg;
> > > > >   static bool initialized = false;
> > > > >   if (!initialized) {
> > > > >     initialized = true;
> > > > >     arg._size = sizeof(__ifunc_arg_t);
> > > > >     arg._hwcap = getauxval(AT_HWCAP);
> > > > >     arg._hwcap2 = getauxval(AT_HWCAP2);
> > > > >   }
> > > > >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(arg._hwcap
> > > > > | _IFUNC_ARG_HWCAP, &arg);
> > > > >
> > > > > https://android.googlesource.com/platform/bionic/+/main/libc/bionic/bionic_call_ifunc_resolver.cpp
> > > > >
> > > > > > That's good. It sounds like you're planning to just
> > > > > > continue passing NULL for now, and wait for people to start clamoring
> > > > > > for this in android libc?
> > > > >
> > > > > yeah, and i'm assuming there will never be any clamor ... yesterday
> > > > > and today i actually checked a bunch of popular apks, and didn't find
> > > > > any that were currently using ifuncs.
> > > > >
> > > > > the only change i'm thinking of making right now is that "there's a
> > > > > single argument, and it's null" should probably be the default.
> > > > > obviously since Android doesn't add new architectures very often, this
> > > > > is only likely to affect x86/x86-64 for the foreseeable future, but
> > > > > being able to recognize at a glance "am i running under a libc new
> > > > > enough to pass me arguments?" would certainly have helped for arm64.
> > > > > even if x86/x86-64 never benefit, it seems like the right default for
> > > > > the #else clause...
> > > >
> > > > Sounds good, thanks for the pointers. The paranoid person in me would
> > > > also add a comment in the risc-v section that if a pointer to hwprobe
> > > > is added, it should be added as the second argument, behind hwcap as
> > > > the first (assuming this change lands).
> > > >
> > > > Come to think of it, the static inline helper I'm proposing in my
> > > > discussion with Richard needs to take both arguments, since callers
> > > > need to check both ((arg1 != 0) && (arg2 != NULL)) to safely know that
> > > > arg2 is a pointer to __riscv_hwprobe().
> > >
> > > presumably not `(arg1 != 0)` but `(arg1 & _IFUNC_ARG_HWCAP)` to match arm64?
> >
> > It looks like we didn't do that _IFUNC_ARG_HWCAP bit on riscv.
> > Actually, looking at the history of sysdeps/riscv/dl-irel.h, hwcap has
> > always been passed as the first argument. So I think I don't need to
> > check it in the (glibc-specific) inline helper function, I can safely
> > assume it's there and go straight to checking the second argument.
>
> oh i misunderstood what you were saying earlier.

My bad for causing confusion, I learned something from your reply and
changed my conclusion.

>
> > If you were coding this directly in a library or application, you
> > would need to check the first arg to be compatible with other libcs
> > like Android's.
>
> we haven't shipped yet, so if you're telling me glibc is passing
> `(getauxval(AT_HWCAP), nullptr)`, i'll change bionic to do the same
> today ... the whole reason i'm here on this thread is to ensure source
> compatibility for anyone writing ifuncs :-)

Ah, yes, they are (and have always on risc-v) passed
(getauxval(AT_HWCAP), nullptr), so do exactly that.
-Evan
  
enh Aug. 22, 2023, 3:06 p.m. UTC | #18
On Thu, Aug 17, 2023 at 10:41 AM Evan Green <evan@rivosinc.com> wrote:
>
> On Thu, Aug 17, 2023 at 9:37 AM enh <enh@google.com> wrote:
> >
> > On Thu, Aug 17, 2023 at 9:27 AM Evan Green <evan@rivosinc.com> wrote:
> > >
> > > On Wed, Aug 16, 2023 at 4:18 PM enh <enh@google.com> wrote:
> > > >
> > > > On Tue, Aug 15, 2023 at 4:02 PM Evan Green <evan@rivosinc.com> wrote:
> > > > >
> > > > > On Tue, Aug 15, 2023 at 2:54 PM enh <enh@google.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 15, 2023 at 9:41 AM Evan Green <evan@rivosinc.com> wrote:
> > > > > > >
> > > > > > > On Fri, Aug 11, 2023 at 5:01 PM enh <enh@google.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > > > > > > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > > > > > >> Outside libc something is required.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > > > > > > > > > >> hoops above.  I would hope for something with hidden visibility in libc_nonshared.a that
> > > > > > > > > > > > >> could always be called directly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My previous spin took that approach, defining a
> > > > > > > > > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > > > > > > > > > function if available, or make the syscall directly if not. But that
> > > > > > > > > > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > > > > > > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > > > > > > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > > > > > > > > > >
> > > > > > > > > > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > > > > > > > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > > > > > > > > > falling back to the syscall.
> > > > > > > > > > > >
> > > > > > > > > > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > > > > > > > > > returning the negative errno from the syscall seems fine.
> > > > > > > > > > > >
> > > > > > > > > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > > > > > > > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > > > > > > > > > prior to full relocations being complete.
> > > > > > > > > > >
> > > > > > > > > > > Right, this is what we had in the previous iteration of this series,
> > > > > > > > > > > and it did work ok. But it wasn't as good since it meant ifunc
> > > > > > > > > > > selectors always got stuck in the null/fallback case and were forced
> > > > > > > > > > > to make the syscall. With this mechanism they get to take advantage of
> > > > > > > > > > > the vDSO.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > > > > > > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > > > > > > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > > > > > > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > > > > > > > > > above".
> > > > > > > > > > > >
> > > > > > > > > > > > The check for null function pointer, for sure.  But also consider how __riscv_hwprobe is
> > > > > > > > > > > > going to be used.
> > > > > > > > > > > >
> > > > > > > > > > > > It might be worth defining some helper functions for probing a single key or a single
> > > > > > > > > > > > field.  E.g.
> > > > > > > > > > > >
> > > > > > > > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > > > > > > > > > {
> > > > > > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > > > > > >    int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > > > > > > > > > >    if (err)
> > > > > > > > > > > >      return err;
> > > > > > > > > > > >    if (pair.key == -1)
> > > > > > > > > > > >      return -ENOENT;
> > > > > > > > > > > >    return pair.value;
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > > > > > > > > > value in the errno range (or better, bit 63 unused).  Alternately, or additionally:
> > > > > > > > > > > >
> > > > > > > > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > > > > > > > > > {
> > > > > > > > > > > >    struct riscv_hwprobe pair = { .key = key };
> > > > > > > > > > > >    return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > > > > > > > > > >            && pair.key != -1
> > > > > > > > > > > >            && (pair.value & mask) == val);
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > These yield either
> > > > > > > > > > > >
> > > > > > > > > > > >      int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > > > > > > > > > >      if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > > > > > > > > > >        return __memcpy_noalignment;
> > > > > > > > > > > >      return __memcpy_generic;
> > > > > > > > > > > >
> > > > > > > > > > > > or
> > > > > > > > > > > >
> > > > > > > > > > > >      if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > > > > > > > > > >        return __memcpy_noalignment;
> > > > > > > > > > > >      return __memcpy_generic;
> > > > > > > > > > > >
> > > > > > > > > > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > > > > > > > > > across all of the ifunc implementations in the system.
> > > > > > > > > > >
> > > > > > > > > > > Ah, I see. I could make a static inline function in the header that
> > > > > > > > > > > looks something like this (mangled by gmail, sorry):
> > > > > > > > > > >
> > > > > > > > > > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > > > > > > > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > > > > > > > > > signed long long int key,
> > > > > > > > > > > unsigned long long int *value)
> > > > > > > > > > > {
> > > > > > > > > > > struct riscv_hwprobe pair;
> > > > > > > > > > > int rc;
> > > > > > > > > > >
> > > > > > > > > > > if (!hwprobe_func)
> > > > > > > > > > > return -ENOSYS;
> > > > > > > > > > >
> > > > > > > > > > > pair.key = key;
> > > > > > > > > > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > > > > > > > > > if (rc) {
> > > > > > > > > > > return rc;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > if (pair.key < 0) {
> > > > > > > > > > > return -ENOENT;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > *value = pair.value;
> > > > > > > > > > > return 0;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > The ifunc selector would then be significantly cleaned up, looking
> > > > > > > > > > > something like:
> > > > > > > > > > >
> > > > > > > > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > > > > > > > > > return __memcpy_generic;
> > > > > > > > > > >
> > > > > > > > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > > > > > > > > > return __memcpy_noalignment;
> > > > > > > > > >
> > > > > > > > > > (Android's libc maintainer here, having joined the list just to talk
> > > > > > > > > > about risc-v ifuncs :-) )
> > > > > > > > > >
> > > > > > > > > > has anyone thought about calling ifunc resolvers more like this...
> > > > > > > > > >
> > > > > > > > > > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > > > > > > > > > static struct riscv_hwprobe probes[] = {
> > > > > > > > > >  {.value = RISCV_HWPROBE_KEY_MVENDORID},
> > > > > > > > > >  {.value = RISCV_HWPROBE_KEY_MARCHID},
> > > > > > > > > >  {.value = RISCV_HWPROBE_KEY_MIMPID},
> > > > > > > > > >  {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> > > > > > > > > >  {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> > > > > > > > > >  {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > > > > > > > > > ... // every time a new key is added to the kernel, we add it here
> > > > > > > > > > };
> > > > > > > > > > __riscv_hwprobe(...); // called once
> > > > > > > > > >
> > > > > > > > > > --part of the dynamic loader that calls ifunc resolvers--
> > > > > > > > > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> > > > > > > > > >
> > > > > > > > > > this is similar to what we already have for arm64 (where there's a
> > > > > > > > > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > > > > > > > > > potentially others), but more uniform, and avoiding the source
> > > > > > > > > > (in)compatibility issues of adding new fields to a struct [even if it
> > > > > > > > > > does have a size_t to "version" it like the arm64 ifunc struct].
> > > > > > > > > >
> > > > > > > > > > yes, it means everyone pays to get all the hwprobes, but that gets
> > > > > > > > > > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > > > > > > > > > know that the keys will be kept dense, we can even have code in ifunc
> > > > > > > > > > resolvers like
> > > > > > > > > >
> > > > > > > > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> > > > > > > > > >
> > > > > > > > > > though personally for the "big ticket items" that get a letter to
> > > > > > > > > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > > > > > > > > > probe_count, probes_ptr)` to the resolver, but i hear that's
> > > > > > > > > > controversial :-)
> > > > > > > > >
> > > > > > > > > Hello, welcome to the fun! :)
> > > > > > > >
> > > > > > > > (sorry for the delay. i've been thinking :-) )
> > > > > > > >
> > > > > > > > > What you're describing here is almost exactly what we did inside the
> > > > > > > > > vDSO function. The vDSO function acts as a front for a handful of
> > > > > > > > > probe values that we've already completed and cached in userspace. We
> > > > > > > > > opted to make it a function, rather than exposing the data itself via
> > > > > > > > > vDSO, so that we had future flexibility in what elements we cached in
> > > > > > > > > userspace and their storage format. We can update the kernel as needed
> > > > > > > > > to cache the hottest things in userspace, even if that means
> > > > > > > > > rearranging the data format, passing through some extra information,
> > > > > > > > > or adding an extra snip of code. My hope is callers can directly
> > > > > > > > > interact with the vDSO function (though maybe as Richard suggested
> > > > > > > > > maybe with the help of a tidy inline helper), rather than trying to
> > > > > > > > > add a second layer of userspace caching.
> > > > > > > >
> > > > > > > > on reflection i think i might be too focused on the FMV use case, in
> > > > > > > > part because we're looking at those compiler-generated ifuncs for
> > > > > > > > arm64 on Android atm. i think i'm imagining a world where there's a
> > > > > > > > lot of that, and worrying about having to pay for the setup, call, and
> > > > > > > > loop for each ifunc, and wondering why we don't just pay once instead.
> > > > > > > > (as a bit of background context, Android "app" start is actually a
> > > > > > > > dlopen() in a clone of an existing zygote process, and in general app
> > > > > > > > launch time is one of the key metrics anyone who's serious is
> > > > > > > > optimizing for. you'd be surprised how much of my life i spend
> > > > > > > > explaining to people that if they want dlopen() to be faster, maybe
> > > > > > > > they shouldn't ask us to run thousands of ELF constructors.)
> > > > > > > >
> > > > > > > > but... the more time i spend looking at what we actually need in
> > > > > > > > third-party open source libraries right now i realize that libc and
> > > > > > > > FMV (which is still a future thing for us anyway) are really the only
> > > > > > > > _actual_ ifunc users. perhaps in part because macOS/iOS don't have
> > > > > > > > ifuncs, all the libraries that are part of the OS itself, for example,
> > > > > > > > are just doing their own thing with function pointers and
> > > > > > > > pthread_once() or whatever.
> > > > > > > >
> > > > > > > > (i have yet to try to get any data on actual apps. i have no reason to
> > > > > > > > think they'll be very different, but that could easily be skewed by
> > > > > > > > popular middleware or a popular game engine using ifuncs, so i do plan
> > > > > > > > on following up on that.)
> > > > > > > >
> > > > > > > > "how do they decide what to set that function pointer to?". well, it
> > > > > > > > looks like in most cases cpuid on x86 and calls to getauxval()
> > > > > > > > everywhere else. in some cases that's actually via some other library:
> > > > > > > > https://github.com/pytorch/cpuinfo or
> > > > > > > > https://github.com/google/cpu_features for example. so they have a
> > > > > > > > layer of caching there, even in cases where they don't have a single
> > > > > > > > function that sets all the function pointers.
> > > > > > >
> > > > > > > Right, function multi-versioning is just the sort of spot where we'd
> > > > > > > imagine hwprobe gets used, since it's providing similar/equivalent
> > > > > > > information to what cpuid does on x86. It may not be quite as fast as
> > > > > > > cpuid (I don't know how fast cpuid actually is). But with the vDSO
> > > > > > > function+data in userspace it should be able to match getauxval() in
> > > > > > > performance, as they're both a function pointer plus a loop. We're
> > > > > > > sort of planning for a world in which RISC-V has a wider set of these
> > > > > > > values to fetch, such that a ifunc selector may need a more complex
> > > > > > > set of information. Hwprobe and the vDSO gives us the ability both to
> > > > > > > answer multiple queries fast, and freely allocate more keys that may
> > > > > > > represent versioned features or even compound features.
> > > > > >
> > > > > > yeah, my incorrect mental model was that -- primarily because of
> > > > > > x86-64 and cpuid -- every function would get its own ifunc resolver
> > > > > > that would have to make a query. but the [in progress] arm64
> > > > > > implementation shows that that's not really the case anyway, and we
> > > > > > can just cache __riscv_hwprobe() in the same [one] place that
> > > > > > getauxval() is already being cached for arm64.
> > > > >
> > > > > Sounds good.
> > > > >
> > > > > >
> > > > > > > > so assuming i don't find that apps look very different from the OS
> > > > > > > > (that is: that apps use lots of ifuncs), i probably don't care at all
> > > > > > > > until we get to FMV. and i probably don't care for FMV, because
> > > > > > > > compiler-rt (or gcc's equivalent) will be the "caching layer" there.
> > > > > > > > (and on Android it'll be a while before i have to worry about libc's
> > > > > > > > ifuncs because we'll require V and not use ifuncs there for the
> > > > > > > > foreseeable future.)
> > > > > > > >
> > > > > > > > so, yeah, given that i've adopted the "pass a null pointer rather than
> > > > > > > > no arguments" convention you have, we have room for expansion if/when
> > > > > > > > FMV is a big thing, and until then -- unless i'm shocked by what i
> > > > > > > > find looking at actual apps -- i don't think i have any reason to
> > > > > > > > believe that ifuncs matter that much, and if compiler-rt makes one
> > > > > > > > __riscv_hwprobe() call per .so, that's probably fine. (i already spend
> > > > > > > > a big chunk of my life advising people to just have one .so file,
> > > > > > > > exporting nothing but a JNI_OnLoad symbol, so this will just make that
> > > > > > > > advice even better advice :-) )
> > > > > > >
> > > > > > > Just to confirm, by "pass a null pointer", you're saying that the
> > > > > > > Android libc also passes NULL as the second ifunc selector argument
> > > > > > > (or first)?
> > > > > >
> > > > > > #elif defined(__riscv)
> > > > > >   // This argument and its value is just a placeholder for now,
> > > > > >   // but it means that if we do pass something in future (such as
> > > > > >   // getauxval() and/or hwprobe key/value pairs), callees will be able to
> > > > > >   // recognize what they're being given.
> > > > > >   typedef ElfW(Addr) (*ifunc_resolver_t)(void*);
> > > > > >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(nullptr);
> > > > > >
> > > > > > it's arm64 that has the initial getauxval() argument:
> > > > > >
> > > > > > #if defined(__aarch64__)
> > > > > >   typedef ElfW(Addr) (*ifunc_resolver_t)(uint64_t, __ifunc_arg_t*);
> > > > > >   static __ifunc_arg_t arg;
> > > > > >   static bool initialized = false;
> > > > > >   if (!initialized) {
> > > > > >     initialized = true;
> > > > > >     arg._size = sizeof(__ifunc_arg_t);
> > > > > >     arg._hwcap = getauxval(AT_HWCAP);
> > > > > >     arg._hwcap2 = getauxval(AT_HWCAP2);
> > > > > >   }
> > > > > >   return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(arg._hwcap
> > > > > > | _IFUNC_ARG_HWCAP, &arg);
> > > > > >
> > > > > > https://android.googlesource.com/platform/bionic/+/main/libc/bionic/bionic_call_ifunc_resolver.cpp
> > > > > >
> > > > > > > That's good. It sounds like you're planning to just
> > > > > > > continue passing NULL for now, and wait for people to start clamoring
> > > > > > > for this in android libc?
> > > > > >
> > > > > > yeah, and i'm assuming there will never be any clamor ... yesterday
> > > > > > and today i actually checked a bunch of popular apks, and didn't find
> > > > > > any that were currently using ifuncs.
> > > > > >
> > > > > > the only change i'm thinking of making right now is that "there's a
> > > > > > single argument, and it's null" should probably be the default.
> > > > > > obviously since Android doesn't add new architectures very often, this
> > > > > > is only likely to affect x86/x86-64 for the foreseeable future, but
> > > > > > being able to recognize at a glance "am i running under a libc new
> > > > > > enough to pass me arguments?" would certainly have helped for arm64.
> > > > > > even if x86/x86-64 never benefit, it seems like the right default for
> > > > > > the #else clause...
> > > > >
> > > > > Sounds good, thanks for the pointers. The paranoid person in me would
> > > > > also add a comment in the risc-v section that if a pointer to hwprobe
> > > > > is added, it should be added as the second argument, behind hwcap as
> > > > > the first (assuming this change lands).
> > > > >
> > > > > Come to think of it, the static inline helper I'm proposing in my
> > > > > discussion with Richard needs to take both arguments, since callers
> > > > > need to check both ((arg1 != 0) && (arg2 != NULL)) to safely know that
> > > > > arg2 is a pointer to __riscv_hwprobe().
> > > >
> > > > presumably not `(arg1 != 0)` but `(arg1 & _IFUNC_ARG_HWCAP)` to match arm64?
> > >
> > > It looks like we didn't do that _IFUNC_ARG_HWCAP bit on riscv.
> > > Actually, looking at the history of sysdeps/riscv/dl-irel.h, hwcap has
> > > always been passed as the first argument. So I think I don't need to
> > > check it in the (glibc-specific) inline helper function, I can safely
> > > assume it's there and go straight to checking the second argument.
> >
> > oh i misunderstood what you were saying earlier.
>
> My bad for causing confusion, I learned something from your reply and
> changed my conclusion.
>
> >
> > > If you were coding this directly in a library or application, you
> > > would need to check the first arg to be compatible with other libcs
> > > like Android's.
> >
> > we haven't shipped yet, so if you're telling me glibc is passing
> > `(getauxval(AT_HWCAP), nullptr)`, i'll change bionic to do the same
> > today ... the whole reason i'm here on this thread is to ensure source
> > compatibility for anyone writing ifuncs :-)
>
> Ah, yes, they are (and have always on risc-v) passed
> (getauxval(AT_HWCAP), nullptr), so do exactly that.

done: https://android-review.googlesource.com/c/platform/bionic/+/2695693

> -Evan
  

Patch

diff --git a/sysdeps/riscv/memcopy.h b/sysdeps/riscv/memcopy.h
new file mode 100644
index 0000000000..2b685c8aa0
--- /dev/null
+++ b/sysdeps/riscv/memcopy.h
@@ -0,0 +1,26 @@ 
+/* memcopy.h -- definitions for memory copy functions. RISC-V version.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdeps/generic/memcopy.h>
+
+/* Redefine the generic memcpy implementation to __memcpy_generic, so
+   the memcpy ifunc can select between generic and special versions.
+   In rtld, don't bother with all the ifunciness. */
+#if IS_IN (libc)
+#define MEMCPY __memcpy_generic
+#endif
diff --git a/sysdeps/riscv/memcpy.c b/sysdeps/riscv/memcpy.c
new file mode 100644
index 0000000000..ecadd96433
--- /dev/null
+++ b/sysdeps/riscv/memcpy.c
@@ -0,0 +1,66 @@ 
+/* Multiple versions of memcpy.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2017-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#if IS_IN (libc)
+/* Redefine memcpy so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef memcpy
+# define memcpy __redirect_memcpy
+# include <stdint.h>
+# include <string.h>
+# include <ifunc-init.h>
+# include <riscv-ifunc.h>
+# include <sys/hwprobe.h>
+
+# define INIT_ARCH()
+
+extern __typeof (__redirect_memcpy) __libc_memcpy;
+
+extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden;
+extern __typeof (__redirect_memcpy) __memcpy_noalignment attribute_hidden;
+
+static inline __typeof (__redirect_memcpy) *
+select_memcpy_ifunc (uint64_t dl_hwcap, __riscv_hwprobe_t hwprobe_func)
+{
+  INIT_ARCH ();
+
+  struct riscv_hwprobe pair;
+
+  pair.key = RISCV_HWPROBE_KEY_CPUPERF_0;
+  if (!hwprobe_func || hwprobe_func(&pair, 1, 0, NULL, 0) != 0)
+    return __memcpy_generic;
+
+  if ((pair.key > 0) &&
+      (pair.value & RISCV_HWPROBE_MISALIGNED_MASK) ==
+       RISCV_HWPROBE_MISALIGNED_FAST)
+    return __memcpy_noalignment;
+
+  return __memcpy_generic;
+}
+
+riscv_libc_ifunc (__libc_memcpy, select_memcpy_ifunc);
+
+# undef memcpy
+strong_alias (__libc_memcpy, memcpy);
+# ifdef SHARED
+__hidden_ver1 (memcpy, __GI_memcpy, __redirect_memcpy)
+  __attribute__ ((visibility ("hidden"))) __attribute_copy__ (memcpy);
+# endif
+
+#endif
diff --git a/sysdeps/riscv/memcpy_noalignment.S b/sysdeps/riscv/memcpy_noalignment.S
new file mode 100644
index 0000000000..f3bf8e5867
--- /dev/null
+++ b/sysdeps/riscv/memcpy_noalignment.S
@@ -0,0 +1,138 @@ 
+/* memcpy for RISC-V, ignoring buffer alignment
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+/* void *memcpy(void *, const void *, size_t) */
+ENTRY (__memcpy_noalignment)
+	move t6, a0  /* Preserve return value */
+
+	/* Bail if 0 */
+	beqz a2, 7f
+
+	/* Jump to byte copy if size < SZREG */
+	li a4, SZREG
+	bltu a2, a4, 5f
+
+	/* Round down to the nearest "page" size */
+	andi a4, a2, ~((16*SZREG)-1)
+	beqz a4, 2f
+	add a3, a1, a4
+
+	/* Copy the first word to get dest word aligned */
+	andi a5, t6, SZREG-1
+	beqz a5, 1f
+	REG_L a6, (a1)
+	REG_S a6, (t6)
+
+	/* Align dst up to a word, move src and size as well. */
+	addi t6, t6, SZREG-1
+	andi t6, t6, ~(SZREG-1)
+	sub a5, t6, a0
+	add a1, a1, a5
+	sub a2, a2, a5
+
+	/* Recompute page count */
+	andi a4, a2, ~((16*SZREG)-1)
+	beqz a4, 2f
+
+1:
+	/* Copy "pages" (chunks of 16 registers) */
+	REG_L a4,       0(a1)
+	REG_L a5,   SZREG(a1)
+	REG_L a6, 2*SZREG(a1)
+	REG_L a7, 3*SZREG(a1)
+	REG_L t0, 4*SZREG(a1)
+	REG_L t1, 5*SZREG(a1)
+	REG_L t2, 6*SZREG(a1)
+	REG_L t3, 7*SZREG(a1)
+	REG_L t4, 8*SZREG(a1)
+	REG_L t5, 9*SZREG(a1)
+	REG_S a4,       0(t6)
+	REG_S a5,   SZREG(t6)
+	REG_S a6, 2*SZREG(t6)
+	REG_S a7, 3*SZREG(t6)
+	REG_S t0, 4*SZREG(t6)
+	REG_S t1, 5*SZREG(t6)
+	REG_S t2, 6*SZREG(t6)
+	REG_S t3, 7*SZREG(t6)
+	REG_S t4, 8*SZREG(t6)
+	REG_S t5, 9*SZREG(t6)
+	REG_L a4, 10*SZREG(a1)
+	REG_L a5, 11*SZREG(a1)
+	REG_L a6, 12*SZREG(a1)
+	REG_L a7, 13*SZREG(a1)
+	REG_L t0, 14*SZREG(a1)
+	REG_L t1, 15*SZREG(a1)
+	addi a1, a1, 16*SZREG
+	REG_S a4, 10*SZREG(t6)
+	REG_S a5, 11*SZREG(t6)
+	REG_S a6, 12*SZREG(t6)
+	REG_S a7, 13*SZREG(t6)
+	REG_S t0, 14*SZREG(t6)
+	REG_S t1, 15*SZREG(t6)
+	addi t6, t6, 16*SZREG
+	bltu a1, a3, 1b
+	andi a2, a2, (16*SZREG)-1  /* Update count */
+
+2:
+	/* Remainder is smaller than a page, compute native word count */
+	beqz a2, 7f
+	andi a5, a2, ~(SZREG-1)
+	andi a2, a2, (SZREG-1)
+	add a3, a1, a5
+	/* Jump directly to last word if no words. */
+	beqz a5, 4f
+
+3:
+	/* Use single native register copy */
+	REG_L a4, 0(a1)
+	addi a1, a1, SZREG
+	REG_S a4, 0(t6)
+	addi t6, t6, SZREG
+	bltu a1, a3, 3b
+
+	/* Jump directly out if no more bytes */
+	beqz a2, 7f
+
+4:
+	/* Copy the last word unaligned */
+	add a3, a1, a2
+	add a4, t6, a2
+	REG_L a5, -SZREG(a3)
+	REG_S a5, -SZREG(a4)
+	ret
+
+5:
+	/* Copy bytes when the total copy is <SZREG */
+	add a3, a1, a2
+
+6:
+	lb a4, 0(a1)
+	addi a1, a1, 1
+	sb a4, 0(t6)
+	addi t6, t6, 1
+	bltu a1, a3, 6b
+
+7:
+	ret
+
+END (__memcpy_noalignment)
+
+hidden_def (__memcpy_noalignment)
diff --git a/sysdeps/unix/sysv/linux/riscv/Makefile b/sysdeps/unix/sysv/linux/riscv/Makefile
index 45cc29e40d..aa9ea443d6 100644
--- a/sysdeps/unix/sysv/linux/riscv/Makefile
+++ b/sysdeps/unix/sysv/linux/riscv/Makefile
@@ -7,6 +7,10 @@  ifeq ($(subdir),stdlib)
 gen-as-const-headers += ucontext_i.sym
 endif
 
+ifeq ($(subdir),string)
+sysdep_routines += memcpy memcpy-generic memcpy_noalignment
+endif
+
 abi-variants := ilp32 ilp32d lp64 lp64d
 
 ifeq (,$(filter $(default-abi),$(abi-variants)))
diff --git a/sysdeps/unix/sysv/linux/riscv/memcpy-generic.c b/sysdeps/unix/sysv/linux/riscv/memcpy-generic.c
new file mode 100644
index 0000000000..0abe03f7f5
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/memcpy-generic.c
@@ -0,0 +1,24 @@ 
+/* Re-include the default memcpy implementation.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+
+extern __typeof (memcpy) __memcpy_generic;
+hidden_proto(__memcpy_generic)
+
+#include <string/memcpy.c>