[RFC,00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines

Message ID 20230207001618.458947-1-christoph.muellner@vrull.eu
Headers
Series riscv: ifunc support with optimized mem*/str*/cpu_relax routines |

Message

Christoph Müllner Feb. 7, 2023, 12:15 a.m. UTC
  From: Christoph Müllner <christoph.muellner@vrull.eu>

This RFC series introduces ifunc support for RISC-V and adds
optimized routines of memset(), memcpy()/memmove(), strlen(),
strcmp(), strncmp(), and cpu_relax().

The ifunc mechanism desides based on the following hart features:
- Available extensions
- Cache block size
- Fast unaligned accesses

Since we don't have an interface to get this information from the
kernel (at the moment), this patch uses environment variables instead,
which is also why this patch should not be considered for upstream
inclusion and is explicitly tagged as RFC.

The environment variables are:
- RISCV_RT_MARCH (e.g. "rv64gc_zicboz")
- RISCV_RT_CBOZ_BLOCKSIZE (e.g. "64")
- RISCV_RT_CBOM_BLOCKSIZE (e.g. "64")
- RISCV_RT_FAST_UNALIGNED (e.g. "1")

The environment variables are looked up and parsed early during
startup, where other architectures query similar properties from
the kernel or the CPU.
The ifunc implementation can use test macros to select a matching
implementation (e.g. HAVE_RV(zbb) or HAVE_FAST_UNALIGNED()).

The following optimized routines exist:
- memset
- memcpy/memmove
- strlen
- strcmp
- strncmp
- cpu_relax

The following optimizations have been applied:
- excessive loop unrolling
- Zbb's orc.b instruction
- Zbb's ctz intruction
- Zicboz/Zic64b ability to clear a cache block in memory
- Fast unaligned accesses (but with keeping exception guarantees intact)
- Fast overlapping accesses

The patch was developed more than a year ago and was tested as part
of a vendor SDK since then. One of the areas where this patchset
was used is benchmarking (e.g. SPEC CPU2017).
The optimized string functions have been tested with the glibc tests
for that purpose.

The first patch of the series does not strictly belong to this series,
but was required to build and test SPEC CPU2017 benchmarks.

To build a cross-toolchain that includes these patches,
the riscv-gnu-toolchain or any other cross-toolchain
builder can be used.

Christoph Müllner (19):
  Inhibit early libcalls before ifunc support is ready
  riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol
  riscv: Add ENTRY_ALIGN() macro
  riscv: Add hart feature run-time detection framework
  riscv: Introduction of ISA extensions
  riscv: Adding ISA string parser for environment variables
  riscv: hart-features: Add fast_unaligned property
  riscv: Add (empty) ifunc framework
  riscv: Add ifunc support for memset
  riscv: Add accelerated memset routines for RV64
  riscv: Add ifunc support for memcpy/memmove
  riscv: Add accelerated memcpy/memmove routines for RV64
  riscv: Add ifunc support for strlen
  riscv: Add accelerated strlen routine
  riscv: Add ifunc support for strcmp
  riscv: Add accelerated strcmp routines
  riscv: Add ifunc support for strncmp
  riscv: Add an optimized strncmp routine
  riscv: Add __riscv_cpu_relax() to allow yielding in busy loops

 csu/libc-start.c                              |   1 +
 elf/dl-support.c                              |   1 +
 sysdeps/riscv/dl-machine.h                    |  13 +
 sysdeps/riscv/ldsodefs.h                      |   1 +
 sysdeps/riscv/multiarch/Makefile              |  24 +
 sysdeps/riscv/multiarch/cpu_relax.c           |  36 ++
 sysdeps/riscv/multiarch/cpu_relax_impl.S      |  40 ++
 sysdeps/riscv/multiarch/ifunc-impl-list.c     |  70 +++
 sysdeps/riscv/multiarch/init-arch.h           |  24 +
 sysdeps/riscv/multiarch/memcpy.c              |  49 ++
 sysdeps/riscv/multiarch/memcpy_generic.c      |  32 ++
 .../riscv/multiarch/memcpy_rv64_unaligned.S   | 475 ++++++++++++++++++
 sysdeps/riscv/multiarch/memmove.c             |  49 ++
 sysdeps/riscv/multiarch/memmove_generic.c     |  32 ++
 sysdeps/riscv/multiarch/memset.c              |  52 ++
 sysdeps/riscv/multiarch/memset_generic.c      |  32 ++
 .../riscv/multiarch/memset_rv64_unaligned.S   |  31 ++
 .../multiarch/memset_rv64_unaligned_cboz64.S  | 217 ++++++++
 sysdeps/riscv/multiarch/strcmp.c              |  47 ++
 sysdeps/riscv/multiarch/strcmp_generic.c      |  32 ++
 sysdeps/riscv/multiarch/strcmp_zbb.S          | 104 ++++
 .../riscv/multiarch/strcmp_zbb_unaligned.S    | 213 ++++++++
 sysdeps/riscv/multiarch/strlen.c              |  44 ++
 sysdeps/riscv/multiarch/strlen_generic.c      |  32 ++
 sysdeps/riscv/multiarch/strlen_zbb.S          | 105 ++++
 sysdeps/riscv/multiarch/strncmp.c             |  44 ++
 sysdeps/riscv/multiarch/strncmp_generic.c     |  32 ++
 sysdeps/riscv/multiarch/strncmp_zbb.S         | 119 +++++
 sysdeps/riscv/sys/asm.h                       |  14 +-
 .../unix/sysv/linux/riscv/atomic-machine.h    |   3 +
 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c   |  62 +++
 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h   |  46 ++
 sysdeps/unix/sysv/linux/riscv/hart-features.c | 356 +++++++++++++
 sysdeps/unix/sysv/linux/riscv/hart-features.h |  58 +++
 .../unix/sysv/linux/riscv/isa-extensions.def  |  72 +++
 sysdeps/unix/sysv/linux/riscv/libc-start.c    |  29 ++
 .../unix/sysv/linux/riscv/macro-for-each.h    |  24 +
 37 files changed, 2610 insertions(+), 5 deletions(-)
 create mode 100644 sysdeps/riscv/multiarch/Makefile
 create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c
 create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S
 create mode 100644 sysdeps/riscv/multiarch/ifunc-impl-list.c
 create mode 100644 sysdeps/riscv/multiarch/init-arch.h
 create mode 100644 sysdeps/riscv/multiarch/memcpy.c
 create mode 100644 sysdeps/riscv/multiarch/memcpy_generic.c
 create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S
 create mode 100644 sysdeps/riscv/multiarch/memmove.c
 create mode 100644 sysdeps/riscv/multiarch/memmove_generic.c
 create mode 100644 sysdeps/riscv/multiarch/memset.c
 create mode 100644 sysdeps/riscv/multiarch/memset_generic.c
 create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned.S
 create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S
 create mode 100644 sysdeps/riscv/multiarch/strcmp.c
 create mode 100644 sysdeps/riscv/multiarch/strcmp_generic.c
 create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S
 create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
 create mode 100644 sysdeps/riscv/multiarch/strlen.c
 create mode 100644 sysdeps/riscv/multiarch/strlen_generic.c
 create mode 100644 sysdeps/riscv/multiarch/strlen_zbb.S
 create mode 100644 sysdeps/riscv/multiarch/strncmp.c
 create mode 100644 sysdeps/riscv/multiarch/strncmp_generic.c
 create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S
 create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h
 create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.h
 create mode 100644 sysdeps/unix/sysv/linux/riscv/isa-extensions.def
 create mode 100644 sysdeps/unix/sysv/linux/riscv/libc-start.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/macro-for-each.h
  

Comments

Kito Cheng Feb. 7, 2023, 2:59 a.m. UTC | #1
nit: copyright year should be 2023
  
Adhemerval Zanella Netto Feb. 7, 2023, 4:40 p.m. UTC | #2
On 06/02/23 21:15, Christoph Muellner wrote:
> From: Christoph Müllner <christoph.muellner@vrull.eu>
> 
> This RFC series introduces ifunc support for RISC-V and adds
> optimized routines of memset(), memcpy()/memmove(), strlen(),
> strcmp(), strncmp(), and cpu_relax().
> 
> The ifunc mechanism desides based on the following hart features:
> - Available extensions
> - Cache block size
> - Fast unaligned accesses
> 
> Since we don't have an interface to get this information from the
> kernel (at the moment), this patch uses environment variables instead,
> which is also why this patch should not be considered for upstream
> inclusion and is explicitly tagged as RFC.
> 
> The environment variables are:
> - RISCV_RT_MARCH (e.g. "rv64gc_zicboz")
> - RISCV_RT_CBOZ_BLOCKSIZE (e.g. "64")
> - RISCV_RT_CBOM_BLOCKSIZE (e.g. "64")
> - RISCV_RT_FAST_UNALIGNED (e.g. "1")
> 
> The environment variables are looked up and parsed early during
> startup, where other architectures query similar properties from
> the kernel or the CPU.
> The ifunc implementation can use test macros to select a matching
> implementation (e.g. HAVE_RV(zbb) or HAVE_FAST_UNALIGNED()).

So now we have 3 different proposal mechanism to provide implementation runtime 
selection on riscv:

  1. The sysdep mechanism to select optimized routines based on compiler/ABI
     done at build time.  It is the current mechanism and it is also used
     on rvv routines [1].

  2. A ifunc one using a new riscv syscall to query the kernel the required 
     information.

  3. Another ifunc one using riscv specific environment variable.

Although all of them are interchangeable in a sense they can be used independently,
RISCV is following MIPS on having uncountable minor ABI variants due this exactly
available permutations.  This incurs in extra maintanance, extra documentation, 
extra testing, etc.

So I would like you RISCV arch-maintainers to first figure out what scheme you
want focus on, instead of trying to push multiple fronts with different ad-hoc 
schemes.

The first scheme, which is the oldest one used by architectures like arm, powerpc,
mips, etc. is the sysdep where you select the variant at build time.  It has the
advantage of no need to extra runtime cost or probing, and a slight code size 
reduction.  However it ties the ABI used to build glibc, which means you need 
multiple libc build if you targeting different chips/ABIs.

I recall that Red Hat and SuSE used to provided specialized glibc build for POWER
machines to try leverage new chips optimization (libm showed some gain, specially
back when ISA 2.05 added rounding instruction, and isa 2.07 GRP to FP special
register).  But I also recall that it was deprecated over using ifunc to optimize
only the required functions that does show performance improvement, since each
glibc build variantion required all the steps to validation.

And that's why aarch64 and x86_64 initially followed the patch to avoid using 
sysdeps folder and have a minimum default implementation that works on the minimum 
support ISA and provide any optimized variant through iFUNC.  And that is what I 
suggest you to do for *rvv*.

You can follow x86_64/s390 and add an extra optimization to only build certain 
variant if the ISA is high enough (for instance, if you targeting rbb, use it as 
default).  It requires a *lot* of boilerplate code, as you can see for the
x86_64-vX code recently; but it should integrate better with current ldconfig
and newer RPM support (and I expect that other packages managers to follow as
well).

And it lead us on *how* to select the ABI variants. I am sorry, but I will *block 
the new environment variables* as is. You might rework it through glibc hardware 
tunables [2], nevertheless I *strong* suggest you to *first* figure out the kernel
interface first prior starting working on providing the optimized routine in glibc.
The glibc tunable might then work a way to tune/test/filter the already in place
mechanism, ideally it should not rely on user intervention as default.

It was not clear from the 'hardware probing user interface' thread [3] why current 
Linux auxv advertise mechanism are not suffice enough for this specific interface 
(maybe you want something more generic like a cpuid-like interface).  It works for
aarch64 and powerpc, so I am not sure why RISCV can't start using it.

[1] https://sourceware.org/pipermail/libc-alpha/2023-February/145102.html
[2] https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
[3] https://yhbt.net/lore/all/20221013163551.6775-1-palmer@rivosinc.com/

> 
> The following optimized routines exist:
> - memset

It seems that main gain here is unaligned access, loop unrolling, and cache clear 
instruction.  Unfortuantely current implementation does not provide support for 
any of this, however I wonder if we could parametrize the generic implementation 
to allow at least some support for fast unaligned memory (we can factor cache clear 
as well).

I am working on refactor memcpy, memmove, memset, and memcmp to get rid of old
code and allow to work toward it.

> - memcpy/memmove

The generic implementation already does some loop unrolling, so I wonder if we can
improve the generic implementation by adding a swtich to assume unaligned access
(so there is no need to use the two load/merge strategy).

One advantage that is not easily reproducable on C is to branch to memcpy on memmove
if the copy shoud be fone fowards.  This is not easily done on generic implementation 
because we can't simply call memcpy in such case (since source and destiny can overlap
and it might call a memcpy routine that does not support it).

My approach on my generic refactor is just to remove the wordcopy and make memcpy
and memmove using the same strategy, but with different code.

> - strlen

The optimized routine seems quite similar to the generic one I installed recently [4],
which should use both cbz and orc.b with RISCV hooks [5]

[4] https://sourceware.org/git/?p=glibc.git;a=commit;h=350d8d13661a863e6b189f02d876fa265fe71302
[5] https://sourceware.org/git/?p=glibc.git;a=commit;h=25788431c0f5264c4830415de0cdd4d9926cbad9

> - strcmp
> - strncmp

The current generic implementations [6][7] now have a small advantage where unaligned
inputs are also improved by first aligning one input and operating with a double
load and merge comparision.

[6] https://sourceware.org/git/?p=glibc.git;a=commit;h=30cf54bf3072be942847400c1669bcd63aab039e
[7] https://sourceware.org/git/?p=glibc.git;a=commit;h=367c31b5d61164db97834917f5487094ebef2f58

> - cpu_relax
> 
> The following optimizations have been applied:
> - excessive loop unrolling
> - Zbb's orc.b instruction
> - Zbb's ctz intruction
> - Zicboz/Zic64b ability to clear a cache block in memory
> - Fast unaligned accesses (but with keeping exception guarantees intact)
> - Fast overlapping accesses
> 
> The patch was developed more than a year ago and was tested as part
> of a vendor SDK since then. One of the areas where this patchset
> was used is benchmarking (e.g. SPEC CPU2017).
> The optimized string functions have been tested with the glibc tests
> for that purpose.
> 
> The first patch of the series does not strictly belong to this series,
> but was required to build and test SPEC CPU2017 benchmarks.
> 
> To build a cross-toolchain that includes these patches,
> the riscv-gnu-toolchain or any other cross-toolchain
> builder can be used.
> 
> Christoph Müllner (19):
>   Inhibit early libcalls before ifunc support is ready
>   riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol
>   riscv: Add ENTRY_ALIGN() macro
>   riscv: Add hart feature run-time detection framework
>   riscv: Introduction of ISA extensions
>   riscv: Adding ISA string parser for environment variables
>   riscv: hart-features: Add fast_unaligned property
>   riscv: Add (empty) ifunc framework
>   riscv: Add ifunc support for memset
>   riscv: Add accelerated memset routines for RV64
>   riscv: Add ifunc support for memcpy/memmove
>   riscv: Add accelerated memcpy/memmove routines for RV64
>   riscv: Add ifunc support for strlen
>   riscv: Add accelerated strlen routine
>   riscv: Add ifunc support for strcmp
>   riscv: Add accelerated strcmp routines
>   riscv: Add ifunc support for strncmp
>   riscv: Add an optimized strncmp routine
>   riscv: Add __riscv_cpu_relax() to allow yielding in busy loops
> 
>  csu/libc-start.c                              |   1 +
>  elf/dl-support.c                              |   1 +
>  sysdeps/riscv/dl-machine.h                    |  13 +
>  sysdeps/riscv/ldsodefs.h                      |   1 +
>  sysdeps/riscv/multiarch/Makefile              |  24 +
>  sysdeps/riscv/multiarch/cpu_relax.c           |  36 ++
>  sysdeps/riscv/multiarch/cpu_relax_impl.S      |  40 ++
>  sysdeps/riscv/multiarch/ifunc-impl-list.c     |  70 +++
>  sysdeps/riscv/multiarch/init-arch.h           |  24 +
>  sysdeps/riscv/multiarch/memcpy.c              |  49 ++
>  sysdeps/riscv/multiarch/memcpy_generic.c      |  32 ++
>  .../riscv/multiarch/memcpy_rv64_unaligned.S   | 475 ++++++++++++++++++
>  sysdeps/riscv/multiarch/memmove.c             |  49 ++
>  sysdeps/riscv/multiarch/memmove_generic.c     |  32 ++
>  sysdeps/riscv/multiarch/memset.c              |  52 ++
>  sysdeps/riscv/multiarch/memset_generic.c      |  32 ++
>  .../riscv/multiarch/memset_rv64_unaligned.S   |  31 ++
>  .../multiarch/memset_rv64_unaligned_cboz64.S  | 217 ++++++++
>  sysdeps/riscv/multiarch/strcmp.c              |  47 ++
>  sysdeps/riscv/multiarch/strcmp_generic.c      |  32 ++
>  sysdeps/riscv/multiarch/strcmp_zbb.S          | 104 ++++
>  .../riscv/multiarch/strcmp_zbb_unaligned.S    | 213 ++++++++
>  sysdeps/riscv/multiarch/strlen.c              |  44 ++
>  sysdeps/riscv/multiarch/strlen_generic.c      |  32 ++
>  sysdeps/riscv/multiarch/strlen_zbb.S          | 105 ++++
>  sysdeps/riscv/multiarch/strncmp.c             |  44 ++
>  sysdeps/riscv/multiarch/strncmp_generic.c     |  32 ++
>  sysdeps/riscv/multiarch/strncmp_zbb.S         | 119 +++++
>  sysdeps/riscv/sys/asm.h                       |  14 +-
>  .../unix/sysv/linux/riscv/atomic-machine.h    |   3 +
>  sysdeps/unix/sysv/linux/riscv/dl-procinfo.c   |  62 +++
>  sysdeps/unix/sysv/linux/riscv/dl-procinfo.h   |  46 ++
>  sysdeps/unix/sysv/linux/riscv/hart-features.c | 356 +++++++++++++
>  sysdeps/unix/sysv/linux/riscv/hart-features.h |  58 +++
>  .../unix/sysv/linux/riscv/isa-extensions.def  |  72 +++
>  sysdeps/unix/sysv/linux/riscv/libc-start.c    |  29 ++
>  .../unix/sysv/linux/riscv/macro-for-each.h    |  24 +
>  37 files changed, 2610 insertions(+), 5 deletions(-)
>  create mode 100644 sysdeps/riscv/multiarch/Makefile
>  create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c
>  create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S
>  create mode 100644 sysdeps/riscv/multiarch/ifunc-impl-list.c
>  create mode 100644 sysdeps/riscv/multiarch/init-arch.h
>  create mode 100644 sysdeps/riscv/multiarch/memcpy.c
>  create mode 100644 sysdeps/riscv/multiarch/memcpy_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S
>  create mode 100644 sysdeps/riscv/multiarch/memmove.c
>  create mode 100644 sysdeps/riscv/multiarch/memmove_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/memset.c
>  create mode 100644 sysdeps/riscv/multiarch/memset_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned.S
>  create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S
>  create mode 100644 sysdeps/riscv/multiarch/strcmp.c
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
>  create mode 100644 sysdeps/riscv/multiarch/strlen.c
>  create mode 100644 sysdeps/riscv/multiarch/strlen_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/strlen_zbb.S
>  create mode 100644 sysdeps/riscv/multiarch/strncmp.c
>  create mode 100644 sysdeps/riscv/multiarch/strncmp_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.h
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/isa-extensions.def
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/libc-start.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/macro-for-each.h
>
  
DJ Delorie Feb. 7, 2023, 5:16 p.m. UTC | #3
Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> writes:
> So now we have 3 different proposal mechanism to provide implementation runtime 
> selection on riscv:
>
>   1. The sysdep mechanism to select optimized routines based on compiler/ABI
>      done at build time.  It is the current mechanism and it is also used
>      on rvv routines [1].
>
>   2. A ifunc one using a new riscv syscall to query the kernel the required 
>      information.
>
>   3. Another ifunc one using riscv specific environment variable.

I'm also going to oppose #3 on principles.  We've been removing the use
of environment variables for tuning, in favor of tunables.

If we have a way to auto-detect the best implementation without relying
on the user, that's my preference.  Users are unreliable and require
documentation.  The compiler likely doesn't have access to the
hardware[*], so must rely on the user.  Thus, my preference is #2 - the
kernel has access to the hardware and its device tree, and can tell the
userspace what capabilities are available.

I would not be opposed to a tunable that overrides the autodetection; we
have something similar for x86.  But the default (and should be) is
"works basically correctly without user intervention".

[*] you can run gcc on the "right" hardware, but typically we
    build-once-run-everywhere.
  
Philipp Tomsich Feb. 7, 2023, 7:32 p.m. UTC | #4
On Tue, 7 Feb 2023 at 18:16, DJ Delorie <dj@redhat.com> wrote:
>
> Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> writes:
> > So now we have 3 different proposal mechanism to provide implementation runtime
> > selection on riscv:
> >
> >   1. The sysdep mechanism to select optimized routines based on compiler/ABI
> >      done at build time.  It is the current mechanism and it is also used
> >      on rvv routines [1].
> >
> >   2. A ifunc one using a new riscv syscall to query the kernel the required
> >      information.
> >
> >   3. Another ifunc one using riscv specific environment variable.
>
> I'm also going to oppose #3 on principles.  We've been removing the use
> of environment variables for tuning, in favor of tunables.

You may have missed the essential part of the commit message:
> > Since we don't have an interface to get this information from the
> > kernel (at the moment), this patch uses environment variables instead,
> > which is also why this patch should not be considered for upstream
> > inclusion and is explicitly tagged as RFC.

So this patch has always been a stand-in until option #2 is ready.
I am strongly opinionated towards a mechanism that uses existing
mechanisms in the ELF auxiliary vector to pass information — and tries
to avoid the introduction of a new arch-specific syscall. if possible.

> If we have a way to auto-detect the best implementation without relying
> on the user, that's my preference.  Users are unreliable and require
> documentation.  The compiler likely doesn't have access to the
> hardware[*], so must rely on the user.  Thus, my preference is #2 - the
> kernel has access to the hardware and its device tree, and can tell the
> userspace what capabilities are available.
>
> I would not be opposed to a tunable that overrides the autodetection; we
> have something similar for x86.  But the default (and should be) is
> "works basically correctly without user intervention".
>
> [*] you can run gcc on the "right" hardware, but typically we
>     build-once-run-everywhere.
>
  
DJ Delorie Feb. 7, 2023, 9:14 p.m. UTC | #5
Philipp Tomsich <philipp.tomsich@vrull.eu> writes:
> So this patch has always been a stand-in until option #2 is ready.
> I am strongly opinionated towards a mechanism that uses existing
> mechanisms in the ELF auxiliary vector to pass information — and tries
> to avoid the introduction of a new arch-specific syscall. if possible.

If the patch were converted to use tunables, it could be more than a
standin.  It's the environment variable itself I'm opposed to.
  
Christoph Müllner Feb. 8, 2023, 11:26 a.m. UTC | #6
On Tue, Feb 7, 2023 at 10:14 PM DJ Delorie <dj@redhat.com> wrote:

>
> Philipp Tomsich <philipp.tomsich@vrull.eu> writes:
> > So this patch has always been a stand-in until option #2 is ready.
> > I am strongly opinionated towards a mechanism that uses existing
> > mechanisms in the ELF auxiliary vector to pass information — and tries
> > to avoid the introduction of a new arch-specific syscall. if possible.
>
> If the patch were converted to use tunables, it could be more than a
> standin.  It's the environment variable itself I'm opposed to.
>

Thanks DJ and  Adhemerval for your valuable inputs!

As said in the cover letter, the environment variable approach was not
meant to be merged but represents a starting point for discussions.
It is not what we want but serves as a dirty placeholder, that allows
development of optimized routines.

The IFUNC support and the kernel-userspace API was discussed
multiple times in the past (e.g. on LPC2021 and LPC2022).
There are different opinions on the approaches so the whole process
is regularly getting stuck.

The topic, where most (if not all) in the RISC-V community agree on,
is that we don't want a compile-time-only approach.
Patches that rely on a compile-time-only approach are most likely
written such, because of the absence of ifunc support.

Meanwhile, we have heard from multiple vendors to work on their own
solutions downstream, which results in a duplication of work and not
necessarily in a common solution upstream.

This patchset was sent out to move the discussion from the idea level
to actual code, which can be reviewed, criticized, tested, improved, and
reused (get a common ground for RISC-V vendors).

Both of your comments show how we can move this patchset forward:
- eliminate the first patch
  See also: https://sourceware.org/bugzilla/show_bug.cgi?id=30095
- work on the kernel-userspace interface to query capabilities
- use tunables instead of env variable
- look how we can reuse more of the generic implementation
  to minimize ASM code with little or no benefits

I fully agree with all the mentioned points.

Thanks,
Christoph