[v2,0/3] RISC-V: ifunced memcpy using new kernel hwprobe interface

Message ID 20230221191537.3159966-1-evan@rivosinc.com
Headers
Series RISC-V: ifunced memcpy using new kernel hwprobe interface |

Message

Evan Green Feb. 21, 2023, 7:15 p.m. UTC
  This series illustrates the use of a proposed Linux syscall that
enumerates architectural information about the RISC-V cores the system
is running on. In this series we expose a small wrapper function around
the syscall. An ifunc selector for memcpy queries it to see if unaligned
access is "fast" on this hardware. If it is, it selects a newly provided
implementation of memcpy that doesn't work hard at aligning the src and
destination buffers.

This is somewhat of a proof of concept for the syscall itself, but I do
find that in my goofy memcpy test [1], the unaligned memcpy performed at
least as well as the generic C version. This is however on Qemu on an M1
mac, so not a test of any real hardware (more a smoke test that the
implementation isn't silly).

v3 of the Linux series can be found at [2].

[1] https://pastebin.com/Nj8ixpkX
[2] https://lore.kernel.org/lkml/20230221190858.3159617-1-evan@rivosinc.com/T/#t

Changes in v2:
 - hwprobe.h: Use __has_include and duplicate Linux content to make
   compilation work when Linux headers are absent (Adhemerval)
 - hwprobe.h: Put declaration under __USE_GNU (Adhemerval)
 - Use INLINE_SYSCALL_CALL (Adhemerval)
 - Update versions
 - Update UNALIGNED_MASK to match kernel v3 series.
 - Add vDSO interface
 - Used _MASK instead of _FAST value itself.

Evan Green (3):
  riscv: Add Linux hwprobe syscall support
  riscv: Add hwprobe vdso call support
  riscv: Add and use alignment-ignorant memcpy

 sysdeps/riscv/memcopy.h                       |  28 +++++
 sysdeps/riscv/memcpy.c                        |  65 +++++++++++
 sysdeps/riscv/memcpy_noalignment.S            | 103 ++++++++++++++++++
 sysdeps/unix/sysv/linux/dl-vdso-setup.c       |  10 ++
 sysdeps/unix/sysv/linux/dl-vdso-setup.h       |   3 +
 sysdeps/unix/sysv/linux/riscv/Makefile        |   8 +-
 sysdeps/unix/sysv/linux/riscv/Versions        |   3 +
 sysdeps/unix/sysv/linux/riscv/hwprobe.c       |  36 ++++++
 .../unix/sysv/linux/riscv/memcpy-generic.c    |  24 ++++
 .../unix/sysv/linux/riscv/rv32/arch-syscall.h |   1 +
 .../unix/sysv/linux/riscv/rv32/libc.abilist   |   1 +
 .../unix/sysv/linux/riscv/rv64/arch-syscall.h |   1 +
 .../unix/sysv/linux/riscv/rv64/libc.abilist   |   1 +
 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h   |  67 ++++++++++++
 sysdeps/unix/sysv/linux/riscv/sysdep.h        |   1 +
 sysdeps/unix/sysv/linux/syscall-names.list    |   1 +
 16 files changed, 351 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/riscv/memcopy.h
 create mode 100644 sysdeps/riscv/memcpy.c
 create mode 100644 sysdeps/riscv/memcpy_noalignment.S
 create mode 100644 sysdeps/unix/sysv/linux/riscv/hwprobe.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/memcpy-generic.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h
  

Comments

Palmer Dabbelt March 28, 2023, 10:54 p.m. UTC | #1
On Tue, 21 Feb 2023 11:15:34 PST (-0800), Evan Green wrote:
>
> This series illustrates the use of a proposed Linux syscall that
> enumerates architectural information about the RISC-V cores the system
> is running on. In this series we expose a small wrapper function around
> the syscall. An ifunc selector for memcpy queries it to see if unaligned
> access is "fast" on this hardware. If it is, it selects a newly provided
> implementation of memcpy that doesn't work hard at aligning the src and
> destination buffers.
>
> This is somewhat of a proof of concept for the syscall itself, but I do
> find that in my goofy memcpy test [1], the unaligned memcpy performed at
> least as well as the generic C version. This is however on Qemu on an M1
> mac, so not a test of any real hardware (more a smoke test that the
> implementation isn't silly).

QEMU isn't a good enough benchmark to justify a new memcpy routine in 
glibc.  Evan has a D1, which does support misaligned access and runs 
some simple benchmarks faster.  There's also been some minor changes to 
the Linux side of things that warrant a v3 anyway, so he'll just post 
some benchmarks on HW along with that.

Aside from those comments,

Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>

There's a lot more stuff to probe for, but I think we've got enough of a 
proof of concept for the hwprobe stuff that we can move forward with the 
core interface bits in Linux/glibc and then unleash the chaos...

Unless anyone else has comments?

> v3 of the Linux series can be found at [2].
>
> [1] https://pastebin.com/Nj8ixpkX
> [2] https://lore.kernel.org/lkml/20230221190858.3159617-1-evan@rivosinc.com/T/#t
>
> Changes in v2:
>  - hwprobe.h: Use __has_include and duplicate Linux content to make
>    compilation work when Linux headers are absent (Adhemerval)
>  - hwprobe.h: Put declaration under __USE_GNU (Adhemerval)
>  - Use INLINE_SYSCALL_CALL (Adhemerval)
>  - Update versions
>  - Update UNALIGNED_MASK to match kernel v3 series.
>  - Add vDSO interface
>  - Used _MASK instead of _FAST value itself.
>
> Evan Green (3):
>   riscv: Add Linux hwprobe syscall support
>   riscv: Add hwprobe vdso call support
>   riscv: Add and use alignment-ignorant memcpy
>
>  sysdeps/riscv/memcopy.h                       |  28 +++++
>  sysdeps/riscv/memcpy.c                        |  65 +++++++++++
>  sysdeps/riscv/memcpy_noalignment.S            | 103 ++++++++++++++++++
>  sysdeps/unix/sysv/linux/dl-vdso-setup.c       |  10 ++
>  sysdeps/unix/sysv/linux/dl-vdso-setup.h       |   3 +
>  sysdeps/unix/sysv/linux/riscv/Makefile        |   8 +-
>  sysdeps/unix/sysv/linux/riscv/Versions        |   3 +
>  sysdeps/unix/sysv/linux/riscv/hwprobe.c       |  36 ++++++
>  .../unix/sysv/linux/riscv/memcpy-generic.c    |  24 ++++
>  .../unix/sysv/linux/riscv/rv32/arch-syscall.h |   1 +
>  .../unix/sysv/linux/riscv/rv32/libc.abilist   |   1 +
>  .../unix/sysv/linux/riscv/rv64/arch-syscall.h |   1 +
>  .../unix/sysv/linux/riscv/rv64/libc.abilist   |   1 +
>  sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h   |  67 ++++++++++++
>  sysdeps/unix/sysv/linux/riscv/sysdep.h        |   1 +
>  sysdeps/unix/sysv/linux/syscall-names.list    |   1 +
>  16 files changed, 351 insertions(+), 2 deletions(-)
>  create mode 100644 sysdeps/riscv/memcopy.h
>  create mode 100644 sysdeps/riscv/memcpy.c
>  create mode 100644 sysdeps/riscv/memcpy_noalignment.S
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/hwprobe.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/memcpy-generic.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h

Thanks!
  
Adhemerval Zanella Netto March 28, 2023, 11:41 p.m. UTC | #2
On 28/03/23 19:54, Palmer Dabbelt wrote:
> On Tue, 21 Feb 2023 11:15:34 PST (-0800), Evan Green wrote:
>>
>> This series illustrates the use of a proposed Linux syscall that
>> enumerates architectural information about the RISC-V cores the system
>> is running on. In this series we expose a small wrapper function around
>> the syscall. An ifunc selector for memcpy queries it to see if unaligned
>> access is "fast" on this hardware. If it is, it selects a newly provided
>> implementation of memcpy that doesn't work hard at aligning the src and
>> destination buffers.
>>
>> This is somewhat of a proof of concept for the syscall itself, but I do
>> find that in my goofy memcpy test [1], the unaligned memcpy performed at
>> least as well as the generic C version. This is however on Qemu on an M1
>> mac, so not a test of any real hardware (more a smoke test that the
>> implementation isn't silly).
> 
> QEMU isn't a good enough benchmark to justify a new memcpy routine in glibc.  Evan has a D1, which does support misaligned access and runs some simple benchmarks faster.  There's also been some minor changes to the Linux side of things that warrant a v3 anyway, so he'll just post some benchmarks on HW along with that.
> 
> Aside from those comments,
> 
> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
> 
> There's a lot more stuff to probe for, but I think we've got enough of a proof of concept for the hwprobe stuff that we can move forward with the core interface bits in Linux/glibc and then unleash the chaos...
> 
> Unless anyone else has comments?

Until riscv_hwprobe is not on Linus tree as official Linux ABI this patchset 
can not be installed.  We failed to enforce it on some occasion (like Intel 
CET) and it turned out a complete mess after some years...
  
Palmer Dabbelt March 29, 2023, 12:01 a.m. UTC | #3
On Tue, 28 Mar 2023 16:41:10 PDT (-0700), adhemerval.zanella@linaro.org wrote:
>
>
> On 28/03/23 19:54, Palmer Dabbelt wrote:
>> On Tue, 21 Feb 2023 11:15:34 PST (-0800), Evan Green wrote:
>>>
>>> This series illustrates the use of a proposed Linux syscall that
>>> enumerates architectural information about the RISC-V cores the system
>>> is running on. In this series we expose a small wrapper function around
>>> the syscall. An ifunc selector for memcpy queries it to see if unaligned
>>> access is "fast" on this hardware. If it is, it selects a newly provided
>>> implementation of memcpy that doesn't work hard at aligning the src and
>>> destination buffers.
>>>
>>> This is somewhat of a proof of concept for the syscall itself, but I do
>>> find that in my goofy memcpy test [1], the unaligned memcpy performed at
>>> least as well as the generic C version. This is however on Qemu on an M1
>>> mac, so not a test of any real hardware (more a smoke test that the
>>> implementation isn't silly).
>>
>> QEMU isn't a good enough benchmark to justify a new memcpy routine in glibc.  Evan has a D1, which does support misaligned access and runs some simple benchmarks faster.  There's also been some minor changes to the Linux side of things that warrant a v3 anyway, so he'll just post some benchmarks on HW along with that.
>>
>> Aside from those comments,
>>
>> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
>>
>> There's a lot more stuff to probe for, but I think we've got enough of a proof of concept for the hwprobe stuff that we can move forward with the core interface bits in Linux/glibc and then unleash the chaos...
>>
>> Unless anyone else has comments?
>
> Until riscv_hwprobe is not on Linus tree as official Linux ABI this patchset
> can not be installed.  We failed to enforce it on some occasion (like Intel
> CET) and it turned out a complete mess after some years...

Sorry if that wasn't clear, I was asking if there were any more comments 
from the glibc side of things before merging the Linux code.
  
Adhemerval Zanella Netto March 29, 2023, 7:16 p.m. UTC | #4
On 28/03/23 21:01, Palmer Dabbelt wrote:
> On Tue, 28 Mar 2023 16:41:10 PDT (-0700), adhemerval.zanella@linaro.org wrote:
>>
>>
>> On 28/03/23 19:54, Palmer Dabbelt wrote:
>>> On Tue, 21 Feb 2023 11:15:34 PST (-0800), Evan Green wrote:
>>>>
>>>> This series illustrates the use of a proposed Linux syscall that
>>>> enumerates architectural information about the RISC-V cores the system
>>>> is running on. In this series we expose a small wrapper function around
>>>> the syscall. An ifunc selector for memcpy queries it to see if unaligned
>>>> access is "fast" on this hardware. If it is, it selects a newly provided
>>>> implementation of memcpy that doesn't work hard at aligning the src and
>>>> destination buffers.
>>>>
>>>> This is somewhat of a proof of concept for the syscall itself, but I do
>>>> find that in my goofy memcpy test [1], the unaligned memcpy performed at
>>>> least as well as the generic C version. This is however on Qemu on an M1
>>>> mac, so not a test of any real hardware (more a smoke test that the
>>>> implementation isn't silly).
>>>
>>> QEMU isn't a good enough benchmark to justify a new memcpy routine in glibc.  Evan has a D1, which does support misaligned access and runs some simple benchmarks faster.  There's also been some minor changes to the Linux side of things that warrant a v3 anyway, so he'll just post some benchmarks on HW along with that.
>>>
>>> Aside from those comments,
>>>
>>> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
>>>
>>> There's a lot more stuff to probe for, but I think we've got enough of a proof of concept for the hwprobe stuff that we can move forward with the core interface bits in Linux/glibc and then unleash the chaos...
>>>
>>> Unless anyone else has comments?
>>
>> Until riscv_hwprobe is not on Linus tree as official Linux ABI this patchset
>> can not be installed.  We failed to enforce it on some occasion (like Intel
>> CET) and it turned out a complete mess after some years...
> 
> Sorry if that wasn't clear, I was asking if there were any more comments from the glibc side of things before merging the Linux code.

Right, so is this already settle to be the de-factor ABI to query for system
information in RISCV? Or is it still being discussed? Is it in a next branch
already, and/or have been tested with a patch glibc?

In any case I added some minimal comments.  With the vDSO approach I think
there is no need to cache the result at startup, as aarch64 and x86 does.
  
Palmer Dabbelt March 29, 2023, 7:45 p.m. UTC | #5
On Wed, 29 Mar 2023 12:16:39 PDT (-0700), adhemerval.zanella@linaro.org wrote:
>
>
> On 28/03/23 21:01, Palmer Dabbelt wrote:
>> On Tue, 28 Mar 2023 16:41:10 PDT (-0700), adhemerval.zanella@linaro.org wrote:
>>>
>>>
>>> On 28/03/23 19:54, Palmer Dabbelt wrote:
>>>> On Tue, 21 Feb 2023 11:15:34 PST (-0800), Evan Green wrote:
>>>>>
>>>>> This series illustrates the use of a proposed Linux syscall that
>>>>> enumerates architectural information about the RISC-V cores the system
>>>>> is running on. In this series we expose a small wrapper function around
>>>>> the syscall. An ifunc selector for memcpy queries it to see if unaligned
>>>>> access is "fast" on this hardware. If it is, it selects a newly provided
>>>>> implementation of memcpy that doesn't work hard at aligning the src and
>>>>> destination buffers.
>>>>>
>>>>> This is somewhat of a proof of concept for the syscall itself, but I do
>>>>> find that in my goofy memcpy test [1], the unaligned memcpy performed at
>>>>> least as well as the generic C version. This is however on Qemu on an M1
>>>>> mac, so not a test of any real hardware (more a smoke test that the
>>>>> implementation isn't silly).
>>>>
>>>> QEMU isn't a good enough benchmark to justify a new memcpy routine in glibc.  Evan has a D1, which does support misaligned access and runs some simple benchmarks faster.  There's also been some minor changes to the Linux side of things that warrant a v3 anyway, so he'll just post some benchmarks on HW along with that.
>>>>
>>>> Aside from those comments,
>>>>
>>>> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
>>>>
>>>> There's a lot more stuff to probe for, but I think we've got enough of a proof of concept for the hwprobe stuff that we can move forward with the core interface bits in Linux/glibc and then unleash the chaos...
>>>>
>>>> Unless anyone else has comments?
>>>
>>> Until riscv_hwprobe is not on Linus tree as official Linux ABI this patchset
>>> can not be installed.  We failed to enforce it on some occasion (like Intel
>>> CET) and it turned out a complete mess after some years...
>>
>> Sorry if that wasn't clear, I was asking if there were any more comments from the glibc side of things before merging the Linux code.
>
> Right, so is this already settle to be the de-factor ABI to query for system
> information in RISCV? Or is it still being discussed? Is it in a next branch
> already, and/or have been tested with a patch glibc?

It's not in for-next yet, but various patch sets / proposals have been 
on the lists for a few months and it seems like discussion on the kernel 
side has pretty much died down.  That's why I was pinging the glibc side 
of things, if anyone here has comments on the interface then it's time 
to chime in.  If there's no comments then we're likely to end up with 
this in the next release (so queue into for-next soon, Linus' master in 
a month or so).

IIUC Evan's been testing the kernel+glibc stuff on QEMU, but he should 
be able to ack that explicitly (it's a little vague in the cover 
letter).  There's also a glibc-independent kselftest as part of the 
kernel patch set: 
https://lore.kernel.org/all/20230327163203.2918455-6-evan@rivosinc.com/ 
.

>
> In any case I added some minimal comments.  With the vDSO approach I think
> there is no need to cache the result at startup, as aarch64 and x86 does.
  
Adhemerval Zanella Netto March 29, 2023, 8:13 p.m. UTC | #6
On 29/03/23 16:45, Palmer Dabbelt wrote:
> On Wed, 29 Mar 2023 12:16:39 PDT (-0700), adhemerval.zanella@linaro.org wrote:
>>
>>
>> On 28/03/23 21:01, Palmer Dabbelt wrote:
>>> On Tue, 28 Mar 2023 16:41:10 PDT (-0700), adhemerval.zanella@linaro.org wrote:
>>>>
>>>>
>>>> On 28/03/23 19:54, Palmer Dabbelt wrote:
>>>>> On Tue, 21 Feb 2023 11:15:34 PST (-0800), Evan Green wrote:
>>>>>>
>>>>>> This series illustrates the use of a proposed Linux syscall that
>>>>>> enumerates architectural information about the RISC-V cores the system
>>>>>> is running on. In this series we expose a small wrapper function around
>>>>>> the syscall. An ifunc selector for memcpy queries it to see if unaligned
>>>>>> access is "fast" on this hardware. If it is, it selects a newly provided
>>>>>> implementation of memcpy that doesn't work hard at aligning the src and
>>>>>> destination buffers.
>>>>>>
>>>>>> This is somewhat of a proof of concept for the syscall itself, but I do
>>>>>> find that in my goofy memcpy test [1], the unaligned memcpy performed at
>>>>>> least as well as the generic C version. This is however on Qemu on an M1
>>>>>> mac, so not a test of any real hardware (more a smoke test that the
>>>>>> implementation isn't silly).
>>>>>
>>>>> QEMU isn't a good enough benchmark to justify a new memcpy routine in glibc.  Evan has a D1, which does support misaligned access and runs some simple benchmarks faster.  There's also been some minor changes to the Linux side of things that warrant a v3 anyway, so he'll just post some benchmarks on HW along with that.
>>>>>
>>>>> Aside from those comments,
>>>>>
>>>>> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
>>>>>
>>>>> There's a lot more stuff to probe for, but I think we've got enough of a proof of concept for the hwprobe stuff that we can move forward with the core interface bits in Linux/glibc and then unleash the chaos...
>>>>>
>>>>> Unless anyone else has comments?
>>>>
>>>> Until riscv_hwprobe is not on Linus tree as official Linux ABI this patchset
>>>> can not be installed.  We failed to enforce it on some occasion (like Intel
>>>> CET) and it turned out a complete mess after some years...
>>>
>>> Sorry if that wasn't clear, I was asking if there were any more comments from the glibc side of things before merging the Linux code.
>>
>> Right, so is this already settle to be the de-factor ABI to query for system
>> information in RISCV? Or is it still being discussed? Is it in a next branch
>> already, and/or have been tested with a patch glibc?
> 
> It's not in for-next yet, but various patch sets / proposals have been on the lists for a few months and it seems like discussion on the kernel side has pretty much died down.  That's why I was pinging the glibc side of things, if anyone here has comments on the interface then it's time to chime in.  If there's no comments then we're likely to end up with this in the next release (so queue into for-next soon, Linus' master in a month or so).
> 
> IIUC Evan's been testing the kernel+glibc stuff on QEMU, but he should be able to ack that explicitly (it's a little vague in the cover letter).  There's also a glibc-independent kselftest as part of the kernel patch set: https://lore.kernel.org/all/20230327163203.2918455-6-evan@rivosinc.com/ .

I am not sure if this is latest thread, but it seems that from cover letter link
Arnd has raised some concerns about the interface [1] that has not been fully 
addressed.  

From libc perspective, the need to specify the query key on riscv_hwprobe should 
not be a problem (libc must know what tohandle, unknown tags are no use) and it 
simplifies the buffer management (so there is no need to query for unknown set of
keys of a allocate a large buffer to handle multiple non-required pairs).

However, I agree with Arnd that there should be no need to optimize for hardware
that has an asymmetric set of features and, at least for glibc usage and most 
runtime feature selection, it does not make sense to query per-cpu information
(unless you some very specific programming, like pine the process to specific
cores and enable core-specific code).

I also wonder how hotplug or cpusets would play with the vDSO support, and how
kernel would synchronize the update, if any, to the prive vDSO data.

[1] https://lore.kernel.org/lkml/20230221190858.3159617-1-evan@rivosinc.com/T/#m452cffd9f60684e9d6d6dccf595f33ecfbc99be2
  
Jeff Law March 30, 2023, 6:20 a.m. UTC | #7
On 3/29/23 13:45, Palmer Dabbelt wrote:

> It's not in for-next yet, but various patch sets / proposals have been 
> on the lists for a few months and it seems like discussion on the kernel 
> side has pretty much died down.  That's why I was pinging the glibc side 
> of things, if anyone here has comments on the interface then it's time 
> to chime in.  If there's no comments then we're likely to end up with 
> this in the next release (so queue into for-next soon, Linus' master in 
> a month or so).
Right.  And I've suggested that we at least try to settle on the various 
mem* and str* implementations independently of the kernel->glibc 
interface question.

I don't much care how we break down the problem of selecting 
implementations, just that we get started.   That can and probably 
should be happening in parallel with the kernel->glibc API work.

I've got some performance testing to do in this space (primarily of the 
VRULL implementations).  It's just going to take a long time to get the 
data.  And that implementation probably needs some revamping after all 
the work on the mem* and str* infrastructure that landed earlier this year.

jeff
  
Evan Green March 30, 2023, 6:31 p.m. UTC | #8
Hi Adhemerval,

On Wed, Mar 29, 2023 at 1:13 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 29/03/23 16:45, Palmer Dabbelt wrote:
> > On Wed, 29 Mar 2023 12:16:39 PDT (-0700), adhemerval.zanella@linaro.org wrote:
> >>
> >>
> >> On 28/03/23 21:01, Palmer Dabbelt wrote:
> >>> On Tue, 28 Mar 2023 16:41:10 PDT (-0700), adhemerval.zanella@linaro.org wrote:
> >>>>
> >>>>
> >>>> On 28/03/23 19:54, Palmer Dabbelt wrote:
> >>>>> On Tue, 21 Feb 2023 11:15:34 PST (-0800), Evan Green wrote:
> >>>>>>
> >>>>>> This series illustrates the use of a proposed Linux syscall that
> >>>>>> enumerates architectural information about the RISC-V cores the system
> >>>>>> is running on. In this series we expose a small wrapper function around
> >>>>>> the syscall. An ifunc selector for memcpy queries it to see if unaligned
> >>>>>> access is "fast" on this hardware. If it is, it selects a newly provided
> >>>>>> implementation of memcpy that doesn't work hard at aligning the src and
> >>>>>> destination buffers.
> >>>>>>
> >>>>>> This is somewhat of a proof of concept for the syscall itself, but I do
> >>>>>> find that in my goofy memcpy test [1], the unaligned memcpy performed at
> >>>>>> least as well as the generic C version. This is however on Qemu on an M1
> >>>>>> mac, so not a test of any real hardware (more a smoke test that the
> >>>>>> implementation isn't silly).
> >>>>>
> >>>>> QEMU isn't a good enough benchmark to justify a new memcpy routine in glibc.  Evan has a D1, which does support misaligned access and runs some simple benchmarks faster.  There's also been some minor changes to the Linux side of things that warrant a v3 anyway, so he'll just post some benchmarks on HW along with that.
> >>>>>
> >>>>> Aside from those comments,
> >>>>>
> >>>>> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
> >>>>>
> >>>>> There's a lot more stuff to probe for, but I think we've got enough of a proof of concept for the hwprobe stuff that we can move forward with the core interface bits in Linux/glibc and then unleash the chaos...
> >>>>>
> >>>>> Unless anyone else has comments?
> >>>>
> >>>> Until riscv_hwprobe is not on Linus tree as official Linux ABI this patchset
> >>>> can not be installed.  We failed to enforce it on some occasion (like Intel
> >>>> CET) and it turned out a complete mess after some years...
> >>>
> >>> Sorry if that wasn't clear, I was asking if there were any more comments from the glibc side of things before merging the Linux code.
> >>
> >> Right, so is this already settle to be the de-factor ABI to query for system
> >> information in RISCV? Or is it still being discussed? Is it in a next branch
> >> already, and/or have been tested with a patch glibc?
> >
> > It's not in for-next yet, but various patch sets / proposals have been on the lists for a few months and it seems like discussion on the kernel side has pretty much died down.  That's why I was pinging the glibc side of things, if anyone here has comments on the interface then it's time to chime in.  If there's no comments then we're likely to end up with this in the next release (so queue into for-next soon, Linus' master in a month or so).
> >
> > IIUC Evan's been testing the kernel+glibc stuff on QEMU, but he should be able to ack that explicitly (it's a little vague in the cover letter).  There's also a glibc-independent kselftest as part of the kernel patch set: https://lore.kernel.org/all/20230327163203.2918455-6-evan@rivosinc.com/ .
>
> I am not sure if this is latest thread, but it seems that from cover letter link
> Arnd has raised some concerns about the interface [1] that has not been fully
> addressed.

I've replied to that thread.

>
> From libc perspective, the need to specify the query key on riscv_hwprobe should
> not be a problem (libc must know what tohandle, unknown tags are no use) and it
> simplifies the buffer management (so there is no need to query for unknown set of
> keys of a allocate a large buffer to handle multiple non-required pairs).
>
> However, I agree with Arnd that there should be no need to optimize for hardware
> that has an asymmetric set of features and, at least for glibc usage and most
> runtime feature selection, it does not make sense to query per-cpu information
> (unless you some very specific programming, like pine the process to specific
> cores and enable core-specific code).

I pushed back on that in my reply upstream, feel free to jump in
there. I think you're right that glibc probably wouldn't ever use the
cpuset aspect of the interface, but the gist of my reply upstream is
that more specialized apps may.

>
> I also wonder how hotplug or cpusets would play with the vDSO support, and how
> kernel would synchronize the update, if any, to the prive vDSO data.

The good news is that the cached data in the vDSO is not ABI, it's
hidden behind the vDSO function. So as things like hotplug start
evolving and interacting with the vDSO cache data, we can update what
data we cache and when we fall back to the syscall.

-Evan

>
> [1] https://lore.kernel.org/lkml/20230221190858.3159617-1-evan@rivosinc.com/T/#m452cffd9f60684e9d6d6dccf595f33ecfbc99be2
  
Evan Green March 30, 2023, 6:43 p.m. UTC | #9
On Wed, Mar 29, 2023 at 11:20 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 3/29/23 13:45, Palmer Dabbelt wrote:
>
> > It's not in for-next yet, but various patch sets / proposals have been
> > on the lists for a few months and it seems like discussion on the kernel
> > side has pretty much died down.  That's why I was pinging the glibc side
> > of things, if anyone here has comments on the interface then it's time
> > to chime in.  If there's no comments then we're likely to end up with
> > this in the next release (so queue into for-next soon, Linus' master in
> > a month or so).
> Right.  And I've suggested that we at least try to settle on the various
> mem* and str* implementations independently of the kernel->glibc
> interface question.

This works for me. As we talked about off-list, this series cleaves
pretty cleanly. One option would be to take this series now(ish,
whenever the kernel series lands), then cleave off my memcpy and
replace it with Vrull's when it's ready. The hope being that two
incremental improvements go faster than waiting to try and land
everything perfectly all at once.
-Evan

>
> I don't much care how we break down the problem of selecting
> implementations, just that we get started.   That can and probably
> should be happening in parallel with the kernel->glibc API work.
>
> I've got some performance testing to do in this space (primarily of the
> VRULL implementations).  It's just going to take a long time to get the
> data.  And that implementation probably needs some revamping after all
> the work on the mem* and str* infrastructure that landed earlier this year.
>
> jeff
  
Adhemerval Zanella Netto March 30, 2023, 7:38 p.m. UTC | #10
On 30/03/23 03:20, Jeff Law wrote:
> 
> 
> On 3/29/23 13:45, Palmer Dabbelt wrote:
> 
>> It's not in for-next yet, but various patch sets / proposals have been on the lists for a few months and it seems like discussion on the kernel side has pretty much died down.  That's why I was pinging the glibc side of things, if anyone here has comments on the interface then it's time to chime in.  If there's no comments then we're likely to end up with this in the next release (so queue into for-next soon, Linus' master in a month or so).
> Right.  And I've suggested that we at least try to settle on the various mem* and str* implementations independently of the kernel->glibc interface question.
> 
> I don't much care how we break down the problem of selecting implementations, just that we get started.   That can and probably should be happening in parallel with the kernel->glibc API work.
> 
> I've got some performance testing to do in this space (primarily of the VRULL implementations).  It's just going to take a long time to get the data.  And that implementation probably needs some revamping after all the work on the mem* and str* infrastructure that landed earlier this year.
> 

I don't think glibc is the right place for code dump, specially for implementations
that does not have representative performance numbers in real hardware and might
require further tuning.  It can be even tricky if you require different build config 
to testing as used to have for some ABI (for instance on powerpc with --with-cpu),
at least for ifunc we have some mechanism to test multiple variants assuming the 
chips at least support (which should be case for unaligned).

For ARM we have the optimize-routines [1] project, where we use as testbed for 
multiple implementations and also, due its license mechanism, make it easier to 
implement the optimized routines on different projects.  We used to have a similar 
project, cortex-strings, on Linaro.

So for experimental routines, where you expect to have frequent tuning based on
once you have tested and benchmarks on different chips; an external project
might a better idea; and sync with glibc once the routines are tested and validate.
And these RISCV does seemed to be still very experimental, where performance numbers 
are still synthetic ones from emulators.

Another possibility might to improve the generic implementation, as we have done
recently where RISCV bitmanip was a matter to add just 2 files and 4 functions
to optimize multiple string functions [2].  I have some WIP patches to add support
for unaligned memcpy/memmove with a very simple strategy.

[1] https://github.com/ARM-software/optimized-routines
[2] https://sourceware.org/git/?p=glibc.git;a=commit;h=25788431c0f5264c4830415de0cdd4d9926cbad9
  
Adhemerval Zanella Netto March 30, 2023, 7:43 p.m. UTC | #11
On 30/03/23 15:31, Evan Green wrote:
> Hi Adhemerval,
> 
> On Wed, Mar 29, 2023 at 1:13 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 29/03/23 16:45, Palmer Dabbelt wrote:
>>> On Wed, 29 Mar 2023 12:16:39 PDT (-0700), adhemerval.zanella@linaro.org wrote:
>>>>
>>>>
>>>> On 28/03/23 21:01, Palmer Dabbelt wrote:
>>>>> On Tue, 28 Mar 2023 16:41:10 PDT (-0700), adhemerval.zanella@linaro.org wrote:
>>>>>>
>>>>>>
>>>>>> On 28/03/23 19:54, Palmer Dabbelt wrote:
>>>>>>> On Tue, 21 Feb 2023 11:15:34 PST (-0800), Evan Green wrote:
>>>>>>>>
>>>>>>>> This series illustrates the use of a proposed Linux syscall that
>>>>>>>> enumerates architectural information about the RISC-V cores the system
>>>>>>>> is running on. In this series we expose a small wrapper function around
>>>>>>>> the syscall. An ifunc selector for memcpy queries it to see if unaligned
>>>>>>>> access is "fast" on this hardware. If it is, it selects a newly provided
>>>>>>>> implementation of memcpy that doesn't work hard at aligning the src and
>>>>>>>> destination buffers.
>>>>>>>>
>>>>>>>> This is somewhat of a proof of concept for the syscall itself, but I do
>>>>>>>> find that in my goofy memcpy test [1], the unaligned memcpy performed at
>>>>>>>> least as well as the generic C version. This is however on Qemu on an M1
>>>>>>>> mac, so not a test of any real hardware (more a smoke test that the
>>>>>>>> implementation isn't silly).
>>>>>>>
>>>>>>> QEMU isn't a good enough benchmark to justify a new memcpy routine in glibc.  Evan has a D1, which does support misaligned access and runs some simple benchmarks faster.  There's also been some minor changes to the Linux side of things that warrant a v3 anyway, so he'll just post some benchmarks on HW along with that.
>>>>>>>
>>>>>>> Aside from those comments,
>>>>>>>
>>>>>>> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
>>>>>>>
>>>>>>> There's a lot more stuff to probe for, but I think we've got enough of a proof of concept for the hwprobe stuff that we can move forward with the core interface bits in Linux/glibc and then unleash the chaos...
>>>>>>>
>>>>>>> Unless anyone else has comments?
>>>>>>
>>>>>> Until riscv_hwprobe is not on Linus tree as official Linux ABI this patchset
>>>>>> can not be installed.  We failed to enforce it on some occasion (like Intel
>>>>>> CET) and it turned out a complete mess after some years...
>>>>>
>>>>> Sorry if that wasn't clear, I was asking if there were any more comments from the glibc side of things before merging the Linux code.
>>>>
>>>> Right, so is this already settle to be the de-factor ABI to query for system
>>>> information in RISCV? Or is it still being discussed? Is it in a next branch
>>>> already, and/or have been tested with a patch glibc?
>>>
>>> It's not in for-next yet, but various patch sets / proposals have been on the lists for a few months and it seems like discussion on the kernel side has pretty much died down.  That's why I was pinging the glibc side of things, if anyone here has comments on the interface then it's time to chime in.  If there's no comments then we're likely to end up with this in the next release (so queue into for-next soon, Linus' master in a month or so).
>>>
>>> IIUC Evan's been testing the kernel+glibc stuff on QEMU, but he should be able to ack that explicitly (it's a little vague in the cover letter).  There's also a glibc-independent kselftest as part of the kernel patch set: https://lore.kernel.org/all/20230327163203.2918455-6-evan@rivosinc.com/ .
>>
>> I am not sure if this is latest thread, but it seems that from cover letter link
>> Arnd has raised some concerns about the interface [1] that has not been fully
>> addressed.
> 
> I've replied to that thread.
> 
>>
>> From libc perspective, the need to specify the query key on riscv_hwprobe should
>> not be a problem (libc must know what tohandle, unknown tags are no use) and it
>> simplifies the buffer management (so there is no need to query for unknown set of
>> keys of a allocate a large buffer to handle multiple non-required pairs).
>>
>> However, I agree with Arnd that there should be no need to optimize for hardware
>> that has an asymmetric set of features and, at least for glibc usage and most
>> runtime feature selection, it does not make sense to query per-cpu information
>> (unless you some very specific programming, like pine the process to specific
>> cores and enable core-specific code).
> 
> I pushed back on that in my reply upstream, feel free to jump in
> there. I think you're right that glibc probably wouldn't ever use the
> cpuset aspect of the interface, but the gist of my reply upstream is
> that more specialized apps may.

Well, I still think providing the userland with asymmetric set of features is
a complexity that does not pay off, but at least the interface does allow
to return a concise view of the supported features.

> 
>>
>> I also wonder how hotplug or cpusets would play with the vDSO support, and how
>> kernel would synchronize the update, if any, to the prive vDSO data.
> 
> The good news is that the cached data in the vDSO is not ABI, it's
> hidden behind the vDSO function. So as things like hotplug start
> evolving and interacting with the vDSO cache data, we can update what
> data we cache and when we fall back to the syscall.

Right, I was just curious how one would synchronize the vDSO code with the
concurrent update from kernel.  Some time ago, I was working with another
kernel developer on a vDSO getrandom and it required a lot of boilerplate and
even though we did not come with a good interface for concurrent access with
a structure that kernel might change concurrently.
  
Jeff Law March 31, 2023, 5:09 a.m. UTC | #12
On 3/30/23 12:43, Evan Green wrote:
> On Wed, Mar 29, 2023 at 11:20 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>>
>>
>>
>> On 3/29/23 13:45, Palmer Dabbelt wrote:
>>
>>> It's not in for-next yet, but various patch sets / proposals have been
>>> on the lists for a few months and it seems like discussion on the kernel
>>> side has pretty much died down.  That's why I was pinging the glibc side
>>> of things, if anyone here has comments on the interface then it's time
>>> to chime in.  If there's no comments then we're likely to end up with
>>> this in the next release (so queue into for-next soon, Linus' master in
>>> a month or so).
>> Right.  And I've suggested that we at least try to settle on the various
>> mem* and str* implementations independently of the kernel->glibc
>> interface question.
> 
> This works for me. As we talked about off-list, this series cleaves
> pretty cleanly. One option would be to take this series now(ish,
> whenever the kernel series lands), then cleave off my memcpy and
> replace it with Vrull's when it's ready. The hope being that two
> incremental improvements go faster than waiting to try and land
> everything perfectly all at once.
No idea at this point if VRULL's is better or worse ;-)  Right now I'm 
focused on their cboz implementation of memset.  Assuming no uarch 
quirks it should be a slam dunk.  But of course there's a quirk in our 
uarch, so testing testing testing.

I did just spend a fair amount of time in the hottest path of their 
strcmp.  It seems quite reasonable.

Jeff
  
Jeff Law March 31, 2023, 6:07 p.m. UTC | #13
On 3/30/23 13:38, Adhemerval Zanella Netto wrote:
> 
> 
> On 30/03/23 03:20, Jeff Law wrote:
>>
>>
>> On 3/29/23 13:45, Palmer Dabbelt wrote:
>>
>>> It's not in for-next yet, but various patch sets / proposals have been on the lists for a few months and it seems like discussion on the kernel side has pretty much died down.  That's why I was pinging the glibc side of things, if anyone here has comments on the interface then it's time to chime in.  If there's no comments then we're likely to end up with this in the next release (so queue into for-next soon, Linus' master in a month or so).
>> Right.  And I've suggested that we at least try to settle on the various mem* and str* implementations independently of the kernel->glibc interface question.
>>
>> I don't much care how we break down the problem of selecting implementations, just that we get started.   That can and probably should be happening in parallel with the kernel->glibc API work.
>>
>> I've got some performance testing to do in this space (primarily of the VRULL implementations).  It's just going to take a long time to get the data.  And that implementation probably needs some revamping after all the work on the mem* and str* infrastructure that landed earlier this year.
>>
> 
> I don't think glibc is the right place for code dump, specially for implementations
> that does not have representative performance numbers in real hardware and might
> require further tuning.  It can be even tricky if you require different build config
> to testing as used to have for some ABI (for instance on powerpc with --with-cpu),
> at least for ifunc we have some mechanism to test multiple variants assuming the
> chips at least support (which should be case for unaligned).
It's not meant to be "code dump".  It's "these are the recommended 
implementation and we're just waiting for the final ifunc wiring to use 
them automatically."

But I understand your point. Even if we just agree on the 
implementations without committing until the ifunc interface is settled 
is a major step forward.

My larger point is that we need to work through the str* and mem* 
implementations and settle on those implementations and that can happen 
in independently of the interface discussion with the kernel team.  If 
we've settled on specific implementations, why not go ahead and put them 
into the repo with the expectation that we can trivially wire them into 
the ifunc resolver once the abi interface is sorted out.




> 
> So for experimental routines, where you expect to have frequent tuning based on
> once you have tested and benchmarks on different chips; an external project
> might a better idea; and sync with glibc once the routines are tested and validate.
> And these RISCV does seemed to be still very experimental, where performance numbers
> are still synthetic ones from emulators.
I think we're actually a lot closer than you might think :-)  My goal 
would be that we're not doing frequent tuning and avoid uarch specific 
versions if we at all can.  There's a reasonable chance we can do that 
if we have good baseline, zbb and vector versions.  I'm not including 
cboz memory clear right now -- there's already evidence that uarch 
considerations around cboz may be significant.


> 
> Another possibility might to improve the generic implementation, as we have done
> recently where RISCV bitmanip was a matter to add just 2 files and 4 functions
> to optimize multiple string functions [2].  I have some WIP patches to add support
> for unaligned memcpy/memmove with a very simple strategy.
As I noted elsewhere.  I was on the fence with pushing for improvements 
to the generic strcmp bits, but could be easily swayed to that position.

jeff
  
Palmer Dabbelt March 31, 2023, 6:34 p.m. UTC | #14
On Fri, 31 Mar 2023 11:07:02 PDT (-0700), jeffreyalaw@gmail.com wrote:
>
>
> On 3/30/23 13:38, Adhemerval Zanella Netto wrote:
>>
>>
>> On 30/03/23 03:20, Jeff Law wrote:
>>>
>>>
>>> On 3/29/23 13:45, Palmer Dabbelt wrote:
>>>
>>>> It's not in for-next yet, but various patch sets / proposals have been on the lists for a few months and it seems like discussion on the kernel side has pretty much died down.  That's why I was pinging the glibc side of things, if anyone here has comments on the interface then it's time to chime in.  If there's no comments then we're likely to end up with this in the next release (so queue into for-next soon, Linus' master in a month or so).
>>> Right.  And I've suggested that we at least try to settle on the various mem* and str* implementations independently of the kernel->glibc interface question.
>>>
>>> I don't much care how we break down the problem of selecting implementations, just that we get started.   That can and probably should be happening in parallel with the kernel->glibc API work.
>>>
>>> I've got some performance testing to do in this space (primarily of the VRULL implementations).  It's just going to take a long time to get the data.  And that implementation probably needs some revamping after all the work on the mem* and str* infrastructure that landed earlier this year.
>>>
>>
>> I don't think glibc is the right place for code dump, specially for implementations
>> that does not have representative performance numbers in real hardware and might
>> require further tuning.  It can be even tricky if you require different build config
>> to testing as used to have for some ABI (for instance on powerpc with --with-cpu),
>> at least for ifunc we have some mechanism to test multiple variants assuming the
>> chips at least support (which should be case for unaligned).
> It's not meant to be "code dump".  It's "these are the recommended
> implementation and we're just waiting for the final ifunc wiring to use
> them automatically."
>
> But I understand your point. Even if we just agree on the
> implementations without committing until the ifunc interface is settled
> is a major step forward.
>
> My larger point is that we need to work through the str* and mem*
> implementations and settle on those implementations and that can happen
> in independently of the interface discussion with the kernel team.  If
> we've settled on specific implementations, why not go ahead and put them
> into the repo with the expectation that we can trivially wire them into
> the ifunc resolver once the abi interface is sorted out.

IMO that's fine: we've got a bunch of other infrastructure around these 
optimized routines that will need to get built (glibc_hwcaps, for 
example) so it's not like just having hwprobe means we're done.

The only issue I see with having these in tree is that we'll end up with 
glibc binaries that have vendor-specific tunings, but no way to provide 
those with generic binaries.  That means vendors will end up shipping 
these non-portable binaries.  We've historically tried to avoid that 
wherever possible, but it's probably time to call that a pipe dream -- 
the only base we could really have is rv64gc, and that's going to be so 
slow it's essentially useless for any real systems.

So if you guys have actual performance gain numbers to talk about, then 
I'm happy taking the optimized glibc routines (or at least whatever bits 
of them are in RISC-V land) for that hardware -- even if it means 
there's a build-time configuration that results in Ventana-specific 
binaries.

I think we do want to keep pushing on the dynamic flavors of stuff, just 
so we can try to dig out of this hole at some point, but we're going to 
have a mess until the ISA get sorted out.  My guess is that will take 
years, and blocking the optimizations until then is just going to lead 
to a bunch of out-of-tree ports from vendors and an even bigger mess.

>> So for experimental routines, where you expect to have frequent tuning based on
>> once you have tested and benchmarks on different chips; an external project
>> might a better idea; and sync with glibc once the routines are tested and validate.
>> And these RISCV does seemed to be still very experimental, where performance numbers
>> are still synthetic ones from emulators.
> I think we're actually a lot closer than you might think :-)  My goal
> would be that we're not doing frequent tuning and avoid uarch specific
> versions if we at all can.  There's a reasonable chance we can do that
> if we have good baseline, zbb and vector versions.  I'm not including

Unfortunately there's going to be very wide variation in performance 
between vendors for the vector extension, we're going to have at least 3 
flavors of anything there (plus whatever Allwinner/T-Head ends up 
needing, but that's a whole can of worms).  So I think at this point 
we'd be better off just calling these vendor-specific routines, if 
there's some commonality between them we can sort it out later.

> cboz memory clear right now -- there's already evidence that uarch
> considerations around cboz may be significant.

Yep, again there's at least 3 ways of implementing CBOZ that I've seen 
floating around so we're going to have a vendor-specific mess there.

>> Another possibility might to improve the generic implementation, as we have done
>> recently where RISCV bitmanip was a matter to add just 2 files and 4 functions
>> to optimize multiple string functions [2].  I have some WIP patches to add support
>> for unaligned memcpy/memmove with a very simple strategy.
> As I noted elsewhere.  I was on the fence with pushing for improvements
> to the generic strcmp bits, but could be easily swayed to that position.
>
> jeff
  
Adhemerval Zanella Netto March 31, 2023, 7:32 p.m. UTC | #15
On 31/03/23 15:34, Palmer Dabbelt wrote:
> On Fri, 31 Mar 2023 11:07:02 PDT (-0700), jeffreyalaw@gmail.com wrote:
>>
>>
>> On 3/30/23 13:38, Adhemerval Zanella Netto wrote:
>>>
>>>
>>> On 30/03/23 03:20, Jeff Law wrote:
>>>>
>>>>
>>>> On 3/29/23 13:45, Palmer Dabbelt wrote:
>>>>
>>>>> It's not in for-next yet, but various patch sets / proposals have been on the lists for a few months and it seems like discussion on the kernel side has pretty much died down.  That's why I was pinging the glibc side of things, if anyone here has comments on the interface then it's time to chime in.  If there's no comments then we're likely to end up with this in the next release (so queue into for-next soon, Linus' master in a month or so).
>>>> Right.  And I've suggested that we at least try to settle on the various mem* and str* implementations independently of the kernel->glibc interface question.
>>>>
>>>> I don't much care how we break down the problem of selecting implementations, just that we get started.   That can and probably should be happening in parallel with the kernel->glibc API work.
>>>>
>>>> I've got some performance testing to do in this space (primarily of the VRULL implementations).  It's just going to take a long time to get the data.  And that implementation probably needs some revamping after all the work on the mem* and str* infrastructure that landed earlier this year.
>>>>
>>>
>>> I don't think glibc is the right place for code dump, specially for implementations
>>> that does not have representative performance numbers in real hardware and might
>>> require further tuning.  It can be even tricky if you require different build config
>>> to testing as used to have for some ABI (for instance on powerpc with --with-cpu),
>>> at least for ifunc we have some mechanism to test multiple variants assuming the
>>> chips at least support (which should be case for unaligned).
>> It's not meant to be "code dump".  It's "these are the recommended
>> implementation and we're just waiting for the final ifunc wiring to use
>> them automatically."
>>
>> But I understand your point. Even if we just agree on the
>> implementations without committing until the ifunc interface is settled
>> is a major step forward.
>>
>> My larger point is that we need to work through the str* and mem*
>> implementations and settle on those implementations and that can happen
>> in independently of the interface discussion with the kernel team.  If
>> we've settled on specific implementations, why not go ahead and put them
>> into the repo with the expectation that we can trivially wire them into
>> the ifunc resolver once the abi interface is sorted out.
> 
> IMO that's fine: we've got a bunch of other infrastructure around these optimized routines that will need to get built (glibc_hwcaps, for example) so it's not like just having hwprobe means we're done.
> 
> The only issue I see with having these in tree is that we'll end up with glibc binaries that have vendor-specific tunings, but no way to provide those with generic binaries.  That means vendors will end up shipping these non-portable binaries.  We've historically tried to avoid that wherever possible, but it's probably time to call that a pipe dream -- the only base we could really have is rv64gc, and that's going to be so slow it's essentially useless for any real systems.
> 
> So if you guys have actual performance gain numbers to talk about, then I'm happy taking the optimized glibc routines (or at least whatever bits of them are in RISC-V land) for that hardware -- even if it means there's a build-time configuration that results in Ventana-specific binaries.
> 
> I think we do want to keep pushing on the dynamic flavors of stuff, just so we can try to dig out of this hole at some point, but we're going to have a mess until the ISA get sorted out.  My guess is that will take years, and blocking the optimizations until then is just going to lead to a bunch of out-of-tree ports from vendors and an even bigger mess.

It is still not clear to me what RISCV, as ABI and not as an specific vendor, 
wants to provide arch and vendor specific str* and mem* routines.  Christophe
has hinted that the focus is not compile-only approach, so I take --with-cpu
support (similar to what some old ABI used to provide, like powerpc) is not
an option.  However, this is not what the RVV proposal does [3], which is to
enable RVV iff you target glibc to rvv (so compile-only).

And that's why I asked you guys to first define on how you want to approach
it.  

So I take that RISCV want to follow what x86_64 and aarch64 do, which is
provide optimized routines for a minimum abi (say rv64gc), and then provide
runtime selection through ifunc for either ABI or vendor specific routines
(including variant like the unaligned optimization).  You can still follow
what x86_64 and s390 recently did, which is if you define a minimum ABI
version, you default the optimized version and either skip ifunc selection
or setup a more restrict set (so in future, you can have a rvv-only build
that does not need to provide old zbb or rv64gc support).

Which then leads to how to actually test and provide such support.  The
str* and mem* tests consult which ifunc variant are support 
(ifunc-impl-list.c) on the underlying hardware; while the selector returns
the best option.  Both rely on how to query the hardware at or least which
version are supported, so I think RISCV should first figure out this part
(unless you do want to follow the compile-only approach...)

So it does not make sense to me to have ifunc variants not selected or 
tested in repo, only to be enabled in a foreseen future.

[1] https://sourceware.org/pipermail/libc-alpha/2023-February/145392.html
[2] https://sourceware.org/pipermail/libc-alpha/2023-February/145414.html
[3] https://sourceware.org/pipermail/libc-alpha/2023-March/thread.html

> 
>>> So for experimental routines, where you expect to have frequent tuning based on
>>> once you have tested and benchmarks on different chips; an external project
>>> might a better idea; and sync with glibc once the routines are tested and validate.
>>> And these RISCV does seemed to be still very experimental, where performance numbers
>>> are still synthetic ones from emulators.
>> I think we're actually a lot closer than you might think :-)  My goal
>> would be that we're not doing frequent tuning and avoid uarch specific
>> versions if we at all can.  There's a reasonable chance we can do that
>> if we have good baseline, zbb and vector versions.  I'm not including
> 
> Unfortunately there's going to be very wide variation in performance between vendors for the vector extension, we're going to have at least 3 flavors of anything there (plus whatever Allwinner/T-Head ends up needing, but that's a whole can of worms).  So I think at this point we'd be better off just calling these vendor-specific routines, if there's some commonality between them we can sort it out later.
> 
>> cboz memory clear right now -- there's already evidence that uarch
>> considerations around cboz may be significant.
> 
> Yep, again there's at least 3 ways of implementing CBOZ that I've seen floating around so we're going to have a vendor-specific mess there.
> 
>>> Another possibility might to improve the generic implementation, as we have done
>>> recently where RISCV bitmanip was a matter to add just 2 files and 4 functions
>>> to optimize multiple string functions [2].  I have some WIP patches to add support
>>> for unaligned memcpy/memmove with a very simple strategy.
>> As I noted elsewhere.  I was on the fence with pushing for improvements
>> to the generic strcmp bits, but could be easily swayed to that position.
>>
>> jeff
  
Jeff Law March 31, 2023, 8:19 p.m. UTC | #16
On 3/31/23 13:32, Adhemerval Zanella Netto wrote:
> 
> It is still not clear to me what RISCV, as ABI and not as an specific vendor,
> wants to provide arch and vendor specific str* and mem* routines.  Christophe
> has hinted that the focus is not compile-only approach, so I take --with-cpu
> support (similar to what some old ABI used to provide, like powerpc) is not
> an option.  However, this is not what the RVV proposal does [3], which is to
> enable RVV iff you target glibc to rvv (so compile-only).
I believe there is consensus on the desire to use dynamic dispatch via 
an ifunc resolver.




> 
> And that's why I asked you guys to first define on how you want to approach
> it.
I think that's already done.  I don't really see any confusion in this 
space.

The patch from the sifive team has static dispatch, they made it clear 
they want dynamic dispatch though.  Static dispatch is just a stopgap 
until the dynamic dispatch work is ready AFAICT.

rivos had a dynamic dispatch mechanism based on riscv_hwprobe

VRULL had a dynamic dispatch based on an environment variable.  This was 
acknowledged to be a hack which would be dropped once the kernel->glibc 
interface bits were sorted out.

Ventana doesn't have patches in this space, but had been using the VRULL 
bits.  I don't really have a preference as far as implementations.  I 
just want to define good ones that cover the most important cases, 
particularly with regard to ISA extensions, but I'm even willing to 
narrow the immediate focus down further (see below).


> 
> So I take that RISCV want to follow what x86_64 and aarch64 do, which is
> provide optimized routines for a minimum abi (say rv64gc), and then provide
> runtime selection through ifunc for either ABI or vendor specific routines
> (including variant like the unaligned optimization).

Right.  That's basically what I think we're trying to do.  Find a 
suitable implementation we can agree upon for a given ISA architecture. 
The belief right now is that we need one for the baseline architecture, 
one for architectures implementing ZBB and another for architectures 
that implement RVV.  ZBB and RVV are not uarch variants; they are 
standardized, but optional ISA features.

I don't think anyone is (yet!) pushing for uarch variants.  In fact, I 
would very much like to avoid that as much as I can.  Palmer might see 
uarch variants are inevitable, I don't (and maybe I'm being naive).




   You can still follow
> what x86_64 and s390 recently did, which is if you define a minimum ABI
> version, you default the optimized version and either skip ifunc selection
> or setup a more restrict set (so in future, you can have a rvv-only build
> that does not need to provide old zbb or rv64gc support).
I'm focused on defining a implementation for the baseline architecture 
as well as one for ZBB and RVV ISAs.


> 
> Which then leads to how to actually test and provide such support.  The
> str* and mem* tests consult which ifunc variant are support
> (ifunc-impl-list.c) on the underlying hardware; while the selector returns
> the best option.  Both rely on how to query the hardware at or least which
> version are supported, so I think RISCV should first figure out this part
> (unless you do want to follow the compile-only approach...)

> 
> So it does not make sense to me to have ifunc variants not selected or
> tested in repo, only to be enabled in a foreseen future.
I think this is the core point we disagree on.   I understand your 
position, respectfully disagree, but I'm willing to set it aside.

So perhaps we can narrow down the scope right now even further.  Can we 
agree to try and settle on a base implementation with no ISA extensions 
and no uarch variants?  ISTM if we can settle on those implementations 
that it should be usable immediately by the RV community at large and 
doesn't depend on the kernel->glibc interface work.


Jeff
  
Palmer Dabbelt March 31, 2023, 9:03 p.m. UTC | #17
On Fri, 31 Mar 2023 13:19:19 PDT (-0700), jeffreyalaw@gmail.com wrote:

[just snipping the rest so we can focus on Jeff's ask, the other stuff 
is interesting but a longer reply and we'd probably want to fork the 
thread anyway...]

> So perhaps we can narrow down the scope right now even further.  Can we
> agree to try and settle on a base implementation with no ISA extensions
> and no uarch variants?  ISTM if we can settle on those implementations
> that it should be usable immediately by the RV community at large and
> doesn't depend on the kernel->glibc interface work.

That base includes V and ZBB?  In that case we'd be dropping support for 
all existing hardware, which I would be very much against.
  
Jeff Law March 31, 2023, 9:35 p.m. UTC | #18
On 3/31/23 15:03, Palmer Dabbelt wrote:
> On Fri, 31 Mar 2023 13:19:19 PDT (-0700), jeffreyalaw@gmail.com wrote:
> 
> [just snipping the rest so we can focus on Jeff's ask, the other stuff 
> is interesting but a longer reply and we'd probably want to fork the 
> thread anyway...]
> 
>> So perhaps we can narrow down the scope right now even further.  Can we
>> agree to try and settle on a base implementation with no ISA extensions
>> and no uarch variants?  ISTM if we can settle on those implementations
>> that it should be usable immediately by the RV community at large and
>> doesn't depend on the kernel->glibc interface work.
> 
> That base includes V and ZBB?  In that case we'd be dropping support for 
> all existing hardware, which I would be very much against.
No, it would not include V or ZBB.  It would be something that could 
work on any risc-v hardware.  Sorry if I wasn't clear about that.

jeff
  
Palmer Dabbelt March 31, 2023, 9:38 p.m. UTC | #19
On Fri, 31 Mar 2023 14:35:36 PDT (-0700), jeffreyalaw@gmail.com wrote:
>
>
> On 3/31/23 15:03, Palmer Dabbelt wrote:
>> On Fri, 31 Mar 2023 13:19:19 PDT (-0700), jeffreyalaw@gmail.com wrote:
>>
>> [just snipping the rest so we can focus on Jeff's ask, the other stuff
>> is interesting but a longer reply and we'd probably want to fork the
>> thread anyway...]
>>
>>> So perhaps we can narrow down the scope right now even further.  Can we
>>> agree to try and settle on a base implementation with no ISA extensions
>>> and no uarch variants?  ISTM if we can settle on those implementations
>>> that it should be usable immediately by the RV community at large and
>>> doesn't depend on the kernel->glibc interface work.
>>
>> That base includes V and ZBB?  In that case we'd be dropping support for
>> all existing hardware, which I would be very much against.
> No, it would not include V or ZBB.  It would be something that could
> work on any risc-v hardware.  Sorry if I wasn't clear about that.

I'm still kind of confused then, maybe it's just too abstract?  Is there 
something you could propose as being the base?
  
Jeff Law March 31, 2023, 10:10 p.m. UTC | #20
On 3/31/23 15:38, Palmer Dabbelt wrote:
> On Fri, 31 Mar 2023 14:35:36 PDT (-0700), jeffreyalaw@gmail.com wrote:
>>
>>
>> On 3/31/23 15:03, Palmer Dabbelt wrote:
>>> On Fri, 31 Mar 2023 13:19:19 PDT (-0700), jeffreyalaw@gmail.com wrote:
>>>
>>> [just snipping the rest so we can focus on Jeff's ask, the other stuff
>>> is interesting but a longer reply and we'd probably want to fork the
>>> thread anyway...]
>>>
>>>> So perhaps we can narrow down the scope right now even further.  Can we
>>>> agree to try and settle on a base implementation with no ISA extensions
>>>> and no uarch variants?  ISTM if we can settle on those implementations
>>>> that it should be usable immediately by the RV community at large and
>>>> doesn't depend on the kernel->glibc interface work.
>>>
>>> That base includes V and ZBB?  In that case we'd be dropping support for
>>> all existing hardware, which I would be very much against.
>> No, it would not include V or ZBB.  It would be something that could
>> work on any risc-v hardware.  Sorry if I wasn't clear about that.
> 
> I'm still kind of confused then, maybe it's just too abstract?  Is there 
> something you could propose as being the base?
So right now we use the generic (architecture independent) routines for 
str* and mem*.

If we look at (for example) strcmp there's hand written variants out 
there are are purported to have better performance than the generic code 
in glibc.

Note that any such performance claims likely predate the work from 
Adhemerval and others earlier this year to reduce the reliance on 
hand-coded assembly.

So the first step is to answer the question, for any str* or mem* where 
we've received a patch submission of a hand coded assembly variant 
(which isn't using ZBB or V), does that hand coded assembly variant 
significantly out perform the generic code currently in glibc.  If yes 
and the generic code can't be significantly improved, then we should 
declare that hand written variant as the standard baseline for risc-v in 
glibc.  Review, adjust, commit and move on.

My hope would be that many (most, all?) of the base architecture hand 
coded assembly variants no longer provide any significant benefit over 
the current generic versions.

That's my minimal proposal for now.  It's not meant to solve everything 
in this space, but at least carve out a chunk of the work and get it 
resolved one way or the other.

Does that help clarify what I'm suggesting?

Jeff
  
Palmer Dabbelt April 7, 2023, 3:36 p.m. UTC | #21
On Fri, 31 Mar 2023 15:10:24 PDT (-0700), jeffreyalaw@gmail.com wrote:
>
>
> On 3/31/23 15:38, Palmer Dabbelt wrote:
>> On Fri, 31 Mar 2023 14:35:36 PDT (-0700), jeffreyalaw@gmail.com wrote:
>>>
>>>
>>> On 3/31/23 15:03, Palmer Dabbelt wrote:
>>>> On Fri, 31 Mar 2023 13:19:19 PDT (-0700), jeffreyalaw@gmail.com wrote:
>>>>
>>>> [just snipping the rest so we can focus on Jeff's ask, the other stuff
>>>> is interesting but a longer reply and we'd probably want to fork the
>>>> thread anyway...]
>>>>
>>>>> So perhaps we can narrow down the scope right now even further.  Can we
>>>>> agree to try and settle on a base implementation with no ISA extensions
>>>>> and no uarch variants?  ISTM if we can settle on those implementations
>>>>> that it should be usable immediately by the RV community at large and
>>>>> doesn't depend on the kernel->glibc interface work.
>>>>
>>>> That base includes V and ZBB?  In that case we'd be dropping support for
>>>> all existing hardware, which I would be very much against.
>>> No, it would not include V or ZBB.  It would be something that could
>>> work on any risc-v hardware.  Sorry if I wasn't clear about that.
>>
>> I'm still kind of confused then, maybe it's just too abstract?  Is there
>> something you could propose as being the base?
> So right now we use the generic (architecture independent) routines for
> str* and mem*.
>
> If we look at (for example) strcmp there's hand written variants out
> there are are purported to have better performance than the generic code
> in glibc.
>
> Note that any such performance claims likely predate the work from
> Adhemerval and others earlier this year to reduce the reliance on
> hand-coded assembly.
>
> So the first step is to answer the question, for any str* or mem* where
> we've received a patch submission of a hand coded assembly variant
> (which isn't using ZBB or V), does that hand coded assembly variant
> significantly out perform the generic code currently in glibc.  If yes
> and the generic code can't be significantly improved, then we should
> declare that hand written variant as the standard baseline for risc-v in
> glibc.  Review, adjust, commit and move on.
>
> My hope would be that many (most, all?) of the base architecture hand
> coded assembly variants no longer provide any significant benefit over
> the current generic versions.
>
> That's my minimal proposal for now.  It's not meant to solve everything
> in this space, but at least carve out a chunk of the work and get it
> resolved one way or the other.
>
> Does that help clarify what I'm suggesting?

Sorry for being slow here, this fell off the queue.

I think this proposal is in theory what we've done, it's just that 
nobody's posted patches like that -- unless I missed something?  
Certainly the original port had some assembly routines an we tossed 
those because we didn't care enough to justify them.  If someone's got 
code then I'm happy to look, but we'd also need some benchmarks (on real 
HW that's publicly available) and that's usually the sticking point.

That said, I'd guess that anyone trying to ship real product is going to 
need at least V (or some other explicitly data parallel instructions) 
before the performance of these routines matters.