[v14,0/9] Add rseq extensible ABI support

Message ID 20241121190924.837446-1-mjeanson@efficios.com (mailing list archive)
Headers
Series Add rseq extensible ABI support |

Message

Michael Jeanson Nov. 21, 2024, 7:08 p.m. UTC
  Introduced in Linux v6.3 the rseq extensible ABI [1] will allow adding
rseq features past the initial 32 bytes of the original ABI.

While the rseq features in the latest kernel still fit within the
original ABI size, there are currently only 4 bytes left. It would thus
be a good time to add support for the extensible ABI so that when new
features are added, they are immediately available to GNU libc users.

We use the ELF auxiliary vector to query the kernel for the size and
alignment of the rseq area, if this fails we default to the original
fixed size and alignment of '32' which the kernel will accept as a
compatibility mode with the original ABI.

This makes the size of the rseq area variable and thus requires to
relocate it out of 'struct pthread'. We chose to move it after (in block
allocation order) the last TLS block inside the static TLS block
allocation. It required a fairly small modification to the TLS block
allocator and did not interfere with the main executable TLS block which
must always be the first block relative to the thread pointer.

[1] https://lore.kernel.org/all/20221122203932.231377-4-mathieu.desnoyers@efficios.com/

Cc: Florian Weimer <fweimer@redhat.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: DJ Delorie <dj@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
Changes since v13:
- Ensure that the VOLATILE variants of the RSEQ_ accessors static
  asserts on 64bit types on 32bit architectures
- Move rtld_hidden_proto rseq symbols to a separate patch
Changes since v12:
- Set _rseq_size to 0 on registration failure
- Split RSEQ_SET/GETMEM from THREAD_SET/GETMEM
- Rename rseq_get_area() to RSEQ_SELF()
- Add rtld_hidden_proto to __rseq_size and __rseq_offset
- Add comment and variable array member to 'struct rseq_area'
- Style nits
Changes since v11:
- Removed _dl_rseq_feature_size, use __rseq_size instead
- Replace GLRO(dl_rseq_align) with a hidden global variable _rseq_align
- __rseq_size is now set directly in _dl_parse_auxv, set it to 0 when
  the main thread registration fails or is disabled by tunable
Changes since v10:
- Split the patchset in smaller patches
- Rebased on 'Make __rseq_size useful for feature detection'
- Remove 'rseq' from the generic TLS code, add 'extra TLS'
- Add thread_pointer.h for all Linux architectures
- Fix build on the Hurd
Changes since v8:
- Fix copyright year in sysdeps/generic/dl-rseq.h
- Clarify the the tcb math comments
- Add a comment to clarify what enforces the aligment requirements of a
  pointer calculated from the rseq_offset
- Remove nonsensical test in tst-rseq-disable
- Add comments to clarify why the rseq size is 0 when registration fails
  or is disabled
- Add comments to explain why whe allocate and rseq area block even when
  the registration is disabled by tunable
- Rename 'rseq_size' -> 'rseq_alloc_size' and 'dl_tls_rseq_size' ->
  'dl_tls_rseq_alloc_size' to clarify the distinction between the
  allocated rseq size and the size reported to application code in
  '__rseq_size'
Changes since v6:
- Fix tst-rseq for feature size over 32 bytes
- Rebased on 'nptl: fix potential merge of __rseq_* relro symbols'
Changes since v5:
- Fix TLS_DTV_AT_TP rseq offset with statically linked executables
Changes since RFC v4:
- Move dynamic linker defines to a header file
- Fix alignment when tls block align is smaller than rseq align with
  statically linked executables
- Add statically linked rseq tests
- Revert: Set __rseq_size even when the registration fails
- Use minimum size when rseq is disabled by tunable
Changes since RFC v3:
- Fix RSEQ_SETMEM for rseq disabled
- Replace sys/auxv.h usage with dl-parse_auxv.h
- Fix offset for TLS_TCB_AT_TP with statically linked executables
- Zero the rseq area before registration
Changes since RFC v2:
- Set __rseq_size even when the registration fails
- Adjust rseq tests to the new ABI
- Added support for statically linked executables
Changes since RFC v1:
- Insert the rseq area after the last TLS block
- Add proper support for TLS_TCB_AT_TP variant

---
Michael Jeanson (9):
  nptl: initialize cpu_id_start prior to rseq registration
  nptl: Add rseq auxvals
  Add generic 'extra TLS'
  Add Linux 'extra TLS'
  nptl: add rtld_hidden_proto to __rseq_size and __rseq_offset
  nptl: Introduce <rseq-access.h> for RSEQ_* accessors
  nptl: Move the rseq area to the 'extra TLS' block
  nptl: Remove the rseq area from 'struct pthread'
  Linux: Update internal copy of '<sys/rseq.h>'

 csu/libc-tls.c                                | 59 +++++++++--
 elf/dl-tls.c                                  | 59 +++++++++++
 nptl/descr.h                                  | 22 +----
 nptl/pthread_create.c                         |  2 +-
 sysdeps/generic/dl-extra_tls.h                | 45 +++++++++
 sysdeps/i386/nptl/rseq-access.h               | 98 +++++++++++++++++++
 sysdeps/nptl/dl-tls_init_tp.c                 | 23 ++---
 sysdeps/nptl/rseq-access.h                    | 58 +++++++++++
 sysdeps/unix/sysv/linux/Makefile              | 10 ++
 sysdeps/unix/sysv/linux/dl-extra_tls.h        | 70 +++++++++++++
 sysdeps/unix/sysv/linux/dl-parse_auxv.h       |  7 ++
 sysdeps/unix/sysv/linux/dl-rseq-symbols.S     | 27 +++--
 sysdeps/unix/sysv/linux/rseq-internal.h       | 94 ++++++++++++++----
 sysdeps/unix/sysv/linux/sched_getcpu.c        |  3 +-
 sysdeps/unix/sysv/linux/sys/rseq.h            | 11 +++
 .../unix/sysv/linux/tst-rseq-disable-static.c |  1 +
 sysdeps/unix/sysv/linux/tst-rseq-disable.c    | 64 ++++++++++--
 .../unix/sysv/linux/tst-rseq-nptl-static.c    |  1 +
 sysdeps/unix/sysv/linux/tst-rseq-static.c     |  1 +
 sysdeps/unix/sysv/linux/tst-rseq.c            | 84 +++++++++++++---
 sysdeps/unix/sysv/linux/tst-rseq.h            |  3 +-
 sysdeps/x86_64/nptl/rseq-access.h             | 79 +++++++++++++++
 22 files changed, 725 insertions(+), 96 deletions(-)
 create mode 100644 sysdeps/generic/dl-extra_tls.h
 create mode 100644 sysdeps/i386/nptl/rseq-access.h
 create mode 100644 sysdeps/nptl/rseq-access.h
 create mode 100644 sysdeps/unix/sysv/linux/dl-extra_tls.h
 create mode 100644 sysdeps/unix/sysv/linux/tst-rseq-disable-static.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-rseq-nptl-static.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-rseq-static.c
 create mode 100644 sysdeps/x86_64/nptl/rseq-access.h
  

Comments

Michael Jeanson Dec. 6, 2024, 3:50 p.m. UTC | #1
On 2024-11-21 14:08, Michael Jeanson wrote:
> Introduced in Linux v6.3 the rseq extensible ABI [1] will allow adding
> rseq features past the initial 32 bytes of the original ABI.
> 
> While the rseq features in the latest kernel still fit within the
> original ABI size, there are currently only 4 bytes left. It would thus
> be a good time to add support for the extensible ABI so that when new
> features are added, they are immediately available to GNU libc users.
> 
> We use the ELF auxiliary vector to query the kernel for the size and
> alignment of the rseq area, if this fails we default to the original
> fixed size and alignment of '32' which the kernel will accept as a
> compatibility mode with the original ABI.
> 
> This makes the size of the rseq area variable and thus requires to
> relocate it out of 'struct pthread'. We chose to move it after (in block
> allocation order) the last TLS block inside the static TLS block
> allocation. It required a fairly small modification to the TLS block
> allocator and did not interfere with the main executable TLS block which
> must always be the first block relative to the thread pointer.
> 
> [1] https://lore.kernel.org/all/20221122203932.231377-4-mathieu.desnoyers@efficios.com/

Ping? I see the release date for 2.41 and I'm getting a bit nervous.

Michael
  
Mathieu Desnoyers Dec. 7, 2024, 3:37 p.m. UTC | #2
On 2024-12-06 10:50, Michael Jeanson wrote:
> On 2024-11-21 14:08, Michael Jeanson wrote:
>> Introduced in Linux v6.3 the rseq extensible ABI [1] will allow adding
>> rseq features past the initial 32 bytes of the original ABI.
>>
>> While the rseq features in the latest kernel still fit within the
>> original ABI size, there are currently only 4 bytes left. It would thus
>> be a good time to add support for the extensible ABI so that when new
>> features are added, they are immediately available to GNU libc users.
>>
>> We use the ELF auxiliary vector to query the kernel for the size and
>> alignment of the rseq area, if this fails we default to the original
>> fixed size and alignment of '32' which the kernel will accept as a
>> compatibility mode with the original ABI.
>>
>> This makes the size of the rseq area variable and thus requires to
>> relocate it out of 'struct pthread'. We chose to move it after (in block
>> allocation order) the last TLS block inside the static TLS block
>> allocation. It required a fairly small modification to the TLS block
>> allocator and did not interfere with the main executable TLS block which
>> must always be the first block relative to the thread pointer.
>>
>> [1] https://lore.kernel.org/all/20221122203932.231377-4-mathieu.desnoyers@efficios.com/
> 
> Ping? I see the release date for 2.41 and I'm getting a bit nervous.

Florian, Carlos, DJ, what is missing to get this into 2.41 ?

Thanks,

Mathieu

> 
> Michael
>
  
Mathieu Desnoyers Dec. 27, 2024, 2:52 p.m. UTC | #3
On 2024-11-21 14:08, Michael Jeanson wrote:
> Introduced in Linux v6.3 the rseq extensible ABI [1] will allow adding
> rseq features past the initial 32 bytes of the original ABI.

Hi Florian,

I am currently implementing a new rseq feature which requires adding
an 8-byte field, which goes beyond the 32 initial bytes (it reaches 36
bytes).

I need to disable the glibc rseq integration with the tunable for
my development testing, but I am reaching a point where we will shortly
need the extensible rseq integration in glibc.

How is review progressing ? How can we help ?

Thanks,

Mathieu

> 
> While the rseq features in the latest kernel still fit within the
> original ABI size, there are currently only 4 bytes left. It would thus
> be a good time to add support for the extensible ABI so that when new
> features are added, they are immediately available to GNU libc users.
> 
> We use the ELF auxiliary vector to query the kernel for the size and
> alignment of the rseq area, if this fails we default to the original
> fixed size and alignment of '32' which the kernel will accept as a
> compatibility mode with the original ABI.
> 
> This makes the size of the rseq area variable and thus requires to
> relocate it out of 'struct pthread'. We chose to move it after (in block
> allocation order) the last TLS block inside the static TLS block
> allocation. It required a fairly small modification to the TLS block
> allocator and did not interfere with the main executable TLS block which
> must always be the first block relative to the thread pointer.
> 
> [1] https://lore.kernel.org/all/20221122203932.231377-4-mathieu.desnoyers@efficios.com/
> 
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: Carlos O'Donell <carlos@redhat.com>
> Cc: DJ Delorie <dj@redhat.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> ---
> Changes since v13:
> - Ensure that the VOLATILE variants of the RSEQ_ accessors static
>    asserts on 64bit types on 32bit architectures
> - Move rtld_hidden_proto rseq symbols to a separate patch
> Changes since v12:
> - Set _rseq_size to 0 on registration failure
> - Split RSEQ_SET/GETMEM from THREAD_SET/GETMEM
> - Rename rseq_get_area() to RSEQ_SELF()
> - Add rtld_hidden_proto to __rseq_size and __rseq_offset
> - Add comment and variable array member to 'struct rseq_area'
> - Style nits
> Changes since v11:
> - Removed _dl_rseq_feature_size, use __rseq_size instead
> - Replace GLRO(dl_rseq_align) with a hidden global variable _rseq_align
> - __rseq_size is now set directly in _dl_parse_auxv, set it to 0 when
>    the main thread registration fails or is disabled by tunable
> Changes since v10:
> - Split the patchset in smaller patches
> - Rebased on 'Make __rseq_size useful for feature detection'
> - Remove 'rseq' from the generic TLS code, add 'extra TLS'
> - Add thread_pointer.h for all Linux architectures
> - Fix build on the Hurd
> Changes since v8:
> - Fix copyright year in sysdeps/generic/dl-rseq.h
> - Clarify the the tcb math comments
> - Add a comment to clarify what enforces the aligment requirements of a
>    pointer calculated from the rseq_offset
> - Remove nonsensical test in tst-rseq-disable
> - Add comments to clarify why the rseq size is 0 when registration fails
>    or is disabled
> - Add comments to explain why whe allocate and rseq area block even when
>    the registration is disabled by tunable
> - Rename 'rseq_size' -> 'rseq_alloc_size' and 'dl_tls_rseq_size' ->
>    'dl_tls_rseq_alloc_size' to clarify the distinction between the
>    allocated rseq size and the size reported to application code in
>    '__rseq_size'
> Changes since v6:
> - Fix tst-rseq for feature size over 32 bytes
> - Rebased on 'nptl: fix potential merge of __rseq_* relro symbols'
> Changes since v5:
> - Fix TLS_DTV_AT_TP rseq offset with statically linked executables
> Changes since RFC v4:
> - Move dynamic linker defines to a header file
> - Fix alignment when tls block align is smaller than rseq align with
>    statically linked executables
> - Add statically linked rseq tests
> - Revert: Set __rseq_size even when the registration fails
> - Use minimum size when rseq is disabled by tunable
> Changes since RFC v3:
> - Fix RSEQ_SETMEM for rseq disabled
> - Replace sys/auxv.h usage with dl-parse_auxv.h
> - Fix offset for TLS_TCB_AT_TP with statically linked executables
> - Zero the rseq area before registration
> Changes since RFC v2:
> - Set __rseq_size even when the registration fails
> - Adjust rseq tests to the new ABI
> - Added support for statically linked executables
> Changes since RFC v1:
> - Insert the rseq area after the last TLS block
> - Add proper support for TLS_TCB_AT_TP variant
> 
> ---
> Michael Jeanson (9):
>    nptl: initialize cpu_id_start prior to rseq registration
>    nptl: Add rseq auxvals
>    Add generic 'extra TLS'
>    Add Linux 'extra TLS'
>    nptl: add rtld_hidden_proto to __rseq_size and __rseq_offset
>    nptl: Introduce <rseq-access.h> for RSEQ_* accessors
>    nptl: Move the rseq area to the 'extra TLS' block
>    nptl: Remove the rseq area from 'struct pthread'
>    Linux: Update internal copy of '<sys/rseq.h>'
> 
>   csu/libc-tls.c                                | 59 +++++++++--
>   elf/dl-tls.c                                  | 59 +++++++++++
>   nptl/descr.h                                  | 22 +----
>   nptl/pthread_create.c                         |  2 +-
>   sysdeps/generic/dl-extra_tls.h                | 45 +++++++++
>   sysdeps/i386/nptl/rseq-access.h               | 98 +++++++++++++++++++
>   sysdeps/nptl/dl-tls_init_tp.c                 | 23 ++---
>   sysdeps/nptl/rseq-access.h                    | 58 +++++++++++
>   sysdeps/unix/sysv/linux/Makefile              | 10 ++
>   sysdeps/unix/sysv/linux/dl-extra_tls.h        | 70 +++++++++++++
>   sysdeps/unix/sysv/linux/dl-parse_auxv.h       |  7 ++
>   sysdeps/unix/sysv/linux/dl-rseq-symbols.S     | 27 +++--
>   sysdeps/unix/sysv/linux/rseq-internal.h       | 94 ++++++++++++++----
>   sysdeps/unix/sysv/linux/sched_getcpu.c        |  3 +-
>   sysdeps/unix/sysv/linux/sys/rseq.h            | 11 +++
>   .../unix/sysv/linux/tst-rseq-disable-static.c |  1 +
>   sysdeps/unix/sysv/linux/tst-rseq-disable.c    | 64 ++++++++++--
>   .../unix/sysv/linux/tst-rseq-nptl-static.c    |  1 +
>   sysdeps/unix/sysv/linux/tst-rseq-static.c     |  1 +
>   sysdeps/unix/sysv/linux/tst-rseq.c            | 84 +++++++++++++---
>   sysdeps/unix/sysv/linux/tst-rseq.h            |  3 +-
>   sysdeps/x86_64/nptl/rseq-access.h             | 79 +++++++++++++++
>   22 files changed, 725 insertions(+), 96 deletions(-)
>   create mode 100644 sysdeps/generic/dl-extra_tls.h
>   create mode 100644 sysdeps/i386/nptl/rseq-access.h
>   create mode 100644 sysdeps/nptl/rseq-access.h
>   create mode 100644 sysdeps/unix/sysv/linux/dl-extra_tls.h
>   create mode 100644 sysdeps/unix/sysv/linux/tst-rseq-disable-static.c
>   create mode 100644 sysdeps/unix/sysv/linux/tst-rseq-nptl-static.c
>   create mode 100644 sysdeps/unix/sysv/linux/tst-rseq-static.c
>   create mode 100644 sysdeps/x86_64/nptl/rseq-access.h
>
  
Florian Weimer Dec. 27, 2024, 4:46 p.m. UTC | #4
* Mathieu Desnoyers:

> On 2024-11-21 14:08, Michael Jeanson wrote:
>> Introduced in Linux v6.3 the rseq extensible ABI [1] will allow adding
>> rseq features past the initial 32 bytes of the original ABI.
>
> Hi Florian,
>
> I am currently implementing a new rseq feature which requires adding
> an 8-byte field, which goes beyond the 32 initial bytes (it reaches 36
> bytes).
>
> I need to disable the glibc rseq integration with the tunable for
> my development testing, but I am reaching a point where we will shortly
> need the extensible rseq integration in glibc.
>
> How is review progressing ? How can we help ?

There doesn't seem to be a __thread_pointer patch for C-SKY.

There's a simple to resolve conflict in nptl/descr.h.

TLS is currently broken due to the botched GET_ADDR_ARGS removal.  I'm
testing the rseq series after reverting 5e249192cac7354af02a7347a0d8c.

Thanks,
Florian
  
Florian Weimer Dec. 27, 2024, 6:40 p.m. UTC | #5
* Florian Weimer:

> * Mathieu Desnoyers:
>
>> On 2024-11-21 14:08, Michael Jeanson wrote:
>>> Introduced in Linux v6.3 the rseq extensible ABI [1] will allow adding
>>> rseq features past the initial 32 bytes of the original ABI.
>>
>> Hi Florian,
>>
>> I am currently implementing a new rseq feature which requires adding
>> an 8-byte field, which goes beyond the 32 initial bytes (it reaches 36
>> bytes).
>>
>> I need to disable the glibc rseq integration with the tunable for
>> my development testing, but I am reaching a point where we will shortly
>> need the extensible rseq integration in glibc.
>>
>> How is review progressing ? How can we help ?
>
> There doesn't seem to be a __thread_pointer patch for C-SKY.
>
> There's a simple to resolve conflict in nptl/descr.h.
>
> TLS is currently broken due to the botched GET_ADDR_ARGS removal.  I'm
> testing the rseq series after reverting 5e249192cac7354af02a7347a0d8c.

So it looks like that TLS_DTV_AT_TP part doesn't work, unfortunately.  I
see many new failures (after the mentioned revert) on powerpc64le, which
is one of those targets.  I haven't tried to reproduce them yet on
AArch64.  The GCC compile farm has a powerpc64le test machine
(gcc120.fsffrance.org).

I should be able to look into this in more detail on Monday.

Thanks,
Florian
  
Mathieu Desnoyers Dec. 29, 2024, 9:16 p.m. UTC | #6
On 2024-12-27 13:40, Florian Weimer wrote:
> * Florian Weimer:
> 
>> * Mathieu Desnoyers:
>>
>>> On 2024-11-21 14:08, Michael Jeanson wrote:
>>>> Introduced in Linux v6.3 the rseq extensible ABI [1] will allow adding
>>>> rseq features past the initial 32 bytes of the original ABI.
>>>
>>> Hi Florian,
>>>
>>> I am currently implementing a new rseq feature which requires adding
>>> an 8-byte field, which goes beyond the 32 initial bytes (it reaches 36
>>> bytes).
>>>
>>> I need to disable the glibc rseq integration with the tunable for
>>> my development testing, but I am reaching a point where we will shortly
>>> need the extensible rseq integration in glibc.
>>>
>>> How is review progressing ? How can we help ?
>>
>> There doesn't seem to be a __thread_pointer patch for C-SKY.
>>
>> There's a simple to resolve conflict in nptl/descr.h.
>>
>> TLS is currently broken due to the botched GET_ADDR_ARGS removal.  I'm
>> testing the rseq series after reverting 5e249192cac7354af02a7347a0d8c.
> 
> So it looks like that TLS_DTV_AT_TP part doesn't work, unfortunately.  I
> see many new failures (after the mentioned revert) on powerpc64le, which
> is one of those targets.  I haven't tried to reproduce them yet on
> AArch64.  The GCC compile farm has a powerpc64le test machine
> (gcc120.fsffrance.org).

I recall that Michael did test on at least one architecture
with TLS_DTV_AT_TP (aarch64), and one with TLS_TCB_AT_TP (x86-64).

The issue may be specific to powerpc.

> 
> I should be able to look into this in more detail on Monday.

Michael is currently on vacation for the holiday break, he should be
back on January 6.

Thanks!

Mathieu

> 
> Thanks,
> Florian
>
  
Florian Weimer Dec. 30, 2024, 9:32 a.m. UTC | #7
* Mathieu Desnoyers:

>> So it looks like that TLS_DTV_AT_TP part doesn't work, unfortunately.
>> I see many new failures (after the mentioned revert) on powerpc64le,
>> which is one of those targets.  I haven't tried to reproduce them yet
>> on AArch64.  The GCC compile farm has a powerpc64le test machine
>> (gcc120.fsffrance.org).
>
> I recall that Michael did test on at least one architecture
> with TLS_DTV_AT_TP (aarch64), and one with TLS_TCB_AT_TP (x86-64).
>
> The issue may be specific to powerpc.

Maybe it's caused by a non-zero TLS_TP_OFFSET.  We use a shifted thread
pointer on POWER, presumably to increase the reach of signed
displacements in TLS-accessing instructions.

This is alluded to in this comment:

#ifdef RSEQ_SIG
    /* This should be a compile-time constant, but the current
       infrastructure makes it difficult to determine its value.  Not
       all targets support __thread_pointer, so set __rseq_offset only
       if the rseq registration may have happened because RSEQ_SIG is
       defined.  */
    _rseq_offset = (char *) &pd->rseq_area - (char *) __thread_pointer ();
#endif

If this is indeed the cause, we should define TLS_TP_OFFSET on all
architectures and use it to initialize __rseq_offset.  Similar to what I
started here:

  [PATCH 1/4] elf: Introduce generic <dl-tls.h>
  <https://inbox.sourceware.org/libc-alpha/1b470b147d0bfe51af6256b10cd38c14285a61b2.1735313702.git.fweimer@redhat.com/>

Thanks,
Florian
  
Florian Weimer Dec. 30, 2024, 10:06 a.m. UTC | #8
* Florian Weimer:

> * Mathieu Desnoyers:
>
>>> So it looks like that TLS_DTV_AT_TP part doesn't work, unfortunately.
>>> I see many new failures (after the mentioned revert) on powerpc64le,
>>> which is one of those targets.  I haven't tried to reproduce them yet
>>> on AArch64.  The GCC compile farm has a powerpc64le test machine
>>> (gcc120.fsffrance.org).
>>
>> I recall that Michael did test on at least one architecture
>> with TLS_DTV_AT_TP (aarch64), and one with TLS_TCB_AT_TP (x86-64).
>>
>> The issue may be specific to powerpc.
>
> Maybe it's caused by a non-zero TLS_TP_OFFSET.  We use a shifted thread
> pointer on POWER, presumably to increase the reach of signed
> displacements in TLS-accessing instructions.
>
> This is alluded to in this comment:
>
> #ifdef RSEQ_SIG
>     /* This should be a compile-time constant, but the current
>        infrastructure makes it difficult to determine its value.  Not
>        all targets support __thread_pointer, so set __rseq_offset only
>        if the rseq registration may have happened because RSEQ_SIG is
>        defined.  */
>     _rseq_offset = (char *) &pd->rseq_area - (char *) __thread_pointer ();
> #endif
>
> If this is indeed the cause, we should define TLS_TP_OFFSET on all
> architectures and use it to initialize __rseq_offset.  Similar to what I
> started here:
>
>   [PATCH 1/4] elf: Introduce generic <dl-tls.h>
>   <https://inbox.sourceware.org/libc-alpha/1b470b147d0bfe51af6256b10cd38c14285a61b2.1735313702.git.fweimer@redhat.com/>

With

diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 57e72be4..1055032a 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -419,7 +419,7 @@ _dl_determine_tlsoffset (void)
     The alignment requirements of the pointer resulting from this offset and
     the thread pointer are enforced by 'max_align' which is used to align the
     tcb_offset.  */
-  _dl_extra_tls_set_offset(offset);
+  _dl_extra_tls_set_offset(offset - TLS_TP_OFFSET);
 
   /* Add the extra TLS block to the global offset.  */
   offset += extra_tls_size;

the results look better, but I'm not sure if this is the correct fix.

I see crashes on the scv 0 instruction in statically linked binaries.
Looks like the TCB placement is wrong and the flag that indicates
support for the new system call instruction (scv 0) is not loaded
correctly.

Thanks,
Florian
  
Florian Weimer Jan. 3, 2025, 8:36 a.m. UTC | #9
* Florian Weimer:

> * Mathieu Desnoyers:
>
>> On 2024-11-21 14:08, Michael Jeanson wrote:
>>> Introduced in Linux v6.3 the rseq extensible ABI [1] will allow adding
>>> rseq features past the initial 32 bytes of the original ABI.
>>
>> Hi Florian,
>>
>> I am currently implementing a new rseq feature which requires adding
>> an 8-byte field, which goes beyond the 32 initial bytes (it reaches 36
>> bytes).
>>
>> I need to disable the glibc rseq integration with the tunable for
>> my development testing, but I am reaching a point where we will shortly
>> need the extensible rseq integration in glibc.
>>
>> How is review progressing ? How can we help ?
>
> There doesn't seem to be a __thread_pointer patch for C-SKY.

No worries, found it and reviewed it.

Thanks,
Florian
  
Michael Jeanson Jan. 6, 2025, 6:54 p.m. UTC | #10
On 2024-12-30 05:06, Florian Weimer wrote:
> * Florian Weimer:
> 
>> * Mathieu Desnoyers:
>>
>>>> So it looks like that TLS_DTV_AT_TP part doesn't work, unfortunately.
>>>> I see many new failures (after the mentioned revert) on powerpc64le,
>>>> which is one of those targets.  I haven't tried to reproduce them yet
>>>> on AArch64.  The GCC compile farm has a powerpc64le test machine
>>>> (gcc120.fsffrance.org).
>>>
>>> I recall that Michael did test on at least one architecture
>>> with TLS_DTV_AT_TP (aarch64), and one with TLS_TCB_AT_TP (x86-64).
>>>
>>> The issue may be specific to powerpc.
>>
>> Maybe it's caused by a non-zero TLS_TP_OFFSET.  We use a shifted thread
>> pointer on POWER, presumably to increase the reach of signed
>> displacements in TLS-accessing instructions.
>>
>> This is alluded to in this comment:
>>
>> #ifdef RSEQ_SIG
>>     /* This should be a compile-time constant, but the current
>>        infrastructure makes it difficult to determine its value.  Not
>>        all targets support __thread_pointer, so set __rseq_offset only
>>        if the rseq registration may have happened because RSEQ_SIG is
>>        defined.  */
>>     _rseq_offset = (char *) &pd->rseq_area - (char *) __thread_pointer ();
>> #endif
>>
>> If this is indeed the cause, we should define TLS_TP_OFFSET on all
>> architectures and use it to initialize __rseq_offset.  Similar to what I
>> started here:
>>
>>   [PATCH 1/4] elf: Introduce generic <dl-tls.h>
>>   <https://inbox.sourceware.org/libc-alpha/1b470b147d0bfe51af6256b10cd38c14285a61b2.1735313702.git.fweimer@redhat.com/>
> 
> With
> 
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index 57e72be4..1055032a 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -419,7 +419,7 @@ _dl_determine_tlsoffset (void)
>      The alignment requirements of the pointer resulting from this offset and
>      the thread pointer are enforced by 'max_align' which is used to align the
>      tcb_offset.  */
> -  _dl_extra_tls_set_offset(offset);
> +  _dl_extra_tls_set_offset(offset - TLS_TP_OFFSET);
>  
>    /* Add the extra TLS block to the global offset.  */
>    offset += extra_tls_size;
> 
> the results look better, but I'm not sure if this is the correct fix.
> 
> I see crashes on the scv 0 instruction in statically linked binaries.
> Looks like the TCB placement is wrong and the flag that indicates
> support for the new system call instruction (scv 0) is not loaded
> correctly.
> 
> Thanks,
> Florian
> 

Hi,

Sorry for the delay, I'm now back from holidays.

I'm setting up a ppc64el test VM to investigate this, I had only tested
the TLS_DTV_AT_TP implementation on aarch64 with the assumption that
the other architectures would behave similarly, that might have been
too optimistic.

Thanks,

Michael
  
Michael Jeanson Jan. 6, 2025, 10:39 p.m. UTC | #11
On 2025-01-06 13:54, Michael Jeanson wrote:
> On 2024-12-30 05:06, Florian Weimer wrote:
>> * Florian Weimer:
>>
>>> * Mathieu Desnoyers:
>>>
>>>>> So it looks like that TLS_DTV_AT_TP part doesn't work, unfortunately.
>>>>> I see many new failures (after the mentioned revert) on powerpc64le,
>>>>> which is one of those targets.  I haven't tried to reproduce them yet
>>>>> on AArch64.  The GCC compile farm has a powerpc64le test machine
>>>>> (gcc120.fsffrance.org).
>>>>
>>>> I recall that Michael did test on at least one architecture
>>>> with TLS_DTV_AT_TP (aarch64), and one with TLS_TCB_AT_TP (x86-64).
>>>>
>>>> The issue may be specific to powerpc.
>>>
>>> Maybe it's caused by a non-zero TLS_TP_OFFSET.  We use a shifted thread
>>> pointer on POWER, presumably to increase the reach of signed
>>> displacements in TLS-accessing instructions.
>>>
>>> This is alluded to in this comment:
>>>
>>> #ifdef RSEQ_SIG
>>>     /* This should be a compile-time constant, but the current
>>>        infrastructure makes it difficult to determine its value.  Not
>>>        all targets support __thread_pointer, so set __rseq_offset only
>>>        if the rseq registration may have happened because RSEQ_SIG is
>>>        defined.  */
>>>     _rseq_offset = (char *) &pd->rseq_area - (char *) __thread_pointer ();
>>> #endif
>>>
>>> If this is indeed the cause, we should define TLS_TP_OFFSET on all
>>> architectures and use it to initialize __rseq_offset.  Similar to what I
>>> started here:
>>>
>>>   [PATCH 1/4] elf: Introduce generic <dl-tls.h>
>>>   <https://inbox.sourceware.org/libc-alpha/1b470b147d0bfe51af6256b10cd38c14285a61b2.1735313702.git.fweimer@redhat.com/>
>>
>> With
>>
>> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
>> index 57e72be4..1055032a 100644
>> --- a/elf/dl-tls.c
>> +++ b/elf/dl-tls.c
>> @@ -419,7 +419,7 @@ _dl_determine_tlsoffset (void)
>>      The alignment requirements of the pointer resulting from this offset and
>>      the thread pointer are enforced by 'max_align' which is used to align the
>>      tcb_offset.  */
>> -  _dl_extra_tls_set_offset(offset);
>> +  _dl_extra_tls_set_offset(offset - TLS_TP_OFFSET);
>>  
>>    /* Add the extra TLS block to the global offset.  */
>>    offset += extra_tls_size;
>>
>> the results look better, but I'm not sure if this is the correct fix.
>>
>> I see crashes on the scv 0 instruction in statically linked binaries.
>> Looks like the TCB placement is wrong and the flag that indicates
>> support for the new system call instruction (scv 0) is not loaded
>> correctly.

Your patch looks good to me but it's missing the same change for the
statically linked binary code, like this :

diff --git a/csu/libc-tls.c b/csu/libc-tls.c                                                                                                 index 20d6b48a4d..65ce9ddb3d 100644
--- a/csu/libc-tls.c
+++ b/csu/libc-tls.c
@@ -20,6 +20,7 @@
 #include <errno.h>
 #include <ldsodefs.h>
 #include <tls.h>
+#include <dl-tls.h>
 #include <unistd.h>
 #include <stdio.h>
 #include <sys/param.h>
@@ -195,7 +196,7 @@ __libc_setup_tls (void)
     The alignment requirements of the pointer resulting from this offset and
     the thread pointer are enforced by 'max_align' which is used to align the
     tcb_offset.  */
-  _dl_extra_tls_set_offset(tls_blocks_size - extra_tls_size);
+  _dl_extra_tls_set_offset(tls_blocks_size - extra_tls_size - TLS_TP_OFFSET);
 
   tlsblock = _dl_early_allocate (tls_blocks_size + max_align
                                 + TLS_PRE_TCB_SIZE


There are also minor adjustments required in the rseq tests since we can't
assume the rseq_offset to be greater than zero on architectures that have a
non-zero TLS_TP_OFFSET.

From what I can see, all the architectures that define TLS_TP_OFFSET are of
the TLS_DTV_AT_TP type so we don't need to change the offset calculation in
the TLS_TCB_AT_TP code paths.

I ran the test suite on a ppc64el VM with the current master and then with
the RSEQ patchset and your patch plus my changes, the tests results are the
same.

	=== Summary of results ===
     18 FAIL
   5877 PASS
     21 UNSUPPORTED
     16 XFAIL
      2 XPASS

Do you want me to send an updated patchset or should I wait for further
review?

Michael
  
Florian Weimer Jan. 7, 2025, 11:15 a.m. UTC | #12
* Michael Jeanson:

> Your patch looks good to me but it's missing the same change for the
> statically linked binary code, like this :
>
> diff --git a/csu/libc-tls.c b/csu/libc-tls.c                                                                                                 index 20d6b48a4d..65ce9ddb3d 100644
> --- a/csu/libc-tls.c
> +++ b/csu/libc-tls.c
> @@ -20,6 +20,7 @@
>  #include <errno.h>
>  #include <ldsodefs.h>
>  #include <tls.h>
> +#include <dl-tls.h>
>  #include <unistd.h>
>  #include <stdio.h>
>  #include <sys/param.h>
> @@ -195,7 +196,7 @@ __libc_setup_tls (void)
>      The alignment requirements of the pointer resulting from this offset and
>      the thread pointer are enforced by 'max_align' which is used to align the
>      tcb_offset.  */
> -  _dl_extra_tls_set_offset(tls_blocks_size - extra_tls_size);
> +  _dl_extra_tls_set_offset(tls_blocks_size - extra_tls_size - TLS_TP_OFFSET);
>  
>    tlsblock = _dl_early_allocate (tls_blocks_size + max_align
>                                  + TLS_PRE_TCB_SIZE
>

Ah, makes sense.

I've worked on posting a patch for exposing TLS_TP_OFFSET
unconditionally, but the test case for it hits a snag: we still don't
have <thread_pointer.h> on all architectures.  I see build failures on:

  alpha
  arc
  arm
  mips
  s390
  s390x
  sh

Of these, only mips and sh do not yet assume that GCC's
__builtin_thread_pointer work.

We could disable the new test, but perhaps we should complete the
<thread_pointer.h> implementation?  Not sure where this puts us in terms
of inclusion in the upcoming release.  I'll try to add a generic
<thread_pointer.h> and implementations for mips and sh.

I also looked at an alternative implementation: make the size of struct
pthread dynamic.  This minimizes changes on architectures such as
x86-64, at least if we assume that 32-byte alignment of the rseq area is
enough.  It's more complicated on architectures such as AArch64, where
THREAD_SELF would need to perform variable adjustment.  It does not look
like an overall win in terms of complexity.

Michael, before reposting your changes, you could fix some style nits?
Mostly missing space before '(' in function calls, missing two spaces
after '.' at the end of the comment.  Copyright years will need
updating, too.

grep -E '^\+.*(Contributed by|http://|Copyright.*(19[0-9][0-9]|20[01][0-9]|202[0-4])|XXX|FIXME|TODO|\?\?\?|internal_function|\bu_(char|int)|the\s+the|[^	 *$({~]\(|\b(if|for|while|do)\b.*\{|[[:space:]]$|\($|\. \*/)'

Thanks,
Florian
  
Michael Jeanson Jan. 7, 2025, 4:16 p.m. UTC | #13
On 2025-01-07 06:15, Florian Weimer wrote:
> * Michael Jeanson:
> 
>> Your patch looks good to me but it's missing the same change for the
>> statically linked binary code, like this :
>>
>> diff --git a/csu/libc-tls.c b/csu/libc-tls.c                                                                                                 index 20d6b48a4d..65ce9ddb3d 100644
>> --- a/csu/libc-tls.c
>> +++ b/csu/libc-tls.c
>> @@ -20,6 +20,7 @@
>>  #include <errno.h>
>>  #include <ldsodefs.h>
>>  #include <tls.h>
>> +#include <dl-tls.h>
>>  #include <unistd.h>
>>  #include <stdio.h>
>>  #include <sys/param.h>
>> @@ -195,7 +196,7 @@ __libc_setup_tls (void)
>>      The alignment requirements of the pointer resulting from this offset and
>>      the thread pointer are enforced by 'max_align' which is used to align the
>>      tcb_offset.  */
>> -  _dl_extra_tls_set_offset(tls_blocks_size - extra_tls_size);
>> +  _dl_extra_tls_set_offset(tls_blocks_size - extra_tls_size - TLS_TP_OFFSET);
>>  
>>    tlsblock = _dl_early_allocate (tls_blocks_size + max_align
>>                                  + TLS_PRE_TCB_SIZE
>>
> 
> Ah, makes sense.
> 
> I've worked on posting a patch for exposing TLS_TP_OFFSET
> unconditionally, but the test case for it hits a snag: we still don't
> have <thread_pointer.h> on all architectures.  I see build failures on:
> 
>   alpha
>   arc
>   arm
>   mips
>   s390
>   s390x
>   sh
> 
> Of these, only mips and sh do not yet assume that GCC's
> __builtin_thread_pointer work.
> 
> We could disable the new test, but perhaps we should complete the
> <thread_pointer.h> implementation?  Not sure where this puts us in terms
> of inclusion in the upcoming release.  I'll try to add a generic
> <thread_pointer.h> and implementations for mips and sh.

I agree we should complete the <thread_pointer.h> implementation, I was
under the impression it was mostly done with my previous patches. I'll
have a look at the architectures you listed.

I can do mips and sh if you haven't already started and it saves you
some time?

> 
> I also looked at an alternative implementation: make the size of struct
> pthread dynamic.  This minimizes changes on architectures such as
> x86-64, at least if we assume that 32-byte alignment of the rseq area is
> enough.  It's more complicated on architectures such as AArch64, where
> THREAD_SELF would need to perform variable adjustment.  It does not look
> like an overall win in terms of complexity.

I don't think we can assume 32-byte alignment will always be enough and
as you say it doesn't seem to reduce complexity compared to this patchset.

> 
> Michael, before reposting your changes, you could fix some style nits?
> Mostly missing space before '(' in function calls, missing two spaces
> after '.' at the end of the comment.  Copyright years will need
> updating, too.

Will do.

> 
> grep -E '^\+.*(Contributed by|http://|Copyright.*(19[0-9][0-9]|20[01][0-9]|202[0-4])|XXX|FIXME|TODO|\?\?\?|internal_function|\bu_(char|int)|the\s+the|[^	 *$({~]\(|\b(if|for|while|do)\b.*\{|[[:space:]]$|\($|\. \*/)'
> 
> Thanks,
> Florian
>
  
Michael Jeanson Jan. 7, 2025, 8:22 p.m. UTC | #14
On 2025-01-07 11:16, Michael Jeanson wrote:
>> I've worked on posting a patch for exposing TLS_TP_OFFSET
>> unconditionally, but the test case for it hits a snag: we still don't
>> have <thread_pointer.h> on all architectures.  I see build failures on:
>>
>>   alpha
>>   arc
>>   arm
>>   mips
>>   s390
>>   s390x
>>   sh
>>
>> Of these, only mips and sh do not yet assume that GCC's
>> __builtin_thread_pointer work.
>>
>> We could disable the new test, but perhaps we should complete the
>> <thread_pointer.h> implementation?  Not sure where this puts us in terms
>> of inclusion in the upcoming release.  I'll try to add a generic
>> <thread_pointer.h> and implementations for mips and sh.
> 
> I agree we should complete the <thread_pointer.h> implementation, I was
> under the impression it was mostly done with my previous patches. I'll
> have a look at the architectures you listed.
> 
> I can do mips and sh if you haven't already started and it saves you
> some time?

I've tried build-many-glibcs.py builds for all the architectures you
mentioned here with your TLS_TP_OFFSET v2 patchset on top of my
updated RSEQ ABI patchset and I had no build failures other than
something weird on mips-n32 where I had to add an include guard
to 'sysdeps/mips/dl-tls.h'.

Is there something different in your environment?

Maybe some older GCC version on some of those architectures
without __builtin_thread_pointer() support?
  
Florian Weimer Jan. 7, 2025, 8:40 p.m. UTC | #15
* Michael Jeanson:

> On 2025-01-07 11:16, Michael Jeanson wrote:
>>> I've worked on posting a patch for exposing TLS_TP_OFFSET
>>> unconditionally, but the test case for it hits a snag: we still don't
>>> have <thread_pointer.h> on all architectures.  I see build failures on:
>>>
>>>   alpha
>>>   arc
>>>   arm
>>>   mips
>>>   s390
>>>   s390x
>>>   sh
>>>
>>> Of these, only mips and sh do not yet assume that GCC's
>>> __builtin_thread_pointer work.
>>>
>>> We could disable the new test, but perhaps we should complete the
>>> <thread_pointer.h> implementation?  Not sure where this puts us in terms
>>> of inclusion in the upcoming release.  I'll try to add a generic
>>> <thread_pointer.h> and implementations for mips and sh.
>> 
>> I agree we should complete the <thread_pointer.h> implementation, I was
>> under the impression it was mostly done with my previous patches. I'll
>> have a look at the architectures you listed.
>> 
>> I can do mips and sh if you haven't already started and it saves you
>> some time?
>
> I've tried build-many-glibcs.py builds for all the architectures you
> mentioned here with your TLS_TP_OFFSET v2 patchset on top of my
> updated RSEQ ABI patchset and I had no build failures other than
> something weird on mips-n32 where I had to add an include guard
> to 'sysdeps/mips/dl-tls.h'.
>
> Is there something different in your environment?

No, the first patch was buggy, as you indicated: the generic
implementation in sysdeps/thread_pointer.h was not found anymore, so I
forgot that it existed.  After moving it to
sysdeps/generic/thread_pointer.h, it worked as expected.

Thanks,
Florian
  
Michael Jeanson Jan. 7, 2025, 8:45 p.m. UTC | #16
On 2025-01-07 15:40, Florian Weimer wrote:
> No, the first patch was buggy, as you indicated: the generic
> implementation in sysdeps/thread_pointer.h was not found anymore, so I
> forgot that it existed.  After moving it to
> sysdeps/generic/thread_pointer.h, it worked as expected.

Ahhh, makes sense, I'll send v15 once I've finished running the tests.