RFC: Provide a function to reset IFUNC PLTs

Message ID DM6PR11MB40736D2E35582FCFDA780C5FC6B69@DM6PR11MB4073.namprd11.prod.outlook.com
State Rejected
Headers
Series RFC: Provide a function to reset IFUNC PLTs |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit fail Patch series failed to build

Commit Message

Jan Kratochvil March 6, 2023, 8:04 a.m. UTC
  Some projects snapshot+restore process images to migrate them to
a different machine. The target machine may have different (particularly
lower) set of CPU features. Restored process does crash in glibc IFUNC
functions which have been already set in PLTs due to the former more
rich CPU features on the snapshotting machine.

Providing a glibc function which can be called during the restore.

I understand the code may need more adjustments before its upstreaming
but is this an acceptable approach?

An example of the problem reproducibility with: http://criu.org

int main(void) {
  setbuf(stdout, NULL);
  for (;;) {
    time_t t = time(NULL);
    fputs(ctime(&t), stdout);
    sleep(strcmp("hello", "world") != 0);
  }
}

newcpu$ gcc -o strcmper strcmper.c -Wall -g
newcpu$ rm -rf dir;mkdir dir;./strcmper &p=$!;sleep 1;sudo criu dump --shell-job -t $p -D dir
oldcpu$ sudo criu restore --shell-job -D dir
Segmentation fault

newcpu$ gcc -o strcmper strcmper.c -Wall -g glibc/ld-linux-x86-64.so.2 glibc/libc.so.6 # patched files are in glibc/
newcpu$ rm -rf dir;mkdir dir;LD_LIBRARY_PATH=glibc ./glibc/ld-linux-x86-64.so.2 ./strcmper &p=$!;sleep 1;sudo criu dump --shell-job -t $p -D dir
oldcpu$ sudo criu restore --shell-job -D dir
running...
---
 elf/Makefile                       |   2 +
 elf/Versions                       |   9 +
 elf/dl-reloc.c                     |   2 +
 elf/dl-reset-ifunc.c               | 392 +++++++++++++++++++++++++++++
 elf/dl-tunables.c                  |  14 ++
 elf/dl-tunables.h                  |   4 +
 elf/link.h                         |   2 +
 elf/tst-reset-ifunc.c              |  70 ++++++
 sysdeps/generic/ldsodefs.h         |   4 +-
 sysdeps/x86/dl-get-cpu-features.c  |   1 +
 sysdeps/x86/include/cpu-features.h |   3 +-
 11 files changed, 500 insertions(+), 3 deletions(-)
 create mode 100644 elf/dl-reset-ifunc.c
 create mode 100644 elf/tst-reset-ifunc.c
  

Comments

Florian Weimer March 7, 2023, 8:40 a.m. UTC | #1
* Jan Kratochvil via Libc-alpha:

> Some projects snapshot+restore process images to migrate them to
> a different machine. The target machine may have different (particularly
> lower) set of CPU features. Restored process does crash in glibc IFUNC
> functions which have been already set in PLTs due to the former more
> rich CPU features on the snapshotting machine.
>
> Providing a glibc function which can be called during the restore.
>
> I understand the code may need more adjustments before its upstreaming
> but is this an acceptable approach?

Do you have a high-level overview how this is supposed to work?  Is CRIU
expected to call the new _dl_reset_ifunc symbol after restore?

Use of <sys/platform/x86.h> is somewhat rare.  Not even GCC uses it
AFAIK.  It has its own cached CPU data used for target clones and
similar features.  Re-running those IFUNC resolvers will just give the
same results.

Distributions have started to deploy alternate builds in
glibc-hwcaps/x86-64-v3 directories:

  openSUSE Tumbleweed gains optional x86-64-v3 optimization
  <https://news.opensuse.org/2023/03/02/tw-gains-optional-optimizations/>

This might be a common situation fairly soon.  This is not IFUNC-based
at all, so any IFUNC-based approach is not going to help with that.

Practically speaking, I think cluster heterogeneity needs to be hidden
at the cluster level.

Thanks,
Florian
  
Adhemerval Zanella March 7, 2023, 1:07 p.m. UTC | #2
On 07/03/23 05:40, Florian Weimer via Libc-alpha wrote:
> * Jan Kratochvil via Libc-alpha:
> 
>> Some projects snapshot+restore process images to migrate them to
>> a different machine. The target machine may have different (particularly
>> lower) set of CPU features. Restored process does crash in glibc IFUNC
>> functions which have been already set in PLTs due to the former more
>> rich CPU features on the snapshotting machine.
>>
>> Providing a glibc function which can be called during the restore.
>>
>> I understand the code may need more adjustments before its upstreaming
>> but is this an acceptable approach?
> 
> Do you have a high-level overview how this is supposed to work?  Is CRIU
> expected to call the new _dl_reset_ifunc symbol after restore?
> 
> Use of <sys/platform/x86.h> is somewhat rare.  Not even GCC uses it
> AFAIK.  It has its own cached CPU data used for target clones and
> similar features.  Re-running those IFUNC resolvers will just give the
> same results.

I am not sure if the ifunc reset will be really safe without adding CRIU
migrate sync points, to avoid suspend execution in a context that the
ifunc variants are already being executed or its address is being in a
function point (for instance in PLT code).

Besides, I also not sure if adding way to remove RELRO protection won't
add more security issues (we can disable for sesuid binaries, but even
though it is not a good security practice).

> 
> Distributions have started to deploy alternate builds in
> glibc-hwcaps/x86-64-v3 directories:
> 
>   openSUSE Tumbleweed gains optional x86-64-v3 optimization
>   <https://news.opensuse.org/2023/03/02/tw-gains-optional-optimizations/>
> 
> This might be a common situation fairly soon.  This is not IFUNC-based
> at all, so any IFUNC-based approach is not going to help with that.
> 
> Practically speaking, I think cluster heterogeneity needs to be hidden
> at the cluster level.


For glibc standpoint I think heterogeneity should be handled by masking
off the higher cpu features sets so the programs always run with the 
lowest ifunc variants set.
  
Jan Kratochvil March 8, 2023, 10:21 a.m. UTC | #3
On Tue, 07 Mar 2023 21:07:58 +0800, Adhemerval Zanella Netto wrote:
> I am not sure if the ifunc reset will be really safe without adding CRIU
> migrate sync points, to avoid suspend execution in a context that the
> ifunc variants are already being executed or its address is being in a
> function point (for instance in PLT code).

You are right but I left the thread safety up to the caller ("Freezer"):
	https://github.com/openjdk/crac/pull/41/files#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6029

It could be moved to the glibc part.


> Besides, I also not sure if adding way to remove RELRO protection won't
> add more security issues (we can disable for sesuid binaries, but even
> though it is not a good security practice).

RELRO is removed only temporarily, it gets re-engaged. And that time other
threads should be even stopped (see above). Is it still a security issues?


Thanks,
Jan
  
Jan Kratochvil March 8, 2023, 10:23 a.m. UTC | #4
On Tue, 07 Mar 2023 16:40:46 +0800, Florian Weimer wrote:
> Do you have a high-level overview how this is supposed to work?  Is CRIU
> expected to call the new _dl_reset_ifunc symbol after restore?

Yes, it should be called by CRIU. The current implementation was calling it
from CRaC restore (which uses CRIU underneath) but I agree it is not the
proper placement:
	https://github.com/openjdk/crac/pull/41/files#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6274
The function is called differently there as the IFUNC reset implementation is
outside of glibc there.


> Use of <sys/platform/x86.h> is somewhat rare.  Not even GCC uses it
> AFAIK.  It has its own cached CPU data used for target clones and
> similar features.  Re-running those IFUNC resolvers will just give the
> same results.

I do not use <sys/platform/x86.h> anywhere, why do you mention it?
I am doing there
	memset((volatile uint8_t *)cpu_features, 0, sizeof(*cpu_features));
	...
	_dl_x86_init_cpu_features();
to prevent the cache of CPU data. And it does work.


> Distributions have started to deploy alternate builds in
> glibc-hwcaps/x86-64-v3 directories:
> 
>   openSUSE Tumbleweed gains optional x86-64-v3 optimization
>   <https://news.opensuse.org/2023/03/02/tw-gains-optional-optimizations/>
> 
> This might be a common situation fairly soon.  This is not IFUNC-based
> at all, so any IFUNC-based approach is not going to help with that.

What about some "dlreopen" to replace the variant of glibc?


> Practically speaking, I think cluster heterogeneity needs to be hidden
> at the cluster level.

The question is how to implement it. I have found
	https://github.com/ddcc/libcpuidoverride
but it is not compatible with latest glibc (maybe its loader code needs to
implement ELF relocations, it would be definitely fixable).

Another option is GLIBC_TUNABLES=glibc.cpu.hwcaps=... although then one has to
translate cpuid feature bits into the glibc names like SSE4_2 etc.
Also re-exec() after setenv("GLIBC_TUNABLES") is not too nice.

There are multiple projects solving the same problem such as:
	https://github.com/eclipse-openj9/openj9/issues/14253
So I was thinking some more standardized solution may be already appropriate.


Thanks,
Jan
  
Florian Weimer March 8, 2023, 10:44 a.m. UTC | #5
* Jan Kratochvil:

> On Tue, 07 Mar 2023 16:40:46 +0800, Florian Weimer wrote:
>> Use of <sys/platform/x86.h> is somewhat rare.  Not even GCC uses it
>> AFAIK.  It has its own cached CPU data used for target clones and
>> similar features.  Re-running those IFUNC resolvers will just give the
>> same results.
>
> I do not use <sys/platform/x86.h> anywhere, why do you mention it?
> I am doing there
> 	memset((volatile uint8_t *)cpu_features, 0, sizeof(*cpu_features));
> 	...
> 	_dl_x86_init_cpu_features();
> to prevent the cache of CPU data. And it does work.

This code resets the data queried by <sys/platform/x86.h>.  But GCC and
others do not use these interfaces, they use CPUID/XGETBV directly and
cache the result, so this reset does not alter the outcome of their
IFUNC resolvers.

>> Distributions have started to deploy alternate builds in
>> glibc-hwcaps/x86-64-v3 directories:
>> 
>>   openSUSE Tumbleweed gains optional x86-64-v3 optimization
>>   <https://news.opensuse.org/2023/03/02/tw-gains-optional-optimizations/>
>> 
>> This might be a common situation fairly soon.  This is not IFUNC-based
>> at all, so any IFUNC-based approach is not going to help with that.
>
> What about some "dlreopen" to replace the variant of glibc?

The trivial implementation is not going to work because most of the
function and data addresses will change as a result of the reload.  Most
checkpoints probably occur in a blocking system call, and that would
need some form of on-stack replacement to work properly.

>> Practically speaking, I think cluster heterogeneity needs to be hidden
>> at the cluster level.
>
> The question is how to implement it. I have found
> 	https://github.com/ddcc/libcpuidoverride
> but it is not compatible with latest glibc (maybe its loader code needs to
> implement ELF relocations, it would be definitely fixable).

It needs kernel support for sure.  I do not know to what extent CPUID
can be context-switched at the silicon level, maybe the masking is
system-wide (or per-guest in a virtualized configuration). 

> Another option is GLIBC_TUNABLES=glibc.cpu.hwcaps=... although then one has to
> translate cpuid feature bits into the glibc names like SSE4_2 etc.
> Also re-exec() after setenv("GLIBC_TUNABLES") is not too nice.

Again, that's not going to work in practice because libraries and
applications do not use <sys/platform/x86.h>.

Thanks,
Florian
  
Jan Kratochvil March 8, 2023, 11:03 a.m. UTC | #6
On Wed, 08 Mar 2023 18:44:45 +0800, Florian Weimer wrote:
> This code resets the data queried by <sys/platform/x86.h>.  But GCC and
> others do not use these interfaces, they use CPUID/XGETBV directly and
> cache the result, so this reset does not alter the outcome of their
> IFUNC resolvers.

We are on glibc mailing list so this patch and mail thread is about glibc.
I am aware other programs and libraries need their own reset. For OpenJDK JIT
compiler I wrote the CPU reset already; but that is offtopic here:
	https://github.com/openjdk/crac/pull/41


> > The question is how to implement it. I have found
> > 	https://github.com/ddcc/libcpuidoverride
> > but it is not compatible with latest glibc (maybe its loader code needs to
> > implement ELF relocations, it would be definitely fixable).
> 
> It needs kernel support for sure.  I do not know to what extent CPUID
> can be context-switched at the silicon level, maybe the masking is
> system-wide (or per-guest in a virtualized configuration). 

There is already its kernel support - ARCH_SET_CPUID. It is per process.


> > Another option is GLIBC_TUNABLES=glibc.cpu.hwcaps=... although then one has to
> > translate cpuid feature bits into the glibc names like SSE4_2 etc.
> > Also re-exec() after setenv("GLIBC_TUNABLES") is not too nice.
> 
> Again, that's not going to work in practice because libraries and
> applications do not use <sys/platform/x86.h>.

This mailing list is for glibc so let's talk about glibc.


Thanks,
Jan
  
Adhemerval Zanella March 8, 2023, 1:04 p.m. UTC | #7
On 08/03/23 07:21, Jan Kratochvil wrote:
> On Tue, 07 Mar 2023 21:07:58 +0800, Adhemerval Zanella Netto wrote:
>> I am not sure if the ifunc reset will be really safe without adding CRIU
>> migrate sync points, to avoid suspend execution in a context that the
>> ifunc variants are already being executed or its address is being in a
>> function point (for instance in PLT code).
> 
> You are right but I left the thread safety up to the caller ("Freezer"):
> 	https://github.com/openjdk/crac/pull/41/files#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6029
> 
> It could be moved to the glibc part.
> 
> 
>> Besides, I also not sure if adding way to remove RELRO protection won't
>> add more security issues (we can disable for sesuid binaries, but even
>> though it is not a good security practice).
> 
> RELRO is removed only temporarily, it gets re-engaged. And that time other
> threads should be even stopped (see above). Is it still a security issues?

Yes, without a stop-the-world scheme where a helper thread sets PR_GET_DUMPABLE 
and PTRACE_ATTACH the process can not really be sure that any new thread will not 
be created between the time you enumerate the process threads and call the 'freeze'
function.

I really don't think glibc should provide an interface to temporary disable any
security hardening, it should always opt-in at either program startup or by
building time.  The ifunc mechanism is already full or corner cases and I think
adding a runtime mechanism to reset them is *not* a way forward.  

As I said, I think CRIU heterogeneity should be handled by masking off the higher 
cpu features.  I am not if ARCH_SET_CPUID would a solution here, it means that
we will need to handle SIGSEGV in the loader and come up with a sane subset
in case of failure (we now have x86_64-vx, so we can use it as default). 

But as Florian has said, fixing on glibc won't work consistently on other
libraries that uses cpuid instruction.  And yes, I am aware we are discussing
glibc, but I prefer a composable solution (like adding proper cpuid maskoff in 
the kernel) to add an ah-hoc one.
  
Jan Kratochvil March 9, 2023, 11:32 a.m. UTC | #8
On Wed, 08 Mar 2023 21:04:59 +0800, Adhemerval Zanella Netto wrote:
> Yes, without a stop-the-world scheme where a helper thread sets PR_GET_DUMPABLE 
> and PTRACE_ATTACH the process can not really be sure that any new thread will not 
> be created between the time you enumerate the process threads and call the 'freeze'
> function.

That "Freezer" class does solve the race of new threads. I also do not see
a need for ptrace there, it is self-snapshotting/restoring.
	https://github.com/openjdk/crac/pull/41/files#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6029


> I really don't think glibc should provide an interface to temporary disable any
> security hardening, it should always opt-in at either program startup or by
> building time.  The ifunc mechanism is already full or corner cases and I think
> adding a runtime mechanism to reset them is *not* a way forward.  
> 
> As I said, I think CRIU heterogeneity should be handled by masking off the higher 
> cpu features.  I am not if ARCH_SET_CPUID would a solution here, it means that
> we will need to handle SIGSEGV in the loader and come up with a sane subset
> in case of failure (we now have x86_64-vx, so we can use it as default). 

So the only remaining option is that all the programs will be doing
setenv("GLIBC_TUNABLES=glibc.cpu.hwcaps=...") and re-exec(). That is
a peformance kill and definitely not nice compared to any method of an IFUNC
reset.


> But as Florian has said, fixing on glibc won't work consistently on other
> libraries that uses cpuid instruction.

In Java world the other libraries (in general, there are some JNI exceptions)
do not matter as they are a Java code JIT-compiled by JVM.


Jan
  
Adhemerval Zanella March 9, 2023, 3:47 p.m. UTC | #9
On 09/03/23 08:32, Jan Kratochvil wrote:
> On Wed, 08 Mar 2023 21:04:59 +0800, Adhemerval Zanella Netto wrote:
>> Yes, without a stop-the-world scheme where a helper thread sets PR_GET_DUMPABLE 
>> and PTRACE_ATTACH the process can not really be sure that any new thread will not 
>> be created between the time you enumerate the process threads and call the 'freeze'
>> function.
> 
> That "Freezer" class does solve the race of new threads. I also do not see
> a need for ptrace there, it is self-snapshotting/restoring.
> 	https://github.com/openjdk/crac/pull/41/files#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6029

I am not sure how the kernel would enumerate new tasks that are created
while iterating over /proc/self/task.  On closefrom Linux fallback we have
a similar problem, where the code iterates over /proc/self/fd, and everytime
it closes a file descriptor it lseeks back to beginning.  It works because
eventually there will be no more entries on /proc/self/fd, so either you
will need to certify that kernel adds new tasks at the end of the getdents
call (used by readdir, or lseek and keep track of all tasks already signaled.

While it might work on the JVM where you can not fully control who change 
SIGUSR1  disposition (and I am not sure JVM would prevent a JNI call to do so),
so you can't really make it generic without explicit reserve a signal to do so, 
similar to what glibc does for SIGCANCEL and SIGSETXID (used to synchronize 
setuid functions over threads).  Meaning that callers of sigaction can't 
not explicit set such reserved signal.

This is similar to what we do for SIGSETXID, so I think a proper way to
do it would to do always install a new signal handler to this on pthread_create,
on signal handle synchronize with proper async-signal-safe interface
(pthread_mutex_lock is not, you might accomplish with sem_post but most likely
you will need a atomic+futex way similar to a barrier), iterate over all 
dl_stack_used (so the interface can work without access to procfs), issue the 
signal handler or each thread, operate on the maps, then synchronize to resume
threads.  We can't really make it generic without accessing the internal
glibc thread states.

And you will also need to reallocate not only glibc, but potentially *all*
libraries (since ifunc can be used by any function).

> 
> 
>> I really don't think glibc should provide an interface to temporary disable any
>> security hardening, it should always opt-in at either program startup or by
>> building time.  The ifunc mechanism is already full or corner cases and I think
>> adding a runtime mechanism to reset them is *not* a way forward.  
>>
>> As I said, I think CRIU heterogeneity should be handled by masking off the higher 
>> cpu features.  I am not if ARCH_SET_CPUID would a solution here, it means that
>> we will need to handle SIGSEGV in the loader and come up with a sane subset
>> in case of failure (we now have x86_64-vx, so we can use it as default). 
> 
> So the only remaining option is that all the programs will be doing
> setenv("GLIBC_TUNABLES=glibc.cpu.hwcaps=...") and re-exec(). That is
> a peformance kill and definitely not nice compared to any method of an IFUNC
> reset.

Assuming you don't reset env variable on process spawning, you can set it as 
default for the session. Another option would to deploy a glibc built with 
--disable-multi-arch; it will disable ifunc generation.

And IMHO this is way nicer because this IFUNC reset as-is, without a proper 
stop-the-word support, is not safe and adds another corner case for the already
over-complicated ifunc interface.

> 
> 
>> But as Florian has said, fixing on glibc won't work consistently on other
>> libraries that uses cpuid instruction.
> 
> In Java world the other libraries (in general, there are some JNI exceptions)
> do not matter as they are a Java code JIT-compiled by JVM.

And this won't be a Java specific interface, but rather a GNU extension for C
library.  So we must make it as concise as possible, without adding any other
security or undefined behavior.
  
Adhemerval Zanella March 9, 2023, 5:43 p.m. UTC | #10
On 09/03/23 12:47, Adhemerval Zanella Netto wrote:
> 
> 
> On 09/03/23 08:32, Jan Kratochvil wrote:
>> On Wed, 08 Mar 2023 21:04:59 +0800, Adhemerval Zanella Netto wrote:
>>> Yes, without a stop-the-world scheme where a helper thread sets PR_GET_DUMPABLE 
>>> and PTRACE_ATTACH the process can not really be sure that any new thread will not 
>>> be created between the time you enumerate the process threads and call the 'freeze'
>>> function.
>>
>> That "Freezer" class does solve the race of new threads. I also do not see
>> a need for ptrace there, it is self-snapshotting/restoring.
>> 	https://github.com/openjdk/crac/pull/41/files#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6029
> 
> I am not sure how the kernel would enumerate new tasks that are created
> while iterating over /proc/self/task.  On closefrom Linux fallback we have
> a similar problem, where the code iterates over /proc/self/fd, and everytime
> it closes a file descriptor it lseeks back to beginning.  It works because
> eventually there will be no more entries on /proc/self/fd, so either you
> will need to certify that kernel adds new tasks at the end of the getdents
> call (used by readdir, or lseek and keep track of all tasks already signaled.

And the 'handler' signal handler has some potential shortcomings as well: 

  * backtrace is not async-signal-safe: glibc implementation on first call
    issues dlopen, which calls malloc; and libgcc_eh.so *might* also calls malloc.
  
  * pthread calls are not async-signal-safe either.

  * it only handles libc.so, other libraries that uses ifunc for function
    selection also fails.

  * the syscall heuristics do not handle partial results (for instance if
    write syscall returns do EINTR).

So this code has the potential of deadlock, specially if you have another
thread issuing malloc.
  
Florian Weimer March 13, 2023, 1:59 p.m. UTC | #11
* Jan Kratochvil:

> On Wed, 08 Mar 2023 21:04:59 +0800, Adhemerval Zanella Netto wrote:
>> But as Florian has said, fixing on glibc won't work consistently on other
>> libraries that uses cpuid instruction.
>
> In Java world the other libraries (in general, there are some JNI exceptions)
> do not matter as they are a Java code JIT-compiled by JVM.

That's not really true for GUI applications, which use quite a few
system libraries.  Even in headless applications, font rendering code
may be loaded, say for PDF generation.  If system versions are used,
that pulls in compression libraries and PCRE, and those tend to have
CPUID-based switching, too (but perhaps not in the form of IFUNC
resolvers).  This is just on x86-64—on aarch64, you will have to deal
with the outline atomics most distributions use.

I suggest you base your OpenJDK work on the OpenJDK musl port.  This
way, you can carefully control what you build and how you build it, and
also make it impossible to pick up system libraries by accident.  And
static JNI would be a good fit if you do not expect JNI to be used
anyway.

Thanks,
Florian
  
Jan Kratochvil March 14, 2023, 12:55 p.m. UTC | #12
On Mon, 13 Mar 2023 21:59:26 +0800, Florian Weimer wrote:
> I suggest you base your OpenJDK work on the OpenJDK musl port.  This
> way, you can carefully control what you build and how you build it, and
> also make it impossible to pick up system libraries by accident.

I really cannot say OpenJDK feature CRaC ( https://github.com/CRaC/docs )
will require musl. In such case including CRaC to upstream OpenJDK is
pointless as nobody will use it anyway.


Jan Kratochvil
  
Florian Weimer March 14, 2023, 2:49 p.m. UTC | #13
* Jan Kratochvil:

> On Mon, 13 Mar 2023 21:59:26 +0800, Florian Weimer wrote:
>> I suggest you base your OpenJDK work on the OpenJDK musl port.  This
>> way, you can carefully control what you build and how you build it, and
>> also make it impossible to pick up system libraries by accident.
>
> I really cannot say OpenJDK feature CRaC ( https://github.com/CRaC/docs )
> will require musl. In such case including CRaC to upstream OpenJDK is
> pointless as nobody will use it anyway.

You will have to rebuild everything with a custom toolchain anyway, and
avoid compiling in any incompatible features.  Use the musl port as the
baseline makes this explicit.

Thanks,
Florian
  
Jan Kratochvil March 14, 2023, 3:06 p.m. UTC | #14
On Tue, 14 Mar 2023 22:49:38 +0800, Florian Weimer wrote:
> You will have to rebuild everything with a custom toolchain anyway, and
> avoid compiling in any incompatible features.

I do not see why, it normally works on Fedora and Debian systems.


Jan
  
Jan Kratochvil March 16, 2023, 2:38 p.m. UTC | #15
On Thu, 09 Mar 2023 16:47:49 +0100, Adhemerval Zanella Netto wrote:
> I am not sure how the kernel would enumerate new tasks that are created
> while iterating over /proc/self/task.

I have updated the code as you have found a race there. Now this is no longer
relevant as all known tasks are already verified as stopped. So there is no
more running task to create another task. While iterating /proc/self/task:
(1) either there must be already an unstopped task when opendir() was called,
    in such case the iteration of /proc/self/task will be retried anyway.
(2) or all tasks are stopped and therefore no task can create any new task.


> On closefrom Linux fallback we have
> a similar problem, where the code iterates over /proc/self/fd, and everytime
> it closes a file descriptor it lseeks back to beginning.  It works because
> eventually there will be no more entries on /proc/self/fd, so either you
> will need to certify that kernel adds new tasks at the end of the getdents
> call (used by readdir, or lseek and keep track of all tasks already signaled.

That is not needed, see above.


> While it might work on the JVM where you can not fully control who change 
> SIGUSR1  disposition (and I am not sure JVM would prevent a JNI call to do so),
> so you can't really make it generic without explicit reserve a signal to do so, 
> similar to what glibc does for SIGCANCEL and SIGSETXID (used to synchronize 
> setuid functions over threads).  Meaning that callers of sigaction can't 
> not explicit set such reserved signal.
> 
> This is similar to what we do for SIGSETXID, so I think a proper way to
> do it would to do always install a new signal handler to this on pthread_create,
> on signal handle synchronize with proper async-signal-safe interface
> (pthread_mutex_lock is not, you might accomplish with sem_post but most likely
> you will need a atomic+futex way similar to a barrier), iterate over all 
> dl_stack_used (so the interface can work without access to procfs), issue the 
> signal handler or each thread, operate on the maps, then synchronize to resume
> threads.  We can't really make it generic without accessing the internal
> glibc thread states.

Good to know, if the patch gets a serious consideration for upstreaming
I understand the signal number needs to be handled better.


> And you will also need to reallocate not only glibc, but potentially *all*
> libraries (since ifunc can be used by any function).

This is what the patch already does by _dl_relocate_object().


> > So the only remaining option is that all the programs will be doing
> > setenv("GLIBC_TUNABLES=glibc.cpu.hwcaps=...") and re-exec(). That is
> > a peformance kill and definitely not nice compared to any method of an IFUNC
> > reset.
> 
> Assuming you don't reset env variable on process spawning, you can set it as 
> default for the session.

The /usr/bin/java program needs to setenv("GLIBC_TUNABLES=glibc.cpu.hwcaps=...")
and then it can either system("itself") or exec("itself"). This is what you
mean by the session?


> Another option would to deploy a glibc built with 
> --disable-multi-arch; it will disable ifunc generation.

That is not an option. OpenJDK must be compatible with normal existing Linux
OSes.


> And IMHO this is way nicer because this IFUNC reset as-is, without a proper 
> stop-the-word support, is not safe and adds another corner case for the already
> over-complicated ifunc interface.

stop-the-world is already implemented modulo possible bugfixes.
	https://github.com/openjdk/crac/pull/41/files#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6029


> > In Java world the other libraries (in general, there are some JNI exceptions)
> > do not matter as they are a Java code JIT-compiled by JVM.
> 
> And this won't be a Java specific interface, but rather a GNU extension for C
> library.  So we must make it as concise as possible, without adding any other
> security or undefined behavior.  

I agree. Handling IFUNC for other libraries is also possible but it has to be
a next step. It does not make sense to handle IFUNC in other libraries when
glibc still crashes first.


On Thu, 09 Mar 2023 18:43:08 +0100, Adhemerval Zanella Netto wrote:
> And the 'handler' signal handler has some potential shortcomings as well: 
> 
>   * backtrace is not async-signal-safe: glibc implementation on first call
>     issues dlopen, which calls malloc; and libgcc_eh.so *might* also calls malloc.

I do not see it in practice:
	Temporary breakpoint 1, main () at backtrace.c:5
	5	  backtrace(buf, sizeof(buf)/sizeof(*buf));
	(gdb) b dlopen
	Breakpoint 2 at 0x7ffff7c88f20: file dlopen.c, line 77.
	(gdb) c
	Continuing.
	[Inferior 1 (process 825695) exited normally]

And neither in the sources:
glibc$ grep dlopen $(find -iname "*backtrace*")


>   * pthread calls are not async-signal-safe either.

There are no pthread_* calls, everything is based on kernel tasks.


>   * it only handles libc.so, other libraries that uses ifunc for function
>     selection also fails.

You are right, I have mostly implemented this hard-coded "libc.so.6" to make
it general (for any libraries containing at least one STT_GNU_IFUNC) although
I haven't finished this implementation due to the last paragraph below.


>   * the syscall heuristics do not handle partial results (for instance if
>     write syscall returns do EINTR).

I do not think EINTR would matter. The syscall heuristics is there expecting
that any library function which contains syscalls is not an IFUNC function.


> So this code has the potential of deadlock, specially if you have another
> thread issuing malloc.

I may have missed something but I do not see it so according to the answers
above.


According to the other reactions here I doubt this functionality would get
accepted to glibc so we have decided to give up on its upstreaming and use the
setenv("GLIBC_TUNABLES=glibc.cpu.hwcaps=...") + re-exec workaround instead.
That would need to be coded for compatibility with existing/old glibcs anyway.


Thanks,
Jan
  
Adhemerval Zanella March 20, 2023, 4:47 p.m. UTC | #16
On 16/03/23 11:38, Jan Kratochvil wrote:
> On Thu, 09 Mar 2023 16:47:49 +0100, Adhemerval Zanella Netto wrote:
>> I am not sure how the kernel would enumerate new tasks that are created
>> while iterating over /proc/self/task.
> 
> I have updated the code as you have found a race there. Now this is no longer
> relevant as all known tasks are already verified as stopped. So there is no
> more running task to create another task. While iterating /proc/self/task:
> (1) either there must be already an unstopped task when opendir() was called,
>     in such case the iteration of /proc/self/task will be retried anyway.
> (2) or all tasks are stopped and therefore no task can create any new task.
> 
> 
>> On closefrom Linux fallback we have
>> a similar problem, where the code iterates over /proc/self/fd, and everytime
>> it closes a file descriptor it lseeks back to beginning.  It works because
>> eventually there will be no more entries on /proc/self/fd, so either you
>> will need to certify that kernel adds new tasks at the end of the getdents
>> call (used by readdir, or lseek and keep track of all tasks already signaled.
> 
> That is not needed, see above.
> 
> 
>> While it might work on the JVM where you can not fully control who change 
>> SIGUSR1  disposition (and I am not sure JVM would prevent a JNI call to do so),
>> so you can't really make it generic without explicit reserve a signal to do so, 
>> similar to what glibc does for SIGCANCEL and SIGSETXID (used to synchronize 
>> setuid functions over threads).  Meaning that callers of sigaction can't 
>> not explicit set such reserved signal.
>>
>> This is similar to what we do for SIGSETXID, so I think a proper way to
>> do it would to do always install a new signal handler to this on pthread_create,
>> on signal handle synchronize with proper async-signal-safe interface
>> (pthread_mutex_lock is not, you might accomplish with sem_post but most likely
>> you will need a atomic+futex way similar to a barrier), iterate over all 
>> dl_stack_used (so the interface can work without access to procfs), issue the 
>> signal handler or each thread, operate on the maps, then synchronize to resume
>> threads.  We can't really make it generic without accessing the internal
>> glibc thread states.
> 
> Good to know, if the patch gets a serious consideration for upstreaming
> I understand the signal number needs to be handled better.
> 

For libc point of view I don't think this interface make sense without it itself
do the stop-the-world; since potentially calling while ifunc might be resolved 
adds another kind of concurrency that is far from ideal (we already have to handle
lazy binding, dl_profile, and other concurrent notes that touch the GOT and
associated data for symbol resolution).

> 
>> And you will also need to reallocate not only glibc, but potentially *all*
>> libraries (since ifunc can be used by any function).
> 
> This is what the patch already does by _dl_relocate_object().
> 
> 
>>> So the only remaining option is that all the programs will be doing
>>> setenv("GLIBC_TUNABLES=glibc.cpu.hwcaps=...") and re-exec(). That is
>>> a peformance kill and definitely not nice compared to any method of an IFUNC
>>> reset.
>>
>> Assuming you don't reset env variable on process spawning, you can set it as 
>> default for the session.
> 
> The /usr/bin/java program needs to setenv("GLIBC_TUNABLES=glibc.cpu.hwcaps=...")
> and then it can either system("itself") or exec("itself"). This is what you
> mean by the session?

I was thinking invoking java itself with GLIBC_TUNABLES, but I think it would
make sense to add a way to globally set tunables without the need to setting
the environment variable.

We already some ideas in mind [1], where I have in mind a system defined file
(as glibc already have for preload and ld.so.cache) that sets the defaults.
Siddhesh idea is to have a global one that is *not* overwritten by users,
while my idea is the opposite where the global sets the default, and users
can override it.  This is open to discussion on how this would work, if we
we will need an extra flag to disable it (similar to DF_1_NODEFLIB), how 
user define GLIBC_TUNABLES will interact with it, etc.

[1] https://sourceware.org/pipermail/libc-alpha/2023-March/146405.html

> 
> 
>> Another option would to deploy a glibc built with 
>> --disable-multi-arch; it will disable ifunc generation.
> 
> That is not an option. OpenJDK must be compatible with normal existing Linux
> OSes.
> 

I do not know exactly how criu works internally, but from you answer and if I
understood it correctly when criu restores a dump it will only restore private 
mappings from the process and not shared ones?  If so, how does it handle 
restoring on a system on an older glibc (or on a system with any dependency
older than the one that the process was dumped)? Does it only restore file
backed shared mappings based on naming?

Using a custom built glibc with --disable-multi-arch along with rpath, you
have a dissociated glibc from the system.  You can even tune to use an 
specific x86_64-vX abi if you see that the workload benefits from some
specific string/mem/math routine.

> 
>> And IMHO this is way nicer because this IFUNC reset as-is, without a proper 
>> stop-the-word support, is not safe and adds another corner case for the already
>> over-complicated ifunc interface.
> 
> stop-the-world is already implemented modulo possible bugfixes.
> 	https://github.com/openjdk/crac/pull/41/files#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6029
> 

Again, from a Java perspective it might be fine.  But for a generic C interface
it still has a lot of corner cases.

>>> In Java world the other libraries (in general, there are some JNI exceptions)
>>> do not matter as they are a Java code JIT-compiled by JVM.
>>
>> And this won't be a Java specific interface, but rather a GNU extension for C
>> library.  So we must make it as concise as possible, without adding any other
>> security or undefined behavior.  
> 
> I agree. Handling IFUNC for other libraries is also possible but it has to be
> a next step. It does not make sense to handle IFUNC in other libraries when
> glibc still crashes first.
> 
> 
> On Thu, 09 Mar 2023 18:43:08 +0100, Adhemerval Zanella Netto wrote:
>> And the 'handler' signal handler has some potential shortcomings as well: 
>>
>>   * backtrace is not async-signal-safe: glibc implementation on first call
>>     issues dlopen, which calls malloc; and libgcc_eh.so *might* also calls malloc.
> 
> I do not see it in practice:
> 	Temporary breakpoint 1, main () at backtrace.c:5
> 	5	  backtrace(buf, sizeof(buf)/sizeof(*buf));
> 	(gdb) b dlopen
> 	Breakpoint 2 at 0x7ffff7c88f20: file dlopen.c, line 77.
> 	(gdb) c
> 	Continuing.
> 	[Inferior 1 (process 825695) exited normally]
> 
> And neither in the sources:
> glibc$ grep dlopen $(find -iname "*backtrace*")

Check misc/unwind-link.c:

 48   /* Initialize a copy of the data, so that we do not need about
 49      unlocking in case the dynamic loader somehow triggers
 50      unwinding.  */
 51   void *local_libgcc_handle = __libc_dlopen (LIBGCC_S_SO);
 52   if (local_libgcc_handle == NULL)
 53     {
 54       __libc_lock_unlock (lock);
 55       return NULL;
 56     }

The backtrace calls it through UNWIND_LINK_PTR macro.  You might not seeing it
because we cache libgcc_s.so loading and once the programs calls backtrace or
pthread_cancel, it won't call __libc_unwind_link_get anymore.

> 
> 
>>   * pthread calls are not async-signal-safe either.
> 
> There are no pthread_* calls, everything is based on kernel tasks.

The SIGUSR1 handler it installs on freeze() uses both mutexes and cond variables,
and this is quite easy to deadload if a thread sends a SIGUSR1 if its not suppose
to (and both pthread_mutex_lock and pthread_cond_signal are not async-signal-safe).

> 
> 
>>   * it only handles libc.so, other libraries that uses ifunc for function
>>     selection also fails.
> 
> You are right, I have mostly implemented this hard-coded "libc.so.6" to make
> it general (for any libraries containing at least one STT_GNU_IFUNC) although
> I haven't finished this implementation due to the last paragraph below.
> 
> 
>>   * the syscall heuristics do not handle partial results (for instance if
>>     write syscall returns do EINTR).
> 
> I do not think EINTR would matter. The syscall heuristics is there expecting
> that any library function which contains syscalls is not an IFUNC function.
> 
> 
>> So this code has the potential of deadlock, specially if you have another
>> thread issuing malloc.
> 
> I may have missed something but I do not see it so according to the answers
> above.
>
  
Jan Kratochvil March 29, 2023, 12:12 p.m. UTC | #17
On Tue, 21 Mar 2023 00:47:53 +0800, Adhemerval Zanella Netto wrote:
> if I
> understood it correctly when criu restores a dump it will only restore private
> mappings from the process and not shared ones?  If so, how does it handle
> restoring on a system on an older glibc (or on a system with any dependency
> older than the one that the process was dumped)? Does it only restore file
> backed shared mappings based on naming?

Different OS (components) version is not supported, similar to:
	https://cr.openjdk.org/~heidinga/crac/Portability_of_checkpoints.pdf
	2. Same hardware, different distro


> Using a custom built glibc with --disable-multi-arch along with rpath, you
> have a dissociated glibc from the system.  You can even tune to use an
> specific x86_64-vX abi if you see that the workload benefits from some
> specific string/mem/math routine.

Then I need to build also all other used libraries and that is what is called
nowadays a container. Which is a perfectly valid use case. But building
a custom distribution (=glibc) is not a supported way of running any software.
[This is my personal opinion, I do not speak for any company.]


Thanks for your review,
Jan
  
Adhemerval Zanella March 29, 2023, 1:14 p.m. UTC | #18
On 29/03/23 09:12, Jan Kratochvil wrote:
> On Tue, 21 Mar 2023 00:47:53 +0800, Adhemerval Zanella Netto wrote:
>> if I
>> understood it correctly when criu restores a dump it will only restore private
>> mappings from the process and not shared ones?  If so, how does it handle
>> restoring on a system on an older glibc (or on a system with any dependency
>> older than the one that the process was dumped)? Does it only restore file
>> backed shared mappings based on naming?
> 
> Different OS (components) version is not supported, similar to:
> 	https://cr.openjdk.org/~heidinga/crac/Portability_of_checkpoints.pdf
> 	2. Same hardware, different distro

Thanks for the link, so it does seems that CRIU does not save anonymous
file-backed maps, this is what I understood from CRIU memory restore doc [1]
where open file mappings are done by 'files engines' (where I presume it is
either an external process or a library, correct if I am wrong).

Another issue you might see (even on same distro) is that Intel is doing a 
*lot* of performance backports on release branches which means that even 
glibc in the same release version might ended up with different sets
of ifunc variants (since this does not really affect ABI).  I am not sure
how it would reflect on distribution itself, but assuming a distro that
follows the glibc release branch it might indeed happen.

[1] https://criu.org/Memory_dumping_and_restoring	

> 
> 
>> Using a custom built glibc with --disable-multi-arch along with rpath, you
>> have a dissociated glibc from the system.  You can even tune to use an
>> specific x86_64-vX abi if you see that the workload benefits from some
>> specific string/mem/math routine.
> 
> Then I need to build also all other used libraries and that is what is called
> nowadays a container. Which is a perfectly valid use case. But building
> a custom distribution (=glibc) is not a supported way of running any software.
> [This is my personal opinion, I do not speak for any company.]

In fact I think for this specific problem is the best solution.  Along with
a JVM built with rpath, it means you will pack everything it requires and assure
that once it is installed from same package, the glibc won't enable any
optimization not supported and make sure that any distro backport won't
break these assumptions.  It also works with all scenarios you raised, assuming
that 3rd party libraries (outside the JVM package itself) does play nice
in such scenarios.
  

Patch

diff --git a/elf/Makefile b/elf/Makefile
index 0d19964d42..1948d51407 100644
--- a/elf/Makefile
+++ b/elf/Makefile
@@ -39,6 +39,7 @@  routines = \
   dl-origin \
   dl-profstub \
   dl-reloc-static-pie \
+  dl-reset-ifunc \
   dl-support \
   dl-sym \
   dl-sysdep \
@@ -309,6 +310,7 @@  tests := \
   tst-auxv \
   tst-dl-hash \
   tst-leaks1 \
+  tst-reset-ifunc \
   tst-stringtable \
   tst-tls9 \
   # tests
diff --git a/elf/Versions b/elf/Versions
index 4614acea3e..f04f81efbf 100644
--- a/elf/Versions
+++ b/elf/Versions
@@ -23,6 +23,9 @@  libc {
   GLIBC_2.35 {
     _dl_find_object;
   }
+  GLIBC_2.38 {
+    _dl_reset_ifunc;
+  }
   GLIBC_ABI_DT_RELR {
     # This symbol is used only for empty version map and will be removed
     # by scripts/versions.awk.
@@ -78,5 +81,11 @@  ld {
 
     # Set value of a tunable.
     __tunable_get_val;
+
+    # For dl-reset-ifunc.c.
+    __tunable_get_start;
+    __tunable_get_end;
+    _dl_relocate_object;
+    _dl_x86_init_cpu_features;
   }
 }
diff --git a/elf/dl-reloc.c b/elf/dl-reloc.c
index 1d558c1e0c..7aed2d44fa 100644
--- a/elf/dl-reloc.c
+++ b/elf/dl-reloc.c
@@ -349,6 +349,8 @@  _dl_relocate_object (struct link_map *l, struct r_scope_elem *scope[],
     _dl_protect_relro (l);
 }
 
+rtld_hidden_def (_dl_relocate_object)
+
 
 void
 _dl_protect_relro (struct link_map *l)
diff --git a/elf/dl-reset-ifunc.c b/elf/dl-reset-ifunc.c
new file mode 100644
index 0000000000..04a17088e4
--- /dev/null
+++ b/elf/dl-reset-ifunc.c
@@ -0,0 +1,392 @@ 
+/* Reset IFUNC symbols for their new detection.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+/* Very special case: This object is built into the static libc, but
+   must know the layout of _rtld_global_ro.  */
+#undef SHARED
+#define SHARED
+#include <ldsodefs.h>
+
+#include <cpu-features.h>
+
+#if HAVE_TUNABLES
+# define TUNABLES_INTERNAL 1
+# include "dl-tunables.h"
+static tunable_t tunable_list[] attribute_relro __attribute_maybe_unused__;
+#endif // HAVE_TUNABLES
+
+#define strcmp strcmp_local
+static int strcmp_local(const char *s1, const char *s2) {
+  const unsigned char *a = (const unsigned char *) s1;
+  const unsigned char *b = (const unsigned char *) s2;
+  for (;; ++a, ++b) {
+    if (*a != *b)
+      return *b > *a ? +1 : -1;
+    if (*a == 0)
+      return 0;
+  }
+}
+
+#define strchr strchr_local
+static char *strchr_local(const char *s, int c) {
+  for (; *s; ++s)
+    if ((uint8_t)*s == (uint8_t)c)
+      return (char *)s;
+  return NULL;
+}
+
+#define strlen strlen_local
+__attribute__ ((optimize(0))) // otherwise GCC will replace the function by system strlen() again
+static size_t strlen_local(const char *cs) {
+  size_t retval = 0;
+  while (*cs++)
+    ++retval;
+  return retval;
+}
+
+// FIXME: Why volatile?
+#define memset memset_local
+static volatile void *memset_local(volatile void *m, int c, size_t n) {
+  for (volatile uint8_t *d = (volatile uint8_t *)m; n--; ++d)
+    *d = c;
+  return m;
+}
+
+static char *file_read(const char *fn) {
+  int fd = open(fn, O_RDONLY);
+  assert(fd != -1);
+  // realloc() calls memmove().
+  size_t buf_size = 0x100000;
+  char *buf = (char *)malloc(buf_size);
+  assert(buf);
+  size_t buf_have = 0;
+  for (;;) {
+    assert(buf_have < buf_size);
+    ssize_t got = read(fd, buf + buf_have, buf_size - buf_have);
+    assert(got != -1);
+    if (got == 0)
+      break;
+    assert(got > 0);
+    assert(buf_have + got <= buf_size);
+    buf_have += got;
+  }
+  assert(buf_have < buf_size);
+  buf[buf_have] = 0;
+  assert(strlen(buf) == buf_have);
+  return buf;
+}
+
+static uint64_t read_hex(const char **cs_p) {
+  uint64_t retval = 0;
+  for (;;)
+#define cs (*cs_p)
+    switch (*cs) {
+    case '0' ... '9':
+      retval <<= 4;
+      retval |= *cs++ - '0';
+      continue;
+    case 'a' ... 'f':
+      retval <<= 4;
+      retval |= *cs++ - 'a' + 0xa;
+      continue;
+    default:
+      return retval;
+      break;
+    }
+#undef cs
+}
+
+static int mprotect_read(const void *addr, const void **addr_end_return) {
+  uint64_t addr_u = (uintptr_t)addr;
+  char *file = file_read("/proc/self/maps");
+  int retval = -1;
+  for (const char *cs = file; *cs;) {
+    // sscanf() calls rawmemchr().
+    uint64_t start = read_hex(&cs);
+    assert(*cs == '-');
+    ++cs;
+    uint64_t end = read_hex(&cs);
+    assert(*cs == ' ');
+    ++cs;
+    assert(start < end);
+    int rwxp = 0;
+    assert(*cs == 'r' || *cs == '-');
+    if (*cs++ == 'r')
+      rwxp |= 04;
+    assert(*cs == 'w' || *cs == '-');
+    if (*cs++ == 'w')
+      rwxp |= 02;
+    assert(*cs == 'x' || *cs == '-');
+    if (*cs++ == 'x')
+      rwxp |= 01;
+    assert(*cs == 's' || *cs == 'p');
+    ++cs;
+    assert(*cs == ' ');
+    ++cs;
+    if (start <= addr_u && addr_u < end) {
+      if (addr_end_return)
+        *addr_end_return = (const void *)(uintptr_t)end;
+      retval = rwxp;
+      break;
+    }
+    cs = strchr(cs, '\n');
+    assert(cs);
+    ++cs;
+  }
+  if (retval == -1) {
+    fprintf(stderr, "Not found an address: %p\n", addr);
+    assert(0);
+  }
+  free(file);
+  return retval;
+}
+
+static void verify_rwxp(const void *start, const void *end, int rwxp_want) {
+  assert((((uintptr_t)start) & (PAGE_SIZE - 1)) == 0);
+  assert((((uintptr_t)end  ) & (PAGE_SIZE - 1)) == 0);
+  assert(start < end);
+  while (start < end) {
+    int rwxp_found = mprotect_read(start, &start);
+    if (rwxp_found != rwxp_want) { printf("sudo gdb -p %d\n",getpid()); pause(); }
+    assert(rwxp_found == rwxp_want);
+  }
+}
+
+const struct link_map *phdr_info_to_link_map(struct dl_phdr_info *phdr_info) {
+  Dl_info info;
+  const struct link_map *link_map = NULL;
+  int err = dladdr1(phdr_info->dlpi_phdr, &info, (void **)&link_map, RTLD_DL_LINKMAP);
+  assert(err == 1);
+  assert(link_map);
+  return link_map;
+}
+
+static void page_align(const void **start_p, const void **end_p) {
+  *start_p = (const void *)(((uintptr_t)*start_p) & -PAGE_SIZE);
+  assert(*start_p);
+  *end_p = (const void *)((((uintptr_t)*end_p) + PAGE_SIZE - 1) & -PAGE_SIZE);
+}
+
+static void readonly_unset(const void *start, const void *end) {
+  assert((((uintptr_t)start) & (PAGE_SIZE - 1)) == 0);
+  assert((((uintptr_t)end  ) & (PAGE_SIZE - 1)) == 0);
+  assert(start <= end);
+  if (start == end)
+    return;
+  verify_rwxp(start, end, 04/*r--*/);
+  int err = mprotect((void *)start, (const uint8_t *)end - (const uint8_t *)start, PROT_READ | PROT_WRITE);
+  assert(!err);
+  verify_rwxp(start, end, 06/*rw-*/);
+}
+
+static void readonly_reset(const void *start, const void *end) {
+  assert((((uintptr_t)start) & (PAGE_SIZE - 1)) == 0);
+  assert((((uintptr_t)end  ) & (PAGE_SIZE - 1)) == 0);
+  assert(start <= end);
+  if (start == end)
+    return;
+  verify_rwxp(start, end, 06/*rw-*/);
+  int err = mprotect((void *)start, (const uint8_t *)end - (const uint8_t *)start, PROT_READ);
+  assert(!err);
+  verify_rwxp(start, end, 04/*r--*/);
+}
+
+static void swap(const void **a_p, const void **b_p) {
+  const void *p = *a_p;
+  *a_p = *b_p;
+  *b_p = p;
+}
+
+static void intersect(const void **first_start_p, const void **first_end_p, const void **second_start_p, const void **second_end_p) {
+  assert((((uintptr_t)*first_start_p) & (PAGE_SIZE - 1)) == 0);
+  assert((((uintptr_t)*first_end_p  ) & (PAGE_SIZE - 1)) == 0);
+  assert(*first_start_p <= *first_end_p);
+  assert((((uintptr_t)*second_start_p) & (PAGE_SIZE - 1)) == 0);
+  assert((((uintptr_t)*second_end_p  ) & (PAGE_SIZE - 1)) == 0);
+  assert(*second_start_p <= *second_end_p);
+  if (*first_start_p > *second_start_p) {
+    swap(first_start_p, second_start_p);
+    swap(second_start_p, second_start_p);
+  }
+  if (*second_start_p < *first_end_p) {
+    *second_start_p = *first_end_p;
+    if (*second_start_p > *second_end_p)
+      *second_end_p = *second_start_p;
+  }
+}
+
+static int reset_ifunc_iterate_phdr(struct dl_phdr_info *info, size_t size, void *data_unused) {
+  if (strcmp(info->dlpi_name, "/lib64/ld-linux-x86-64.so.2") == 0) // _dl_relocate_object would crash on scope == NULL.
+    return 0; // unused
+  const void *relro = NULL;
+  const void *relro_end = NULL;
+  assert(size >= offsetof(struct dl_phdr_info, dlpi_adds));
+  for (size_t phdr_ix = 0; phdr_ix < info->dlpi_phnum; ++phdr_ix) {
+    const Elf64_Phdr *phdr = info->dlpi_phdr + phdr_ix;
+    if (phdr->p_type == PT_GNU_RELRO) {
+      // It does not apply: assert(phdr->p_offset == phdr->p_vaddr);
+      assert(phdr->p_paddr == phdr->p_vaddr);
+      // /lib64/libz.so.1: p_filesz=0x538 > p_memsz=0x550
+      assert(!relro);
+      relro = (const void *)(uintptr_t)(phdr->p_vaddr + info->dlpi_addr);
+      relro_end = (const void *)(((const uint8_t *)relro) + phdr->p_memsz);
+      page_align(&relro, &relro_end);
+      assert(relro);
+    }
+  }
+  if (relro)
+    readonly_unset(relro, relro_end);
+  const struct link_map *map = phdr_info_to_link_map(info);
+  Elf64_Dyn *dynamic = map->l_ld;
+  Elf64_Xword *relxsz_p = NULL;
+  Elf64_Xword *relrsz_p = NULL;
+  Elf64_Xword *relxcount_p = NULL;
+  for (; dynamic->d_tag != DT_NULL; ++dynamic)
+    switch (dynamic->d_tag) {
+    case DT_RELASZ:
+    case DT_RELSZ:
+      assert(!relxsz_p);
+      relxsz_p = &dynamic->d_un.d_val;
+      break;
+    case DT_RELRSZ:
+      assert(!relrsz_p);
+      relrsz_p = &dynamic->d_un.d_val;
+      break;
+    case DT_RELCOUNT:
+    case DT_RELACOUNT:
+      assert(!relxcount_p);
+      relxcount_p = &dynamic->d_un.d_val;
+      break;
+    case DT_PLTREL:
+      // It is impossible to relocate DT_REL twice.
+      assert(dynamic->d_un.d_val == DT_RELA);
+      break;
+    }
+  Elf64_Xword relxsz_saved = -1; // may be used uninitialized [-Werror=maybe-uninitialized]
+  if (relxsz_p) {
+    relxsz_saved = *relxsz_p;
+    *relxsz_p = 0;
+  }
+  Elf64_Xword relrsz_saved = -1; // may be used uninitialized [-Werror=maybe-uninitialized]
+  if (relrsz_p) {
+    relrsz_saved = *relrsz_p;
+    *relrsz_p = 0;
+  }
+  Elf64_Xword relxcount_saved = -1; // may be used uninitialized [-Werror=maybe-uninitialized]
+  if (relxcount_p) {
+    relxcount_saved = *relxcount_p;
+    *relxcount_p = 0;
+  }
+  struct link_map *link_map_p = (struct link_map *)map;
+  assert(link_map_p->l_relocated);
+  link_map_p->l_relocated = 0;
+  // FIXME: skip ifuncs
+  _dl_relocate_object((struct link_map *)map, link_map_p->l_scope, 0/*lazy*/, 0/*consider_profiling*/);
+  // It was read/write before but _dl_relocate_object made it read-only.
+  const void *dynamic_start = NULL;
+  const void *dynamic_end;
+  if (relxsz_p) {
+    dynamic_start = relxsz_p;
+    dynamic_end = relxsz_p + 1;
+  }
+  if (relrsz_p) {
+    if (!dynamic_start) {
+      dynamic_start = relrsz_p;
+      dynamic_end = relrsz_p + 1;
+    } else {
+      if (dynamic_start > (const void *)relrsz_p)
+        dynamic_start = relrsz_p;
+      if (dynamic_end < (const void *)(relrsz_p + 1))
+        dynamic_end = relrsz_p + 1;
+    }
+  }
+  if (relxcount_p) {
+    if (!dynamic_start) {
+      dynamic_start = relxcount_p;
+      dynamic_end = relxcount_p + 1;
+    } else {
+      if (dynamic_start > (const void *)relxcount_p)
+        dynamic_start = relxcount_p;
+      if (dynamic_end < (const void *)(relxcount_p + 1))
+        dynamic_end = relxcount_p + 1;
+    }
+  }
+  if (dynamic_start) {
+    page_align(&dynamic_start, &dynamic_end);
+    // _dl_relocate_object made it already readonly: readonly_reset(relro, relro_end);
+    intersect(&relro, &relro_end, &dynamic_start, &dynamic_end);
+    readonly_unset(relro, relro_end);
+    readonly_unset(dynamic_start, dynamic_end);
+  }
+  if (relxsz_p)
+    *relxsz_p = relxsz_saved;
+  if (relrsz_p)
+    *relrsz_p = relrsz_saved;
+  if (relxcount_p)
+    *relxcount_p = relxcount_saved;
+  if (dynamic_start)
+    readonly_reset(dynamic_start, dynamic_end);
+  if (relro)
+    readonly_reset(relro, relro_end);
+  return 0; // unused
+}
+
+static void reset_glibc(void) {
+  const void *rtld_global_ro = &_rtld_global_ro;
+  const void *rtld_global_ro_end = &_rtld_global_ro + 1;
+  page_align(&rtld_global_ro, &rtld_global_ro_end);
+#if HAVE_TUNABLES
+  const void *tunable_list = __tunable_get_start();
+  const void *tunable_list_end = __tunable_get_end();
+  page_align(&tunable_list, &tunable_list_end);
+  intersect(&rtld_global_ro, &rtld_global_ro_end, &tunable_list, &tunable_list_end);
+  readonly_unset(tunable_list, tunable_list_end);
+#endif
+  readonly_unset(rtld_global_ro, rtld_global_ro_end);
+  const struct cpu_features *cpu_features_const = &GLRO(dl_x86_cpu_features);
+  // FIXME: Why volatile?
+  volatile struct cpu_features *cpu_features = (struct cpu_features *)cpu_features_const;
+  assert(cpu_features->basic.kind != arch_kind_unknown);
+  unsigned long int xsave_state_size = cpu_features->xsave_state_size;
+  unsigned int xsave_state_full_size = cpu_features->xsave_state_full_size;
+  memset((volatile uint8_t *)cpu_features, 0, sizeof(*cpu_features));
+  cpu_features->xsave_state_size = xsave_state_size;
+  cpu_features->xsave_state_full_size = xsave_state_full_size;
+  assert(cpu_features->basic.kind == arch_kind_unknown);
+  _dl_x86_init_cpu_features();
+  assert(cpu_features->basic.kind != arch_kind_unknown);
+  readonly_reset(rtld_global_ro, rtld_global_ro_end);
+#if HAVE_TUNABLES
+  readonly_reset(tunable_list, tunable_list_end);
+#endif
+}
+
+void
+_dl_reset_ifunc(void)
+{
+  // _dl_relocate_object() from reset_ifunc_iterate_phdr may be calling glibc ifunc resolvers already.
+  reset_glibc();
+  int i = dl_iterate_phdr(reset_ifunc_iterate_phdr, NULL/*data*/);
+  assert(!i);
+}
+rtld_hidden_def (_dl_reset_ifunc)
diff --git a/elf/dl-tunables.c b/elf/dl-tunables.c
index 327b9eb52f..d8c0617bd1 100644
--- a/elf/dl-tunables.c
+++ b/elf/dl-tunables.c
@@ -433,3 +433,17 @@  __tunable_get_val (tunable_id_t id, void *valp, tunable_callback_t callback)
 }
 
 rtld_hidden_def (__tunable_get_val)
+
+const void *
+__tunable_get_start(void)
+{
+  return &tunable_list;
+}
+rtld_hidden_def (__tunable_get_start)
+
+const void *
+__tunable_get_end(void)
+{
+  return tunable_list + sizeof (tunable_list) / sizeof (tunable_t);
+}
+rtld_hidden_def (__tunable_get_end)
diff --git a/elf/dl-tunables.h b/elf/dl-tunables.h
index ae6e014b95..3e3783b9f7 100644
--- a/elf/dl-tunables.h
+++ b/elf/dl-tunables.h
@@ -56,10 +56,14 @@  extern void __tunables_print (void);
 extern void __tunable_get_val (tunable_id_t, void *, tunable_callback_t);
 extern void __tunable_set_val (tunable_id_t, tunable_val_t *, tunable_num_t *,
 			       tunable_num_t *);
+extern const void *__tunable_get_start (void);
+extern const void *__tunable_get_end (void);
 rtld_hidden_proto (__tunables_init)
 rtld_hidden_proto (__tunables_print)
 rtld_hidden_proto (__tunable_get_val)
 rtld_hidden_proto (__tunable_set_val)
+rtld_hidden_proto (__tunable_get_start)
+rtld_hidden_proto (__tunable_get_end)
 
 /* Define TUNABLE_GET and TUNABLE_SET in short form if TOP_NAMESPACE and
    TUNABLE_NAMESPACE are defined.  This is useful shorthand to get and set
diff --git a/elf/link.h b/elf/link.h
index 3b5954d981..b30a736191 100644
--- a/elf/link.h
+++ b/elf/link.h
@@ -185,6 +185,8 @@  extern int dl_iterate_phdr (int (*__callback) (struct dl_phdr_info *,
 					       size_t, void *),
 			    void *__data);
 
+extern void _dl_reset_ifunc (void);
+
 
 /* Prototypes for the ld.so auditing interfaces.  These are not
    defined anywhere in ld.so but instead have to be provided by the
diff --git a/elf/tst-reset-ifunc.c b/elf/tst-reset-ifunc.c
new file mode 100644
index 0000000000..8c185e269d
--- /dev/null
+++ b/elf/tst-reset-ifunc.c
@@ -0,0 +1,70 @@ 
+/* Test resetting of IFUNC symbols for their new detection.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <support/check.h>
+#include <link.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <assert.h>
+
+static int func_count;
+
+void
+my_testifunc (void)
+{
+  puts ("my_testifunc");
+  ++func_count;
+}
+
+static int resolve_count;
+
+static void (*
+resolve_testifunc (void)) (void)
+{
+  puts ("resolve_testifunc");
+  ++resolve_count;
+  return (void (*)(void)) my_testifunc; // we'll just always select this routine
+}
+
+void testifunc (void) __attribute__ ((ifunc ("resolve_testifunc")));
+
+static int
+do_test (void)
+{
+  // resolver could be called or not yet.
+  assert (func_count == 0);
+  puts("testifunc");
+  testifunc ();
+  assert (resolve_count == 1);
+  assert (func_count == 1);
+  puts("_dl_reset_ifunc");
+  _dl_reset_ifunc ();
+  // resolver could be called or not yet.
+  assert (func_count == 1);
+  puts("testifunc");
+  testifunc ();
+  assert (resolve_count == 2);
+  assert (func_count == 2);
+  puts("testifunc");
+  testifunc ();
+  assert (resolve_count == 2);
+  assert (func_count == 3);
+  return 0;
+}
+
+#include <support/test-driver.c>
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index c99dad77cc..5cf15a797d 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -1015,8 +1015,8 @@  extern struct link_map *_dl_new_object (char *realname, const char *libname,
    If RTLD_LAZY is set in RELOC-MODE, don't relocate its PLT.  */
 extern void _dl_relocate_object (struct link_map *map,
 				 struct r_scope_elem *scope[],
-				 int reloc_mode, int consider_profiling)
-     attribute_hidden;
+				 int reloc_mode, int consider_profiling);
+rtld_hidden_proto (_dl_relocate_object)
 
 /* Protect PT_GNU_RELRO area.  */
 extern void _dl_protect_relro (struct link_map *map) attribute_hidden;
diff --git a/sysdeps/x86/dl-get-cpu-features.c b/sysdeps/x86/dl-get-cpu-features.c
index 5e3bb04d11..46fba307f0 100644
--- a/sysdeps/x86/dl-get-cpu-features.c
+++ b/sysdeps/x86/dl-get-cpu-features.c
@@ -68,6 +68,7 @@  Fatal glibc error: CPU does not support x86-64-v%d\n", 4);
     }
 }
 
+rtld_hidden_def (_dl_x86_init_cpu_features)
 __ifunc (__x86_cpu_features, __x86_cpu_features, NULL, void,
 	 _dl_x86_init_cpu_features);
 #endif
diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
index fa91a23129..ad2bcc883f 100644
--- a/sysdeps/x86/include/cpu-features.h
+++ b/sysdeps/x86/include/cpu-features.h
@@ -929,7 +929,8 @@  extern const struct cpu_features *_dl_x86_get_cpu_features (void)
 /* Unused for x86.  */
 # define INIT_ARCH()
 # define _dl_x86_get_cpu_features() (&GLRO(dl_x86_cpu_features))
-extern void _dl_x86_init_cpu_features (void) attribute_hidden;
+extern void _dl_x86_init_cpu_features (void);
+rtld_hidden_proto (_dl_x86_init_cpu_features)
 #endif
 
 #ifdef __x86_64__