[v7,1/3] Add an internal wrapper for clone, clone2 and clone3

Message ID 20210526123956.2712353-2-hjl.tools@gmail.com
State Superseded
Delegated to: Florian Weimer
Headers
Series Add an internal wrapper for clone, clone2 and clone3 |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent

Commit Message

H.J. Lu May 26, 2021, 12:39 p.m. UTC
  The clone3 system call provides a superset of the functionality of clone
and clone2.  It also provides a number of API improvements, including
the ability to specify the size of the child's stack area which can be
used by kernel to compute the shadow stack size when allocating the
shadow stack.  Add:

extern int __clone_internal (struct clone_args *__cl_args,
			     int (*__func) (void *__arg), void *__arg);

to provide an abstract interface for clone, clone2 and clone3.

1. Simplify stack management for thread creation by passing both stack
base and size to create_thread.
2. Consolidate clone vs clone2 differences into a single file.
3. Call __clone3 if HAVE_CLONE3_WAPPER is defined.  If __clone3 returns
-1 with ENOSYS, fall back to clone or clone2.
4. Use only __clone_internal to clone a thread.  Since the stack size
argument for create_thread is now unconditional, always pass stack size
to create_thread.
5. Enable the public clone3 wrapper in the future after it has been
added to all targets.
---
 include/clone_internal.h                 | 16 +++++
 nptl/allocatestack.c                     | 59 ++-------------
 nptl/createthread.c                      |  3 +-
 nptl/pthread_create.c                    | 17 ++---
 sysdeps/unix/sysv/linux/Makefile         |  2 +-
 sysdeps/unix/sysv/linux/clone-internal.c | 91 ++++++++++++++++++++++++
 sysdeps/unix/sysv/linux/clone3.c         |  1 +
 sysdeps/unix/sysv/linux/clone3.h         | 60 ++++++++++++++++
 sysdeps/unix/sysv/linux/createthread.c   | 25 ++++---
 sysdeps/unix/sysv/linux/spawni.c         | 26 +++----
 10 files changed, 209 insertions(+), 91 deletions(-)
 create mode 100644 include/clone_internal.h
 create mode 100644 sysdeps/unix/sysv/linux/clone-internal.c
 create mode 100644 sysdeps/unix/sysv/linux/clone3.c
 create mode 100644 sysdeps/unix/sysv/linux/clone3.h
  

Comments

Florian Weimer May 26, 2021, 1:05 p.m. UTC | #1
* H. J. Lu:

> +int
> +__clone_internal (struct clone_args *cl_args,
> +		  int (*func) (void *arg), void *arg)
> +{
> +  int ret;
> +#ifdef HAVE_CLONE3_WAPPER
> +  /* Try clone3 first.  */
> +  int saved_errno = errno;
> +  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> +  if (ret != -1 || errno != ENOSYS)
> +    return ret;

How much breakage is this causing once there is a __clone3
implementation that can return errors that aren't ENOSYS?

Do Firefox and Chromium work after this change?  What about
Docker/Kubernetes?

Thanks,
Florian
  
Adhemerval Zanella Netto May 26, 2021, 1:08 p.m. UTC | #2
On 26/05/2021 10:05, Florian Weimer via Libc-alpha wrote:
> * H. J. Lu:
> 
>> +int
>> +__clone_internal (struct clone_args *cl_args,
>> +		  int (*func) (void *arg), void *arg)
>> +{
>> +  int ret;
>> +#ifdef HAVE_CLONE3_WAPPER
>> +  /* Try clone3 first.  */
>> +  int saved_errno = errno;
>> +  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
>> +  if (ret != -1 || errno != ENOSYS)
>> +    return ret;
> 
> How much breakage is this causing once there is a __clone3
> implementation that can return errors that aren't ENOSYS?
> 

I think we have discussed before that syscalls filter that do not
return ENOSYS for blocked syscalls are essentially broken and trying
to support it are not really possible for some usages

> Do Firefox and Chromium work after this change?  What about
> Docker/Kubernetes?
> 
> Thanks,
> Florian
>
  
H.J. Lu May 26, 2021, 1:19 p.m. UTC | #3
On Wed, May 26, 2021 at 6:05 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > +int
> > +__clone_internal (struct clone_args *cl_args,
> > +               int (*func) (void *arg), void *arg)
> > +{
> > +  int ret;
> > +#ifdef HAVE_CLONE3_WAPPER
> > +  /* Try clone3 first.  */
> > +  int saved_errno = errno;
> > +  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> > +  if (ret != -1 || errno != ENOSYS)
> > +    return ret;
>
> How much breakage is this causing once there is a __clone3
> implementation that can return errors that aren't ENOSYS?

Isn't that __clone3 implementation broken?

> Do Firefox and Chromium work after this change?  What about

Firefox and Chromium won't work with clone3.

> Docker/Kubernetes?

I don't know.

> Thanks,
> Florian
>
  
Florian Weimer May 26, 2021, 1:42 p.m. UTC | #4
* H. J. Lu:

> On Wed, May 26, 2021 at 6:05 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * H. J. Lu:
>>
>> > +int
>> > +__clone_internal (struct clone_args *cl_args,
>> > +               int (*func) (void *arg), void *arg)
>> > +{
>> > +  int ret;
>> > +#ifdef HAVE_CLONE3_WAPPER
>> > +  /* Try clone3 first.  */
>> > +  int saved_errno = errno;
>> > +  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
>> > +  if (ret != -1 || errno != ENOSYS)
>> > +    return ret;
>>
>> How much breakage is this causing once there is a __clone3
>> implementation that can return errors that aren't ENOSYS?
>
> Isn't that __clone3 implementation broken?

I completely agree, but that wasn't my question really.

>> Do Firefox and Chromium work after this change?  What about
>
> Firefox and Chromium won't work with clone3.

Could we please engage with their developers *before* putting this into
glibc?

I think Chromium is a priority because the sandbox is one of the
Chromium components that Firefox inherits, or something like that.  I
don't see a clone3 bug for Chromium yet.

For Chromium speficially, making glibc changes and requesting them to
fix their broken sandbox does not work:

  most chromium text rendering broken when built with glibc 2.32.9000 (Fedora Rawhide)
  <https://bugs.chromium.org/p/chromium/issues/detail?id=1164975>

The pointer argument indirection that affects fstatat also affects
clone3.  So maybe Chromium developers will de-facto refuse to fix the
blug clone3 as well.  But at least we should give them a chance to
comment.

I want a fairly smooth transition to the new glibc in Fedora and CentOS
Stream 9, and this issue looks like it could be a huge obstacle.

Thanks,
Florian
  
H.J. Lu May 26, 2021, 1:58 p.m. UTC | #5
On Wed, May 26, 2021 at 6:42 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > On Wed, May 26, 2021 at 6:05 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * H. J. Lu:
> >>
> >> > +int
> >> > +__clone_internal (struct clone_args *cl_args,
> >> > +               int (*func) (void *arg), void *arg)
> >> > +{
> >> > +  int ret;
> >> > +#ifdef HAVE_CLONE3_WAPPER
> >> > +  /* Try clone3 first.  */
> >> > +  int saved_errno = errno;
> >> > +  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> >> > +  if (ret != -1 || errno != ENOSYS)
> >> > +    return ret;
> >>
> >> How much breakage is this causing once there is a __clone3
> >> implementation that can return errors that aren't ENOSYS?
> >
> > Isn't that __clone3 implementation broken?
>
> I completely agree, but that wasn't my question really.
>
> >> Do Firefox and Chromium work after this change?  What about
> >
> > Firefox and Chromium won't work with clone3.
>
> Could we please engage with their developers *before* putting this into
> glibc?
>
> I think Chromium is a priority because the sandbox is one of the
> Chromium components that Firefox inherits, or something like that.  I
> don't see a clone3 bug for Chromium yet.
>
> For Chromium speficially, making glibc changes and requesting them to
> fix their broken sandbox does not work:
>
>   most chromium text rendering broken when built with glibc 2.32.9000 (Fedora Rawhide)
>   <https://bugs.chromium.org/p/chromium/issues/detail?id=1164975>
>
> The pointer argument indirection that affects fstatat also affects
> clone3.  So maybe Chromium developers will de-facto refuse to fix the
> blug clone3 as well.  But at least we should give them a chance to
> comment.

I opened:

https://bugs.chromium.org/p/chromium/issues/detail?id=1213452

> I want a fairly smooth transition to the new glibc in Fedora and CentOS
> Stream 9, and this issue looks like it could be a huge obstacle.
>
> Thanks,
> Florian
>
  
Adhemerval Zanella Netto May 26, 2021, 2:09 p.m. UTC | #6
On 26/05/2021 10:42, Florian Weimer via Libc-alpha wrote:
> * H. J. Lu:
> 
>> On Wed, May 26, 2021 at 6:05 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>
>>> * H. J. Lu:
>>>
>>>> +int
>>>> +__clone_internal (struct clone_args *cl_args,
>>>> +               int (*func) (void *arg), void *arg)
>>>> +{
>>>> +  int ret;
>>>> +#ifdef HAVE_CLONE3_WAPPER
>>>> +  /* Try clone3 first.  */
>>>> +  int saved_errno = errno;
>>>> +  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
>>>> +  if (ret != -1 || errno != ENOSYS)
>>>> +    return ret;
>>>
>>> How much breakage is this causing once there is a __clone3
>>> implementation that can return errors that aren't ENOSYS?
>>
>> Isn't that __clone3 implementation broken?
> 
> I completely agree, but that wasn't my question really.
> 
>>> Do Firefox and Chromium work after this change?  What about
>>
>> Firefox and Chromium won't work with clone3.
> 
> Could we please engage with their developers *before* putting this into
> glibc?
> 
> I think Chromium is a priority because the sandbox is one of the
> Chromium components that Firefox inherits, or something like that.  I
> don't see a clone3 bug for Chromium yet.
> 
> For Chromium speficially, making glibc changes and requesting them to
> fix their broken sandbox does not work:
> 
>   most chromium text rendering broken when built with glibc 2.32.9000 (Fedora Rawhide)
>   <https://bugs.chromium.org/p/chromium/issues/detail?id=1164975>
> 
> The pointer argument indirection that affects fstatat also affects
> clone3.  So maybe Chromium developers will de-facto refuse to fix the
> blug clone3 as well.  But at least we should give them a chance to
> comment.
> 
> I want a fairly smooth transition to the new glibc in Fedora and CentOS
> Stream 9, and this issue looks like it could be a huge obstacle.

So basically the current sandbox strategy does not work on any architecture
that only support fstatat (from the bug report it seems they do not really
care about it anyway). The bug report also points to a not accessible bug
that trigger the fix reversion [1].  Any idea what the bug says?

For clone3, I am not sure it would be possible to fix using a similar hack:
from the hack comment it seems that seccomp filter can't dereference 
argument pointers.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=1199431
  
H.J. Lu May 31, 2021, 12:14 p.m. UTC | #7
On Wed, May 26, 2021 at 7:09 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 26/05/2021 10:42, Florian Weimer via Libc-alpha wrote:
> > * H. J. Lu:
> >
> >> On Wed, May 26, 2021 at 6:05 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>>
> >>> * H. J. Lu:
> >>>
> >>>> +int
> >>>> +__clone_internal (struct clone_args *cl_args,
> >>>> +               int (*func) (void *arg), void *arg)
> >>>> +{
> >>>> +  int ret;
> >>>> +#ifdef HAVE_CLONE3_WAPPER
> >>>> +  /* Try clone3 first.  */
> >>>> +  int saved_errno = errno;
> >>>> +  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> >>>> +  if (ret != -1 || errno != ENOSYS)
> >>>> +    return ret;
> >>>
> >>> How much breakage is this causing once there is a __clone3
> >>> implementation that can return errors that aren't ENOSYS?
> >>
> >> Isn't that __clone3 implementation broken?
> >
> > I completely agree, but that wasn't my question really.
> >
> >>> Do Firefox and Chromium work after this change?  What about
> >>
> >> Firefox and Chromium won't work with clone3.
> >
> > Could we please engage with their developers *before* putting this into
> > glibc?
> >
> > I think Chromium is a priority because the sandbox is one of the
> > Chromium components that Firefox inherits, or something like that.  I
> > don't see a clone3 bug for Chromium yet.
> >
> > For Chromium speficially, making glibc changes and requesting them to
> > fix their broken sandbox does not work:
> >
> >   most chromium text rendering broken when built with glibc 2.32.9000 (Fedora Rawhide)
> >   <https://bugs.chromium.org/p/chromium/issues/detail?id=1164975>
> >
> > The pointer argument indirection that affects fstatat also affects
> > clone3.  So maybe Chromium developers will de-facto refuse to fix the
> > blug clone3 as well.  But at least we should give them a chance to
> > comment.
> >
> > I want a fairly smooth transition to the new glibc in Fedora and CentOS
> > Stream 9, and this issue looks like it could be a huge obstacle.
>
> So basically the current sandbox strategy does not work on any architecture
> that only support fstatat (from the bug report it seems they do not really
> care about it anyway). The bug report also points to a not accessible bug
> that trigger the fix reversion [1].  Any idea what the bug says?
>
> For clone3, I am not sure it would be possible to fix using a similar hack:
> from the hack comment it seems that seccomp filter can't dereference
> argument pointers.
>
> [1] https://bugs.chromium.org/p/chromium/issues/detail?id=1199431

From:

https://bugs.chromium.org/p/chromium/issues/detail?id=1213452#c5

They can modify the sandbox to return ENOSYS on clone3.
  
Florian Weimer May 31, 2021, 12:16 p.m. UTC | #8
* H. J. Lu:

> From:
>
> https://bugs.chromium.org/p/chromium/issues/detail?id=1213452#c5
>
> They can modify the sandbox to return ENOSYS on clone3.

Is this sufficient if we have detected before that the process supports
CET and should enable it?

I think browsers activate the sandbox *after* process initialization
(unlike containers, where it happens before startup).

Thanks,
Florian
  
H.J. Lu May 31, 2021, 12:23 p.m. UTC | #9
On Mon, May 31, 2021 at 5:16 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > From:
> >
> > https://bugs.chromium.org/p/chromium/issues/detail?id=1213452#c5
> >
> > They can modify the sandbox to return ENOSYS on clone3.
>
> Is this sufficient if we have detected before that the process supports

Did you mean we could skip the following clone3 calls by caching
the first ENOSYS clone3 result?

> CET and should enable it?
>
> I think browsers activate the sandbox *after* process initialization
> (unlike containers, where it happens before startup).
>
> Thanks,
> Florian
>
  
Florian Weimer May 31, 2021, 12:28 p.m. UTC | #10
* H. J. Lu:

> On Mon, May 31, 2021 at 5:16 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * H. J. Lu:
>>
>> > From:
>> >
>> > https://bugs.chromium.org/p/chromium/issues/detail?id=1213452#c5
>> >
>> > They can modify the sandbox to return ENOSYS on clone3.
>>
>> Is this sufficient if we have detected before that the process supports
>
> Did you mean we could skip the following clone3 calls by caching
> the first ENOSYS clone3 result?

No, I'm worried the clone (not clone3) will fail because the process has
enabled CET for some reason.

Thanks,
Florian
  
H.J. Lu May 31, 2021, 12:40 p.m. UTC | #11
On Mon, May 31, 2021 at 5:28 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > On Mon, May 31, 2021 at 5:16 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * H. J. Lu:
> >>
> >> > From:
> >> >
> >> > https://bugs.chromium.org/p/chromium/issues/detail?id=1213452#c5
> >> >
> >> > They can modify the sandbox to return ENOSYS on clone3.
> >>
> >> Is this sufficient if we have detected before that the process supports
> >
> > Did you mean we could skip the following clone3 calls by caching
> > the first ENOSYS clone3 result?
>
> No, I'm worried the clone (not clone3) will fail because the process has
> enabled CET for some reason.
>

In the kernel, clone3 and clone go to the same piece of code.  clone won't
fail just because of CET.
  
Florian Weimer May 31, 2021, 1:01 p.m. UTC | #12
* H. J. Lu:

> In the kernel, clone3 and clone go to the same piece of code.  clone won't
> fail just because of CET.

But clone won't have access to the stack boundaries.  Won't this create
issues for setting up the shadow stack?

Thanks,
Florian
  
H.J. Lu May 31, 2021, 1:16 p.m. UTC | #13
On Mon, May 31, 2021 at 6:01 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > In the kernel, clone3 and clone go to the same piece of code.  clone won't
> > fail just because of CET.
>
> But clone won't have access to the stack boundaries.  Won't this create
> issues for setting up the shadow stack?
>

No.  There are:

        /* Cap shadow stack size to 4 GB */
        size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
        size = min(size, stack_size);

where stack_size is passed in clone3.
  
Adhemerval Zanella Netto May 31, 2021, 1:53 p.m. UTC | #14
On 31/05/2021 10:16, H.J. Lu wrote:
> On Mon, May 31, 2021 at 6:01 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * H. J. Lu:
>>
>>> In the kernel, clone3 and clone go to the same piece of code.  clone won't
>>> fail just because of CET.
>>
>> But clone won't have access to the stack boundaries.  Won't this create
>> issues for setting up the shadow stack?
>>
> 
> No.  There are:
> 
>         /* Cap shadow stack size to 4 GB */
>         size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
>         size = min(size, stack_size);
> 
> where stack_size is passed in clone3.

Right, so CET support does not really require clone3 to be used internally 
then? Or am I missing something?
  
H.J. Lu May 31, 2021, 2:01 p.m. UTC | #15
On Mon, May 31, 2021 at 6:53 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 31/05/2021 10:16, H.J. Lu wrote:
> > On Mon, May 31, 2021 at 6:01 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * H. J. Lu:
> >>
> >>> In the kernel, clone3 and clone go to the same piece of code.  clone won't
> >>> fail just because of CET.
> >>
> >> But clone won't have access to the stack boundaries.  Won't this create
> >> issues for setting up the shadow stack?
> >>
> >
> > No.  There are:
> >
> >         /* Cap shadow stack size to 4 GB */
> >         size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
> >         size = min(size, stack_size);
> >
> > where stack_size is passed in clone3.
>
> Right, so CET support does not really require clone3 to be used internally
> then? Or am I missing something?

Shadow stack size shouldn't be more than normal stack size.  The current
CET kernel shadow stack size may not be optimal.  My original code did

if (stack_size != 0)
  size = stack_size;
else
  size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);

But

1. I don't want to disturb it before CET changes are upstreamed.
2. It can be updated AFTER it has been upstreamed.


--
H.J.
  
Adhemerval Zanella Netto May 31, 2021, 3:57 p.m. UTC | #16
On 31/05/2021 11:01, H.J. Lu wrote:
> On Mon, May 31, 2021 at 6:53 AM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 31/05/2021 10:16, H.J. Lu wrote:
>>> On Mon, May 31, 2021 at 6:01 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>>
>>>> * H. J. Lu:
>>>>
>>>>> In the kernel, clone3 and clone go to the same piece of code.  clone won't
>>>>> fail just because of CET.
>>>>
>>>> But clone won't have access to the stack boundaries.  Won't this create
>>>> issues for setting up the shadow stack?
>>>>
>>>
>>> No.  There are:
>>>
>>>         /* Cap shadow stack size to 4 GB */
>>>         size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
>>>         size = min(size, stack_size);
>>>
>>> where stack_size is passed in clone3.
>>
>> Right, so CET support does not really require clone3 to be used internally
>> then? Or am I missing something?
> 
> Shadow stack size shouldn't be more than normal stack size.  The current
> CET kernel shadow stack size may not be optimal.  My original code did
> 
> if (stack_size != 0)
>   size = stack_size;
> else
>   size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
> 
> But
> 
> 1. I don't want to disturb it before CET changes are upstreamed.
> 2. It can be updated AFTER it has been upstreamed.

Right, so I take this is just an optimization assuming that the extra
size would unused, right? I still failing to see why clone3 is an
requirement for CET enablement (if I understood this correctly).

I still think supporting clone3 is a nice thing to have, specially
for possible newer architectures and to support newer flags and
functionalities.
  
H.J. Lu May 31, 2021, 4 p.m. UTC | #17
On Mon, May 31, 2021 at 8:57 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 31/05/2021 11:01, H.J. Lu wrote:
> > On Mon, May 31, 2021 at 6:53 AM Adhemerval Zanella
> > <adhemerval.zanella@linaro.org> wrote:
> >>
> >>
> >>
> >> On 31/05/2021 10:16, H.J. Lu wrote:
> >>> On Mon, May 31, 2021 at 6:01 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>>>
> >>>> * H. J. Lu:
> >>>>
> >>>>> In the kernel, clone3 and clone go to the same piece of code.  clone won't
> >>>>> fail just because of CET.
> >>>>
> >>>> But clone won't have access to the stack boundaries.  Won't this create
> >>>> issues for setting up the shadow stack?
> >>>>
> >>>
> >>> No.  There are:
> >>>
> >>>         /* Cap shadow stack size to 4 GB */
> >>>         size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
> >>>         size = min(size, stack_size);
> >>>
> >>> where stack_size is passed in clone3.
> >>
> >> Right, so CET support does not really require clone3 to be used internally
> >> then? Or am I missing something?
> >
> > Shadow stack size shouldn't be more than normal stack size.  The current
> > CET kernel shadow stack size may not be optimal.  My original code did
> >
> > if (stack_size != 0)
> >   size = stack_size;
> > else
> >   size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
> >
> > But
> >
> > 1. I don't want to disturb it before CET changes are upstreamed.
> > 2. It can be updated AFTER it has been upstreamed.
>
> Right, so I take this is just an optimization assuming that the extra
> size would unused, right? I still failing to see why clone3 is an

Correct.

> requirement for CET enablement (if I understood this correctly).

It isn't a MUST have.  It is an improvement for CET.

> I still think supporting clone3 is a nice thing to have, specially
> for possible newer architectures and to support newer flags and
> functionalities.
  

Patch

diff --git a/include/clone_internal.h b/include/clone_internal.h
new file mode 100644
index 0000000000..4b23ef33ce
--- /dev/null
+++ b/include/clone_internal.h
@@ -0,0 +1,16 @@ 
+#ifndef _CLONE3_H
+#include_next <clone3.h>
+
+extern __typeof (clone3) __clone3;
+
+/* The internal wrapper of clone/clone2 and clone3.  If __clone3 returns
+   -1 with ENOSYS, fall back to clone or clone2.  */
+extern int __clone_internal (struct clone_args *__cl_args,
+			     int (*__func) (void *__arg), void *__arg);
+
+#ifndef _ISOMAC
+libc_hidden_proto (__clone3)
+libc_hidden_proto (__clone_internal)
+#endif
+
+#endif
diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index dc81a2ca73..eebf9c2c3c 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -33,47 +33,6 @@ 
 #include <kernel-features.h>
 #include <nptl-stack.h>
 
-#ifndef NEED_SEPARATE_REGISTER_STACK
-
-/* Most architectures have exactly one stack pointer.  Some have more.  */
-# define STACK_VARIABLES void *stackaddr = NULL
-
-/* How to pass the values to the 'create_thread' function.  */
-# define STACK_VARIABLES_ARGS stackaddr
-
-/* How to declare function which gets there parameters.  */
-# define STACK_VARIABLES_PARMS void *stackaddr
-
-/* How to declare allocate_stack.  */
-# define ALLOCATE_STACK_PARMS void **stack
-
-/* This is how the function is called.  We do it this way to allow
-   other variants of the function to have more parameters.  */
-# define ALLOCATE_STACK(attr, pd) allocate_stack (attr, pd, &stackaddr)
-
-#else
-
-/* We need two stacks.  The kernel will place them but we have to tell
-   the kernel about the size of the reserved address space.  */
-# define STACK_VARIABLES void *stackaddr = NULL; size_t stacksize = 0
-
-/* How to pass the values to the 'create_thread' function.  */
-# define STACK_VARIABLES_ARGS stackaddr, stacksize
-
-/* How to declare function which gets there parameters.  */
-# define STACK_VARIABLES_PARMS void *stackaddr, size_t stacksize
-
-/* How to declare allocate_stack.  */
-# define ALLOCATE_STACK_PARMS void **stack, size_t *stacksize
-
-/* This is how the function is called.  We do it this way to allow
-   other variants of the function to have more parameters.  */
-# define ALLOCATE_STACK(attr, pd) \
-  allocate_stack (attr, pd, &stackaddr, &stacksize)
-
-#endif
-
-
 /* Default alignment of stack.  */
 #ifndef STACK_ALIGN
 # define STACK_ALIGN __alignof__ (long double)
@@ -249,7 +208,7 @@  advise_stack_range (void *mem, size_t size, uintptr_t pd, size_t guardsize)
    PDP must be non-NULL.  */
 static int
 allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
-		ALLOCATE_STACK_PARMS)
+		void **stack, size_t *stacksize)
 {
   struct pthread *pd;
   size_t size;
@@ -600,25 +559,17 @@  allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
   /* We place the thread descriptor at the end of the stack.  */
   *pdp = pd;
 
-#if _STACK_GROWS_DOWN
   void *stacktop;
 
-# if TLS_TCB_AT_TP
+#if TLS_TCB_AT_TP
   /* The stack begins before the TCB and the static TLS block.  */
   stacktop = ((char *) (pd + 1) - tls_static_size_for_stack);
-# elif TLS_DTV_AT_TP
+#elif TLS_DTV_AT_TP
   stacktop = (char *) (pd - 1);
-# endif
+#endif
 
-# ifdef NEED_SEPARATE_REGISTER_STACK
+  *stacksize = stacktop - pd->stackblock;
   *stack = pd->stackblock;
-  *stacksize = stacktop - *stack;
-# else
-  *stack = stacktop;
-# endif
-#else
-  *stack = pd->stackblock;
-#endif
 
   return 0;
 }
diff --git a/nptl/createthread.c b/nptl/createthread.c
index 46943b33fe..2ac83111ec 100644
--- a/nptl/createthread.c
+++ b/nptl/createthread.c
@@ -25,7 +25,8 @@ 
 
 static int
 create_thread (struct pthread *pd, const struct pthread_attr *attr,
-	       bool *stopped_start, STACK_VARIABLES_PARMS, bool *thread_ran)
+	       bool *stopped_start, void *stackaddr, size_t stacksize,
+	       bool *thread_ran)
 {
   /* If the implementation needs to do some tweaks to the thread after
      it has been created at the OS level, it can set STOPPED_START here.  */
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index 5680687efe..5faf1654e0 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -243,8 +243,8 @@  late_init (void)
    be set to true iff the thread actually started up and then got
    canceled before calling user code (*PD->start_routine).  */
 static int create_thread (struct pthread *pd, const struct pthread_attr *attr,
-			  bool *stopped_start, STACK_VARIABLES_PARMS,
-			  bool *thread_ran);
+			  bool *stopped_start, void *stackaddr,
+			  size_t stacksize, bool *thread_ran);
 
 #include <createthread.c>
 
@@ -498,7 +498,8 @@  int
 __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
 		      void *(*start_routine) (void *), void *arg)
 {
-  STACK_VARIABLES;
+  void *stackaddr = NULL;
+  size_t stacksize = 0;
 
   /* Avoid a data race in the multi-threaded case, and call the
      deferred initialization only once.  */
@@ -522,7 +523,7 @@  __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
     }
 
   struct pthread *pd = NULL;
-  int err = ALLOCATE_STACK (iattr, &pd);
+  int err = allocate_stack (iattr, &pd, &stackaddr, &stacksize);
   int retval = 0;
 
   if (__glibc_unlikely (err != 0))
@@ -667,8 +668,8 @@  __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
 
       /* We always create the thread stopped at startup so we can
 	 notify the debugger.  */
-      retval = create_thread (pd, iattr, &stopped_start,
-			      STACK_VARIABLES_ARGS, &thread_ran);
+      retval = create_thread (pd, iattr, &stopped_start, stackaddr,
+			      stacksize, &thread_ran);
       if (retval == 0)
 	{
 	  /* We retain ownership of PD until (a) (see CONCURRENCY NOTES
@@ -699,8 +700,8 @@  __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
 	}
     }
   else
-    retval = create_thread (pd, iattr, &stopped_start,
-			    STACK_VARIABLES_ARGS, &thread_ran);
+    retval = create_thread (pd, iattr, &stopped_start, stackaddr,
+			    stacksize, &thread_ran);
 
   /* Return to the previous signal mask, after creating the new
      thread.  */
diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
index e9566e028a..fcc52763cd 100644
--- a/sysdeps/unix/sysv/linux/Makefile
+++ b/sysdeps/unix/sysv/linux/Makefile
@@ -64,7 +64,7 @@  sysdep_routines += adjtimex clone umount umount2 readahead sysctl \
 		   time64-support pselect32 \
 		   xstat fxstat lxstat xstat64 fxstat64 lxstat64 \
 		   fxstatat fxstatat64 \
-		   xmknod xmknodat
+		   xmknod xmknodat clone3 clone-internal
 
 CFLAGS-gethostid.c = -fexceptions
 CFLAGS-tee.c = -fexceptions -fasynchronous-unwind-tables
diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
new file mode 100644
index 0000000000..1e7a8f6b35
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/clone-internal.c
@@ -0,0 +1,91 @@ 
+/* The internal wrapper of clone and clone3.
+   Copyright (C) 2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <stddef.h>
+#include <errno.h>
+#include <sched.h>
+#include <clone_internal.h>
+#include <libc-pointer-arith.h>	/* For cast_to_pointer.  */
+#include <stackinfo.h>		/* For _STACK_GROWS_{UP,DOWN}.  */
+
+#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
+#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+
+#define sizeof_field(TYPE, MEMBER) sizeof ((((TYPE *)0)->MEMBER))
+#define offsetofend(TYPE, MEMBER) \
+  (offsetof (TYPE, MEMBER) + sizeof_field (TYPE, MEMBER))
+
+_Static_assert (__alignof (struct clone_args) == 8,
+		"__alignof (struct clone_args) != 8");
+_Static_assert (offsetofend (struct clone_args, tls) == CLONE_ARGS_SIZE_VER0,
+		"offsetofend (struct clone_args, tls) != CLONE_ARGS_SIZE_VER0");
+_Static_assert (offsetofend (struct clone_args, set_tid_size) == CLONE_ARGS_SIZE_VER1,
+		"offsetofend (struct clone_args, set_tid_size) != CLONE_ARGS_SIZE_VER1");
+_Static_assert (offsetofend (struct clone_args, cgroup) == CLONE_ARGS_SIZE_VER2,
+		"offsetofend (struct clone_args, cgroup) != CLONE_ARGS_SIZE_VER2");
+_Static_assert (sizeof (struct clone_args) == CLONE_ARGS_SIZE_VER2,
+		"sizeof (struct clone_args) != CLONE_ARGS_SIZE_VER2");
+
+int
+__clone_internal (struct clone_args *cl_args,
+		  int (*func) (void *arg), void *arg)
+{
+  int ret;
+#ifdef HAVE_CLONE3_WAPPER
+  /* Try clone3 first.  */
+  int saved_errno = errno;
+  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
+  if (ret != -1 || errno != ENOSYS)
+    return ret;
+
+  /* NB: Restore errno since errno may be checked against non-zero
+     return value.  */
+  __set_errno (saved_errno);
+#endif
+
+  /* Map clone3 arguments to clone arguments.  NB: No need to check
+     invalid clone3 specific bits in flags nor exit_signal since this
+     is an internal function.  */
+  int flags = cl_args->flags | cl_args->exit_signal;
+  void *stack = cast_to_pointer (cl_args->stack);
+
+#ifdef __ia64__
+  ret = __clone2 (func, stack, cl_args->stack_size,
+		  flags, arg,
+		  cast_to_pointer (cl_args->parent_tid),
+		  cast_to_pointer (cl_args->tls),
+		  cast_to_pointer (cl_args->child_tid));
+#else
+# if !_STACK_GROWS_DOWN && !_STACK_GROWS_UP
+#  error "Define either _STACK_GROWS_DOWN or _STACK_GROWS_UP"
+# endif
+
+# if _STACK_GROWS_DOWN
+  stack += cl_args->stack_size;
+# endif
+  ret = __clone (func, stack, flags, arg,
+		 cast_to_pointer (cl_args->parent_tid),
+		 cast_to_pointer (cl_args->tls),
+		 cast_to_pointer (cl_args->child_tid));
+#endif
+  return ret;
+}
+
+libc_hidden_def (__clone_internal)
diff --git a/sysdeps/unix/sysv/linux/clone3.c b/sysdeps/unix/sysv/linux/clone3.c
new file mode 100644
index 0000000000..de963ef89d
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/clone3.c
@@ -0,0 +1 @@ 
+/* An empty placeholder.  */
diff --git a/sysdeps/unix/sysv/linux/clone3.h b/sysdeps/unix/sysv/linux/clone3.h
new file mode 100644
index 0000000000..0488884d59
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/clone3.h
@@ -0,0 +1,60 @@ 
+/* The wrapper of clone3.
+   Copyright (C) 2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef _CLONE3_H
+#define _CLONE3_H	1
+
+#include <features.h>
+#include <stdint.h>
+#include <stddef.h>
+
+__BEGIN_DECLS
+
+/* This struct should only be used in an argument to the clone3 system
+   call (along with its size argument).  It may be extended with new
+   fields in the future.  */
+
+struct clone_args
+{
+  uint64_t flags;	 /* Flags bit mask.  */
+  uint64_t pidfd;	 /* Where to store PID file descriptor
+			    (pid_t *).  */
+  uint64_t child_tid;	 /* Where to store child TID, in child's memory
+			    (pid_t *).  */
+  uint64_t parent_tid;	 /* Where to store child TID, in parent's memory
+			    (int *). */
+  uint64_t exit_signal;	 /* Signal to deliver to parent on child
+			    termination */
+  uint64_t stack;	 /* The lowest address of stack.  */
+  uint64_t stack_size;	 /* Size of stack.  */
+  uint64_t tls;		 /* Location of new TLS.  */
+  uint64_t set_tid;	 /* Pointer to a pid_t array
+			    (since Linux 5.5).  */
+  uint64_t set_tid_size; /* Number of elements in set_tid
+			    (since Linux 5.5). */
+  uint64_t cgroup;	 /* File descriptor for target cgroup
+			    of child (since Linux 5.7).  */
+} __attribute__ ((aligned (8)));
+
+/* The wrapper of clone3.  */
+extern int clone3 (struct clone_args *__cl_args, size_t __size,
+		   int (*__func) (void *__arg), void *__arg);
+
+__END_DECLS
+
+#endif /* clone3.h */
diff --git a/sysdeps/unix/sysv/linux/createthread.c b/sysdeps/unix/sysv/linux/createthread.c
index bc3409b326..406c73ba00 100644
--- a/sysdeps/unix/sysv/linux/createthread.c
+++ b/sysdeps/unix/sysv/linux/createthread.c
@@ -25,15 +25,10 @@ 
 #include <ldsodefs.h>
 #include <tls.h>
 #include <stdint.h>
+#include <clone_internal.h>
 
 #include <arch-fork.h>
 
-#ifdef __NR_clone2
-# define ARCH_CLONE __clone2
-#else
-# define ARCH_CLONE __clone
-#endif
-
 /* See the comments in pthread_create.c for the requirements for these
    two macros and the create_thread function.  */
 
@@ -47,7 +42,8 @@  static int start_thread (void *arg) __attribute__ ((noreturn));
 
 static int
 create_thread (struct pthread *pd, const struct pthread_attr *attr,
-	       bool *stopped_start, STACK_VARIABLES_PARMS, bool *thread_ran)
+	       bool *stopped_start, void *stackaddr, size_t stacksize,
+	       bool *thread_ran)
 {
   /* Determine whether the newly created threads has to be started
      stopped since we have to set the scheduling parameters or set the
@@ -100,9 +96,18 @@  create_thread (struct pthread *pd, const struct pthread_attr *attr,
 
   TLS_DEFINE_INIT_TP (tp, pd);
 
-  if (__glibc_unlikely (ARCH_CLONE (&start_thread, STACK_VARIABLES_ARGS,
-				    clone_flags, pd, &pd->tid, tp, &pd->tid)
-			== -1))
+  struct clone_args args =
+    {
+      .flags = clone_flags,
+      .pidfd = (uintptr_t) &pd->tid,
+      .parent_tid = (uintptr_t) &pd->tid,
+      .child_tid = (uintptr_t) &pd->tid,
+      .stack = (uintptr_t) stackaddr,
+      .stack_size = stacksize,
+      .tls = (uintptr_t) tp,
+    };
+  int ret = __clone_internal (&args, &start_thread, pd);
+  if (__glibc_unlikely (ret == -1))
     return errno;
 
   /* It's started now, so if we fail below, we'll have to cancel it
diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c
index 501f8fbccd..fd29858cf5 100644
--- a/sysdeps/unix/sysv/linux/spawni.c
+++ b/sysdeps/unix/sysv/linux/spawni.c
@@ -31,6 +31,7 @@ 
 #include <dl-sysdep.h>
 #include <libc-pointer-arith.h>
 #include <ldsodefs.h>
+#include <clone_internal.h>
 #include "spawn_int.h"
 
 /* The Linux implementation of posix_spawn{p} uses the clone syscall directly
@@ -59,21 +60,6 @@ 
    normal program exit with the exit code 127.  */
 #define SPAWN_ERROR	127
 
-#ifdef __ia64__
-# define CLONE(__fn, __stackbase, __stacksize, __flags, __args) \
-  __clone2 (__fn, __stackbase, __stacksize, __flags, __args, 0, 0, 0)
-#else
-# define CLONE(__fn, __stack, __stacksize, __flags, __args) \
-  __clone (__fn, __stack, __flags, __args)
-#endif
-
-/* Since ia64 wants the stackbase w/clone2, re-use the grows-up macro.  */
-#if _STACK_GROWS_UP || defined (__ia64__)
-# define STACK(__stack, __stack_size) (__stack)
-#elif _STACK_GROWS_DOWN
-# define STACK(__stack, __stack_size) (__stack + __stack_size)
-#endif
-
 
 struct posix_spawn_args
 {
@@ -378,8 +364,14 @@  __spawnix (pid_t * pid, const char *file,
      need for CLONE_SETTLS.  Although parent and child share the same TLS
      namespace, there will be no concurrent access for TLS variables (errno
      for instance).  */
-  new_pid = CLONE (__spawni_child, STACK (stack, stack_size), stack_size,
-		   CLONE_VM | CLONE_VFORK | SIGCHLD, &args);
+  struct clone_args clone_args =
+    {
+      .flags = CLONE_VM | CLONE_VFORK,
+      .exit_signal = SIGCHLD,
+      .stack = (uintptr_t) stack,
+      .stack_size = stack_size,
+    };
+  new_pid = __clone_internal (&clone_args, __spawni_child, &args);
 
   /* It needs to collect the case where the auxiliary process was created
      but failed to execute the file (due either any preparation step or