Implement the mlock2 function
Commit Message
Fallback using mlock is provided if the flags argument is zero.
2017-11-24 Florian Weimer <fweimer@redhat.com>
* sysdeps/unix/sysv/linux/mlock2.c: New file.
* sysdeps/unix/sysv/linux/tst-mlock2.c: Likewise.
* sysdeps/unix/sysv/linux/Makefile (routines): Add mlock2.
(tests): Add tst-mlock2.
* sysdeps/unix/sysv/linux/Versions (GLIBC_2.27): Export mlock2.
* sysdeps/unix/sysv/linux/libc**.abilist: Update.
* manual/memory.texi (Page Lock Functions): Move @end deftypefun
for mlock. Document mlock2.
Comments
On Fri, 24 Nov 2017, Florian Weimer wrote:
> Fallback using mlock is provided if the flags argument is zero.
>
> 2017-11-24 Florian Weimer <fweimer@redhat.com>
>
> * sysdeps/unix/sysv/linux/mlock2.c: New file.
> * sysdeps/unix/sysv/linux/tst-mlock2.c: Likewise.
> * sysdeps/unix/sysv/linux/Makefile (routines): Add mlock2.
> (tests): Add tst-mlock2.
> * sysdeps/unix/sysv/linux/Versions (GLIBC_2.27): Export mlock2.
> * sysdeps/unix/sysv/linux/libc**.abilist: Update.
> * manual/memory.texi (Page Lock Functions): Move @end deftypefun
> for mlock. Document mlock2.
The ChangeLog entry is missing the kernel-features.h change.
The patch is missing a NEWS update.
On 11/24/2017 06:02 PM, Joseph Myers wrote:
> On Fri, 24 Nov 2017, Florian Weimer wrote:
>
>> Fallback using mlock is provided if the flags argument is zero.
>>
>> 2017-11-24 Florian Weimer <fweimer@redhat.com>
>>
>> * sysdeps/unix/sysv/linux/mlock2.c: New file.
>> * sysdeps/unix/sysv/linux/tst-mlock2.c: Likewise.
>> * sysdeps/unix/sysv/linux/Makefile (routines): Add mlock2.
>> (tests): Add tst-mlock2.
>> * sysdeps/unix/sysv/linux/Versions (GLIBC_2.27): Export mlock2.
>> * sysdeps/unix/sysv/linux/libc**.abilist: Update.
>> * manual/memory.texi (Page Lock Functions): Move @end deftypefun
>> for mlock. Document mlock2.
>
> The ChangeLog entry is missing the kernel-features.h change.
>
> The patch is missing a NEWS update.
Thanks, I've fixed those locally.
Florian
On 24/11/2017 14:59, Florian Weimer wrote:
> diff --git a/sysdeps/unix/sysv/linux/mlock2.c b/sysdeps/unix/sysv/linux/mlock2.c
> new file mode 100644
> index 0000000000..1646cfb9e1
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/mlock2.c
> @@ -0,0 +1,40 @@
> +/* Wrapper for the mlock2 system call with fallback to mlock.
> + Copyright (C) 2017 Free Software Foundation, Inc.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <http://www.gnu.org/licenses/>. */
> +
> +#include <sys/mman.h>
> +#include <errno.h>
> +#include <sysdep.h>
> +
> +int
> +mlock2 (const void *addr, size_t length, unsigned int flags)
> +{
> +#ifdef __ASSUME_MLOCK2
> + return INLINE_SYSCALL (mlock2, 3, addr, length, flags);
> +#else
> + if (flags == 0)
> + return INLINE_SYSCALL (mlock, 2, addr, length);
> +# ifdef __NR_mlock2
> + int ret = INLINE_SYSCALL (mlock2, 3, addr, length, flags);
> + if (ret == 0 || errno != ENOSYS)
> + return ret;
> +# endif /* __NR_mlock2 */
> + /* Treat the missing system call as an invalid (non-zero) flag
> + argument. */
> + __set_errno (EINVAL);
> + return -1;
> +#endif /* __ASSUME_MLOCK2 */
> +}
We have the INLINE_SYSCALL_CALL to simplify and avoid issue with mismatch
input number and arguments (which is not this case).
I am not sure if it is better to advertise EINVAL for ENOSYS mainly
because it won't be transparent on a syscall trace. But I do not have
a strong opinion here.
On 11/24/2017 06:24 PM, Adhemerval Zanella wrote:
>> +int
>> +mlock2 (const void *addr, size_t length, unsigned int flags)
>> +{
>> +#ifdef __ASSUME_MLOCK2
>> + return INLINE_SYSCALL (mlock2, 3, addr, length, flags);
>> +#else
>> + if (flags == 0)
>> + return INLINE_SYSCALL (mlock, 2, addr, length);
>> +# ifdef __NR_mlock2
>> + int ret = INLINE_SYSCALL (mlock2, 3, addr, length, flags);
>> + if (ret == 0 || errno != ENOSYS)
>> + return ret;
>> +# endif /* __NR_mlock2 */
>> + /* Treat the missing system call as an invalid (non-zero) flag
>> + argument. */
>> + __set_errno (EINVAL);
>> + return -1;
>> +#endif /* __ASSUME_MLOCK2 */
>> +}
>
> We have the INLINE_SYSCALL_CALL to simplify and avoid issue with mismatch
> input number and arguments (which is not this case).
I'll switch to that, thanks.
> I am not sure if it is better to advertise EINVAL for ENOSYS mainly
> because it won't be transparent on a syscall trace. But I do not have
> a strong opinion here.
It's for consistency. If you specify a (yet unsupported) flag, you get
EINVAL with the kernel implementation, but you would get ENOSYS with the
userspace implementation. This matters if more flags are added. It
would be just another error case for the application to check and a
potential for application bugs (say because it is only tested on
mlock2-with-new-flag and mlock2 kernels, not on mlock2-less kernels).
In case this isn't obvious (and you're objecting on principle), I'll add
a comment. 8-)
Thanks,
Florian
On 11/28/2017 10:25 AM, Rical Jasan wrote:
>> +Like @code{mlock}, @code{mlock2} returns zero on success and @code{-1}
>
> I thought you were in the @math camp for return values. :P
mlock uses @code, so it was a difficult decision. 8-/
>> @deftypefun int munlock (const void *@var{addr}, size_t @var{len})
>> @standards{POSIX.1b, sys/mman.h}
>> @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
>
> The rest of the patch looked OK to me, but I'm not able to give an
> authoritative ACK on it. Seemed straightforward though.
I'm waiting for more feedback on the ENOSYS masking behavior.
Thanks,
Florian
On 24/11/2017 17:20, Florian Weimer wrote:
> On 11/24/2017 06:24 PM, Adhemerval Zanella wrote:
>
>>> +int
>>> +mlock2 (const void *addr, size_t length, unsigned int flags)
>>> +{
>>> +#ifdef __ASSUME_MLOCK2
>>> + return INLINE_SYSCALL (mlock2, 3, addr, length, flags);
>>> +#else
>>> + if (flags == 0)
>>> + return INLINE_SYSCALL (mlock, 2, addr, length);
>>> +# ifdef __NR_mlock2
>>> + int ret = INLINE_SYSCALL (mlock2, 3, addr, length, flags);
>>> + if (ret == 0 || errno != ENOSYS)
>>> + return ret;
>>> +# endif /* __NR_mlock2 */
>>> + /* Treat the missing system call as an invalid (non-zero) flag
>>> + argument. */
>>> + __set_errno (EINVAL);
>>> + return -1;
>>> +#endif /* __ASSUME_MLOCK2 */
>>> +}
>>
>> We have the INLINE_SYSCALL_CALL to simplify and avoid issue with mismatch
>> input number and arguments (which is not this case).
>
> I'll switch to that, thanks.
>
>> I am not sure if it is better to advertise EINVAL for ENOSYS mainly
>> because it won't be transparent on a syscall trace. But I do not have
>> a strong opinion here.
>
> It's for consistency. If you specify a (yet unsupported) flag, you get EINVAL with the kernel implementation, but you would get ENOSYS with the userspace implementation. This matters if more flags are added. It would be just another error case for the application to check and a potential for application bugs (say because it is only tested on mlock2-with-new-flag and mlock2 kernels, not on mlock2-less kernels).
>
> In case this isn't obvious (and you're objecting on principle), I'll add a comment. 8-)
This is a reasonable approach and I am ok with this patch with the
INLINE_SYSCALL_CALL change. I wonder if it is worth to add a similar
change to p{read,write}v2 to return ENOSUP in the case of ENOSYS.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
On 11/27/2017 02:11 PM, Adhemerval Zanella wrote:
> This is a reasonable approach and I am ok with this patch with the
> INLINE_SYSCALL_CALL change. I wonder if it is worth to add a similar
> change to p{read,write}v2 to return ENOSUP in the case of ENOSYS.
My copy of the manual page says that EINVAL is used there as well.
Thanks,
Florian
On 27/11/2017 14:07, Florian Weimer wrote:
> On 11/27/2017 02:11 PM, Adhemerval Zanella wrote:
>> This is a reasonable approach and I am ok with this patch with the
>> INLINE_SYSCALL_CALL change. I wonder if it is worth to add a similar
>> change to p{read,write}v2 to return ENOSUP in the case of ENOSYS.
>
> My copy of the manual page says that EINVAL is used there as well.
>
> Thanks,
> Florian
Indeed manpages [1] states that EINVAL is returned, but our documentation
states otherwise:
manual/llio.texi
1286 @item EOPNOTSUPP
1287
1288 @c The default sysdeps/posix code will return it for any flags value
1289 @c different than 0.
1290 An unsupported @var{flags} was used.
Also, "tst-preadvwritev" on a 4.13.0-17-generic indeed generates Linux
EOPNOTSUPP (ENOSUP):
[pid 7896] preadv2(3, <unfinished ...>
[pid 7895] <... rt_sigaction resumed> {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
[pid 7896] <... preadv2 resumed> [{iov_base=0x7ffca77846d0, iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
[pid 7895] wait4(7896, <unfinished ...>
[pid 7896] pwritev2(3, [{iov_base="\0\0\0\0\0\0\0\0\241H\363?\244\177\0\0@Gx\247\374\177\0\0\0\0\0\0\0\0\0\0", iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
So I think it would be worth to change p{read,write}v2 on GLIBC to
return EINVAL for invalid flags. I will prepare a patch.
[1] http://man7.org/linux/man-pages/man2/preadv2.2.html
On 11/27/2017 05:35 PM, Adhemerval Zanella wrote:
>
>
> On 27/11/2017 14:07, Florian Weimer wrote:
>> On 11/27/2017 02:11 PM, Adhemerval Zanella wrote:
>>> This is a reasonable approach and I am ok with this patch with the
>>> INLINE_SYSCALL_CALL change. I wonder if it is worth to add a similar
>>> change to p{read,write}v2 to return ENOSUP in the case of ENOSYS.
>>
>> My copy of the manual page says that EINVAL is used there as well.
>>
>> Thanks,
>> Florian
>
> Indeed manpages [1] states that EINVAL is returned, but our documentation
> states otherwise:
>
> manual/llio.texi
>
> 1286 @item EOPNOTSUPP
> 1287
> 1288 @c The default sysdeps/posix code will return it for any flags value
> 1289 @c different than 0.
> 1290 An unsupported @var{flags} was used.
>
> Also, "tst-preadvwritev" on a 4.13.0-17-generic indeed generates Linux
> EOPNOTSUPP (ENOSUP):
>
> [pid 7896] preadv2(3, <unfinished ...>
> [pid 7895] <... rt_sigaction resumed> {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
> [pid 7896] <... preadv2 resumed> [{iov_base=0x7ffca77846d0, iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
> [pid 7895] wait4(7896, <unfinished ...>
> [pid 7896] pwritev2(3, [{iov_base="\0\0\0\0\0\0\0\0\241H\363?\244\177\0\0@Gx\247\374\177\0\0\0\0\0\0\0\0\0\0", iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
>
> So I think it would be worth to change p{read,write}v2 on GLIBC to
> return EINVAL for invalid flags. I will prepare a patch.
>
> [1] http://man7.org/linux/man-pages/man2/preadv2.2.html
Typo? Shouldn't we match the kernel behavior, so fail with EOPNOTSUPP?
(I double-checked, and for mlock2, the kernel does return EINVAL.)
Thanks,
Florian
On 27/11/2017 14:43, Florian Weimer wrote:
> On 11/27/2017 05:35 PM, Adhemerval Zanella wrote:
>>
>>
>> On 27/11/2017 14:07, Florian Weimer wrote:
>>> On 11/27/2017 02:11 PM, Adhemerval Zanella wrote:
>>>> This is a reasonable approach and I am ok with this patch with the
>>>> INLINE_SYSCALL_CALL change. I wonder if it is worth to add a similar
>>>> change to p{read,write}v2 to return ENOSUP in the case of ENOSYS.
>>>
>>> My copy of the manual page says that EINVAL is used there as well.
>>>
>>> Thanks,
>>> Florian
>>
>> Indeed manpages [1] states that EINVAL is returned, but our documentation
>> states otherwise:
>>
>> manual/llio.texi
>>
>> 1286 @item EOPNOTSUPP
>> 1287
>> 1288 @c The default sysdeps/posix code will return it for any flags value
>> 1289 @c different than 0.
>> 1290 An unsupported @var{flags} was used.
>>
>> Also, "tst-preadvwritev" on a 4.13.0-17-generic indeed generates Linux
>> EOPNOTSUPP (ENOSUP):
>>
>> [pid 7896] preadv2(3, <unfinished ...>
>> [pid 7895] <... rt_sigaction resumed> {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
>> [pid 7896] <... preadv2 resumed> [{iov_base=0x7ffca77846d0, iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
>> [pid 7895] wait4(7896, <unfinished ...>
>> [pid 7896] pwritev2(3, [{iov_base="\0\0\0\0\0\0\0\0\241H\363?\244\177\0\0@Gx\247\374\177\0\0\0\0\0\0\0\0\0\0", iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
>>
>> So I think it would be worth to change p{read,write}v2 on GLIBC to
>> return EINVAL for invalid flags. I will prepare a patch.
>>
>> [1] http://man7.org/linux/man-pages/man2/preadv2.2.html
>
> Typo? Shouldn't we match the kernel behavior, so fail with EOPNOTSUPP?
Not for ENOSYS. And I though about following manpages definition, but
thinking twice I agree following the kernel would be better. I still
think it will be a small improvement to handle ENOSYS as ENOSUP as you
did for mlock2 and EINVAL.
Michael, I think you should update manpages with the correct errno
for invalid flags.
>
> (I double-checked, and for mlock2, the kernel does return EINVAL.)
>
> Thanks,
> Florian
On 11/27/2017 07:40 PM, Adhemerval Zanella wrote:
>
>
> On 27/11/2017 14:43, Florian Weimer wrote:
>> On 11/27/2017 05:35 PM, Adhemerval Zanella wrote:
>>>
>>>
>>> On 27/11/2017 14:07, Florian Weimer wrote:
>>>> On 11/27/2017 02:11 PM, Adhemerval Zanella wrote:
>>>>> This is a reasonable approach and I am ok with this patch with the
>>>>> INLINE_SYSCALL_CALL change. I wonder if it is worth to add a similar
>>>>> change to p{read,write}v2 to return ENOSUP in the case of ENOSYS.
>>>>
>>>> My copy of the manual page says that EINVAL is used there as well.
>>>>
>>>> Thanks,
>>>> Florian
>>>
>>> Indeed manpages [1] states that EINVAL is returned, but our documentation
>>> states otherwise:
>>>
>>> manual/llio.texi
>>>
>>> 1286 @item EOPNOTSUPP
>>> 1287
>>> 1288 @c The default sysdeps/posix code will return it for any flags value
>>> 1289 @c different than 0.
>>> 1290 An unsupported @var{flags} was used.
>>>
>>> Also, "tst-preadvwritev" on a 4.13.0-17-generic indeed generates Linux
>>> EOPNOTSUPP (ENOSUP):
>>>
>>> [pid 7896] preadv2(3, <unfinished ...>
>>> [pid 7895] <... rt_sigaction resumed> {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
>>> [pid 7896] <... preadv2 resumed> [{iov_base=0x7ffca77846d0, iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
>>> [pid 7895] wait4(7896, <unfinished ...>
>>> [pid 7896] pwritev2(3, [{iov_base="\0\0\0\0\0\0\0\0\241H\363?\244\177\0\0@Gx\247\374\177\0\0\0\0\0\0\0\0\0\0", iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
>>>
>>> So I think it would be worth to change p{read,write}v2 on GLIBC to
>>> return EINVAL for invalid flags. I will prepare a patch.
>>>
>>> [1] http://man7.org/linux/man-pages/man2/preadv2.2.html
>>
>> Typo? Shouldn't we match the kernel behavior, so fail with EOPNOTSUPP?
>
> Not for ENOSYS. And I though about following manpages definition, but
> thinking twice I agree following the kernel would be better. I still
> think it will be a small improvement to handle ENOSYS as ENOSUP as you
> did for mlock2 and EINVAL.
Agreed: if you do zero-flag emulation using pwritev in user space,
returning the kernel unknown flag used by pwritev2 (here: EOPNOTSUPP)
when *any* non-zero flag is unknown by the kernel (because pwritev2 is
not implemented) is the right approach.
Thanks,
Florian
On 11/24/2017 08:59 AM, Florian Weimer wrote:
> diff --git a/manual/memory.texi b/manual/memory.texi
> index 3f5dd90260..1b431bf5da 100644
> --- a/manual/memory.texi
> +++ b/manual/memory.texi
> @@ -3337,6 +3337,36 @@ The calling process is not superuser.
> The kernel does not provide @code{mlock} capability.
>
> @end table
> +@end deftypefun
> +
> +@deftypefun int mlock2 (const void *@var{addr}, size_t @var{len}, unsigned int @var{flags})
> +@standards{Linux, sys/mman.h}
> +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
> +
> +This function is similar to @code{mlock}. If @var{flags} is zero, a
> +call to @code{mlock2} behaves exactly as the equivalent call to @code{mlock}.
This is fine. As a matter of form, I think it would be nice to say
something about how mlock2 is different from mlock, but I'm not sure
there is much to else to say than, "..., but it accepts a flags
argument", so I guess that's obvious enough.
> +
> +The @var{flags} argument must be a combination of zero or more of the
> +following flags:
> +
> +@vtable @code
> +@item MLOCK_ONFAULT
> +@standards{Linux, sys/mman.h}
> +Only those pages in the specified address range which are already in
> +memory are locked immediately. Additional pages in the range are
> +automatically locked in case of a page fault and allocation of memory.
> +@end vtable
> +
> +Like @code{mlock}, @code{mlock2} returns zero on success and @code{-1}
I thought you were in the @math camp for return values. :P
> +on failure, setting @code{errno} accordingly. Additional @code{errno}
> +values defined for @code{mlock2} are:
> +
> +@table @code
> +@item EINVAL
> +The specified (non-zero) @var{flags} argument is not supported by this
> +system.
> +@end table
> +@end deftypefun
>
> You can lock @emph{all} a process' memory with @code{mlockall}. You
> unlock memory with @code{munlock} or @code{munlockall}.
> @@ -3346,8 +3376,6 @@ To avoid all page faults in a C program, you have to use
> from the C code, e.g. the stack and automatic variables, and you
> wouldn't know what address to tell @code{mlock}.
>
> -@end deftypefun
> -
Good catch.
> @deftypefun int munlock (const void *@var{addr}, size_t @var{len})
> @standards{POSIX.1b, sys/mman.h}
> @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
The rest of the patch looked OK to me, but I'm not able to give an
authoritative ACK on it. Seemed straightforward though.
Rical
On 27/11/2017 16:46, Florian Weimer wrote:
> On 11/27/2017 07:40 PM, Adhemerval Zanella wrote:
>>
>>
>> On 27/11/2017 14:43, Florian Weimer wrote:
>>> On 11/27/2017 05:35 PM, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 27/11/2017 14:07, Florian Weimer wrote:
>>>>> On 11/27/2017 02:11 PM, Adhemerval Zanella wrote:
>>>>>> This is a reasonable approach and I am ok with this patch with the
>>>>>> INLINE_SYSCALL_CALL change. I wonder if it is worth to add a similar
>>>>>> change to p{read,write}v2 to return ENOSUP in the case of ENOSYS.
>>>>>
>>>>> My copy of the manual page says that EINVAL is used there as well.
>>>>>
>>>>> Thanks,
>>>>> Florian
>>>>
>>>> Indeed manpages [1] states that EINVAL is returned, but our documentation
>>>> states otherwise:
>>>>
>>>> manual/llio.texi
>>>>
>>>> 1286 @item EOPNOTSUPP
>>>> 1287
>>>> 1288 @c The default sysdeps/posix code will return it for any flags value
>>>> 1289 @c different than 0.
>>>> 1290 An unsupported @var{flags} was used.
>>>>
>>>> Also, "tst-preadvwritev" on a 4.13.0-17-generic indeed generates Linux
>>>> EOPNOTSUPP (ENOSUP):
>>>>
>>>> [pid 7896] preadv2(3, <unfinished ...>
>>>> [pid 7895] <... rt_sigaction resumed> {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
>>>> [pid 7896] <... preadv2 resumed> [{iov_base=0x7ffca77846d0, iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
>>>> [pid 7895] wait4(7896, <unfinished ...>
>>>> [pid 7896] pwritev2(3, [{iov_base="\0\0\0\0\0\0\0\0\241H\363?\244\177\0\0@Gx\247\374\177\0\0\0\0\0\0\0\0\0\0", iov_len=32}], 1, 0, 0x10 /* RWF_??? */) = -1 EOPNOTSUPP (Operation not supported)
>>>>
>>>> So I think it would be worth to change p{read,write}v2 on GLIBC to
>>>> return EINVAL for invalid flags. I will prepare a patch.
>>>>
>>>> [1] http://man7.org/linux/man-pages/man2/preadv2.2.html
>>>
>>> Typo? Shouldn't we match the kernel behavior, so fail with EOPNOTSUPP?
>>
>> Not for ENOSYS. And I though about following manpages definition, but
>> thinking twice I agree following the kernel would be better. I still
>> think it will be a small improvement to handle ENOSYS as ENOSUP as you
>> did for mlock2 and EINVAL.
>
> Agreed: if you do zero-flag emulation using pwritev in user space, returning the kernel unknown flag used by pwritev2 (here: EOPNOTSUPP) when *any* non-zero flag is unknown by the kernel (because pwritev2 is not implemented) is the right approach.
We are already doing the correct thing for p{read,write}v2: if ENOSYS is
returned Linux implementation will return ENOTSUP for flags different
than 0 otherwise call p{read,write}v.
@@ -3337,6 +3337,36 @@ The calling process is not superuser.
The kernel does not provide @code{mlock} capability.
@end table
+@end deftypefun
+
+@deftypefun int mlock2 (const void *@var{addr}, size_t @var{len}, unsigned int @var{flags})
+@standards{Linux, sys/mman.h}
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+
+This function is similar to @code{mlock}. If @var{flags} is zero, a
+call to @code{mlock2} behaves exactly as the equivalent call to @code{mlock}.
+
+The @var{flags} argument must be a combination of zero or more of the
+following flags:
+
+@vtable @code
+@item MLOCK_ONFAULT
+@standards{Linux, sys/mman.h}
+Only those pages in the specified address range which are already in
+memory are locked immediately. Additional pages in the range are
+automatically locked in case of a page fault and allocation of memory.
+@end vtable
+
+Like @code{mlock}, @code{mlock2} returns zero on success and @code{-1}
+on failure, setting @code{errno} accordingly. Additional @code{errno}
+values defined for @code{mlock2} are:
+
+@table @code
+@item EINVAL
+The specified (non-zero) @var{flags} argument is not supported by this
+system.
+@end table
+@end deftypefun
You can lock @emph{all} a process' memory with @code{mlockall}. You
unlock memory with @code{munlock} or @code{munlockall}.
@@ -3346,8 +3376,6 @@ To avoid all page faults in a C program, you have to use
from the C code, e.g. the stack and automatic variables, and you
wouldn't know what address to tell @code{mlock}.
-@end deftypefun
-
@deftypefun int munlock (const void *@var{addr}, size_t @var{len})
@standards{POSIX.1b, sys/mman.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
@@ -18,7 +18,7 @@ sysdep_routines += clone umount umount2 readahead \
setfsuid setfsgid epoll_pwait signalfd \
eventfd eventfd_read eventfd_write prlimit \
personality epoll_wait tee vmsplice splice \
- open_by_handle_at
+ open_by_handle_at mlock2
CFLAGS-gethostid.c = -fexceptions
CFLAGS-tee.c = -fexceptions -fasynchronous-unwind-tables
@@ -44,7 +44,7 @@ sysdep_headers += sys/mount.h sys/acct.h sys/sysctl.h \
tests += tst-clone tst-clone2 tst-clone3 tst-fanotify tst-personality \
tst-quota tst-sync_file_range tst-sysconf-iov_max tst-ttyname \
- test-errno-linux tst-memfd_create
+ test-errno-linux tst-memfd_create tst-mlock2
# Generate the list of SYS_* macros for the system calls (__NR_*
# macros). The file syscall-names.list contains all possible system
@@ -168,6 +168,7 @@ libc {
}
GLIBC_2.27 {
memfd_create;
+ mlock2;
}
GLIBC_PRIVATE {
# functions used in other libraries
@@ -2107,6 +2107,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.27 strfromf128 F
GLIBC_2.27 strtof128 F
GLIBC_2.27 strtof128_l F
@@ -2018,6 +2018,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.27 strfromf128 F
GLIBC_2.27 strtof128 F
GLIBC_2.27 strtof128_l F
@@ -108,6 +108,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.4 GLIBC_2.4 A
GLIBC_2.4 _Exit F
GLIBC_2.4 _IO_2_1_stderr_ D 0xa0
@@ -28,12 +28,21 @@
# define MFD_HUGETLB 4U
# endif
+/* Flags for mlock2. */
+# ifndef MLOCK_ONFAULT
+# define MLOCK_ONFAULT 1U
+# endif
+
__BEGIN_DECLS
/* Create a new memory file descriptor. NAME is a name for debugging.
FLAGS is a combination of the MFD_* constants. */
int memfd_create (const char *__name, unsigned int __flags) __THROW;
+/* Lock pages from ADDR (inclusive) to ADDR + LENGTH (exclusive) into
+ memory. FLAGS is a combination of the MLOCK_* flags above. */
+int mlock2 (const void *__addr, size_t __length, unsigned int __flags) __THROW;
+
__END_DECLS
#endif /* __USE_GNU */
@@ -1872,6 +1872,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -2037,6 +2037,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -1901,6 +1901,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -107,3 +107,7 @@
#if __LINUX_KERNEL_VERSION >= 0x031300
# define __ASSUME_EXECVEAT 1
#endif
+
+#if __LINUX_KERNEL_VERSION >= 0x040400
+# define __ASSUME_MLOCK2 1
+#endif
@@ -109,6 +109,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.4 GLIBC_2.4 A
GLIBC_2.4 _Exit F
GLIBC_2.4 _IO_2_1_stderr_ D 0x98
@@ -1986,6 +1986,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -2107,3 +2107,4 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
@@ -1961,6 +1961,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -1959,6 +1959,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -1957,6 +1957,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.27 strfromf128 F
GLIBC_2.27 strtof128 F
GLIBC_2.27 strtof128_l F
@@ -1952,6 +1952,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.27 strfromf128 F
GLIBC_2.27 strtof128 F
GLIBC_2.27 strtof128_l F
new file mode 100644
@@ -0,0 +1,40 @@
+/* Wrapper for the mlock2 system call with fallback to mlock.
+ Copyright (C) 2017 Free Software Foundation, Inc.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <sys/mman.h>
+#include <errno.h>
+#include <sysdep.h>
+
+int
+mlock2 (const void *addr, size_t length, unsigned int flags)
+{
+#ifdef __ASSUME_MLOCK2
+ return INLINE_SYSCALL (mlock2, 3, addr, length, flags);
+#else
+ if (flags == 0)
+ return INLINE_SYSCALL (mlock, 2, addr, length);
+# ifdef __NR_mlock2
+ int ret = INLINE_SYSCALL (mlock2, 3, addr, length, flags);
+ if (ret == 0 || errno != ENOSYS)
+ return ret;
+# endif /* __NR_mlock2 */
+ /* Treat the missing system call as an invalid (non-zero) flag
+ argument. */
+ __set_errno (EINVAL);
+ return -1;
+#endif /* __ASSUME_MLOCK2 */
+}
@@ -2148,3 +2148,4 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
@@ -1990,6 +1990,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -1995,6 +1995,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -2202,3 +2202,4 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
@@ -109,6 +109,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 _Exit F
GLIBC_2.3 _IO_2_1_stderr_ D 0xe0
@@ -1990,6 +1990,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.27 strfromf128 F
GLIBC_2.27 strtof128 F
GLIBC_2.27 strtof128_l F
@@ -1891,6 +1891,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.27 strfromf128 F
GLIBC_2.27 strtof128 F
GLIBC_2.27 strtof128_l F
@@ -1876,6 +1876,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -1983,6 +1983,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.27 strfromf128 F
GLIBC_2.27 strtof128 F
GLIBC_2.27 strtof128_l F
@@ -1920,6 +1920,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.27 strfromf128 F
GLIBC_2.27 strtof128 F
GLIBC_2.27 strtof128_l F
@@ -2114,3 +2114,4 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
@@ -2114,3 +2114,4 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
@@ -2114,3 +2114,4 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
new file mode 100644
@@ -0,0 +1,66 @@
+/* Test the mlock2 function.
+ Copyright (C) 2017 Free Software Foundation, Inc.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <errno.h>
+#include <stdio.h>
+#include <support/check.h>
+#include <support/xunistd.h>
+#include <sys/mman.h>
+
+/* Allocate a page using mmap. */
+static void *
+get_page (void)
+{
+ return xmmap (NULL, 1, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1);
+}
+
+static int
+do_test (void)
+{
+ /* Current kernels have a small reserve of locked memory, so this
+ test does not need any privileges to run. */
+
+ void *page = get_page ();
+ if (mlock (page, 1) != 0)
+ FAIL_EXIT1 ("mlock: %m\n");
+ xmunmap (page, 1);
+
+ page = get_page ();
+ if (mlock2 (page, 1, 0) != 0)
+ /* Should be implemented using mlock if necessary. */
+ FAIL_EXIT1 ("mlock2 (0): %m\n");
+ xmunmap (page, 1);
+
+ page = get_page ();
+ int ret = mlock2 (page, 1, MLOCK_ONFAULT);
+ if (ret != 0)
+ {
+ TEST_VERIFY (ret == -1);
+ if (errno != EINVAL)
+ /* EINVAL means the system does not support the mlock2 system
+ call. */
+ FAIL_EXIT1 ("mlock2 (0): %m\n");
+ else
+ puts ("warning: mlock2 system call not supported");
+ }
+ xmunmap (page, 1);
+
+ return 0;
+}
+
+#include <support/test-driver.c>
@@ -1878,6 +1878,7 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F
GLIBC_2.3 GLIBC_2.3 A
GLIBC_2.3 __ctype_b_loc F
GLIBC_2.3 __ctype_tolower_loc F
@@ -2121,3 +2121,4 @@ GLIBC_2.27 GLIBC_2.27 A
GLIBC_2.27 glob F
GLIBC_2.27 glob64 F
GLIBC_2.27 memfd_create F
+GLIBC_2.27 mlock2 F