posix_fallocate.3: Mention glibc emulation caveats.

Message ID 560E0567.7040204@redhat.com
State Not applicable
Headers

Commit Message

Carlos O'Donell Oct. 2, 2015, 4:17 a.m. UTC
  Michael,

You're going to really enjoy reading this patch ;-)

Patch applies to master.

When the glibc implementation of posix_fallocate detects
that the underlying filesystem does not support fallocate
it uses an emulation function to attempt to allocate the
space requested. The most common case is calling
posix_fallocate for a file that is on NFS where the
NFS server is not new enough to support the recent fallocate
extensions. This emulation has various serious caveats that
must be understood in order to use posix_fallocate robustly
on all filesystems. The change document the caveats in the
glibc implementation.

Lastly, we expand the meaning of EINVAL to match POSIX
2013 (Issue 7). If the underlying filesystem doesn't support
posix_fallocate the implementation can return EINVAL, but
glibc does not do this, it emulates the operation instead.

Signed-off-by: Carlos O'Donell <carlos@redhat.com>

---

Cheers,
Carlos.
  

Comments

Michael Kerrisk \(man-pages\) Oct. 5, 2015, 9:06 a.m. UTC | #1
Hi Carlos,

On 10/02/2015 05:17 AM, Carlos O'Donell wrote:
> Michael,
> 
> You're going to really enjoy reading this patch ;-)

Thanks for the patch. What a sad story :-{

> Patch applies to master.
> 
> When the glibc implementation of posix_fallocate detects
> that the underlying filesystem does not support fallocate
> it uses an emulation function to attempt to allocate the
> space requested. The most common case is calling
> posix_fallocate for a file that is on NFS where the
> NFS server is not new enough to support the recent fallocate
> extensions. This emulation has various serious caveats that
> must be understood in order to use posix_fallocate robustly
> on all filesystems. The change document the caveats in the
> glibc implementation.
> 
> Lastly, we expand the meaning of EINVAL to match POSIX
> 2013 (Issue 7). If the underlying filesystem doesn't support
> posix_fallocate the implementation can return EINVAL, but
> glibc does not do this, it emulates the operation instead.

Thanks. I've applied. I tweaked the wording a bit in a further
commit, and then made a further commit where I tried to fine tune
the  technical details a little. Could you please check commit
624fbe44d9c1ef54eb3fd36328f59a5037b87986 and let me know if there
ia any technical misstep there?

Thanks,

Michael

> Signed-off-by: Carlos O'Donell <carlos@redhat.com>
> 
> diff --git a/man3/posix_fallocate.3 b/man3/posix_fallocate.3
> index e35dcb9..1b91a37 100644
> --- a/man3/posix_fallocate.3
> +++ b/man3/posix_fallocate.3
> @@ -83,7 +83,8 @@ exceeds the maximum file size.
>  .I offset
>  was less than 0, or
>  .I len
> -was less than or equal to 0.
> +was less than or equal to 0, or the underlying filesystem does not
> +support the operation.
>  .TP
>  .B ENODEV
>  .I fd
> @@ -142,6 +143,30 @@ In the glibc implementation,
>  .BR posix_fallocate ()
>  is implemented using
>  .BR fallocate (2).
> +If the underlying filesystem does not support the
> +.BR fallocate (2)
> +syscall then the operation is emulated with the following caveats:
> +.IP * 2
> +The emulation is inefficient.
> +.IP *
> +There is a race condition where concurrent writes from another thread or
> +process could be overwritten with null bytes.
> +.IP *
> +There is a race condition where concurrent file size increase by
> +another thread or process could result in a file whose size is smaller
> +than expected.
> +.IP *
> +If fd has been opened with the O_APPEND or O_WRONLY flags the function
> +will fail with
> +.B EBADF.
> +.PP
> +In general the emulation is not MT-safe. On Linux, applications may use
> +.BR fallocate (2)
> +if they cannot work around the emulation caveats. In general this is
> +only recommended if the application plans to terminate the operation if
> +.B EOPNOTSUPP
> +is returned, otherwise the application itself will need to implement an
> +fallback with all the same problems as the emulation provided by glibc.
>  .SH SEE ALSO
>  .BR fallocate (1),
>  .BR fallocate (2),
> ---
> 
> Cheers,
> Carlos.
>
  
Carlos O'Donell Oct. 7, 2015, 1:44 p.m. UTC | #2
On 10/05/2015 05:06 AM, Michael Kerrisk (man-pages) wrote:
> Hi Carlos,
> 
> On 10/02/2015 05:17 AM, Carlos O'Donell wrote:
>> Michael,
>>
>> You're going to really enjoy reading this patch ;-)
> 
> Thanks for the patch. What a sad story :-{

I've gotten at least one hate mail for documenting how broken
it is when the underlying filesystem doesn't support it ;-)

Florian Weimer (Red Hat) started a rather long and interesting
discussion on libc-alpha about removing the emulation layer,
but we found that it was impossible to do without breaking a
lot of userspace applications that operate over NFS, are
single-threaded, and expect posix_fallocate to work correctly.

The best compromise was to document the behaviour, and wait
for everyone to use NFS 4.2, at which point the issue goes
away. Until then we need to help users cope.

The worst case scenario would be that we remove the fallback
and all the downstream users start implementing their own
incorrect and poorly tested fallback. One fallback
in one project, reviewed by a dozen people is sane.

> Thanks. I've applied. I tweaked the wording a bit in a further
> commit, and then made a further commit where I tried to fine tune
> the  technical details a little. Could you please check commit
> 624fbe44d9c1ef54eb3fd36328f59a5037b87986 and let me know if there
> ia any technical misstep there?

Looks perfect. The goal is to scare you into reviewing your code ;-)

Cheers,
Carlos.
  
Michael Kerrisk \(man-pages\) Oct. 8, 2015, 9:10 p.m. UTC | #3
On 10/07/2015 02:44 PM, Carlos O'Donell wrote:
> On 10/05/2015 05:06 AM, Michael Kerrisk (man-pages) wrote:
>> Hi Carlos,
>>
>> On 10/02/2015 05:17 AM, Carlos O'Donell wrote:
>>> Michael,
>>>
>>> You're going to really enjoy reading this patch ;-)
>>
>> Thanks for the patch. What a sad story :-{
> 
> I've gotten at least one hate mail for documenting how broken
> it is when the underlying filesystem doesn't support it ;-)
> 
> Florian Weimer (Red Hat) started a rather long and interesting
> discussion on libc-alpha about removing the emulation layer,
> but we found that it was impossible to do without breaking a
> lot of userspace applications that operate over NFS, are
> single-threaded, and expect posix_fallocate to work correctly.
> 
> The best compromise was to document the behaviour, and wait
> for everyone to use NFS 4.2, at which point the issue goes
> away. Until then we need to help users cope.
> 
> The worst case scenario would be that we remove the fallback
> and all the downstream users start implementing their own
> incorrect and poorly tested fallback. One fallback
> in one project, reviewed by a dozen people is sane.

<nod>

>> Thanks. I've applied. I tweaked the wording a bit in a further
>> commit, and then made a further commit where I tried to fine tune
>> the  technical details a little. Could you please check commit
>> 624fbe44d9c1ef54eb3fd36328f59a5037b87986 and let me know if there
>> ia any technical misstep there?
> 
> Looks perfect. The goal is to scare you into reviewing your code ;-)

Thanks for checking it, Carlos.

Cheers,

Michael
  

Patch

diff --git a/man3/posix_fallocate.3 b/man3/posix_fallocate.3
index e35dcb9..1b91a37 100644
--- a/man3/posix_fallocate.3
+++ b/man3/posix_fallocate.3
@@ -83,7 +83,8 @@  exceeds the maximum file size.
 .I offset
 was less than 0, or
 .I len
-was less than or equal to 0.
+was less than or equal to 0, or the underlying filesystem does not
+support the operation.
 .TP
 .B ENODEV
 .I fd
@@ -142,6 +143,30 @@  In the glibc implementation,
 .BR posix_fallocate ()
 is implemented using
 .BR fallocate (2).
+If the underlying filesystem does not support the
+.BR fallocate (2)
+syscall then the operation is emulated with the following caveats:
+.IP * 2
+The emulation is inefficient.
+.IP *
+There is a race condition where concurrent writes from another thread or
+process could be overwritten with null bytes.
+.IP *
+There is a race condition where concurrent file size increase by
+another thread or process could result in a file whose size is smaller
+than expected.
+.IP *
+If fd has been opened with the O_APPEND or O_WRONLY flags the function
+will fail with
+.B EBADF.
+.PP
+In general the emulation is not MT-safe. On Linux, applications may use
+.BR fallocate (2)
+if they cannot work around the emulation caveats. In general this is
+only recommended if the application plans to terminate the operation if
+.B EOPNOTSUPP
+is returned, otherwise the application itself will need to implement an
+fallback with all the same problems as the emulation provided by glibc.
 .SH SEE ALSO
 .BR fallocate (1),
 .BR fallocate (2),