From patchwork Mon May 18 10:43:06 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Weimer X-Patchwork-Id: 6779 Received: (qmail 19103 invoked by alias); 18 May 2015 10:43:23 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 19093 invoked by uid 89); 18 May 2015 10:43:22 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=AWL, BAYES_50, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_PASS, T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: mx1.redhat.com Message-ID: <5559C23A.4070406@redhat.com> Date: Mon, 18 May 2015 12:43:06 +0200 From: Florian Weimer User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Roland McGrath , Rich Felker CC: Paul Eggert , libc-alpha@sourceware.org, "Carlos O'Donell" , Mark Wielaard Subject: Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661] References: <20150424134516.6795441F484D0@oldenburg.str.redhat.com> <554927F9.7080509@redhat.com> <5549C097.50505@redhat.com> <554A9A46.2050806@cs.ucla.edu> <20150506233055.GQ17573@brightrain.aerifal.cx> <20150507181942.E71202C3B93@topped-with-meat.com> <554BB77D.1040805@redhat.com> In-Reply-To: <554BB77D.1040805@redhat.com> On 05/07/2015 09:05 PM, Florian Weimer wrote: > On 05/07/2015 08:19 PM, Roland McGrath wrote: >>> If I'm not mistaken ftruncate could still reduce the file size if it >>> races with another operation that would extend the file. This is also >>> a data loss bug. >> >> I concur. > > It happens with length == 0. We could error out with EINVAL instead of > calling ftruncate. > > Daniel Berrange pointed me to these bugs: > > https://sourceware.org/bugzilla/show_bug.cgi?id=17322 > https://bugzilla.redhat.com/show_bug.cgi?id=1140250 > https://bugzilla.redhat.com/show_bug.cgi?id=1077068 Another very recent example is here: https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html > This suggests that people actually rely on the current allocation > behavior. Combined with my previous analysis that applications will > start to fail if we remove the fallback and return EINVAL, I now think > we need to keep the allocation loop. This is the patch I currently have. It fixes the avoidable bugs. I still think we are in a bad situation here, that even a compatibility symbol cannot fix. From f25f46c0e0223d9f6e9e8ead27a37caa34f33631 Mon Sep 17 00:00:00 2001 Message-Id: From: Florian Weimer Date: Mon, 18 May 2015 11:32:44 +0100 Subject: [PATCH] posix_fallocate: Emulation fixes and documentation [BZ #15661] To: libc-alpha@sourceware.org Handle signed integer overflow correctly. Detect and reject O_APPEND. Document drawbacks of emulation. This does not completely address bug 15661, but improves the situation somewhat. --- ChangeLog | 9 ++++ manual/filesys.texi | 94 +++++++++++++++++++++++++++++++++++++++ sysdeps/posix/posix_fallocate.c | 67 ++++++++++++++++++++-------- sysdeps/posix/posix_fallocate64.c | 67 ++++++++++++++++++++-------- 4 files changed, 199 insertions(+), 38 deletions(-) diff --git a/ChangeLog b/ChangeLog index 4de8a25..603847b 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,12 @@ +2015-05-18 Florian Weimer + + [BZ #15661] + * sysdeps/posix/posix_fallocate64.c (__posix_fallocate64_l64): + Check for overflow properly. Check for O_APPEND. Ignore large + file system block sizes. Add comments about problems. + * sysdeps/posix/posix_fallocate.c (posix_fallocate): Likewise. + * manual/filesys.texi (Storage Allocation): New node. + 2015-05-18 Arjun Shankar * include/stdio.h: Define __need_wint_t. diff --git a/manual/filesys.texi b/manual/filesys.texi index 7d55b43..0f2e3dc 100644 --- a/manual/filesys.texi +++ b/manual/filesys.texi @@ -1723,6 +1723,7 @@ modify the attributes of a file. access a file. * File Times:: About the time attributes of a file. * File Size:: Manually changing the size of a file. +* Storage Allocation:: Allocate backing storage for files. @end menu @node Attribute Meanings @@ -3233,6 +3234,99 @@ is a requirement of @code{mmap}. The program has to keep track of the real size, and when it has finished a final @code{ftruncate} call should set the real size of the file. +@node Storage Allocation +@subsection Storage Allocation +@cindex allocating file storage +@cindex file allocation +@cindex storage allocating + +@cindex file fragmentation +@cindex fragmentation of files +@cindex sparse files +@cindex files, sparse +Most file systems support allocating large files in a non-contiguous +fashion: the file is split into @emph{fragments} which are allocated +sequentially, but the fragments themselves can be scattered across the +disk. File systems generally try to avoid such fragmentation because it +decreases performance, but if a file gradually increases in size, there +might be no other option than to fragment it. In addition, many file +systems support @emph{sparse files} with @emph{holes}: regions of null +bytes for which no backing storage has been allocated by the file +system. When the holes are finally overwritten with data, fragmentation +can occur as well. + +Explicit allocation of storage for yet-unwritten parts of the file can +help the system to avoid fragmentation. Additionally, if storage +pre-allocation fails, it is possible to report the out-of-disk error +early, often without filling up the entire disk. However, due to +deduplication, copy-on-write semantics, and file compression, such +pre-allocation may not reliably prevent the out-of-disk-space error from +occurring later. Checking for write errors is still required, and +writes to memory-mapped regions created with @code{mmap} can still +result in @code{SIGBUS}. + +@deftypefun int posix_fallocate (int @var{fd}, off_t @var{offset}, off_t @var{length}) +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} +@c If the file system does not support allocation, +@c @code{posix_fallocate} has a race with file extension (if +@c @var{length} is zero) or with concurrent writes of non-NUL bytes (if +@c @var{length} is positive). + +Allocate backing store for the region of @var{length} bytes starting at +byte @var{offset} in the file for the descriptor @var{fd}. The file +length is increased to @samp{@var{length} + @var{offset}} if necessary. + +@var{fd} must be a regular file opened for writing, or @code{EBADF} is +returned. If there is insufficient disk space to fulfill the allocation +request, @code{ENOSPC} is returned. + +@strong{Note:} If @code{fallocate} is not available (because the file +system does not support it), @code{posix_fallocate} is emulated, which +has the following drawbacks: + +@itemize @bullet +@item +It is very inefficient because all file system blocks in the requested +range need to be examined (even if they have been allocated before) and +potentially rewritten. In contrast, with proper @code{fallocate} +support (see below), the file system can examine the internal file +allocation data structures and eliminate holes directly, maybe even +using unwritten extents (which are pre-allocated but uninitialized on +disk). + +@item +There is a race condition if another thread or process modifies the +underlying file in the to-be-allocated area. Non-null bytes could be +overwritten with null bytes. + +@item +If @var{fd} has been opened with the @code{O_APPEND} flag, the function +will fail with an @code{errno} value of @code{EBADF}. + +@item +If @var{length} is zero, @code{ftruncate} is used to increase the file +size as requested, without allocating file system blocks. There is a +race condition which means that @code{ftruncate} can accidentally +truncate the file if it has been extended concurrently. +@end itemize + +On Linux, if an application does not benefit from emulation or if the +emulation is harmful due to its inherent race conditions, the +application can use the Linux-specific @code{fallocate} function, with a +zero flag argument. For the @code{fallocate} function, @theglibc{} does +not perform allocation emulation if the file system does not support +allocation. Instead, an @code{EOPNOTSUPP} is returned to the caller. + +@end deftypefun + +@deftypefun int posix_fallocate64 (int @var{fd}, off64_t @var{length}, off64_t @var{offset}) +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} + +This function is a variant of @code{posix_fallocate64} which accepts +64-bit file offsets on all platforms. + +@end deftypefun + @node Making Special Files @section Making Special Files @cindex creating special files diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c index d15d603..e7fe201 100644 --- a/sysdeps/posix/posix_fallocate.c +++ b/sysdeps/posix/posix_fallocate.c @@ -18,26 +18,36 @@ #include #include #include +#include +#include #include #include -/* Reserve storage for the data of the file associated with FD. */ +/* Reserve storage for the data of the file associated with FD. This + emulation is far from perfect, but the kernel cannot do not much + better for network file systems, either. */ int posix_fallocate (int fd, __off_t offset, __off_t len) { struct stat64 st; - struct statfs f; - /* `off_t' is a signed type. Therefore we can determine whether - OFFSET + LEN is too large if it is a negative value. */ if (offset < 0 || len < 0) return EINVAL; - if (offset + len < 0) + + /* Perform overflow check. The outer cast relies on a GCC + extension. */ + if ((__off_t) ((uint64_t) offset) + ((uint64_t) len) < 0) return EFBIG; - /* First thing we have to make sure is that this is really a regular - file. */ + /* pwrite below will not do the right thing in O_APPEND mode. */ + { + int flags = __fcntl (fd, F_GETFL, 0); + if (flags < 0 || (flags & O_APPEND) != 0) + return EBADF; + } + + /* We have to make sure that this is really a regular file. */ if (__fxstat64 (_STAT_VER, fd, &st) != 0) return EBADF; if (S_ISFIFO (st.st_mode)) @@ -47,6 +57,8 @@ posix_fallocate (int fd, __off_t offset, __off_t len) if (len == 0) { + /* This is racy, but there is no good way to satisfy a + zero-length allocation request. */ if (st.st_size < offset) { int ret = __ftruncate (fd, offset); @@ -58,19 +70,36 @@ posix_fallocate (int fd, __off_t offset, __off_t len) return 0; } - /* We have to know the block size of the filesystem to get at least some - sort of performance. */ - if (__fstatfs (fd, &f) != 0) - return errno; - - /* Try to play safe. */ - if (f.f_bsize == 0) - f.f_bsize = 512; - - /* Write something to every block. */ - for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize) + /* Minimize data transfer for network file systems, by issuing + single-byte write requests spaced by the file system block size. + (Most local file systems have fallocate support, so this fallback + code is not used there.) */ + + unsigned increment; + { + struct statfs64 f; + + if (__fstatfs64 (fd, &f) != 0) + return errno; + if (f.f_bsize == 0) + increment = 512; + else if (f.f_bsize < 4096) + increment = f.f_bsize; + else + /* NFS does not propagate the block size of the underlying + storage and may report a much larger value which would still + leave holes after the loop below, so we cap the increment at + 4096. */ + increment = 4096; + } + + /* Write a null byte to every block. This is racy; we currently + lack a better option. Compare-and-swap against a file mapping + might additional local races, but requires interposition of a + signal handler to catch SIGBUS. */ + for (offset += (len - 1) % increment; len > 0; offset += increment) { - len -= f.f_bsize; + len -= increment; if (offset < st.st_size) { diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c index b845df7..ee32679 100644 --- a/sysdeps/posix/posix_fallocate64.c +++ b/sysdeps/posix/posix_fallocate64.c @@ -18,26 +18,36 @@ #include #include #include +#include +#include #include #include -/* Reserve storage for the data of the file associated with FD. */ +/* Reserve storage for the data of the file associated with FD. This + emulation is far from perfect, but the kernel cannot do not much + better for network file systems, either. */ int __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len) { struct stat64 st; - struct statfs64 f; - /* `off64_t' is a signed type. Therefore we can determine whether - OFFSET + LEN is too large if it is a negative value. */ if (offset < 0 || len < 0) return EINVAL; - if (offset + len < 0) + + /* Perform overflow check. The outer cast relies on a GCC + extension. */ + if ((__off64_t) ((uint64_t) offset) + ((uint64_t) len) < 0) return EFBIG; - /* First thing we have to make sure is that this is really a regular - file. */ + /* pwrite64 below will not do the right thing in O_APPEND mode. */ + { + int flags = __fcntl (fd, F_GETFL, 0); + if (flags < 0 || (flags & O_APPEND) != 0) + return EBADF; + } + + /* We have to make sure that this is really a regular file. */ if (__fxstat64 (_STAT_VER, fd, &st) != 0) return EBADF; if (S_ISFIFO (st.st_mode)) @@ -47,6 +57,8 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len) if (len == 0) { + /* This is racy, but there is no good way to satisfy a + zero-length allocation request. */ if (st.st_size < offset) { int ret = __ftruncate64 (fd, offset); @@ -58,19 +70,36 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len) return 0; } - /* We have to know the block size of the filesystem to get at least some - sort of performance. */ - if (__fstatfs64 (fd, &f) != 0) - return errno; - - /* Try to play safe. */ - if (f.f_bsize == 0) - f.f_bsize = 512; - - /* Write something to every block. */ - for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize) + /* Minimize data transfer for network file systems, by issuing + single-byte write requests spaced by the file system block size. + (Most local file systems have fallocate support, so this fallback + code is not used there.) */ + + unsigned increment; + { + struct statfs64 f; + + if (__fstatfs64 (fd, &f) != 0) + return errno; + if (f.f_bsize == 0) + increment = 512; + else if (f.f_bsize < 4096) + increment = f.f_bsize; + else + /* NFS clients do not propagate the block size of the underlying + storage and may report a much larger value which would still + leave holes after the loop below, so we cap the increment at + 4096. */ + increment = 4096; + } + + /* Write a null byte to every block. This is racy; we currently + lack a better option. Compare-and-swap against a file mapping + might address local races, but requires interposition of a signal + handler to catch SIGBUS. */ + for (offset += (len - 1) % increment; len > 0; offset += increment) { - len -= f.f_bsize; + len -= increment; if (offset < st.st_size) { -- 2.1.0