[v2,4/9] linux: Use getdents64 on non-LFS readdir

Message ID 20201002170620.1611673-5-adhemerval.zanella@linaro.org
State Superseded
Headers
Series Fix getdents{64} regression on some FS |

Commit Message

Adhemerval Zanella Netto Oct. 2, 2020, 5:06 p.m. UTC
  It reserves some space on the allocated internal buffer to be
used as a the returned dirent struct.  The kernel obtained dirent64
struct are copied to the temporary buffer on each readdir call.

The overflow test is moved once the dirent64 entry is copied
to the temporary buffer, and a subsequent readdir will obtain the
next entry.  The idea is an overflow fails to return the entry on
readdir, but a next readdir might still obtain the next entry.
(for filesystem that does not have the concept of sequential d_off,
such as ext4).

Checked on x86_64-linux-gnu and i686-linux-gnu.
---
 sysdeps/unix/sysv/linux/opendir.c |   6 +-
 sysdeps/unix/sysv/linux/readdir.c |  25 +++----
 sysdeps/unix/sysv/linux/readdir.h | 117 ++++++++++++++++++++++++++++++
 3 files changed, 131 insertions(+), 17 deletions(-)
 create mode 100644 sysdeps/unix/sysv/linux/readdir.h
  

Comments

Florian Weimer Oct. 13, 2020, 3:59 p.m. UTC | #1
* Adhemerval Zanella via Libc-alpha:

> It reserves some space on the allocated internal buffer to be
> used as a the returned dirent struct.  The kernel obtained dirent64
> struct are copied to the temporary buffer on each readdir call.
>
> The overflow test is moved once the dirent64 entry is copied
> to the temporary buffer, and a subsequent readdir will obtain the
> next entry.  The idea is an overflow fails to return the entry on
> readdir, but a next readdir might still obtain the next entry.
> (for filesystem that does not have the concept of sequential d_off,
> such as ext4).

Can you make the translation buffer a separate allocation?  Then you can
resize it to accommodate long names.

Thanks,
Florian
  
Adhemerval Zanella Netto Oct. 15, 2020, 2:25 p.m. UTC | #2
On 13/10/2020 12:59, Florian Weimer wrote:
> * Adhemerval Zanella via Libc-alpha:
> 
>> It reserves some space on the allocated internal buffer to be
>> used as a the returned dirent struct.  The kernel obtained dirent64
>> struct are copied to the temporary buffer on each readdir call.
>>
>> The overflow test is moved once the dirent64 entry is copied
>> to the temporary buffer, and a subsequent readdir will obtain the
>> next entry.  The idea is an overflow fails to return the entry on
>> readdir, but a next readdir might still obtain the next entry.
>> (for filesystem that does not have the concept of sequential d_off,
>> such as ext4).
> 
> Can you make the translation buffer a separate allocation?  Then you can
> resize it to accommodate long names.

I followed the idea of mips64 getdents64 (which came from your suggestion)
and it seems that at least for most majority of filesystems [1], the buffer
size should accommodate the maximum supported file size (ReiserFS seems to
be an outlier, but it also seems that Linux VFS limits the maximum size
to 255 characters).

Using a single allocation make the code slight faster (which not sure if
plays much difference since a lot of memory copy would be required for 
non-LFS) and it does not require to handle the possible reallocation
failure on readdir.

[1] https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits (Not
    sure how exact this reflect the latest Linux version though)
  
Florian Weimer Oct. 19, 2020, 8:18 a.m. UTC | #3
* Adhemerval Zanella:

> Using a single allocation make the code slight faster (which not sure if
> plays much difference since a lot of memory copy would be required for 
> non-LFS) and it does not require to handle the possible reallocation
> failure on readdir.

But you'd still have to handle the case where name is longer than the
allocation, to avoid a buffer overflow.  So an error case still exists.

Thanks,
Florian
  
Adhemerval Zanella Netto Oct. 19, 2020, 8 p.m. UTC | #4
On 19/10/2020 05:18, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> Using a single allocation make the code slight faster (which not sure if
>> plays much difference since a lot of memory copy would be required for 
>> non-LFS) and it does not require to handle the possible reallocation
>> failure on readdir.
> 
> But you'd still have to handle the case where name is longer than the
> allocation, to avoid a buffer overflow.  So an error case still exists.
> 
> Thanks,
> Florian

The dirstream_ret_entry already handles it by returning EOVERFLOW for such
case.  Also keep in mind this limitation is only for the old non-LFS
interface, for the LFS one it will use all the available allocated buffer
created by the opendir (it still have the 32kb limit for getdents64 call).
I can add a more complex buffer handling to resize the auxiliary buffer,
but I am not sure it really pays of the complexity.
  
Florian Weimer Oct. 19, 2020, 8:50 p.m. UTC | #5
* Adhemerval Zanella:

> On 19/10/2020 05:18, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> Using a single allocation make the code slight faster (which not sure if
>>> plays much difference since a lot of memory copy would be required for 
>>> non-LFS) and it does not require to handle the possible reallocation
>>> failure on readdir.
>> 
>> But you'd still have to handle the case where name is longer than the
>> allocation, to avoid a buffer overflow.  So an error case still exists.
>> 
>> Thanks,
>> Florian
>
> The dirstream_ret_entry already handles it by returning EOVERFLOW for such
> case.  Also keep in mind this limitation is only for the old non-LFS
> interface, for the LFS one it will use all the available allocated buffer
> created by the opendir (it still have the 32kb limit for getdents64 call).
> I can add a more complex buffer handling to resize the auxiliary buffer,
> but I am not sure it really pays of the complexity.

Hmm.  The error code should be ENAMETOOLONG.  Ideally, it should be
delayed until the end of the stream, so that the rest of the directory
can be listed (similar to what we do for readdir_r).  It's probably okay
to report an ENOMEM failure immediately.

Anyway, I'll look at the overflow check and see if we can move this
forward, and then revisit the translation buffer allocation in a
separate patch.

Thanks,
Florian
  
Adhemerval Zanella Netto Oct. 19, 2020, 9:09 p.m. UTC | #6
On 19/10/2020 17:50, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 19/10/2020 05:18, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> Using a single allocation make the code slight faster (which not sure if
>>>> plays much difference since a lot of memory copy would be required for 
>>>> non-LFS) and it does not require to handle the possible reallocation
>>>> failure on readdir.
>>>
>>> But you'd still have to handle the case where name is longer than the
>>> allocation, to avoid a buffer overflow.  So an error case still exists.
>>>
>>> Thanks,
>>> Florian
>>
>> The dirstream_ret_entry already handles it by returning EOVERFLOW for such
>> case.  Also keep in mind this limitation is only for the old non-LFS
>> interface, for the LFS one it will use all the available allocated buffer
>> created by the opendir (it still have the 32kb limit for getdents64 call).
>> I can add a more complex buffer handling to resize the auxiliary buffer,
>> but I am not sure it really pays of the complexity.
> 
> Hmm.  The error code should be ENAMETOOLONG.  Ideally, it should be
> delayed until the end of the stream, so that the rest of the directory
> can be listed (similar to what we do for readdir_r).  It's probably okay
> to report an ENOMEM failure immediately.

ENAMETOOLONG does make more sense, I have fixed it locally.  I am not
sure I understood by 'delayed', it it report on the readdir call, but
the next entry will still be reported in the next readdir call.  The
construction such as won't work:

  while (readdir (dp) != NULL)
    {
      [...]
    }

But this will work as intended:

  while (1)
    {
      errno = 0;
      struct dirent *entry = readdir (dp);
      if (entry == NULL)
	{
	  if (errno == ENAMETOOLONG)
	    continue;
	  break;
	}
    }

Even if go to buffer resize, application will still need to handle
ENOMEM so I am not sure if using separated buffer is really an
improvement here (at least on trying to adequate usual way to call
opendir functions).


> 
> Anyway, I'll look at the overflow check and see if we can move this
> forward, and then revisit the translation buffer allocation in a
> separate patch.

Thanks.
  
Florian Weimer Oct. 20, 2020, 7:38 a.m. UTC | #7
* Adhemerval Zanella via Libc-alpha:

> ENAMETOOLONG does make more sense, I have fixed it locally.  I am not
> sure I understood by 'delayed', it it report on the readdir call, but
> the next entry will still be reported in the next readdir call.  The
> construction such as won't work:
>
>   while (readdir (dp) != NULL)
>     {
>       [...]
>     }
>
> But this will work as intended:
>
>   while (1)
>     {
>       errno = 0;
>       struct dirent *entry = readdir (dp);
>       if (entry == NULL)
> 	{
> 	  if (errno == ENAMETOOLONG)
> 	    continue;
> 	  break;
> 	}
>     }

I think it's unreasonable to expect that this pattern will work.  It
assumes that readdir updates dp despite the error.  In other
implementations, this may produce an endless loop.  This is why for
readdir_r, we record the ENAMETOOLONG error as pending, and report it
only at the end.  (We should do the same for EOVERFLOW errors.)

> Even if go to buffer resize, application will still need to handle
> ENOMEM so I am not sure if using separated buffer is really an
> improvement here (at least on trying to adequate usual way to call
> opendir functions).

ENOMEM is a temporary error, ENAMETOOLONG depends on file system content
and may be much more difficult to resolve.

Thanks,
Florian
  
Adhemerval Zanella Netto Oct. 20, 2020, 12:05 p.m. UTC | #8
On 20/10/2020 04:38, Florian Weimer wrote:
> * Adhemerval Zanella via Libc-alpha:
> 
>> ENAMETOOLONG does make more sense, I have fixed it locally.  I am not
>> sure I understood by 'delayed', it it report on the readdir call, but
>> the next entry will still be reported in the next readdir call.  The
>> construction such as won't work:
>>
>>   while (readdir (dp) != NULL)
>>     {
>>       [...]
>>     }
>>
>> But this will work as intended:
>>
>>   while (1)
>>     {
>>       errno = 0;
>>       struct dirent *entry = readdir (dp);
>>       if (entry == NULL)
>> 	{
>> 	  if (errno == ENAMETOOLONG)
>> 	    continue;
>> 	  break;
>> 	}
>>     }
> 
> I think it's unreasonable to expect that this pattern will work.  It
> assumes that readdir updates dp despite the error.  In other
> implementations, this may produce an endless loop.  This is why for
> readdir_r, we record the ENAMETOOLONG error as pending, and report it
> only at the end.  (We should do the same for EOVERFLOW errors.)

This is what POSIX states as the way to check for errors:

  "Applications wishing to check for error situations should set errno 
   to 0 before calling readdir(). If errno is set to non-zero on return, 
   an error occurred."

What the standard is not clear is whether the position as the directory 
stream is updated in a case of failure.  I think it is a fair assumption
since it also states that 'When the end of the directory is encountered, 
a null pointer shall be returned and errno is not changed.'.  So there
is a distinction between end of directory stream and a failure (in the
example above checking for 'errno == 0' should indicate the end of the
stream).

And it also defines EOVERFLOW as possible transient error.  It
is possible to do for readdir_r because the function return is different
than the returned 'direct' itself, for readdir what we have is the
'errno' to signal such failure.

> 
>> Even if go to buffer resize, application will still need to handle
>> ENOMEM so I am not sure if using separated buffer is really an
>> improvement here (at least on trying to adequate usual way to call
>> opendir functions).
> 
> ENOMEM is a temporary error, ENAMETOOLONG depends on file system content
> and may be much more difficult to resolve.

Even with ENOMEM, application will require to actually handle a possible
readdir failure and acts accordingly (either by retrying or skip the
file).  But I consider failing with ENOMEM in fact much more confusing,
it exposes internal implementation that needs to be handled by the
application (the internal memory reallocation), where ENAMETOOLONG is
indeed a system limitation that can be resolved by using LFS support.

That's why this limitation is only for non-LFS interface.  This whole fix
is a best effort to a deprecated interface.
  
Florian Weimer Oct. 20, 2020, 12:35 p.m. UTC | #9
* Adhemerval Zanella:

> On 20/10/2020 04:38, Florian Weimer wrote:
>> * Adhemerval Zanella via Libc-alpha:
>> 
>>> ENAMETOOLONG does make more sense, I have fixed it locally.  I am not
>>> sure I understood by 'delayed', it it report on the readdir call, but
>>> the next entry will still be reported in the next readdir call.  The
>>> construction such as won't work:
>>>
>>>   while (readdir (dp) != NULL)
>>>     {
>>>       [...]
>>>     }
>>>
>>> But this will work as intended:
>>>
>>>   while (1)
>>>     {
>>>       errno = 0;
>>>       struct dirent *entry = readdir (dp);
>>>       if (entry == NULL)
>>> 	{
>>> 	  if (errno == ENAMETOOLONG)
>>> 	    continue;
>>> 	  break;
>>> 	}
>>>     }
>> 
>> I think it's unreasonable to expect that this pattern will work.  Itg
>> assumes that readdir updates dp despite the error.  In other
>> implementations, this may produce an endless loop.  This is why for
>> readdir_r, we record the ENAMETOOLONG error as pending, and report it
>> only at the end.  (We should do the same for EOVERFLOW errors.)
>
> This is what POSIX states as the way to check for errors:
>
>   "Applications wishing to check for error situations should set errno 
>    to 0 before calling readdir(). If errno is set to non-zero on return, 
>    an error occurred."
>
> What the standard is not clear is whether the position as the directory 
> stream is updated in a case of failure.

Yes, that was my concern.  I found this in POSIX (in 2.3 Error Numbers):

| If an error condition is detected, the action requested may have been
| partially performed, unless otherwise stated.

But in general we would be rather concerned if a regular read or write
on a file descriptor failed *and* still updated the file offset.  Not
sure about directory streams.

> I think it is a fair assumption
> since it also states that 'When the end of the directory is encountered, 
> a null pointer shall be returned and errno is not changed.'.  So there
> is a distinction between end of directory stream and a failure (in the
> example above checking for 'errno == 0' should indicate the end of the
> stream).

The question is whether you are supposed to be able to continue reading
and encounter errno == 0 eventually.  In general, this is simply not
true, consider EIO or EUCLEAN.  This is why I went with delayed error
reporting for readdir_r: it is simply not reasonable to expect that
applications continue calling readddir_r or readdir after the first
error because it's not clear how to can avoid an endless loop (even
assuming that the failing readdir call somehow advanced the read
pointer).

I hope this explains my point of view.

> And it also defines EOVERFLOW as possible transient error.  It
> is possible to do for readdir_r because the function return is different
> than the returned 'direct' itself, for readdir what we have is the
> 'errno' to signal such failure.

I don't understand this point.  The approach to detect EOF vs error is
different for readdir_r, but the question whether continue after the
first error is equally relevant.

>>> Even if go to buffer resize, application will still need to handle
>>> ENOMEM so I am not sure if using separated buffer is really an
>>> improvement here (at least on trying to adequate usual way to call
>>> opendir functions).
>> 
>> ENOMEM is a temporary error, ENAMETOOLONG depends on file system content
>> and may be much more difficult to resolve.
>
> Even with ENOMEM, application will require to actually handle a possible
> readdir failure and acts accordingly (either by retrying or skip the
> file).  But I consider failing with ENOMEM in fact much more confusing,
> it exposes internal implementation that needs to be handled by the
> application (the internal memory reallocation), where ENAMETOOLONG is
> indeed a system limitation that can be resolved by using LFS support.

I think readdir can already fail with ENOMEM if the error comes from the
kernel, for example from ext4_ind_map_blocks and ext4_alloc_branch.  So
adding another corner case where ENOMEM can result does not seem
egregiously bad to me.

Thanks,
Florian
  
Adhemerval Zanella Netto Oct. 20, 2020, 2:09 p.m. UTC | #10
On 20/10/2020 09:35, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 20/10/2020 04:38, Florian Weimer wrote:
>>> * Adhemerval Zanella via Libc-alpha:
>>>
>>>> ENAMETOOLONG does make more sense, I have fixed it locally.  I am not
>>>> sure I understood by 'delayed', it it report on the readdir call, but
>>>> the next entry will still be reported in the next readdir call.  The
>>>> construction such as won't work:
>>>>
>>>>   while (readdir (dp) != NULL)
>>>>     {
>>>>       [...]
>>>>     }
>>>>
>>>> But this will work as intended:
>>>>
>>>>   while (1)
>>>>     {
>>>>       errno = 0;
>>>>       struct dirent *entry = readdir (dp);
>>>>       if (entry == NULL)
>>>> 	{
>>>> 	  if (errno == ENAMETOOLONG)
>>>> 	    continue;
>>>> 	  break;
>>>> 	}
>>>>     }
>>>
>>> I think it's unreasonable to expect that this pattern will work.  Itg
>>> assumes that readdir updates dp despite the error.  In other
>>> implementations, this may produce an endless loop.  This is why for
>>> readdir_r, we record the ENAMETOOLONG error as pending, and report it
>>> only at the end.  (We should do the same for EOVERFLOW errors.)
>>
>> This is what POSIX states as the way to check for errors:
>>
>>   "Applications wishing to check for error situations should set errno 
>>    to 0 before calling readdir(). If errno is set to non-zero on return, 
>>    an error occurred."
>>
>> What the standard is not clear is whether the position as the directory 
>> stream is updated in a case of failure.
> 
> Yes, that was my concern.  I found this in POSIX (in 2.3 Error Numbers):
> 
> | If an error condition is detected, the action requested may have been
> | partially performed, unless otherwise stated.
> 
> But in general we would be rather concerned if a regular read or write
> on a file descriptor failed *and* still updated the file offset.  Not
> sure about directory streams.

I think the snippet you stated does not really prevent us to return an
empty entry with a transient error (ENAMETOOLONG), since it is clear
different than entries read error for getdents itself (EINVAL, ENOENT).

Former mean that only the referred entry could not be obtained, where
latter means that either there is a EOF (if the directory has been
removed) or the buffer is too small.  For latter we might still try to
resize the internal buffer to accommodate the required size, but this
incur in the memory failure possible error nonetheless.

> 
>> I think it is a fair assumption
>> since it also states that 'When the end of the directory is encountered, 
>> a null pointer shall be returned and errno is not changed.'.  So there
>> is a distinction between end of directory stream and a failure (in the
>> example above checking for 'errno == 0' should indicate the end of the
>> stream).
> 
> The question is whether you are supposed to be able to continue reading
> and encounter errno == 0 eventually.  In general, this is simply not
> true, consider EIO or EUCLEAN.  This is why I went with delayed error
> reporting for readdir_r: it is simply not reasonable to expect that
> applications continue calling readddir_r or readdir after the first
> error because it's not clear how to can avoid an endless loop (even
> assuming that the failing readdir call somehow advanced the read
> pointer).
> 
> I hope this explains my point of view.

I understood your point, however we are constraint by opendir interface.
Different than readdir_r, for readdir we can't signal that some error
has occurred in obtained the entry while returning the partial entry
information (for the case of ENAMETOOLONG for instance).

We can just skip the ENAMETOOLONG entries is set the errno only for EOF, 
but even  with this approach we might encounter different errors (due 
getdents failure for instance) which might shadow this specific interface
error glibc might trigger.

> 
>> And it also defines EOVERFLOW as possible transient error.  It
>> is possible to do for readdir_r because the function return is different
>> than the returned 'direct' itself, for readdir what we have is the
>> 'errno' to signal such failure.
> 
> I don't understand this point.  The approach to detect EOF vs error is
> different for readdir_r, but the question whether continue after the
> first error is equally relevant.

My point is if we decide to ignore entries with ENAMETOOLONG, how to signal 
the readdir caller this has happened?

> 
>>>> Even if go to buffer resize, application will still need to handle
>>>> ENOMEM so I am not sure if using separated buffer is really an
>>>> improvement here (at least on trying to adequate usual way to call
>>>> opendir functions).
>>>
>>> ENOMEM is a temporary error, ENAMETOOLONG depends on file system content
>>> and may be much more difficult to resolve.
>>
>> Even with ENOMEM, application will require to actually handle a possible
>> readdir failure and acts accordingly (either by retrying or skip the
>> file).  But I consider failing with ENOMEM in fact much more confusing,
>> it exposes internal implementation that needs to be handled by the
>> application (the internal memory reallocation), where ENAMETOOLONG is
>> indeed a system limitation that can be resolved by using LFS support.
> 
> I think readdir can already fail with ENOMEM if the error comes from the
> kernel, for example from ext4_ind_map_blocks and ext4_alloc_branch.  So
> adding another corner case where ENOMEM can result does not seem
> egregiously bad to me.

Is it a possible error for getdents? I couldn't really track it, but I
haven't dig into kernel code that deep. In any case it would be good
to update the getdents implementation.
  
Adhemerval Zanella Netto Oct. 20, 2020, 5:42 p.m. UTC | #11
On 20/10/2020 09:35, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 20/10/2020 04:38, Florian Weimer wrote:
>>> * Adhemerval Zanella via Libc-alpha:
>>>
>>>> ENAMETOOLONG does make more sense, I have fixed it locally.  I am not
>>>> sure I understood by 'delayed', it it report on the readdir call, but
>>>> the next entry will still be reported in the next readdir call.  The
>>>> construction such as won't work:
>>>>
>>>>   while (readdir (dp) != NULL)
>>>>     {
>>>>       [...]
>>>>     }
>>>>
>>>> But this will work as intended:
>>>>
>>>>   while (1)
>>>>     {
>>>>       errno = 0;
>>>>       struct dirent *entry = readdir (dp);
>>>>       if (entry == NULL)
>>>> 	{
>>>> 	  if (errno == ENAMETOOLONG)
>>>> 	    continue;
>>>> 	  break;
>>>> 	}
>>>>     }
>>>
>>> I think it's unreasonable to expect that this pattern will work.  Itg
>>> assumes that readdir updates dp despite the error.  In other
>>> implementations, this may produce an endless loop.  This is why for
>>> readdir_r, we record the ENAMETOOLONG error as pending, and report it
>>> only at the end.  (We should do the same for EOVERFLOW errors.)
>>
>> This is what POSIX states as the way to check for errors:
>>
>>   "Applications wishing to check for error situations should set errno 
>>    to 0 before calling readdir(). If errno is set to non-zero on return, 
>>    an error occurred."
>>
>> What the standard is not clear is whether the position as the directory 
>> stream is updated in a case of failure.
> 
> Yes, that was my concern.  I found this in POSIX (in 2.3 Error Numbers):
> 
> | If an error condition is detected, the action requested may have been
> | partially performed, unless otherwise stated.
> 
> But in general we would be rather concerned if a regular read or write
> on a file descriptor failed *and* still updated the file offset.  Not
> sure about directory streams.
> 
>> I think it is a fair assumption
>> since it also states that 'When the end of the directory is encountered, 
>> a null pointer shall be returned and errno is not changed.'.  So there
>> is a distinction between end of directory stream and a failure (in the
>> example above checking for 'errno == 0' should indicate the end of the
>> stream).
> 
> The question is whether you are supposed to be able to continue reading
> and encounter errno == 0 eventually.  In general, this is simply not
> true, consider EIO or EUCLEAN.  This is why I went with delayed error
> reporting for readdir_r: it is simply not reasonable to expect that
> applications continue calling readddir_r or readdir after the first
> error because it's not clear how to can avoid an endless loop (even
> assuming that the failing readdir call somehow advanced the read
> pointer).
> 
> I hope this explains my point of view.

Thinking twice I think a resizable buffer does make sense in this
situation.  I will send an updated set with this modification.
  

Patch

diff --git a/sysdeps/unix/sysv/linux/opendir.c b/sysdeps/unix/sysv/linux/opendir.c
index 2198224588..24bd63d2ba 100644
--- a/sysdeps/unix/sysv/linux/opendir.c
+++ b/sysdeps/unix/sysv/linux/opendir.c
@@ -22,6 +22,7 @@ 
 #include <sys/param.h>	/* For MIN and MAX.  */
 
 #include <not-cancel.h>
+#include <readdir.h>    /* For return_buffer_size.  */
 
 enum {
   opendir_oflags = O_RDONLY|O_NDELAY|O_DIRECTORY|O_LARGEFILE|O_CLOEXEC
@@ -103,8 +104,9 @@  __alloc_dir (int fd, bool close_fd, int flags, const struct stat64 *statp)
   enum { max_buffer_size = 1048576 };
 
   const size_t allocation_size = 32768;
-  _Static_assert (allocation_size >= sizeof (struct dirent64),
-		  "allocation_size < sizeof (struct dirent64)");
+  _Static_assert (allocation_size >= sizeof (struct dirent64)
+				     + return_buffer_size,
+		  "opendir buffer size smaller than required");
 
   /* Increase allocation if requested, but not if the value appears to
      be bogus.  It will be between 32Kb and 1Mb.  */
diff --git a/sysdeps/unix/sysv/linux/readdir.c b/sysdeps/unix/sysv/linux/readdir.c
index ca2a8964e9..8eab0f4c9b 100644
--- a/sysdeps/unix/sysv/linux/readdir.c
+++ b/sysdeps/unix/sysv/linux/readdir.c
@@ -19,7 +19,7 @@ 
 #include <dirent.h>
 
 #if !_DIRENT_MATCHES_DIRENT64
-#include <dirstream.h>
+#include <readdir.h>
 
 /* Read a directory entry from DIRP.  */
 struct dirent *
@@ -30,16 +30,12 @@  __readdir_unlocked (DIR *dirp)
 
   do
     {
-      size_t reclen;
-
       if (dirp->offset >= dirp->size)
 	{
 	  /* We've emptied out our buffer.  Refill it.  */
 
-	  size_t maxread = dirp->allocation;
-	  ssize_t bytes;
-
-	  bytes = __getdents (dirp->fd, dirp->data, maxread);
+	  ssize_t bytes = __getdents64 (dirp->fd, dirstream_data (dirp),
+					dirstream_alloc_size (dirp));
 	  if (bytes <= 0)
 	    {
 	      /* On some systems getdents fails with ENOENT when the
@@ -54,19 +50,18 @@  __readdir_unlocked (DIR *dirp)
 	      dp = NULL;
 	      break;
 	    }
-	  dirp->size = (size_t) bytes;
+	  dirp->size = bytes;
 
 	  /* Reset the offset into the buffer.  */
 	  dirp->offset = 0;
 	}
 
-      dp = (struct dirent *) &dirp->data[dirp->offset];
-
-      reclen = dp->d_reclen;
-
-      dirp->offset += reclen;
-
-      dirp->filepos = dp->d_off;
+      dp = dirstream_ret_entry (dirp);
+      if (dp == NULL)
+	{
+	  __set_errno (EOVERFLOW);
+	  break;
+	}
 
       /* Skip deleted files.  */
     } while (dp->d_ino == 0);
diff --git a/sysdeps/unix/sysv/linux/readdir.h b/sysdeps/unix/sysv/linux/readdir.h
new file mode 100644
index 0000000000..4dc219e220
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/readdir.h
@@ -0,0 +1,117 @@ 
+/* Linux readdir internal implementation details.
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef _DIRSTREAM_NOLFS_H
+#define _DIRSTREAM_NOLFS_H
+
+#if !_DIRENT_MATCHES_DIRENT64
+# include <dirstream.h>
+
+/* getdents64 is used internally for both LFS and non-LFS implementations.
+   The non-LFS interface reserves part of the allocated buffer to return the
+   non-LFS 'struct dirent' entry.  */
+
+/* This defines the reserved space size on DIR internal buffer to use as the
+   returned 'struct dirent' from a 'readdir' call.
+
+   The largest possible practical length of the d_name member are 255
+   Unicode characters in UTF-8 encoding, so d_name is 766 bytes long, plus
+   10 bytes from header, for a total of 776 bytes total.
+
+   Also it should take in cosideration the alignment requirement for
+   getdents64 call.  */
+enum { return_buffer_size = 1024
+			    + sizeof (off64_t)
+			    - _Alignof (((struct __dirstream) {0}).data) };
+
+_Static_assert ((_Alignof (((struct __dirstream) {0}).data)
+		 + return_buffer_size) % sizeof (off64_t) == 0,
+		"return_buffer_size does not align the buffer properly");
+
+/* Return the avaliable buffer size to use with getdents64 calls.  */
+static inline size_t
+dirstream_alloc_size (struct __dirstream *ds)
+{
+  return ds->allocation - return_buffer_size;
+}
+
+/* Return the start of the allocated buffer minus the reserved part to use on
+   non-LFS readdir call.  */
+static inline void *
+dirstream_data (struct __dirstream *ds)
+{
+  return (char *) ds->data + return_buffer_size;
+}
+
+/* Return the allocated buffer used on non-LFS readdir call.  */
+static inline struct dirent *
+dirstream_ret (struct __dirstream *ds)
+{
+  return (struct dirent *) ds->data;
+}
+
+/* Return the current dirent64 entry from the reserved buffer used on
+   getdent64.  */
+static inline struct dirent64 *
+dirstream_entry (struct __dirstream *ds)
+{
+  size_t offset = return_buffer_size + ds->offset;
+  return (struct dirent64 *) ((char *) ds->data + offset);
+}
+
+/* Copy one obtained entry from 'getdents64' call to the reserved space
+   on DS allocated buffer and updated its internal state.  */
+static inline struct dirent *
+dirstream_ret_entry (struct __dirstream *ds)
+{
+  struct dirent64 *dp64 = dirstream_entry (ds);
+  struct dirent *dp = dirstream_ret (ds);
+
+  dp->d_ino = dp64->d_ino;
+
+  dp->d_off = dp64->d_off;
+  if (dp->d_off != dp64->d_off)
+    /* Overflow.  */
+    return NULL;
+
+  const size_t size_diff = (offsetof (struct dirent64, d_name)
+			    - offsetof (struct dirent, d_name));
+  const size_t alignment = _Alignof (struct dirent);
+  size_t new_reclen  = (dp64->d_reclen - size_diff + alignment - 1)
+			& ~(alignment - 1);
+  if (new_reclen > return_buffer_size)
+    /* Overflow.  */
+    return NULL;
+  dp->d_reclen = new_reclen;
+
+  dp->d_type = dp64->d_type;
+
+  memcpy (dp->d_name, dp64->d_name,
+	  dp64->d_reclen - offsetof (struct dirent64, d_name));
+
+  ds->offset += dp64->d_reclen;
+  ds->filepos = dp64->d_off;
+
+  return dp;
+}
+#else
+/* No need to reserve an buffer space if dirent has already LFS support.  */
+enum { return_buffer_size = 0 };
+#endif /* _DIRENT_MATCHES_DIRENT64  */
+
+#endif