gdb/solib-rocm: limit the number of opened file descriptors

Message ID 20230628144425.8714-1-lancelot.six@amd.com
State New
Headers
Series gdb/solib-rocm: limit the number of opened file descriptors |

Commit Message

Lancelot SIX June 28, 2023, 2:44 p.m. UTC
  ROCm programs can load a high number of programs on GPU devices,
especially if lazy code-object loading have been disabled.  Each code
object containing such program is loaded once for each device available,
and each instance is reported by GDB as individual shared library.

We came across situations where the number of shared libraries opened by
GDB gets higher than the allowed number of opened files for the process.
Increasing the opened files limit works around the problem, but there is a
better way this patch proposes to follow.

Under the hood, the GPU code objects are embedded inside the host
application binary and shared library binaries.  GDB currently opens the
underlying file once for each shared library it sees.  That means that
the same file is re-opened every time a code object is loaded on a GPU.

This patch proposes to only open each underlying file once.  This is
done by implementing a reference counting mechanism so the underlying
file is opened when the underlying file first needs to be opened, and
closed when the last BFD using the underlying file is closed.

On a program where GDB used to open about 1500 files to load all shared
libraries, this patch makes it so only 54 opened file descriptors are
needed.

I have tested this patch on downstream ROCgdb's full testsuite and
upstream GDB testsuite with no regression.
---
 gdb/solib-rocm.c | 125 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 115 insertions(+), 10 deletions(-)


base-commit: 40aa274a75900c368a806991da0026b3a13a65ec
  

Comments

Pedro Alves July 14, 2023, 6:58 p.m. UTC | #1
On 2023-06-28 15:44, Lancelot Six via Gdb-patches wrote:
> ROCm programs can load a high number of programs on GPU devices,

programs can load a high number of ... programs?  :-)

> especially if lazy code-object loading have been disabled.  Each code
> object containing such program is loaded once for each device available,
> and each instance is reported by GDB as individual shared library.

"as" -> "as an"

> 
> We came across situations where the number of shared libraries opened by
> GDB gets higher than the allowed number of opened files for the process.
> Increasing the opened files limit works around the problem, but there is a
> better way this patch proposes to follow.

Very nice.

> 
> Under the hood, the GPU code objects are embedded inside the host
> application binary and shared library binaries.  GDB currently opens the
> underlying file once for each shared library it sees.  That means that
> the same file is re-opened every time a code object is loaded on a GPU.
> 
> This patch proposes to only open each underlying file once.  This is
> done by implementing a reference counting mechanism so the underlying
> file is opened when the underlying file first needs to be opened, and
> closed when the last BFD using the underlying file is closed.
> 
> On a program where GDB used to open about 1500 files to load all shared
> libraries, this patch makes it so only 54 opened file descriptors are
> needed.
> 
> I have tested this patch on downstream ROCgdb's full testsuite and
> upstream GDB testsuite with no regression.
> ---
>  gdb/solib-rocm.c | 125 +++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 115 insertions(+), 10 deletions(-)
> 
> diff --git a/gdb/solib-rocm.c b/gdb/solib-rocm.c
> index 0ea3702e9e3..5dcf8eb71d7 100644
> --- a/gdb/solib-rocm.c
> +++ b/gdb/solib-rocm.c
> @@ -32,12 +32,113 @@
>  #include "solist.h"
>  #include "symfile.h"
>  
> +#include <unordered_map>
> +
> +namespace {
> +
> +/* Per inferior cache of opened file descriptors.  */
> +struct rocm_solib_fd_cache
> +{
> +  rocm_solib_fd_cache (inferior *inf) : m_inferior (inf) {}

explicit.

> +  DISABLE_COPY_AND_ASSIGN (rocm_solib_fd_cache);
> +
> +  /* Return a read-only file descriptor to FILENAME and increment the
> +     associated reference count.
> +
> +     Open the file FILENAME if it is not already opened, reuse the existing file
> +     descriptor otherwise.
> +
> +     On error -1 is returned, and TARGET_ERRNO is set.  */
> +  int open (const std::string &filename, fileio_error *target_errno);
> +
> +  /* Decrement the reference count to FD and close FD if the reference count
> +     reaches 0.
> +
> +     On success, return 0.  On error, return -1 and set TARGET_ERRNO.  */
> +  int close (int fd, fileio_error *target_errno);
> +
> +protected:

Could be private?

> +  struct refcnt_fd
> +  {
> +    DISABLE_COPY_AND_ASSIGN (refcnt_fd);
> +    refcnt_fd (int fd, int refcnt) : fd (fd), refcnt (refcnt) {}
> +
> +    int fd = 0;
> +    int refcnt = 0;
> +  };
> +
> +  inferior *m_inferior;
> +  std::unordered_map<std::string, refcnt_fd> m_cache;
> +};
> +
> +int
> +rocm_solib_fd_cache::open (const std::string &filename,
> +			   fileio_error *target_errno)
> +{
> +  auto it = m_cache.find (filename);
> +  if (it == m_cache.end ())
> +    {
> +      /* The file is not yet opened on the target.  */
> +      int fd
> +	= target_fileio_open (m_inferior, filename.c_str (), FILEIO_O_RDONLY,
> +			      false, 0, target_errno);
> +      if (fd != 0)

Surely you mean (fd != -1) ?

> +	m_cache.emplace (std::piecewise_construct,
> +			 std::forward_as_tuple (filename),
> +			 std::forward_as_tuple (fd, 1));
> +      return fd;
> +    }
> +  else
> +    {
> +      /* The file is already opened.  Increment the refcnt and return the
> +	 already opened FD.  */
> +      it->second.refcnt++;
> +      gdb_assert (it->second.fd != 0);

Ditto, -1.

> +      return it->second.fd;
> +    }
> +}
> +
> +int
> +rocm_solib_fd_cache::close (int fd, fileio_error *target_errno)
> +{
> +  auto it = std::find_if (m_cache.begin (), m_cache.end (),
> +			  [fd](auto &&s) { return s.second.fd == fd; });

Generic lambdas are a C++14 feature:

  https://isocpp.org/wiki/faq/cpp14-language#generic-lambdas

> +
> +  if (it == m_cache.end ())
> +    {
> +      *target_errno = FILEIO_EBADF;
> +      return -1;

Is this case ever not a bug?  Should we assert instead?

Also, AFAICT, no caller actually check the return value of close.  Looks like
we could simplify things a bit.  There isn't usually much you can do if
closing fails, so maybe here ...


> +    }
> +
> +  it->second.refcnt--;
> +  if (it->second.refcnt == 0)
> +    {
> +      int ret = target_fileio_close (it->second.fd, target_errno);
> +      m_cache.erase (it);
> +      return ret;

... issue a warning, and return void?

> +    }
> +  else
> +    {
> +      /* Keep the FD open for the oteh users, return success.  */

"oteh" -> "other".

> +      return 0;
> +    }
> +}
> +
> +} /* Anonymous namespace.  */
> +
>  /* ROCm-specific inferior data.  */
>  
>  struct solib_info
>  {
> +  solib_info (inferior *inf)

explicit.

Thanks,
Pedro Alves

> +    : solib_list (nullptr), fd_cache (inf)
> +  {};
> +
>    /* List of code objects loaded into the inferior.  */
>    so_list *solib_list;
> +
> +  /* Cache of opened FD in the inferior.  */
> +  rocm_solib_fd_cache fd_cache;
>  };
>  
>  /* Per-inferior data key.  */
> @@ -70,7 +171,7 @@ get_solib_info (inferior *inf)
>    solib_info *info = rocm_solib_data.get (inf);
>  
>    if (info == nullptr)
> -    info = rocm_solib_data.emplace (inf);
> +    info = rocm_solib_data.emplace (inf, inf);
>  
>    return info;
>  }
> @@ -217,7 +318,8 @@ struct rocm_code_object_stream_file final : rocm_code_object_stream
>  {
>    DISABLE_COPY_AND_ASSIGN (rocm_code_object_stream_file);
>  
> -  rocm_code_object_stream_file (int fd, ULONGEST offset, ULONGEST size);
> +  rocm_code_object_stream_file (inferior *inf, int fd, ULONGEST offset,
> +				ULONGEST size);
>  
>    file_ptr read (void *buf, file_ptr size, file_ptr offset) override;
>  
> @@ -227,6 +329,9 @@ struct rocm_code_object_stream_file final : rocm_code_object_stream
>  
>  protected:
>  
> +  /* The inferior owning this code object stream.  */
> +  inferior *m_inf;
> +
>    /* The target file descriptor for this stream.  */
>    int m_fd;
>  
> @@ -239,8 +344,8 @@ struct rocm_code_object_stream_file final : rocm_code_object_stream
>  };
>  
>  rocm_code_object_stream_file::rocm_code_object_stream_file
> -  (int fd, ULONGEST offset, ULONGEST size)
> -  : m_fd (fd), m_offset (offset), m_size (size)
> +  (inferior *inf, int fd, ULONGEST offset, ULONGEST size)
> +  : m_inf (inf), m_fd (fd), m_offset (offset), m_size (size)
>  {
>  }
>  
> @@ -305,8 +410,9 @@ rocm_code_object_stream_file::size ()
>  
>  rocm_code_object_stream_file::~rocm_code_object_stream_file ()
>  {
> +  auto info = get_solib_info (m_inf);
>    fileio_error target_errno;
> -  target_fileio_close (m_fd, &target_errno);
> +  info->fd_cache.close (m_fd, &target_errno);
>  }
>  
>  /* Interface to a code object which lives in the inferior's memory.  */
> @@ -446,11 +552,9 @@ rocm_bfd_iovec_open (bfd *abfd, void *inferior_void)
>  
>        if (protocol == "file")
>  	{
> +	  auto info = get_solib_info (inferior);
>  	  fileio_error target_errno;
> -	  int fd
> -	    = target_fileio_open (static_cast<struct inferior *> (inferior),
> -				  decoded_path.c_str (), FILEIO_O_RDONLY,
> -				  false, 0, &target_errno);
> +	  int fd = info->fd_cache.open (decoded_path, &target_errno);
>  
>  	  if (fd == -1)
>  	    {
> @@ -459,7 +563,8 @@ rocm_bfd_iovec_open (bfd *abfd, void *inferior_void)
>  	      return nullptr;
>  	    }
>  
> -	  return new rocm_code_object_stream_file (fd, offset, size);
> +	  return new rocm_code_object_stream_file (inferior, fd, offset,
> +						   size);
>  	}
>  
>        if (protocol == "memory")
> 
> base-commit: 40aa274a75900c368a806991da0026b3a13a65ec
  
Lancelot SIX July 19, 2023, 5:17 p.m. UTC | #2
Hi Pedro,

Thanks for the review.  I have addressed you minor remarks locally.

>> +
>> +  if (it == m_cache.end ())
>> +    {
>> +      *target_errno = FILEIO_EBADF;
>> +      return -1;
> 
> Is this case ever not a bug?  Should we assert instead?
> 
> Also, AFAICT, no caller actually check the return value of close.  Looks like
> we could simplify things a bit.  There isn't usually much you can do if
> closing fails, so maybe here ...
> 
> 
>> +    }
>> +
>> +  it->second.refcnt--;
>> +  if (it->second.refcnt == 0)
>> +    {
>> +      int ret = target_fileio_close (it->second.fd, target_errno);
>> +      m_cache.erase (it);
>> +      return ret;
> 
> ... issue a warning, and return void?
> 

Since this close method is similar to close(2) or target_fileio_close, I 
think I prefer to keep its interface similar to the others, if that's ok 
with you.  I have updated the caller of this method to print a warning 
if -1 is returned.

I'll send a V2 with those changed.

Best,
Lancelot.
  
Pedro Alves July 19, 2023, 7:23 p.m. UTC | #3
On 2023-07-19 18:17, Lancelot SIX wrote:

> Since this close method is similar to close(2) or target_fileio_close, I think I prefer to keep its interface similar to the others, if that's ok with you.  I have updated the caller of this method to print a warning if -1 is returned.
> 
> I'll send a V2 with those changed.

It just seems to me that if we ever add another place that calls .close(), it'll either end
up with duplicated code to emit the warning, or we'll forget to add a warning.  But 
it's an easy thing to change if we get to that situation, so I won't insist, and whatever
you end up with is OK.
  

Patch

diff --git a/gdb/solib-rocm.c b/gdb/solib-rocm.c
index 0ea3702e9e3..5dcf8eb71d7 100644
--- a/gdb/solib-rocm.c
+++ b/gdb/solib-rocm.c
@@ -32,12 +32,113 @@ 
 #include "solist.h"
 #include "symfile.h"
 
+#include <unordered_map>
+
+namespace {
+
+/* Per inferior cache of opened file descriptors.  */
+struct rocm_solib_fd_cache
+{
+  rocm_solib_fd_cache (inferior *inf) : m_inferior (inf) {}
+  DISABLE_COPY_AND_ASSIGN (rocm_solib_fd_cache);
+
+  /* Return a read-only file descriptor to FILENAME and increment the
+     associated reference count.
+
+     Open the file FILENAME if it is not already opened, reuse the existing file
+     descriptor otherwise.
+
+     On error -1 is returned, and TARGET_ERRNO is set.  */
+  int open (const std::string &filename, fileio_error *target_errno);
+
+  /* Decrement the reference count to FD and close FD if the reference count
+     reaches 0.
+
+     On success, return 0.  On error, return -1 and set TARGET_ERRNO.  */
+  int close (int fd, fileio_error *target_errno);
+
+protected:
+  struct refcnt_fd
+  {
+    DISABLE_COPY_AND_ASSIGN (refcnt_fd);
+    refcnt_fd (int fd, int refcnt) : fd (fd), refcnt (refcnt) {}
+
+    int fd = 0;
+    int refcnt = 0;
+  };
+
+  inferior *m_inferior;
+  std::unordered_map<std::string, refcnt_fd> m_cache;
+};
+
+int
+rocm_solib_fd_cache::open (const std::string &filename,
+			   fileio_error *target_errno)
+{
+  auto it = m_cache.find (filename);
+  if (it == m_cache.end ())
+    {
+      /* The file is not yet opened on the target.  */
+      int fd
+	= target_fileio_open (m_inferior, filename.c_str (), FILEIO_O_RDONLY,
+			      false, 0, target_errno);
+      if (fd != 0)
+	m_cache.emplace (std::piecewise_construct,
+			 std::forward_as_tuple (filename),
+			 std::forward_as_tuple (fd, 1));
+      return fd;
+    }
+  else
+    {
+      /* The file is already opened.  Increment the refcnt and return the
+	 already opened FD.  */
+      it->second.refcnt++;
+      gdb_assert (it->second.fd != 0);
+      return it->second.fd;
+    }
+}
+
+int
+rocm_solib_fd_cache::close (int fd, fileio_error *target_errno)
+{
+  auto it = std::find_if (m_cache.begin (), m_cache.end (),
+			  [fd](auto &&s) { return s.second.fd == fd; });
+
+  if (it == m_cache.end ())
+    {
+      *target_errno = FILEIO_EBADF;
+      return -1;
+    }
+
+  it->second.refcnt--;
+  if (it->second.refcnt == 0)
+    {
+      int ret = target_fileio_close (it->second.fd, target_errno);
+      m_cache.erase (it);
+      return ret;
+    }
+  else
+    {
+      /* Keep the FD open for the oteh users, return success.  */
+      return 0;
+    }
+}
+
+} /* Anonymous namespace.  */
+
 /* ROCm-specific inferior data.  */
 
 struct solib_info
 {
+  solib_info (inferior *inf)
+    : solib_list (nullptr), fd_cache (inf)
+  {};
+
   /* List of code objects loaded into the inferior.  */
   so_list *solib_list;
+
+  /* Cache of opened FD in the inferior.  */
+  rocm_solib_fd_cache fd_cache;
 };
 
 /* Per-inferior data key.  */
@@ -70,7 +171,7 @@  get_solib_info (inferior *inf)
   solib_info *info = rocm_solib_data.get (inf);
 
   if (info == nullptr)
-    info = rocm_solib_data.emplace (inf);
+    info = rocm_solib_data.emplace (inf, inf);
 
   return info;
 }
@@ -217,7 +318,8 @@  struct rocm_code_object_stream_file final : rocm_code_object_stream
 {
   DISABLE_COPY_AND_ASSIGN (rocm_code_object_stream_file);
 
-  rocm_code_object_stream_file (int fd, ULONGEST offset, ULONGEST size);
+  rocm_code_object_stream_file (inferior *inf, int fd, ULONGEST offset,
+				ULONGEST size);
 
   file_ptr read (void *buf, file_ptr size, file_ptr offset) override;
 
@@ -227,6 +329,9 @@  struct rocm_code_object_stream_file final : rocm_code_object_stream
 
 protected:
 
+  /* The inferior owning this code object stream.  */
+  inferior *m_inf;
+
   /* The target file descriptor for this stream.  */
   int m_fd;
 
@@ -239,8 +344,8 @@  struct rocm_code_object_stream_file final : rocm_code_object_stream
 };
 
 rocm_code_object_stream_file::rocm_code_object_stream_file
-  (int fd, ULONGEST offset, ULONGEST size)
-  : m_fd (fd), m_offset (offset), m_size (size)
+  (inferior *inf, int fd, ULONGEST offset, ULONGEST size)
+  : m_inf (inf), m_fd (fd), m_offset (offset), m_size (size)
 {
 }
 
@@ -305,8 +410,9 @@  rocm_code_object_stream_file::size ()
 
 rocm_code_object_stream_file::~rocm_code_object_stream_file ()
 {
+  auto info = get_solib_info (m_inf);
   fileio_error target_errno;
-  target_fileio_close (m_fd, &target_errno);
+  info->fd_cache.close (m_fd, &target_errno);
 }
 
 /* Interface to a code object which lives in the inferior's memory.  */
@@ -446,11 +552,9 @@  rocm_bfd_iovec_open (bfd *abfd, void *inferior_void)
 
       if (protocol == "file")
 	{
+	  auto info = get_solib_info (inferior);
 	  fileio_error target_errno;
-	  int fd
-	    = target_fileio_open (static_cast<struct inferior *> (inferior),
-				  decoded_path.c_str (), FILEIO_O_RDONLY,
-				  false, 0, &target_errno);
+	  int fd = info->fd_cache.open (decoded_path, &target_errno);
 
 	  if (fd == -1)
 	    {
@@ -459,7 +563,8 @@  rocm_bfd_iovec_open (bfd *abfd, void *inferior_void)
 	      return nullptr;
 	    }
 
-	  return new rocm_code_object_stream_file (fd, offset, size);
+	  return new rocm_code_object_stream_file (inferior, fd, offset,
+						   size);
 	}
 
       if (protocol == "memory")