Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)

Message ID 87cz69tyla.fsf@dem-tschwing-1.ger.mentorg.com
State Superseded
Headers
Series Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory) |

Commit Message

Thomas Schwinge Feb. 16, 2023, 3:32 p.m. UTC
  Hi!

On 2022-06-09T11:38:22+0200, I wrote:
> On 2022-06-07T13:28:33+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>> On 07/06/2022 13:10, Jakub Jelinek wrote:
>>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>>>> Following some feedback from users of the OG11 branch I think I need to
>>>> withdraw this patch, for now.
>>>>
>>>> The memory pinned via the mlock call does not give the expected performance
>>>> boost. I had not expected that it would do much in my test setup, given that
>>>> the machine has a lot of RAM and my benchmarks are small, but others have
>>>> tried more and on varying machines and architectures.
>>>
>>> I don't understand why there should be any expected performance boost (at
>>> least not unless the machine starts swapping out pages),
>>> { omp_atk_pinned, true } is solely about the requirement that the memory
>>> can't be swapped out.
>>
>> It seems like it takes a faster path through the NVidia drivers. This is
>> a black box, for me, but that seems like a plausible explanation. The
>> results are different on x86_64 and powerpc hosts (such as the Summit
>> supercomputer).
>
> For example, it's documented that 'cuMemHostAlloc',
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9>,
> "Allocates page-locked host memory".  The crucial thing, though, what
> makes this different from 'malloc' plus 'mlock' is, that "The driver
> tracks the virtual memory ranges allocated with this function and
> automatically accelerates calls to functions such as cuMemcpyHtoD().
> Since the memory can be accessed directly by the device, it can be read
> or written with much higher bandwidth than pageable memory obtained with
> functions such as malloc()".
>
> Similar, for example, for 'cuMemAllocHost',
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0>.
>
> This, to me, would explain why "the mlock call does not give the expected
> performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
> with 'mlock' you're missing the "tracks the virtual memory ranges"
> aspect.
>
> Also, by means of the Nvidia Driver allocating the memory, I suppose
> using this interface likely circumvents any "annoying" 'ulimit'
> limitations?  I get this impression, because documentation continues
> stating that "Allocating excessive amounts of memory with
> cuMemAllocHost() may degrade system performance, since it reduces the
> amount of memory available to the system for paging.  As a result, this
> function is best used sparingly to allocate staging areas for data
> exchange between host and device".
>
>>>> It seems that it isn't enough for the memory to be pinned, it has to be
>>>> pinned using the Cuda API to get the performance boost.
>>>
>>> For performance boost of what kind of code?
>>> I don't understand how Cuda API could be useful (or can be used at all) if
>>> offloading to NVPTX isn't involved.  The fact that somebody asks for host
>>> memory allocation with omp_atk_pinned set to true doesn't mean it will be
>>> in any way related to NVPTX offloading (unless it is in NVPTX target region
>>> obviously, but then mlock isn't available, so sure, if there is something
>>> CUDA can provide for that case, nice).
>>
>> This is specifically for NVPTX offload, of course, but then that's what
>> our customer is paying for.
>>
>> The expectation, from users, is that memory pinning will give the
>> benefits specific to the active device. We can certainly make that
>> happen when there is only one (flavour of) offload device present. I had
>> hoped it could be one way for all, but it looks like not.
>
> Aren't there CUDA Driver interfaces for that?  That is:
>
>>>> I had not done this
>>>> this because it was difficult to resolve the code abstraction
>>>> difficulties and anyway the implementation was supposed to be device
>>>> independent, but it seems we need a specific pinning mechanism for each
>>>> device.
>
> If not directly *allocating and registering* such memory via
> 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
> *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223>:
> "Page-locks the memory range specified [...] and maps it for the
> device(s) [...].  This memory range also is added to the same tracking
> mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
> manual 'mlock'ing involved in that case, too; presumably again using this
> interface likely circumvents any "annoying" 'ulimit' limitations?)
>
> Such a *register* abstraction can then be implemented by all the libgomp
> offloading plugins: they just call the respective
> CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
> memory.
>
> ..., but maybe I'm missing some crucial "detail" here?

Indeed this does appear to work; see attached
"[WIP] Attempt to register OpenMP pinned memory using a device instead of 'mlock'".
Any comments (aside from the TODOs that I'm still working on)?


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  

Comments

Li, Pan2 via Gcc-patches Feb. 16, 2023, 4:17 p.m. UTC | #1
> -----Original Message-----
> From: Thomas Schwinge <thomas@codesourcery.com>
> Sent: 16 February 2023 15:33
> To: Andrew Stubbs <ams@codesourcery.com>; Jakub Jelinek <jakub@redhat.com>;
> Tobias Burnus <tobias@codesourcery.com>; gcc-patches@gcc.gnu.org
> Subject: Attempt to register OpenMP pinned memory using a device instead of
> 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)
> 
> Hi!
> 
> On 2022-06-09T11:38:22+0200, I wrote:
> > On 2022-06-07T13:28:33+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> >> On 07/06/2022 13:10, Jakub Jelinek wrote:
> >>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
> >>>> Following some feedback from users of the OG11 branch I think I need to
> >>>> withdraw this patch, for now.
> >>>>
> >>>> The memory pinned via the mlock call does not give the expected
> performance
> >>>> boost. I had not expected that it would do much in my test setup, given
> that
> >>>> the machine has a lot of RAM and my benchmarks are small, but others
> have
> >>>> tried more and on varying machines and architectures.
> >>>
> >>> I don't understand why there should be any expected performance boost
> (at
> >>> least not unless the machine starts swapping out pages),
> >>> { omp_atk_pinned, true } is solely about the requirement that the memory
> >>> can't be swapped out.
> >>
> >> It seems like it takes a faster path through the NVidia drivers. This is
> >> a black box, for me, but that seems like a plausible explanation. The
> >> results are different on x86_64 and powerpc hosts (such as the Summit
> >> supercomputer).
> >
> > For example, it's documented that 'cuMemHostAlloc',
> >
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid
> ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e
> 035b9&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7S8K2opKAV%2F5Ub2tyZtcgplptZ65dNc3b%2F2IYoh
> me%2Fw%3D&reserved=0>,
> > "Allocates page-locked host memory".  The crucial thing, though, what
> > makes this different from 'malloc' plus 'mlock' is, that "The driver
> > tracks the virtual memory ranges allocated with this function and
> > automatically accelerates calls to functions such as cuMemcpyHtoD().
> > Since the memory can be accessed directly by the device, it can be read
> > or written with much higher bandwidth than pageable memory obtained with
> > functions such as malloc()".
> >
> > Similar, for example, for 'cuMemAllocHost',
> >
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid
> ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gdd8311286d2c2691605362c689b
> c64e0&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TAhX%2BFjPavhKZKICMDiO%2BuZuytxnkaDvfDArT0R
> KDV0%3D&reserved=0>.
> >
> > This, to me, would explain why "the mlock call does not give the expected
> > performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
> > with 'mlock' you're missing the "tracks the virtual memory ranges"
> > aspect.
> >
> > Also, by means of the Nvidia Driver allocating the memory, I suppose
> > using this interface likely circumvents any "annoying" 'ulimit'
> > limitations?  I get this impression, because documentation continues
> > stating that "Allocating excessive amounts of memory with
> > cuMemAllocHost() may degrade system performance, since it reduces the
> > amount of memory available to the system for paging.  As a result, this
> > function is best used sparingly to allocate staging areas for data
> > exchange between host and device".
> >
> >>>> It seems that it isn't enough for the memory to be pinned, it has to be
> >>>> pinned using the Cuda API to get the performance boost.
> >>>
> >>> For performance boost of what kind of code?
> >>> I don't understand how Cuda API could be useful (or can be used at all)
> if
> >>> offloading to NVPTX isn't involved.  The fact that somebody asks for
> host
> >>> memory allocation with omp_atk_pinned set to true doesn't mean it will
> be
> >>> in any way related to NVPTX offloading (unless it is in NVPTX target
> region
> >>> obviously, but then mlock isn't available, so sure, if there is
> something
> >>> CUDA can provide for that case, nice).
> >>
> >> This is specifically for NVPTX offload, of course, but then that's what
> >> our customer is paying for.
> >>
> >> The expectation, from users, is that memory pinning will give the
> >> benefits specific to the active device. We can certainly make that
> >> happen when there is only one (flavour of) offload device present. I had
> >> hoped it could be one way for all, but it looks like not.
> >
> > Aren't there CUDA Driver interfaces for that?  That is:
> >
> >>>> I had not done this
> >>>> this because it was difficult to resolve the code abstraction
> >>>> difficulties and anyway the implementation was supposed to be device
> >>>> independent, but it seems we need a specific pinning mechanism for each
> >>>> device.
> >
> > If not directly *allocating and registering* such memory via
> > 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
> > *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
> >
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid
> ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b
> 54223&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Wkwx9TipC8JJNn1QqULahoTfqn9w%2FOLyoCQ1MTt90
> 8M%3D&reserved=0>:
> > "Page-locks the memory range specified [...] and maps it for the
> > device(s) [...].  This memory range also is added to the same tracking
> > mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
> > manual 'mlock'ing involved in that case, too; presumably again using this
> > interface likely circumvents any "annoying" 'ulimit' limitations?)
> >
> > Such a *register* abstraction can then be implemented by all the libgomp
> > offloading plugins: they just call the respective
> > CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
> > memory.
> >
> > ..., but maybe I'm missing some crucial "detail" here?
> 
> Indeed this does appear to work; see attached
> "[WIP] Attempt to register OpenMP pinned memory using a device instead of
> 'mlock'".
> Any comments (aside from the TODOs that I'm still working on)?

The mmap implementation was not optimized for a lot of small allocations, and I can't see that issue changing here, so I don't know if this can be used for mlockall replacement.

I had assumed that using the Cuda allocator would fix that limitation.

Andrew
  

Patch

From 97707db8602430e57b9f1c9c34da6a54ad9e2da9 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Thu, 16 Feb 2023 15:57:37 +0100
Subject: [PATCH] [WIP] Attempt to register OpenMP pinned memory using a device
 instead of 'mlock'

Implemented for nvptx offloading via 'cuMemHostRegister'.

This re-works og12 commit ab7520b3b4cd9fdabfd63652badde478955bd3b5
"libgomp: pinned memory".
---
 include/cuda/cuda.h                          |   3 +
 libgomp/config/linux/allocator.c             |  74 +++++++++-
 libgomp/libgomp-plugin.h                     |   2 +
 libgomp/libgomp.h                            |   4 +
 libgomp/plugin/cuda-lib.def                  |   3 +
 libgomp/plugin/plugin-nvptx.c                |  48 +++++++
 libgomp/target.c                             | 137 +++++++++++++++++++
 libgomp/testsuite/libgomp.c/alloc-pinned-1.c |  25 ++++
 libgomp/testsuite/libgomp.c/alloc-pinned-2.c |  25 ++++
 libgomp/testsuite/libgomp.c/alloc-pinned-3.c |  43 +++++-
 libgomp/testsuite/libgomp.c/alloc-pinned-4.c |  43 +++++-
 libgomp/testsuite/libgomp.c/alloc-pinned-5.c |  25 ++++
 libgomp/testsuite/libgomp.c/alloc-pinned-6.c |  34 ++++-
 13 files changed, 447 insertions(+), 19 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 062d394b95f..b0c7636d318 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -183,6 +183,9 @@  CUresult cuMemAlloc (CUdeviceptr *, size_t);
 CUresult cuMemAllocHost (void **, size_t);
 CUresult cuMemAllocManaged(CUdeviceptr *, size_t, unsigned int);
 CUresult cuMemHostAlloc (void **, size_t, unsigned int);
+#define cuMemHostRegister cuMemHostRegister_v2
+CUresult cuMemHostRegister(void *, size_t, unsigned int);
+CUresult cuMemHostUnregister(void *);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
index f278e5cdf14..81e64b268e9 100644
--- a/libgomp/config/linux/allocator.c
+++ b/libgomp/config/linux/allocator.c
@@ -24,6 +24,10 @@ 
 
 /* Implement malloc routines that can handle pinned memory on Linux.
 
+   Given that pinned memory is typically used to help host <-> device memory
+   transfers, we attempt to register such using a device (really: libgomp
+   plugin), but fall back to mlock if no suitable device is available.
+
    It's possible to use mlock on any heap memory, but using munlock is
    problematic if there are multiple pinned allocations on the same page.
    Tracking all that manually would be possible, but adds overhead. This may
@@ -37,6 +41,7 @@ 
 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <string.h>
+#include <assert.h>
 #include "libgomp.h"
 
 static bool always_pinned_mode = false;
@@ -53,9 +58,15 @@  GOMP_enable_pinned_mode ()
     always_pinned_mode = true;
 }
 
+static int using_device_for_register_page_locked
+  = /* uninitialized */ -1;
+
 static void *
 linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
 {
+  gomp_debug (0, "%s: memspace=%llu, size=%llu, pin=%d\n",
+	      __FUNCTION__, (unsigned long long) memspace, (unsigned long long) size, pin);
+
   /* Explicit pinning may not be required.  */
   pin = pin && !always_pinned_mode;
 
@@ -71,11 +82,32 @@  linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
       if (addr == MAP_FAILED)
 	return NULL;
 
-      if (mlock (addr, size))
+      int using_device
+	= __atomic_load_n (&using_device_for_register_page_locked,
+			   MEMMODEL_RELAXED);
+      gomp_debug (0, "  using_device=%d\n",
+		  using_device);
+      if (using_device != 0)
+	{
+	  using_device = gomp_register_page_locked (addr, size);
+	  int using_device_old
+	    = __atomic_exchange_n (&using_device_for_register_page_locked,
+				   using_device, MEMMODEL_RELAXED);
+	  gomp_debug (0, "  using_device=%d, using_device_old=%d\n",
+		      using_device, using_device_old);
+	  assert (using_device_old == -1
+		  /* We shouldn't have concurrently changed our mind.  */
+		  || using_device_old == using_device);
+	}
+      if (using_device == 0)
 	{
-	  gomp_debug (0, "libgomp: failed to pin memory (ulimit too low?)\n");
-	  munmap (addr, size);
-	  return NULL;
+	  gomp_debug (0, "  mlock\n");
+	  if (mlock (addr, size))
+	    {
+	      gomp_debug (0, "libgomp: failed to pin memory (ulimit too low?)\n");
+	      munmap (addr, size);
+	      return NULL;
+	    }
 	}
 
       return addr;
@@ -87,6 +119,9 @@  linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
 static void *
 linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
 {
+  gomp_debug (0, "%s: memspace=%llu, size=%llu, pin=%d\n",
+	      __FUNCTION__, (unsigned long long) memspace, (unsigned long long) size, pin);
+
   /* Explicit pinning may not be required.  */
   pin = pin && !always_pinned_mode;
 
@@ -107,13 +142,28 @@  static void
 linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size,
 		     int pin)
 {
+  gomp_debug (0, "%s: memspace=%llu, addr=%p, size=%llu, pin=%d\n",
+	      __FUNCTION__, (unsigned long long) memspace, addr, (unsigned long long) size, pin);
+
   /* Explicit pinning may not be required.  */
   pin = pin && !always_pinned_mode;
 
   if (memspace == ompx_unified_shared_mem_space)
     gomp_usm_free (addr, GOMP_DEVICE_ICV);
   else if (pin)
-    munmap (addr, size);
+    {
+      int using_device
+	= __atomic_load_n (&using_device_for_register_page_locked,
+			   MEMMODEL_RELAXED);
+      gomp_debug (0, "  using_device=%d\n",
+		  using_device);
+      if (using_device == 1)
+	gomp_unregister_page_locked (addr, size);
+      else
+	/* 'munlock'ing is implicit with following 'munmap'.  */
+	;
+      munmap (addr, size);
+    }
   else
     free (addr);
 }
@@ -122,6 +172,9 @@  static void *
 linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
 			size_t oldsize, size_t size, int oldpin, int pin)
 {
+  gomp_debug (0, "%s: memspace=%llu, addr=%p, oldsize=%llu, size=%llu, oldpin=%d, pin=%d\n",
+	      __FUNCTION__, (unsigned long long) memspace, addr, (unsigned long long) oldsize, (unsigned long long) size, oldpin, pin);
+
   /* Explicit pinning may not be required.  */
   pin = pin && !always_pinned_mode;
 
@@ -129,6 +182,17 @@  linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
     goto manual_realloc;
   else if (oldpin && pin)
     {
+      /* We can only expect to be able to just 'mremap' if not using a device
+	 for registering page-locked memory.  */
+      int using_device
+	= __atomic_load_n (&using_device_for_register_page_locked,
+		       MEMMODEL_RELAXED);
+      gomp_debug (0, "  using_device=%d\n",
+		  using_device);
+      if (using_device != 0)
+	goto manual_realloc;
+
+      gomp_debug (0, "  mremap\n");
       void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE);
       if (newaddr == MAP_FAILED)
 	return NULL;
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index bb79ef8d9d7..345fc62d4f5 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -144,6 +144,8 @@  extern bool GOMP_OFFLOAD_free (int, void *);
 extern void *GOMP_OFFLOAD_usm_alloc (int, size_t);
 extern bool GOMP_OFFLOAD_usm_free (int, void *);
 extern bool GOMP_OFFLOAD_is_usm_ptr (void *);
+extern bool GOMP_OFFLOAD_register_page_locked (void *, size_t);
+extern bool GOMP_OFFLOAD_unregister_page_locked (void *, size_t);
 extern bool GOMP_OFFLOAD_dev2host (int, void *, const void *, size_t);
 extern bool GOMP_OFFLOAD_host2dev (int, void *, const void *, size_t);
 extern bool GOMP_OFFLOAD_dev2dev (int, void *, const void *, size_t);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index f6fab788519..f8cf04746ac 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1136,6 +1136,8 @@  extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
 			     void *);
 extern void * gomp_usm_alloc (size_t size, int device_num);
 extern void gomp_usm_free (void *device_ptr, int device_num);
+extern bool gomp_register_page_locked (void *, size_t);
+extern void gomp_unregister_page_locked (void *, size_t);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
@@ -1395,6 +1397,8 @@  struct gomp_device_descr
   __typeof (GOMP_OFFLOAD_usm_alloc) *usm_alloc_func;
   __typeof (GOMP_OFFLOAD_usm_free) *usm_free_func;
   __typeof (GOMP_OFFLOAD_is_usm_ptr) *is_usm_ptr_func;
+  __typeof (GOMP_OFFLOAD_register_page_locked) *register_page_locked_func;
+  __typeof (GOMP_OFFLOAD_unregister_page_locked) *unregister_page_locked_func;
   __typeof (GOMP_OFFLOAD_dev2host) *dev2host_func;
   __typeof (GOMP_OFFLOAD_host2dev) *host2dev_func;
   __typeof (GOMP_OFFLOAD_dev2dev) *dev2dev_func;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index 9b786c9f2f6..8dbaadf848e 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -31,6 +31,9 @@  CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
 CUDA_ONE_CALL (cuMemAllocManaged)
 CUDA_ONE_CALL (cuMemHostAlloc)
+CUDA_ONE_CALL_MAYBE_NULL (cuMemHostRegister_v2)
+CUDA_ONE_CALL (cuMemHostRegister)
+CUDA_ONE_CALL (cuMemHostUnregister)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 2ebf17728fa..cbdf466dd05 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -77,11 +77,14 @@  extern CUresult cuGetErrorString (CUresult, const char **);
 CUresult cuLinkAddData (CUlinkState, CUjitInputType, void *, size_t,
 			const char *, unsigned, CUjit_option *, void **);
 CUresult cuLinkCreate (unsigned, CUjit_option *, void **, CUlinkState *);
+#undef cuMemHostRegister
+CUresult cuMemHostRegister (void *, size_t, unsigned int);
 #else
 typedef size_t (*CUoccupancyB2DSize)(int);
 CUresult cuLinkAddData_v2 (CUlinkState, CUjitInputType, void *, size_t,
 			   const char *, unsigned, CUjit_option *, void **);
 CUresult cuLinkCreate_v2 (unsigned, CUjit_option *, void **, CUlinkState *);
+CUresult cuMemHostRegister_v2 (void *, size_t, unsigned int);
 CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
 					  CUoccupancyB2DSize, size_t, int);
 #endif
@@ -361,6 +364,9 @@  nvptx_thread (void)
 static bool
 nvptx_init (void)
 {
+  GOMP_PLUGIN_debug (0, "%s\n",
+		     __FUNCTION__);
+
   int ndevs;
 
   if (instantiated_devices != 0)
@@ -614,6 +620,9 @@  nvptx_close_device (struct ptx_device *ptx_dev)
 static int
 nvptx_get_num_devices (void)
 {
+  GOMP_PLUGIN_debug (0, "%s\n",
+		     __FUNCTION__);
+
   int n;
 
   /* This function will be called before the plugin has been initialized in
@@ -1704,6 +1713,45 @@  GOMP_OFFLOAD_is_usm_ptr (void *ptr)
   return managed;
 }
 
+bool
+GOMP_OFFLOAD_register_page_locked (void *ptr, size_t size)
+{
+  GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p, size=%llu\n",
+		     __FUNCTION__, ptr, (unsigned long long) size);
+
+  // https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223
+  // https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c
+
+  /* 'cuMemHostRegister' "page-locks the memory range specified".  */
+
+  unsigned int flags = /*TODO*/ 0;
+#if 0
+  //TODO
+#define CU_MEMHOSTREGISTER_PORTABLE 0x01
+  flags |= CU_MEMHOSTREGISTER_PORTABLE;
+#endif
+  //TODO Do we need some more elaborate error management instead of this 'return false' for '!CUDA_SUCCESS'?
+  if (CUDA_CALL_EXISTS (cuMemHostRegister_v2))
+    CUDA_CALL (cuMemHostRegister_v2, ptr, size, flags);
+  else
+    CUDA_CALL (cuMemHostRegister, ptr, size, flags);
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_unregister_page_locked (void *ptr, size_t size)
+{
+  GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p, size=%llu\n",
+		     __FUNCTION__, ptr, (unsigned long long) size);
+
+  // https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g63f450c8125359be87b7623b1c0b2a14
+  // https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g81fd4101862bbefdb42a62d60e515eea
+
+  //TODO Do we need some more elaborate error management instead of this 'return false' for '!CUDA_SUCCESS'?
+  CUDA_CALL (cuMemHostUnregister, ptr);
+  return true;
+}
+
 void
 GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 			   void **hostaddrs, void **devaddrs,
diff --git a/libgomp/target.c b/libgomp/target.c
index 1b911c9bdb9..e7285188d1e 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -4584,6 +4584,141 @@  gomp_usm_free (void *device_ptr, int device_num)
   gomp_mutex_unlock (&devicep->lock);
 }
 
+
+/* Device (really: libgomp plugin) for registering paged-locked memory.  We
+   assume there is either none or exactly one such device for the lifetime of
+   the process.  */
+
+static struct gomp_device_descr *device_for_register_page_locked
+  = /* uninitialized */ (void *) -1;
+
+static struct gomp_device_descr *
+get_device_for_register_page_locked (void)
+{
+  gomp_debug (0, "%s\n",
+	      __FUNCTION__);
+
+  struct gomp_device_descr *device;
+#ifdef HAVE_SYNC_BUILTINS
+  device
+    = __atomic_load_n (&device_for_register_page_locked, MEMMODEL_RELAXED);
+  if (device == (void *) -1)
+    {
+      gomp_debug (0, "  init\n");
+
+      gomp_init_targets_once ();
+
+      device = NULL;
+      for (int i = 0; i < num_devices; ++i)
+	{
+	  gomp_debug (0, "  i=%d, target_id=%d\n",
+		      i, devices[i].target_id);
+
+	  /* We consider only the first device of potentially several of the
+	     same type as this functionality is not specific to an individual
+	     offloading device, but instead relates to the host-side
+	     implementation of the respective offloading implementation.  */
+	  if (devices[i].target_id != 0)
+	    continue;
+
+	  if (!devices[i].register_page_locked_func)
+	    continue;
+
+	  gomp_debug (0, "  found device: %p (%s)\n",
+		      &devices[i], devices[i].name);
+	  if (device)
+	    gomp_fatal ("Unclear how %s and %s libgomp plugins may"
+			" simultaneously provide functionality"
+			" to register page-locked memory",
+			device->name, devices[i].name);
+	  else
+	    device = &devices[i];
+	}
+
+      struct gomp_device_descr *device_old
+	= __atomic_exchange_n (&device_for_register_page_locked, device,
+			       MEMMODEL_RELAXED);
+      gomp_debug (0, "  old device_for_register_page_locked: %p\n",
+		  device_old);
+      assert (device_old == (void *) -1
+	      /* We shouldn't have concurrently found a different or no
+		 device.  */
+	      || device_old == device);
+    }
+#else /* !HAVE_SYNC_BUILTINS */
+  gomp_debug (0, "  not implemented for '!HAVE_SYNC_BUILTINS'\n");
+  (void) &device_for_register_page_locked;
+  device = NULL;
+#endif /* HAVE_SYNC_BUILTINS */
+
+  gomp_debug (0, "  -> device=%p (%s)\n",
+	      device, device ? device->name : "[none]");
+  return device;
+}
+
+/* Register page-locked memory region.
+   Returns whether we have a device capable of that.  */
+
+attribute_hidden bool
+gomp_register_page_locked (void *ptr, size_t size)
+{
+  gomp_debug (0, "%s: ptr=%p, size=%llu\n",
+	      __FUNCTION__, ptr, (unsigned long long) size);
+
+  struct gomp_device_descr *device = get_device_for_register_page_locked ();
+  gomp_debug (0, "  device=%p (%s)\n",
+	      device, device ? device->name : "[none]");
+  if (device)
+    {
+      gomp_mutex_lock (&device->lock);
+      if (device->state == GOMP_DEVICE_UNINITIALIZED)
+	gomp_init_device (device);
+      else if (device->state == GOMP_DEVICE_FINALIZED)
+	{
+	  gomp_mutex_unlock (&device->lock);
+	  gomp_fatal ("Device %s for registering page-locked memory"
+		      " is finalized", device->name);
+	}
+      gomp_mutex_unlock (&device->lock);
+
+      if (!device->register_page_locked_func (ptr, size))
+	gomp_fatal ("Failed to register page-locked memory"
+		    " via %s libgomp plugin",
+		    device->name);
+    }
+  return device != NULL;
+}
+
+/* Unregister page-locked memory region.
+   This must only be called if 'gomp_register_page_locked' returned 'true'.  */
+
+attribute_hidden void
+gomp_unregister_page_locked (void *ptr, size_t size)
+{
+  gomp_debug (0, "%s: ptr=%p\n",
+	      __FUNCTION__, ptr);
+
+  struct gomp_device_descr *device = get_device_for_register_page_locked ();
+  gomp_debug (0, "  device=%p (%s)\n",
+	      device, device ? device->name : "[none]");
+  assert (device);
+
+  gomp_mutex_lock (&device->lock);
+  assert (device->state != GOMP_DEVICE_UNINITIALIZED);
+  if (device->state == GOMP_DEVICE_FINALIZED)
+    {
+      gomp_mutex_unlock (&device->lock);
+      return;
+    }
+  gomp_mutex_unlock (&device->lock);
+
+  if (!device->unregister_page_locked_func (ptr, size))
+    gomp_fatal ("Failed to unregister page-locked memory"
+		" via %s libgomp plugin",
+		device->name);
+}
+
+
 int
 omp_target_is_present (const void *ptr, int device_num)
 {
@@ -5268,6 +5403,8 @@  gomp_load_plugin_for_device (struct gomp_device_descr *device,
   DLSYM_OPT (usm_alloc, usm_alloc);
   DLSYM_OPT (usm_free, usm_free);
   DLSYM_OPT (is_usm_ptr, is_usm_ptr);
+  DLSYM_OPT (register_page_locked, register_page_locked);
+  DLSYM_OPT (unregister_page_locked, unregister_page_locked);
   DLSYM (dev2host);
   DLSYM (host2dev);
   DLSYM (evaluate_device);
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
index fb7ac8b0080..bd71e22b003 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
@@ -2,6 +2,8 @@ 
 
 /* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
 
+/* { dg-additional-options -DOFFLOAD_DEVICE_NVPTX { target offload_device_nvptx } } */
+
 /* Test that pinned memory works.  */
 
 #include <stdio.h>
@@ -67,9 +69,14 @@  verify0 (char *p, size_t s)
 int
 main ()
 {
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* Go big or go home.  */
+  const int SIZE = 40 * 1024 * 1024;
+#else
   /* Allocate at least a page each time, but stay within the ulimit.  */
   const int SIZE = PAGE_SIZE;
   CHECK_SIZE (SIZE*3);
+#endif
 
   const omp_alloctrait_t traits[] = {
       { omp_atk_pinned, 1 }
@@ -85,19 +92,37 @@  main ()
     abort ();
 
   int amount = get_pinned_mem ();
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* This doesn't show up as process 'VmLck'ed memory.  */
+  if (amount != 0)
+    abort ();
+#else
   if (amount == 0)
     abort ();
+#endif
 
   p = omp_realloc (p, SIZE*2, allocator, allocator);
 
   int amount2 = get_pinned_mem ();
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* This doesn't show up as process 'VmLck'ed memory.  */
+  if (amount2 != 0)
+    abort ();
+#else
   if (amount2 <= amount)
     abort ();
+#endif
 
   p = omp_calloc (1, SIZE, allocator);
 
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* This doesn't show up as process 'VmLck'ed memory.  */
+  if (get_pinned_mem () != 0)
+    abort ();
+#else
   if (get_pinned_mem () <= amount2)
     abort ();
+#endif
 
   verify0 (p, SIZE);
 
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-2.c b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
index 651b89fb42f..c71248b046d 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
@@ -2,6 +2,8 @@ 
 
 /* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
 
+/* { dg-additional-options -DOFFLOAD_DEVICE_NVPTX { target offload_device_nvptx } } */
+
 /* Test that pinned memory works (pool_size code path).  */
 
 #include <stdio.h>
@@ -67,9 +69,14 @@  verify0 (char *p, size_t s)
 int
 main ()
 {
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* Go big or go home.  */
+  const int SIZE = 40 * 1024 * 1024;
+#else
   /* Allocate at least a page each time, but stay within the ulimit.  */
   const int SIZE = PAGE_SIZE;
   CHECK_SIZE (SIZE*3);
+#endif
 
   const omp_alloctrait_t traits[] = {
       { omp_atk_pinned, 1 },
@@ -87,23 +94,41 @@  main ()
     abort ();
 
   int amount = get_pinned_mem ();
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* This doesn't show up as process 'VmLck'ed memory.  */
+  if (amount != 0)
+    abort ();
+#else
   if (amount == 0)
     abort ();
+#endif
 
   p = omp_realloc (p, SIZE*2, allocator, allocator);
   if (!p)
     abort ();
 
   int amount2 = get_pinned_mem ();
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* This doesn't show up as process 'VmLck'ed memory.  */
+  if (amount2 != 0)
+    abort ();
+#else
   if (amount2 <= amount)
     abort ();
+#endif
 
   p = omp_calloc (1, SIZE, allocator);
   if (!p)
     abort ();
 
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* This doesn't show up as process 'VmLck'ed memory.  */
+  if (get_pinned_mem () != 0)
+    abort ();
+#else
   if (get_pinned_mem () <= amount2)
     abort ();
+#endif
 
   verify0 (p, SIZE);
 
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-3.c b/libgomp/testsuite/libgomp.c/alloc-pinned-3.c
index f41797881ef..26b0c352d85 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-3.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-3.c
@@ -1,5 +1,7 @@ 
 /* { dg-do run } */
 
+/* { dg-additional-options -DOFFLOAD_DEVICE_NVPTX { target offload_device_nvptx } } */
+
 /* Test that pinned memory fails correctly.  */
 
 #include <stdio.h>
@@ -74,8 +76,14 @@  verify0 (char *p, size_t s)
 int
 main ()
 {
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* Go big or go home.  */
+  const int SIZE = 40 * 1024 * 1024;
+#else
   /* This needs to be large enough to cover multiple pages.  */
   const int SIZE = PAGE_SIZE*4;
+#endif
+  const int PIN_LIMIT = PAGE_SIZE*2;
 
   /* Pinned memory, no fallback.  */
   const omp_alloctrait_t traits1[] = {
@@ -92,21 +100,33 @@  main ()
   omp_allocator_handle_t allocator2 = omp_init_allocator (omp_default_mem_space, 2, traits2);
 
   /* Ensure that the limit is smaller than the allocation.  */
-  set_pin_limit (SIZE/2);
+  set_pin_limit (PIN_LIMIT);
 
   // Sanity check
   if (get_pinned_mem () != 0)
     abort ();
 
-  // Should fail
   void *p = omp_alloc (SIZE, allocator1);
+#ifdef OFFLOAD_DEVICE_NVPTX
+  // Doesn't care about 'set_pin_limit'.
+  if (!p)
+    abort ();
+#else
+  // Should fail
   if (p)
     abort ();
+#endif
 
-  // Should fail
   p = omp_calloc (1, SIZE, allocator1);
+#ifdef OFFLOAD_DEVICE_NVPTX
+  // Doesn't care about 'set_pin_limit'.
+  if (!p)
+    abort ();
+#else
+  // Should fail
   if (p)
     abort ();
+#endif
 
   // Should fall back
   p = omp_alloc (SIZE, allocator2);
@@ -119,16 +139,29 @@  main ()
     abort ();
   verify0 (p, SIZE);
 
-  // Should fail to realloc
   void *notpinned = omp_alloc (SIZE, omp_default_mem_alloc);
   p = omp_realloc (notpinned, SIZE, allocator1, omp_default_mem_alloc);
+#ifdef OFFLOAD_DEVICE_NVPTX
+  // Doesn't care about 'set_pin_limit'; does reallocate.
+  if (!notpinned || !p || p == notpinned)
+    abort ();
+#else
+  // Should fail to realloc
   if (!notpinned || p)
     abort ();
+#endif
 
-  // Should fall back to no realloc needed
+#ifdef OFFLOAD_DEVICE_NVPTX
+  void *p_ = omp_realloc (p, SIZE, allocator2, allocator1);
+  // Does reallocate.
+  if (p_ == p)
+    abort ();
+#else
   p = omp_realloc (notpinned, SIZE, allocator2, omp_default_mem_alloc);
+  // Should fall back to no realloc needed
   if (p != notpinned)
     abort ();
+#endif
 
   // No memory should have been pinned
   int amount = get_pinned_mem ();
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-4.c b/libgomp/testsuite/libgomp.c/alloc-pinned-4.c
index a878da8c558..0bd6a552d94 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-4.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-4.c
@@ -1,5 +1,7 @@ 
 /* { dg-do run } */
 
+/* { dg-additional-options -DOFFLOAD_DEVICE_NVPTX { target offload_device_nvptx } } */
+
 /* Test that pinned memory fails correctly, pool_size code path.  */
 
 #include <stdio.h>
@@ -74,8 +76,14 @@  verify0 (char *p, size_t s)
 int
 main ()
 {
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* Go big or go home.  */
+  const int SIZE = 40 * 1024 * 1024;
+#else
   /* This needs to be large enough to cover multiple pages.  */
   const int SIZE = PAGE_SIZE*4;
+#endif
+  const int PIN_LIMIT = PAGE_SIZE*2;
 
   /* Pinned memory, no fallback.  */
   const omp_alloctrait_t traits1[] = {
@@ -94,21 +102,33 @@  main ()
   omp_allocator_handle_t allocator2 = omp_init_allocator (omp_default_mem_space, 3, traits2);
 
   /* Ensure that the limit is smaller than the allocation.  */
-  set_pin_limit (SIZE/2);
+  set_pin_limit (PIN_LIMIT);
 
   // Sanity check
   if (get_pinned_mem () != 0)
     abort ();
 
-  // Should fail
   void *p = omp_alloc (SIZE, allocator1);
+#ifdef OFFLOAD_DEVICE_NVPTX
+  // Doesn't care about 'set_pin_limit'.
+  if (!p)
+    abort ();
+#else
+  // Should fail
   if (p)
     abort ();
+#endif
 
-  // Should fail
   p = omp_calloc (1, SIZE, allocator1);
+#ifdef OFFLOAD_DEVICE_NVPTX
+  // Doesn't care about 'set_pin_limit'.
+  if (!p)
+    abort ();
+#else
+  // Should fail
   if (p)
     abort ();
+#endif
 
   // Should fall back
   p = omp_alloc (SIZE, allocator2);
@@ -121,16 +141,29 @@  main ()
     abort ();
   verify0 (p, SIZE);
 
-  // Should fail to realloc
   void *notpinned = omp_alloc (SIZE, omp_default_mem_alloc);
   p = omp_realloc (notpinned, SIZE, allocator1, omp_default_mem_alloc);
+#ifdef OFFLOAD_DEVICE_NVPTX
+  // Doesn't care about 'set_pin_limit'; does reallocate.
+  if (!notpinned || !p || p == notpinned)
+    abort ();
+#else
+  // Should fail to realloc
   if (!notpinned || p)
     abort ();
+#endif
 
-  // Should fall back to no realloc needed
+#ifdef OFFLOAD_DEVICE_NVPTX
+  void *p_ = omp_realloc (p, SIZE, allocator2, allocator1);
+  // Does reallocate.
+  if (p_ == p)
+    abort ();
+#else
   p = omp_realloc (notpinned, SIZE, allocator2, omp_default_mem_alloc);
+  // Should fall back to no realloc needed
   if (p != notpinned)
     abort ();
+#endif
 
   // No memory should have been pinned
   int amount = get_pinned_mem ();
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-5.c b/libgomp/testsuite/libgomp.c/alloc-pinned-5.c
index 65983b3d03d..623c96a78e3 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-5.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-5.c
@@ -2,6 +2,8 @@ 
 
 /* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
 
+/* { dg-additional-options -DOFFLOAD_DEVICE_NVPTX { target offload_device_nvptx } } */
+
 /* Test that ompx_pinned_mem_alloc works.  */
 
 #include <stdio.h>
@@ -67,9 +69,14 @@  verify0 (char *p, size_t s)
 int
 main ()
 {
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* Go big or go home.  */
+  const int SIZE = 40 * 1024 * 1024;
+#else
   /* Allocate at least a page each time, but stay within the ulimit.  */
   const int SIZE = PAGE_SIZE;
   CHECK_SIZE (SIZE*3);
+#endif
 
   // Sanity check
   if (get_pinned_mem () != 0)
@@ -80,19 +87,37 @@  main ()
     abort ();
 
   int amount = get_pinned_mem ();
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* This doesn't show up as process 'VmLck'ed memory.  */
+  if (amount != 0)
+    abort ();
+#else
   if (amount == 0)
     abort ();
+#endif
 
   p = omp_realloc (p, SIZE*2, ompx_pinned_mem_alloc, ompx_pinned_mem_alloc);
 
   int amount2 = get_pinned_mem ();
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* This doesn't show up as process 'VmLck'ed memory.  */
+  if (amount2 != 0)
+    abort ();
+#else
   if (amount2 <= amount)
     abort ();
+#endif
 
   p = omp_calloc (1, SIZE, ompx_pinned_mem_alloc);
 
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* This doesn't show up as process 'VmLck'ed memory.  */
+  if (get_pinned_mem () != 0)
+    abort ();
+#else
   if (get_pinned_mem () <= amount2)
     abort ();
+#endif
 
   verify0 (p, SIZE);
 
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-6.c b/libgomp/testsuite/libgomp.c/alloc-pinned-6.c
index bbe20c04875..c0f8b260e37 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-6.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-6.c
@@ -1,5 +1,7 @@ 
 /* { dg-do run } */
 
+/* { dg-additional-options -DOFFLOAD_DEVICE_NVPTX { target offload_device_nvptx } } */
+
 /* Test that ompx_pinned_mem_alloc fails correctly.  */
 
 #include <stdio.h>
@@ -66,31 +68,55 @@  set_pin_limit ()
 int
 main ()
 {
+#ifdef OFFLOAD_DEVICE_NVPTX
+  /* Go big or go home.  */
+  const int SIZE = 40 * 1024 * 1024;
+#else
   /* Allocate at least a page each time, but stay within the ulimit.  */
   const int SIZE = PAGE_SIZE*4;
+#endif
+  const int PIN_LIMIT = PAGE_SIZE*2;
 
   /* Ensure that the limit is smaller than the allocation.  */
-  set_pin_limit (SIZE/2);
+  set_pin_limit (PIN_LIMIT);
 
   // Sanity check
   if (get_pinned_mem () != 0)
     abort ();
 
-  // Should fail
   void *p = omp_alloc (SIZE, ompx_pinned_mem_alloc);
+#ifdef OFFLOAD_DEVICE_NVPTX
+  // Doesn't care about 'set_pin_limit'.
+  if (!p)
+    abort ();
+#else
+  // Should fail
   if (p)
     abort ();
+#endif
 
-  // Should fail
   p = omp_calloc (1, SIZE, ompx_pinned_mem_alloc);
+#ifdef OFFLOAD_DEVICE_NVPTX
+  // Doesn't care about 'set_pin_limit'.
+  if (!p)
+    abort ();
+#else
+  // Should fail
   if (p)
     abort ();
+#endif
 
-  // Should fail to realloc
   void *notpinned = omp_alloc (SIZE, omp_default_mem_alloc);
   p = omp_realloc (notpinned, SIZE, ompx_pinned_mem_alloc, omp_default_mem_alloc);
+#ifdef OFFLOAD_DEVICE_NVPTX
+  // Doesn't care about 'set_pin_limit'; does reallocate.
+  if (!notpinned || !p || p == notpinned)
+    abort ();
+#else
+  // Should fail to realloc
   if (!notpinned || p)
     abort ();
+#endif
 
   // No memory should have been pinned
   int amount = get_pinned_mem ();
-- 
2.25.1