libgomp, openmp: pinned memory

Message ID f5260c95-6c71-99a7-3bf2-774380444082@codesourcery.com
State Superseded
Headers
Series libgomp, openmp: pinned memory |

Commit Message

Andrew Stubbs Jan. 4, 2022, 3:32 p.m. UTC
  This patch implements the OpenMP pinned memory trait for Linux hosts. On 
other hosts and on devices the trait becomes a no-op (instead of being 
rejected).

The memory is locked via the mlock syscall, which is both the "correct" 
way to do it on Linux, and a problem because the default ulimit for 
pinned memory is very small (and most users don't have permission to 
increase it (much?)). Therefore the code emits a non-fatal warning 
message if locking fails.

Another approach might be to use cudaHostAlloc to allocate the memory in 
the first place, which bypasses the ulimit somehow, but this would not 
help non-NVidia users.

The tests work on Linux and will xfail on other hosts; neither libgomp 
nor the test knows how to allocate or query pinned memory elsewhere.

The patch applies on top of the text of my previously submitted patches, 
but does not actually depend on the functionality of those patches.

OK for stage 1?

I'll commit a backport to OG11 shortly.

Andrew
libgomp: pinned memory

Implement the OpenMP pinned memory trait on Linux hosts using the mlock
syscall.

libgomp/ChangeLog:

	* allocator.c (MEMSPACE_PIN): New macro.
	(xmlock): New function.
	(omp_init_allocator): Don't disallow the pinned trait.
	(omp_aligned_alloc): Add pinning via MEMSPACE_PIN.
	(omp_aligned_calloc): Likewise.
	(omp_realloc): Likewise.
	* testsuite/libgomp.c/alloc-pinned-1.c: New test.
	* testsuite/libgomp.c/alloc-pinned-2.c: New test.
  

Comments

Jakub Jelinek Jan. 4, 2022, 3:55 p.m. UTC | #1
On Tue, Jan 04, 2022 at 03:32:17PM +0000, Andrew Stubbs wrote:
> This patch implements the OpenMP pinned memory trait for Linux hosts. On
> other hosts and on devices the trait becomes a no-op (instead of being
> rejected).
> 
> The memory is locked via the mlock syscall, which is both the "correct" way
> to do it on Linux, and a problem because the default ulimit for pinned
> memory is very small (and most users don't have permission to increase it
> (much?)). Therefore the code emits a non-fatal warning message if locking
> fails.
> 
> Another approach might be to use cudaHostAlloc to allocate the memory in the
> first place, which bypasses the ulimit somehow, but this would not help
> non-NVidia users.
> 
> The tests work on Linux and will xfail on other hosts; neither libgomp nor
> the test knows how to allocate or query pinned memory elsewhere.
> 
> The patch applies on top of the text of my previously submitted patches, but
> does not actually depend on the functionality of those patches.
> 
> OK for stage 1?
> 
> I'll commit a backport to OG11 shortly.
> 
> Andrew

> libgomp: pinned memory
> 
> Implement the OpenMP pinned memory trait on Linux hosts using the mlock
> syscall.
> 
> libgomp/ChangeLog:
> 
> 	* allocator.c (MEMSPACE_PIN): New macro.
> 	(xmlock): New function.
> 	(omp_init_allocator): Don't disallow the pinned trait.
> 	(omp_aligned_alloc): Add pinning via MEMSPACE_PIN.
> 	(omp_aligned_calloc): Likewise.
> 	(omp_realloc): Likewise.
> 	* testsuite/libgomp.c/alloc-pinned-1.c: New test.
> 	* testsuite/libgomp.c/alloc-pinned-2.c: New test.
> 
> diff --git a/libgomp/allocator.c b/libgomp/allocator.c
> index b1f5fe0a5e2..671b91e7ff8 100644
> --- a/libgomp/allocator.c
> +++ b/libgomp/allocator.c
> @@ -51,6 +51,25 @@
>  #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
>    ((void)MEMSPACE, (void)SIZE, free (ADDR))
>  #endif
> +#ifndef MEMSPACE_PIN
> +/* Only define this on supported host platforms.  */
> +#ifdef __linux__
> +#define MEMSPACE_PIN(MEMSPACE, ADDR, SIZE) \
> +  ((void)MEMSPACE, xmlock (ADDR, SIZE))
> +
> +#include <sys/mman.h>
> +#include <stdio.h>
> +void
> +xmlock (void *addr, size_t size)
> +{
> +  if (mlock (addr, size))
> +      perror ("libgomp: failed to pin memory (ulimit too low?)");
> +}
> +#else
> +#define MEMSPACE_PIN(MEMSPACE, ADDR, SIZE) \
> +  ((void)MEMSPACE, (void)ADDR, (void)SIZE)
> +#endif
> +#endif

The usual libgomp way of doing this wouldn't be to use #ifdef __linux__, but
instead add libgomp/config/linux/allocator.c that includes some headers,
defines some macros and then includes the generic allocator.c.

I think perror is the wrong thing to do, omp_alloc etc. has a well defined
interface what to do in such cases - the allocation should just fail (not be
allocated) and depending on user's choice that can be fatal, or return NULL,
or chain to some other allocator with other properties etc.

Other issues in the patch are that it doesn't munlock on deallocation and
that because of that deallocation we need to figure out what to do on page
boundaries.  As documented, mlock can be passed address and/or address +
size that aren't at page boundaries and pinning happens even just for
partially touched pages.  But munlock unpins also even the partially
overlapping pages and we don't know at that point whether some other pinned
allocations don't appear in those pages.
Some bad options are only pin pages wholy contained within the allocation
and don't pin partial pages around it, force at least page alignment and
size so that everything can be pinned, somehow ensure that we never allocate
more than one pinned allocation in such partial pages (but can allocate
there non-pinned allocations), or e.g. use some internal data structure to
track how many pinned allocations are on the partial pages (say a hash map
from page start address to a counter how many pinned allocations are there,
if it goes to 0 munlock even that page, otherwise munlock just the wholy
contained pages), or perhaps use page size aligned allocation and size and
just remember in some data structure that the partial pages could be used
for other pinned (small) allocations.

	Jakub
  
Andrew Stubbs Jan. 4, 2022, 4:58 p.m. UTC | #2
On 04/01/2022 15:55, Jakub Jelinek wrote:
> The usual libgomp way of doing this wouldn't be to use #ifdef __linux__, but
> instead add libgomp/config/linux/allocator.c that includes some headers,
> defines some macros and then includes the generic allocator.c.

OK, good point, I can do that.

> I think perror is the wrong thing to do, omp_alloc etc. has a well defined
> interface what to do in such cases - the allocation should just fail (not be
> allocated) and depending on user's choice that can be fatal, or return NULL,
> or chain to some other allocator with other properties etc.

I did it this way because pinning feels more like an optimization, and 
falling back to "just works" seemed like what users would want to 
happen. The perror was added because it turns out the default ulimit is 
tiny and I wanted to hint at the solution.

I guess you're right that the consistent behaviour would be to silently 
switch to the fallback allocator, but it still feels like users will be 
left in the dark about why it failed.

> Other issues in the patch are that it doesn't munlock on deallocation and
> that because of that deallocation we need to figure out what to do on page
> boundaries.  As documented, mlock can be passed address and/or address +
> size that aren't at page boundaries and pinning happens even just for
> partially touched pages.  But munlock unpins also even the partially
> overlapping pages and we don't know at that point whether some other pinned
> allocations don't appear in those pages.

Right, it doesn't munlock because of these issues. I don't know of any 
way to solve this that wouldn't involve building tables of locked ranges 
(and knowing what the page size is).

I considered using mmap with the lock flag instead, but the failure mode 
looked unhelpful. I guess we could mmap with the regular flags, then 
mlock after. That should bypass the regular heap and ensure each 
allocation has it's own page. I'm not sure what the unintended 
side-effects of that might be.

> Some bad options are only pin pages wholy contained within the allocation
> and don't pin partial pages around it, force at least page alignment and
> size so that everything can be pinned, somehow ensure that we never allocate
> more than one pinned allocation in such partial pages (but can allocate
> there non-pinned allocations), or e.g. use some internal data structure to
> track how many pinned allocations are on the partial pages (say a hash map
> from page start address to a counter how many pinned allocations are there,
> if it goes to 0 munlock even that page, otherwise munlock just the wholy
> contained pages), or perhaps use page size aligned allocation and size and
> just remember in some data structure that the partial pages could be used
> for other pinned (small) allocations.

Bad options indeed. If any part of the memory block is not pinned I 
expect no performance gains whatsoever. And all this other business adds 
complexity and runtime overhead.

For version 1.0 it feels reasonable to omit the unlock step and hope 
that a) pinned data will be long-lived, or b) short-lived pinned data 
will be replaced with more data that -- most likely -- occupies the same 
pages.

Similarly, it seems likely that serious HPC applications will run on 
devices with lots of RAM, and if not any page swapping will destroy the 
performance gains of using OpenMP.

For now I'll just fix the architectural issues.

Andrew
  
Jakub Jelinek Jan. 4, 2022, 6:28 p.m. UTC | #3
On Tue, Jan 04, 2022 at 04:58:19PM +0000, Andrew Stubbs wrote:
> > I think perror is the wrong thing to do, omp_alloc etc. has a well defined
> > interface what to do in such cases - the allocation should just fail (not be
> > allocated) and depending on user's choice that can be fatal, or return NULL,
> > or chain to some other allocator with other properties etc.
> 
> I did it this way because pinning feels more like an optimization, and
> falling back to "just works" seemed like what users would want to happen.
> The perror was added because it turns out the default ulimit is tiny and I
> wanted to hint at the solution.

Something like perror might be acceptable for GOMP_DEBUG mode, but not
normal operation.  So perhaps use gomp_debug there instead?

If it is just an optimization for the user, they should be using the
chaining to corresponding allocator without the pinning to make it clear
what they want and also standard conforming.

> > Other issues in the patch are that it doesn't munlock on deallocation and
> > that because of that deallocation we need to figure out what to do on page
> > boundaries.  As documented, mlock can be passed address and/or address +
> > size that aren't at page boundaries and pinning happens even just for
> > partially touched pages.  But munlock unpins also even the partially
> > overlapping pages and we don't know at that point whether some other pinned
> > allocations don't appear in those pages.
> 
> Right, it doesn't munlock because of these issues. I don't know of any way
> to solve this that wouldn't involve building tables of locked ranges (and
> knowing what the page size is).
> 
> I considered using mmap with the lock flag instead, but the failure mode
> looked unhelpful. I guess we could mmap with the regular flags, then mlock
> after. That should bypass the regular heap and ensure each allocation has
> it's own page. I'm not sure what the unintended side-effects of that might
> be.

But the munlock is even more important because of the low ulimit -l, because
if munlock isn't done on deallocation, the by default I think 64KB limit
will be reached even much earlier.  If most users have just 64KB limit on
pinned memory per process, then that most likely asks for grabbing such memory
in whole pages and doing memory management on that resource.
Because vasting that precious memory on the partial pages which will most
likely get non-pinned allocations when we just have 16 such pages is a big
waste.

	Jakub
  
Jakub Jelinek Jan. 4, 2022, 6:47 p.m. UTC | #4
On Tue, Jan 04, 2022 at 07:28:29PM +0100, Jakub Jelinek via Gcc-patches wrote:
> > > Other issues in the patch are that it doesn't munlock on deallocation and
> > > that because of that deallocation we need to figure out what to do on page
> > > boundaries.  As documented, mlock can be passed address and/or address +
> > > size that aren't at page boundaries and pinning happens even just for
> > > partially touched pages.  But munlock unpins also even the partially
> > > overlapping pages and we don't know at that point whether some other pinned
> > > allocations don't appear in those pages.
> > 
> > Right, it doesn't munlock because of these issues. I don't know of any way
> > to solve this that wouldn't involve building tables of locked ranges (and
> > knowing what the page size is).
> > 
> > I considered using mmap with the lock flag instead, but the failure mode
> > looked unhelpful. I guess we could mmap with the regular flags, then mlock
> > after. That should bypass the regular heap and ensure each allocation has
> > it's own page. I'm not sure what the unintended side-effects of that might
> > be.
> 
> But the munlock is even more important because of the low ulimit -l, because
> if munlock isn't done on deallocation, the by default I think 64KB limit
> will be reached even much earlier.  If most users have just 64KB limit on
> pinned memory per process, then that most likely asks for grabbing such memory
> in whole pages and doing memory management on that resource.
> Because vasting that precious memory on the partial pages which will most
> likely get non-pinned allocations when we just have 16 such pages is a big
> waste.

E.g. if we start using (dynamically, using dlopen/dlsym etc.) the memkind
library for some of the allocators, for the pinned memory we could use
e.g. the memkind_create_fixed API - on the first pinned allocation, check
what is the ulimit -l and if it is fairly small, mmap PROT_NONE the whole
pinned size (but don't pin it whole at start, just whatever we need as we
go).

	Jakub
  
Andrew Stubbs Jan. 5, 2022, 5:07 p.m. UTC | #5
On 04/01/2022 18:47, Jakub Jelinek wrote:
> On Tue, Jan 04, 2022 at 07:28:29PM +0100, Jakub Jelinek via Gcc-patches wrote:
>>>> Other issues in the patch are that it doesn't munlock on deallocation and
>>>> that because of that deallocation we need to figure out what to do on page
>>>> boundaries.  As documented, mlock can be passed address and/or address +
>>>> size that aren't at page boundaries and pinning happens even just for
>>>> partially touched pages.  But munlock unpins also even the partially
>>>> overlapping pages and we don't know at that point whether some other pinned
>>>> allocations don't appear in those pages.
>>>
>>> Right, it doesn't munlock because of these issues. I don't know of any way
>>> to solve this that wouldn't involve building tables of locked ranges (and
>>> knowing what the page size is).
>>>
>>> I considered using mmap with the lock flag instead, but the failure mode
>>> looked unhelpful. I guess we could mmap with the regular flags, then mlock
>>> after. That should bypass the regular heap and ensure each allocation has
>>> it's own page. I'm not sure what the unintended side-effects of that might
>>> be.
>>
>> But the munlock is even more important because of the low ulimit -l, because
>> if munlock isn't done on deallocation, the by default I think 64KB limit
>> will be reached even much earlier.  If most users have just 64KB limit on
>> pinned memory per process, then that most likely asks for grabbing such memory
>> in whole pages and doing memory management on that resource.
>> Because vasting that precious memory on the partial pages which will most
>> likely get non-pinned allocations when we just have 16 such pages is a big
>> waste.
> 
> E.g. if we start using (dynamically, using dlopen/dlsym etc.) the memkind
> library for some of the allocators, for the pinned memory we could use
> e.g. the memkind_create_fixed API - on the first pinned allocation, check
> what is the ulimit -l and if it is fairly small, mmap PROT_NONE the whole
> pinned size (but don't pin it whole at start, just whatever we need as we
> go).

I don't believe 64KB will be anything like enough for any real HPC 
application. Is it really worth optimizing for this case?

Anyway, I'm working on an implementation using mmap instead of malloc 
for pinned allocations. I figure that will simplify the unpin algorithm 
(because it'll be munmap) and optimize for large allocations such as I 
imagine HPC applications will use. It won't fix the ulimit issue.

Andrew
  
Andrew Stubbs Jan. 13, 2022, 1:53 p.m. UTC | #6
On 05/01/2022 17:07, Andrew Stubbs wrote:
> I don't believe 64KB will be anything like enough for any real HPC 
> application. Is it really worth optimizing for this case?
> 
> Anyway, I'm working on an implementation using mmap instead of malloc 
> for pinned allocations. I figure that will simplify the unpin algorithm 
> (because it'll be munmap) and optimize for large allocations such as I 
> imagine HPC applications will use. It won't fix the ulimit issue.

Here's my new patch.

This version is intended to apply on top of the latest version of my 
low-latency allocator patch, although the dependency is mostly textual.

Pinned memory is allocated via mmap + mlock, and allocation fails 
(returns NULL) if the lock fails and there's no fallback configured.

This means that large allocations will now be page aligned and therefore 
pin the smallest number of pages for the size requested, and that that 
memory will be unpinned automatically when freed via munmap, or moved 
via mremap.

Obviously this is not ideal for allocations much smaller than one page. 
If that turns out to be a problem in the real world then we can add a 
special case fairly straight-forwardly, and incur the extra page 
tracking expense in those cases only, or maybe implement our own 
pinned-memory heap (something like already proposed for low-latency 
memory, perhaps).

Also new is a realloc implementation that works better when reallocation 
fails. This is confirmed by the new testcases.

OK for stage 1?

Thanks

Andrew
libgomp: pinned memory

Implement the OpenMP pinned memory trait on Linux hosts using the mlock
syscall.  Pinned allocations are performed using mmap, not malloc, to ensure
that they can be unpinned safely when freed.

libgomp/ChangeLog:

	* allocator.c (MEMSPACE_ALLOC): Add PIN.
	(MEMSPACE_CALLOC): Add PIN.
	(MEMSPACE_REALLOC): Add PIN.
	(MEMSPACE_FREE): Add PIN.
	(xmlock): New function.
	(omp_init_allocator): Don't disallow the pinned trait.
	(omp_aligned_alloc): Add pinning to all MEMSPACE_* calls.
	(omp_aligned_calloc): Likewise.
	(omp_realloc): Likewise.
	(omp_free): Likewise.
	* config/linux/allocator.c: New file.
	* config/nvptx/allocator.c (MEMSPACE_ALLOC): Add PIN.
	(MEMSPACE_CALLOC): Add PIN.
	(MEMSPACE_REALLOC): Add PIN.
	(MEMSPACE_FREE): Add PIN.
	* testsuite/libgomp.c/alloc-pinned-1.c: New test.
	* testsuite/libgomp.c/alloc-pinned-2.c: New test.
	* testsuite/libgomp.c/alloc-pinned-3.c: New test.
	* testsuite/libgomp.c/alloc-pinned-4.c: New test.

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index 1cc7486fc4c..5ab161b6314 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -36,16 +36,20 @@
 
 /* These macros may be overridden in config/<target>/allocator.c.  */
 #ifndef MEMSPACE_ALLOC
-#define MEMSPACE_ALLOC(MEMSPACE, SIZE) malloc (SIZE)
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \
+  (PIN ? NULL : malloc (SIZE))
 #endif
 #ifndef MEMSPACE_CALLOC
-#define MEMSPACE_CALLOC(MEMSPACE, SIZE) calloc (1, SIZE)
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \
+  (PIN ? NULL : calloc (1, SIZE))
 #endif
 #ifndef MEMSPACE_REALLOC
-#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) realloc (ADDR, SIZE)
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN) \
+  ((PIN) || (OLDPIN) ? NULL : realloc (ADDR, SIZE))
 #endif
 #ifndef MEMSPACE_FREE
-#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) free (ADDR)
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \
+  (PIN ? NULL : free (ADDR))
 #endif
 
 /* Map the predefined allocators to the correct memory space.
@@ -208,7 +212,7 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
     data.alignment = sizeof (void *);
 
   /* No support for these so far (for hbw will use memkind).  */
-  if (data.pinned || data.memspace == omp_high_bw_mem_space)
+  if (data.memspace == omp_high_bw_mem_space)
     return omp_null_allocator;
 
   ret = gomp_malloc (sizeof (struct omp_allocator_data));
@@ -309,7 +313,8 @@ retry:
       allocator_data->used_pool_size = used_pool_size;
       gomp_mutex_unlock (&allocator_data->lock);
 #endif
-      ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size);
+      ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size,
+			    allocator_data->pinned);
       if (ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -329,7 +334,8 @@ retry:
 	= (allocator_data
 	   ? allocator_data->memspace
 	   : predefined_alloc_mapping[allocator]);
-      ptr = MEMSPACE_ALLOC (memspace, new_size);
+      ptr = MEMSPACE_ALLOC (memspace, new_size,
+			    allocator_data && allocator_data->pinned);
       if (ptr == NULL)
 	goto fail;
     }
@@ -356,9 +362,9 @@ fail:
     {
     case omp_atv_default_mem_fb:
       if ((new_alignment > sizeof (void *) && new_alignment > alignment)
-	  || (allocator_data
-	      && allocator_data->pool_size < ~(uintptr_t) 0)
-	  || !allocator_data)
+	  || !allocator_data
+	  || allocator_data->pool_size < ~(uintptr_t) 0
+	  || allocator_data->pinned)
 	{
 	  allocator = omp_default_mem_alloc;
 	  goto retry;
@@ -410,6 +416,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
   struct omp_mem_header *data;
   omp_memspace_handle_t memspace __attribute__((unused))
     = omp_default_mem_space;
+  int pinned __attribute__((unused)) = false;
 
   if (ptr == NULL)
     return;
@@ -432,11 +439,12 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
 	}
 
       memspace = allocator_data->memspace;
+      pinned = allocator_data->pinned;
     }
   else
     memspace = predefined_alloc_mapping[data->allocator];
 
-  MEMSPACE_FREE (memspace, data->ptr, data->size);
+  MEMSPACE_FREE (memspace, data->ptr, data->size, pinned);
 }
 
 ialias (omp_free)
@@ -524,7 +532,8 @@ retry:
       allocator_data->used_pool_size = used_pool_size;
       gomp_mutex_unlock (&allocator_data->lock);
 #endif
-      ptr = MEMSPACE_CALLOC (allocator_data->memspace, new_size);
+      ptr = MEMSPACE_CALLOC (allocator_data->memspace, new_size,
+			     allocator_data->pinned);
       if (ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -544,7 +553,8 @@ retry:
 	= (allocator_data
 	   ? allocator_data->memspace
 	   : predefined_alloc_mapping[allocator]);
-      ptr = MEMSPACE_CALLOC (memspace, new_size);
+      ptr = MEMSPACE_CALLOC (memspace, new_size,
+			     allocator_data && allocator_data->pinned);
       if (ptr == NULL)
 	goto fail;
     }
@@ -571,9 +581,9 @@ fail:
     {
     case omp_atv_default_mem_fb:
       if ((new_alignment > sizeof (void *) && new_alignment > alignment)
-	  || (allocator_data
-	      && allocator_data->pool_size < ~(uintptr_t) 0)
-	  || !allocator_data)
+	  || !allocator_data
+	  || allocator_data->pool_size < ~(uintptr_t) 0
+	  || allocator_data->pinned)
 	{
 	  allocator = omp_default_mem_alloc;
 	  goto retry;
@@ -710,9 +720,13 @@ retry:
 #endif
       if (prev_size)
 	new_ptr = MEMSPACE_REALLOC (allocator_data->memspace, data->ptr,
-				    data->size, new_size);
+				    data->size, new_size,
+				    (free_allocator_data
+				     && free_allocator_data->pinned),
+				    allocator_data->pinned);
       else
-	new_ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size);
+	new_ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size,
+				  allocator_data->pinned);
       if (new_ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -744,9 +758,13 @@ retry:
 	= (allocator_data
 	   ? allocator_data->memspace
 	   : predefined_alloc_mapping[allocator]);
-      new_ptr = MEMSPACE_REALLOC (memspace, data->ptr, data->size, new_size);
+      new_ptr = MEMSPACE_REALLOC (memspace, data->ptr, data->size, new_size,
+				  (free_allocator_data
+				   && free_allocator_data->pinned),
+				  allocator_data && allocator_data->pinned);
       if (new_ptr == NULL)
 	goto fail;
+
       ret = (char *) new_ptr + sizeof (struct omp_mem_header);
       ((struct omp_mem_header *) ret)[-1].ptr = new_ptr;
       ((struct omp_mem_header *) ret)[-1].size = new_size;
@@ -759,7 +777,8 @@ retry:
 	= (allocator_data
 	   ? allocator_data->memspace
 	   : predefined_alloc_mapping[allocator]);
-      new_ptr = MEMSPACE_ALLOC (memspace, new_size);
+      new_ptr = MEMSPACE_ALLOC (memspace, new_size,
+				allocator_data && allocator_data->pinned);
       if (new_ptr == NULL)
 	goto fail;
     }
@@ -802,9 +821,9 @@ fail:
     {
     case omp_atv_default_mem_fb:
       if (new_alignment > sizeof (void *)
-	  || (allocator_data
-	      && allocator_data->pool_size < ~(uintptr_t) 0)
-	  || !allocator_data)
+	  || !allocator_data
+	  || allocator_data->pool_size < ~(uintptr_t) 0
+	  || allocator_data->pinned)
 	{
 	  allocator = omp_default_mem_alloc;
 	  goto retry;
diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
new file mode 100644
index 00000000000..5f3ae491f07
--- /dev/null
+++ b/libgomp/config/linux/allocator.c
@@ -0,0 +1,124 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Implement malloc routines that can handle pinned memory on Linux.
+   
+   It's possible to use mlock on any heap memory, but using munlock is
+   problematic if there are multiple pinned allocations on the same page.
+   Tracking all that manually would be possible, but adds overhead. This may
+   be worth it if there are a lot of small allocations getting pinned, but
+   this seems less likely in a HPC application.
+
+   Instead we optimize for large pinned allocations, and use mmap to ensure
+   that two pinned allocations don't share the same page.  This also means
+   that large allocations don't pin extra pages by being poorly aligned.  */
+
+#define _GNU_SOURCE
+#include <sys/mman.h>
+#include <string.h>
+#include "libgomp.h"
+
+static void *
+linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
+{
+  (void)memspace;
+
+  if (pin)
+    {
+      void *addr = mmap (NULL, size, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+      if (addr == MAP_FAILED)
+	return NULL;
+
+      if (mlock (addr, size))
+	{
+	  gomp_debug (0, "libgomp: failed to pin memory (ulimit too low?)\n");
+	  munmap (addr, size);
+	  return NULL;
+	}
+
+      return addr;
+    }
+  else
+    return malloc (size);
+}
+
+static void *
+linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
+{
+  if (pin)
+    return linux_memspace_alloc (memspace, size, pin);
+  else
+    return calloc (1, size);
+}
+
+static void
+linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size,
+		     int pin)
+{
+  (void)memspace;
+
+  if (pin)
+    munmap (addr, size);
+  else
+    free (addr);
+}
+
+static void *
+linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
+			size_t oldsize, size_t size, int oldpin, int pin)
+{
+  if (oldpin && pin)
+    {
+      void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE);
+      if (newaddr == MAP_FAILED)
+	return NULL;
+
+      return newaddr;
+    }
+  else if (oldpin || pin)
+    {
+      void *newaddr = linux_memspace_alloc (memspace, size, pin);
+      if (newaddr)
+	{
+	  memcpy (newaddr, addr, oldsize < size ? oldsize : size);
+	  linux_memspace_free (memspace, addr, oldsize, oldpin);
+	}
+
+      return newaddr;
+    }
+  else
+    return realloc (addr, size);
+}
+
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \
+  linux_memspace_alloc (MEMSPACE, SIZE, PIN)
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \
+  linux_memspace_calloc (MEMSPACE, SIZE, PIN)
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN) \
+  linux_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN)
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \
+  linux_memspace_free (MEMSPACE, ADDR, SIZE, PIN)
+
+#include "../../allocator.c"
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
index 6bc2ea48043..f740b97f6ac 100644
--- a/libgomp/config/nvptx/allocator.c
+++ b/libgomp/config/nvptx/allocator.c
@@ -358,13 +358,13 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
     return realloc (addr, size);
 }
 
-#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \
   nvptx_memspace_alloc (MEMSPACE, SIZE)
-#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \
   nvptx_memspace_calloc (MEMSPACE, SIZE)
-#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN) \
   nvptx_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
-#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \
   nvptx_memspace_free (MEMSPACE, ADDR, SIZE)
 
 #include "../../allocator.c"
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
new file mode 100644
index 00000000000..0a6360cda29
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
@@ -0,0 +1,81 @@
+/* { dg-do run } */
+
+/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
+
+/* Test that pinned memory works.  */
+
+#ifdef __linux__
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/mman.h>
+
+int
+get_pinned_mem ()
+{
+  int pid = getpid ();
+  char buf[100];
+  sprintf (buf, "/proc/%d/status", pid);
+
+  FILE *proc = fopen (buf, "r");
+  if (!proc)
+    abort ();
+  while (fgets (buf, 100, proc))
+    {
+      int val;
+      if (sscanf (buf, "VmLck: %d", &val))
+	{
+	  fclose (proc);
+	  return val;
+	}
+    }
+  abort ();
+}
+#else
+int
+get_pinned_mem ()
+{
+  return 0;
+}
+#endif
+
+#include <omp.h>
+
+/* Allocate more than a page each time, but stay within the ulimit.  */
+#define SIZE 10*1024
+
+int
+main ()
+{
+  const omp_alloctrait_t traits[] = {
+      { omp_atk_pinned, 1 }
+  };
+  omp_allocator_handle_t allocator = omp_init_allocator (omp_default_mem_space, 1, traits);
+
+  // Sanity check
+  if (get_pinned_mem () != 0)
+    abort ();
+
+  void *p = omp_alloc (SIZE, allocator);
+  if (!p)
+    abort ();
+
+  int amount = get_pinned_mem ();
+  if (amount == 0)
+    abort ();
+
+  p = omp_realloc (p, SIZE*2, allocator, allocator);
+
+  int amount2 = get_pinned_mem ();
+  if (amount2 <= amount)
+    abort ();
+
+  p = omp_calloc (1, SIZE, allocator);
+
+  if (get_pinned_mem () <= amount2)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-2.c b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
new file mode 100644
index 00000000000..8fdb4ff5cfd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
@@ -0,0 +1,87 @@
+/* { dg-do run } */
+
+/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
+
+/* Test that pinned memory works (pool_size code path).  */
+
+#ifdef __linux__
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/mman.h>
+
+int
+get_pinned_mem ()
+{
+  int pid = getpid ();
+  char buf[100];
+  sprintf (buf, "/proc/%d/status", pid);
+
+  FILE *proc = fopen (buf, "r");
+  if (!proc)
+    abort ();
+  while (fgets (buf, 100, proc))
+    {
+      int val;
+      if (sscanf (buf, "VmLck: %d", &val))
+	{
+	  fclose (proc);
+	  return val;
+	}
+    }
+  abort ();
+}
+#else
+int
+get_pinned_mem ()
+{
+  return 0;
+}
+#endif
+
+#include <omp.h>
+
+/* Allocate more than a page each time, but stay within the ulimit.  */
+#define SIZE 10*1024
+
+int
+main ()
+{
+  const omp_alloctrait_t traits[] = {
+      { omp_atk_pinned, 1 },
+      { omp_atk_pool_size, SIZE*8 }
+  };
+  omp_allocator_handle_t allocator = omp_init_allocator (omp_default_mem_space,
+							 2, traits);
+
+  // Sanity check
+  if (get_pinned_mem () != 0)
+    abort ();
+
+  void *p = omp_alloc (SIZE, allocator);
+  if (!p)
+    abort ();
+
+  int amount = get_pinned_mem ();
+  if (amount == 0)
+    abort ();
+
+  p = omp_realloc (p, SIZE*2, allocator, allocator);
+  if (!p)
+    abort ();
+
+  int amount2 = get_pinned_mem ();
+  if (amount2 <= amount)
+    abort ();
+
+  p = omp_calloc (1, SIZE, allocator);
+  if (!p)
+    abort ();
+
+  if (get_pinned_mem () <= amount2)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-3.c b/libgomp/testsuite/libgomp.c/alloc-pinned-3.c
new file mode 100644
index 00000000000..943dfea5c9b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-3.c
@@ -0,0 +1,125 @@
+/* { dg-do run } */
+
+/* Test that pinned memory fails correctly.  */
+
+#ifdef __linux__
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/mman.h>
+#include <sys/resource.h>
+
+int
+get_pinned_mem ()
+{
+  int pid = getpid ();
+  char buf[100];
+  sprintf (buf, "/proc/%d/status", pid);
+
+  FILE *proc = fopen (buf, "r");
+  if (!proc)
+    abort ();
+  while (fgets (buf, 100, proc))
+    {
+      int val;
+      if (sscanf (buf, "VmLck: %d", &val))
+	{
+	  fclose (proc);
+	  return val;
+	}
+    }
+  abort ();
+}
+
+void
+set_pin_limit (int size)
+{
+  struct rlimit limit;
+  if (getrlimit (RLIMIT_MEMLOCK, &limit))
+    abort ();
+  limit.rlim_cur = (limit.rlim_max < size ? limit.rlim_max : size);
+  if (setrlimit (RLIMIT_MEMLOCK, &limit))
+    abort ();
+}
+#else
+int
+get_pinned_mem ()
+{
+  return 0;
+}
+
+void
+set_pin_limit ()
+{
+}
+#endif
+
+#include <omp.h>
+
+/* This should be large enough to cover multiple pages.  */
+#define SIZE 10000*1024
+
+int
+main ()
+{
+  /* Pinned memory, no fallback.  */
+  const omp_alloctrait_t traits1[] = {
+      { omp_atk_pinned, 1 },
+      { omp_atk_fallback, omp_atv_null_fb }
+  };
+  omp_allocator_handle_t allocator1 = omp_init_allocator (omp_default_mem_space, 2, traits1);
+
+  /* Pinned memory, plain memory fallback.  */
+  const omp_alloctrait_t traits2[] = {
+      { omp_atk_pinned, 1 },
+      { omp_atk_fallback, omp_atv_default_mem_fb }
+  };
+  omp_allocator_handle_t allocator2 = omp_init_allocator (omp_default_mem_space, 2, traits2);
+
+  /* Ensure that the limit is smaller than the allocation.  */
+  set_pin_limit (SIZE/2);
+
+  // Sanity check
+  if (get_pinned_mem () != 0)
+    abort ();
+
+  // Should fail
+  void *p = omp_alloc (SIZE, allocator1);
+  if (p)
+    abort ();
+
+  // Should fail
+  p = omp_calloc (1, SIZE, allocator1);
+  if (p)
+    abort ();
+
+  // Should fall back
+  p = omp_alloc (SIZE, allocator2);
+  if (!p)
+    abort ();
+
+  // Should fall back
+  p = omp_calloc (1, SIZE, allocator2);
+  if (!p)
+    abort ();
+
+  // Should fail to realloc
+  void *notpinned = omp_alloc (SIZE, omp_default_mem_alloc);
+  p = omp_realloc (notpinned, SIZE, allocator1, omp_default_mem_alloc);
+  if (!notpinned || p)
+    abort ();
+
+  // Should fall back to no realloc needed
+  p = omp_realloc (notpinned, SIZE, allocator2, omp_default_mem_alloc);
+  if (p != notpinned)
+    abort ();
+
+  // No memory should have been pinned
+  int amount = get_pinned_mem ();
+  if (amount != 0)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-4.c b/libgomp/testsuite/libgomp.c/alloc-pinned-4.c
new file mode 100644
index 00000000000..d9cb8dfe1fd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-4.c
@@ -0,0 +1,127 @@
+/* { dg-do run } */
+
+/* Test that pinned memory fails correctly, pool_size code path.  */
+
+#ifdef __linux__
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/mman.h>
+#include <sys/resource.h>
+
+int
+get_pinned_mem ()
+{
+  int pid = getpid ();
+  char buf[100];
+  sprintf (buf, "/proc/%d/status", pid);
+
+  FILE *proc = fopen (buf, "r");
+  if (!proc)
+    abort ();
+  while (fgets (buf, 100, proc))
+    {
+      int val;
+      if (sscanf (buf, "VmLck: %d", &val))
+	{
+	  fclose (proc);
+	  return val;
+	}
+    }
+  abort ();
+}
+
+void
+set_pin_limit (int size)
+{
+  struct rlimit limit;
+  if (getrlimit (RLIMIT_MEMLOCK, &limit))
+    abort ();
+  limit.rlim_cur = (limit.rlim_max < size ? limit.rlim_max : size);
+  if (setrlimit (RLIMIT_MEMLOCK, &limit))
+    abort ();
+}
+#else
+int
+get_pinned_mem ()
+{
+  return 0;
+}
+
+void
+set_pin_limit ()
+{
+}
+#endif
+
+#include <omp.h>
+
+/* This should be large enough to cover multiple pages.  */
+#define SIZE 10000*1024
+
+int
+main ()
+{
+  /* Pinned memory, no fallback.  */
+  const omp_alloctrait_t traits1[] = {
+      { omp_atk_pinned, 1 },
+      { omp_atk_fallback, omp_atv_null_fb },
+      { omp_atk_pool_size, SIZE*8 }
+  };
+  omp_allocator_handle_t allocator1 = omp_init_allocator (omp_default_mem_space, 3, traits1);
+
+  /* Pinned memory, plain memory fallback.  */
+  const omp_alloctrait_t traits2[] = {
+      { omp_atk_pinned, 1 },
+      { omp_atk_fallback, omp_atv_default_mem_fb },
+      { omp_atk_pool_size, SIZE*8 }
+  };
+  omp_allocator_handle_t allocator2 = omp_init_allocator (omp_default_mem_space, 3, traits2);
+
+  /* Ensure that the limit is smaller than the allocation.  */
+  set_pin_limit (SIZE/2);
+
+  // Sanity check
+  if (get_pinned_mem () != 0)
+    abort ();
+
+  // Should fail
+  void *p = omp_alloc (SIZE, allocator1);
+  if (p)
+    abort ();
+
+  // Should fail
+  p = omp_calloc (1, SIZE, allocator1);
+  if (p)
+    abort ();
+
+  // Should fall back
+  p = omp_alloc (SIZE, allocator2);
+  if (!p)
+    abort ();
+
+  // Should fall back
+  p = omp_calloc (1, SIZE, allocator2);
+  if (!p)
+    abort ();
+
+  // Should fail to realloc
+  void *notpinned = omp_alloc (SIZE, omp_default_mem_alloc);
+  p = omp_realloc (notpinned, SIZE, allocator1, omp_default_mem_alloc);
+  if (!notpinned || p)
+    abort ();
+
+  // Should fall back to no realloc needed
+  p = omp_realloc (notpinned, SIZE, allocator2, omp_default_mem_alloc);
+  if (p != notpinned)
+    abort ();
+
+  // No memory should have been pinned
+  int amount = get_pinned_mem ();
+  if (amount != 0)
+    abort ();
+
+  return 0;
+}
  
Andrew Stubbs June 7, 2022, 11:05 a.m. UTC | #7
Following some feedback from users of the OG11 branch I think I need to 
withdraw this patch, for now.

The memory pinned via the mlock call does not give the expected 
performance boost. I had not expected that it would do much in my test 
setup, given that the machine has a lot of RAM and my benchmarks are 
small, but others have tried more and on varying machines and architectures.

It seems that it isn't enough for the memory to be pinned, it has to be 
pinned using the Cuda API to get the performance boost. I had not done 
this because it was difficult to resolve the code abstraction 
difficulties and anyway the implementation was supposed to be device 
independent, but it seems we need a specific pinning mechanism for each 
device.

I will resubmit this patch with some kind of Cuda/plugin hook soonish, 
keeping the existing implementation for other device types. I don't know 
how that'll handle heterogenous systems, but those ought to be rare.

I don't think libmemkind will resolve this performance issue, although 
certainly it can be used for host implementations of low-latency 
memories, etc.

Andrew

On 13/01/2022 13:53, Andrew Stubbs wrote:
> On 05/01/2022 17:07, Andrew Stubbs wrote:
>> I don't believe 64KB will be anything like enough for any real HPC 
>> application. Is it really worth optimizing for this case?
>>
>> Anyway, I'm working on an implementation using mmap instead of malloc 
>> for pinned allocations. I figure that will simplify the unpin 
>> algorithm (because it'll be munmap) and optimize for large allocations 
>> such as I imagine HPC applications will use. It won't fix the ulimit 
>> issue.
> 
> Here's my new patch.
> 
> This version is intended to apply on top of the latest version of my 
> low-latency allocator patch, although the dependency is mostly textual.
> 
> Pinned memory is allocated via mmap + mlock, and allocation fails 
> (returns NULL) if the lock fails and there's no fallback configured.
> 
> This means that large allocations will now be page aligned and therefore 
> pin the smallest number of pages for the size requested, and that that 
> memory will be unpinned automatically when freed via munmap, or moved 
> via mremap.
> 
> Obviously this is not ideal for allocations much smaller than one page. 
> If that turns out to be a problem in the real world then we can add a 
> special case fairly straight-forwardly, and incur the extra page 
> tracking expense in those cases only, or maybe implement our own 
> pinned-memory heap (something like already proposed for low-latency 
> memory, perhaps).
> 
> Also new is a realloc implementation that works better when reallocation 
> fails. This is confirmed by the new testcases.
> 
> OK for stage 1?
> 
> Thanks
> 
> Andrew
  
Jakub Jelinek June 7, 2022, 12:10 p.m. UTC | #8
On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
> Following some feedback from users of the OG11 branch I think I need to
> withdraw this patch, for now.
> 
> The memory pinned via the mlock call does not give the expected performance
> boost. I had not expected that it would do much in my test setup, given that
> the machine has a lot of RAM and my benchmarks are small, but others have
> tried more and on varying machines and architectures.

I don't understand why there should be any expected performance boost (at
least not unless the machine starts swapping out pages),
{ omp_atk_pinned, true } is solely about the requirement that the memory
can't be swapped out.

> It seems that it isn't enough for the memory to be pinned, it has to be
> pinned using the Cuda API to get the performance boost. I had not done this

For performance boost of what kind of code?
I don't understand how Cuda API could be useful (or can be used at all) if
offloading to NVPTX isn't involved.  The fact that somebody asks for host
memory allocation with omp_atk_pinned set to true doesn't mean it will be
in any way related to NVPTX offloading (unless it is in NVPTX target region
obviously, but then mlock isn't available, so sure, if there is something
CUDA can provide for that case, nice).

> I don't think libmemkind will resolve this performance issue, although
> certainly it can be used for host implementations of low-latency memories,
> etc.

The reason for libmemkind is primarily its support of HBW memory (but
admittedly I need to find out what kind of such memory it does support),
or the various interleaving etc. the library has.
Plus, when we have such support, as it has its own costomizable allocator,
it could be used to allocate larger chunks of memory that can be mlocked
and then just allocate from that pinned memory if user asks for small
allocations from that memory.

	Jakub
  
Andrew Stubbs June 7, 2022, 12:28 p.m. UTC | #9
On 07/06/2022 13:10, Jakub Jelinek wrote:
> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>> Following some feedback from users of the OG11 branch I think I need to
>> withdraw this patch, for now.
>>
>> The memory pinned via the mlock call does not give the expected performance
>> boost. I had not expected that it would do much in my test setup, given that
>> the machine has a lot of RAM and my benchmarks are small, but others have
>> tried more and on varying machines and architectures.
> 
> I don't understand why there should be any expected performance boost (at
> least not unless the machine starts swapping out pages),
> { omp_atk_pinned, true } is solely about the requirement that the memory
> can't be swapped out.

It seems like it takes a faster path through the NVidia drivers. This is 
a black box, for me, but that seems like a plausible explanation. The 
results are different on x86_64 and powerpc hosts (such as the Summit 
supercomputer).

>> It seems that it isn't enough for the memory to be pinned, it has to be
>> pinned using the Cuda API to get the performance boost. I had not done this
> 
> For performance boost of what kind of code?
> I don't understand how Cuda API could be useful (or can be used at all) if
> offloading to NVPTX isn't involved.  The fact that somebody asks for host
> memory allocation with omp_atk_pinned set to true doesn't mean it will be
> in any way related to NVPTX offloading (unless it is in NVPTX target region
> obviously, but then mlock isn't available, so sure, if there is something
> CUDA can provide for that case, nice).

This is specifically for NVPTX offload, of course, but then that's what 
our customer is paying for.

The expectation, from users, is that memory pinning will give the 
benefits specific to the active device. We can certainly make that 
happen when there is only one (flavour of) offload device present. I had 
hoped it could be one way for all, but it looks like not.

> 
>> I don't think libmemkind will resolve this performance issue, although
>> certainly it can be used for host implementations of low-latency memories,
>> etc.
> 
> The reason for libmemkind is primarily its support of HBW memory (but
> admittedly I need to find out what kind of such memory it does support),
> or the various interleaving etc. the library has.
> Plus, when we have such support, as it has its own costomizable allocator,
> it could be used to allocate larger chunks of memory that can be mlocked
> and then just allocate from that pinned memory if user asks for small
> allocations from that memory.

It should be straight-forward to switch the no-offload implementation to 
libmemkind when the time comes (the changes would be contained within 
config/linux/allocator.c), but I have no plans to do so myself (and no 
hardware to test it with). I'd prefer that it didn't impede the offload 
solution in the meantime.

Andrew
  
Jakub Jelinek June 7, 2022, 12:40 p.m. UTC | #10
On Tue, Jun 07, 2022 at 01:28:33PM +0100, Andrew Stubbs wrote:
> > For performance boost of what kind of code?
> > I don't understand how Cuda API could be useful (or can be used at all) if
> > offloading to NVPTX isn't involved.  The fact that somebody asks for host
> > memory allocation with omp_atk_pinned set to true doesn't mean it will be
> > in any way related to NVPTX offloading (unless it is in NVPTX target region
> > obviously, but then mlock isn't available, so sure, if there is something
> > CUDA can provide for that case, nice).
> 
> This is specifically for NVPTX offload, of course, but then that's what our
> customer is paying for.
> 
> The expectation, from users, is that memory pinning will give the benefits
> specific to the active device. We can certainly make that happen when there
> is only one (flavour of) offload device present. I had hoped it could be one
> way for all, but it looks like not.

I think that is just an expectation that isn't backed by anything in the
standard.
When users need something like that (but would be good to describe what
it is, memory that will be primarily used for interfacing the offloading
device 0 (or some specific device given by some number), or memory that
can be used without remapping on some offloading device, something else?
And when we know what exactly that is (e.g. what Cuda APIs or GCN APIs etc.
can provide), discuss on omp-lang whether there shouldn't be some standard
way to ask for such an allocator.  Or there is always the possibility of
extensions.  Not sure if one can just define ompx_atv_whatever, use some
large value for it (but the spec doesn't have a vendor range which would be
safe to use) and support it that way.

Plus a different thing is allocators in the offloading regions.
I think we should translate some omp_alloc etc. calls in such regions
when they use constant expression standard allocators to doing the
allocation through other means, or allocators.c can be overridden or
amended for the needs or possibilities of the offloading targets.

	Jakub
  
Thomas Schwinge June 9, 2022, 9:38 a.m. UTC | #11
Hi!

I'm not all too familiar with the "newish" CUDA Driver API, but maybe the
following is useful still:

On 2022-06-07T13:28:33+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> On 07/06/2022 13:10, Jakub Jelinek wrote:
>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>>> Following some feedback from users of the OG11 branch I think I need to
>>> withdraw this patch, for now.
>>>
>>> The memory pinned via the mlock call does not give the expected performance
>>> boost. I had not expected that it would do much in my test setup, given that
>>> the machine has a lot of RAM and my benchmarks are small, but others have
>>> tried more and on varying machines and architectures.
>>
>> I don't understand why there should be any expected performance boost (at
>> least not unless the machine starts swapping out pages),
>> { omp_atk_pinned, true } is solely about the requirement that the memory
>> can't be swapped out.
>
> It seems like it takes a faster path through the NVidia drivers. This is
> a black box, for me, but that seems like a plausible explanation. The
> results are different on x86_64 and powerpc hosts (such as the Summit
> supercomputer).

For example, it's documented that 'cuMemHostAlloc',
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9>,
"Allocates page-locked host memory".  The crucial thing, though, what
makes this different from 'malloc' plus 'mlock' is, that "The driver
tracks the virtual memory ranges allocated with this function and
automatically accelerates calls to functions such as cuMemcpyHtoD().
Since the memory can be accessed directly by the device, it can be read
or written with much higher bandwidth than pageable memory obtained with
functions such as malloc()".

Similar, for example, for 'cuMemAllocHost',
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0>.

This, to me, would explain why "the mlock call does not give the expected
performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
with 'mlock' you're missing the "tracks the virtual memory ranges"
aspect.

Also, by means of the Nvidia Driver allocating the memory, I suppose
using this interface likely circumvents any "annoying" 'ulimit'
limitations?  I get this impression, because documentation continues
stating that "Allocating excessive amounts of memory with
cuMemAllocHost() may degrade system performance, since it reduces the
amount of memory available to the system for paging.  As a result, this
function is best used sparingly to allocate staging areas for data
exchange between host and device".

>>> It seems that it isn't enough for the memory to be pinned, it has to be
>>> pinned using the Cuda API to get the performance boost.
>>
>> For performance boost of what kind of code?
>> I don't understand how Cuda API could be useful (or can be used at all) if
>> offloading to NVPTX isn't involved.  The fact that somebody asks for host
>> memory allocation with omp_atk_pinned set to true doesn't mean it will be
>> in any way related to NVPTX offloading (unless it is in NVPTX target region
>> obviously, but then mlock isn't available, so sure, if there is something
>> CUDA can provide for that case, nice).
>
> This is specifically for NVPTX offload, of course, but then that's what
> our customer is paying for.
>
> The expectation, from users, is that memory pinning will give the
> benefits specific to the active device. We can certainly make that
> happen when there is only one (flavour of) offload device present. I had
> hoped it could be one way for all, but it looks like not.

Aren't there CUDA Driver interfaces for that?  That is:

>>> I had not done this
>>> this because it was difficult to resolve the code abstraction
>>> difficulties and anyway the implementation was supposed to be device
>>> independent, but it seems we need a specific pinning mechanism for each
>>> device.

If not directly *allocating and registering* such memory via
'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
*register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223>:
"Page-locks the memory range specified [...] and maps it for the
device(s) [...].  This memory range also is added to the same tracking
mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
manual 'mlock'ing involved in that case, too; presumably again using this
interface likely circumvents any "annoying" 'ulimit' limitations?)

Such a *register* abstraction can then be implemented by all the libgomp
offloading plugins: they just call the respective
CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
memory.

..., but maybe I'm missing some crucial "detail" here?


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Tobias Burnus June 9, 2022, 10:09 a.m. UTC | #12
On 09.06.22 11:38, Thomas Schwinge wrote:
> On 2022-06-07T13:28:33+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>> On 07/06/2022 13:10, Jakub Jelinek wrote:
>>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>>>> The memory pinned via the mlock call does not give the expected performance
>>>> boost. I had not expected that it would do much in my test setup, given that
>>>> the machine has a lot of RAM and my benchmarks are small, but others have
>>>> tried more and on varying machines and architectures.
>>> I don't understand why there should be any expected performance boost (at
>>> least not unless the machine starts swapping out pages),
>>> { omp_atk_pinned, true } is solely about the requirement that the memory
>>> can't be swapped out.
>> It seems like it takes a faster path through the NVidia drivers. [...]

I think this conflates two parts:

* User-defined allocators in general – there CUDA does not make much
sense and without unified-shared memory, it will always be inaccessible
on the device (w/o explicit/implicit mapping).

* Memory which is supposed to be accessible both on the host and on the
device. That's most obvious by  explicitly allocating to be accessible
on both – it is less clear cut when just creating an allocator with
unified-shared memory as it is not clear when it is only using on the
host (e.g. with host-based thread parallelization) – and when it is also
relevant for the device.

Currently, the user has no means to express the intent that it should be
accessible on both the host and one/several devices, except for 'omp
requires unified_shared_memory'.

The next OpenMP version will likely permit a means to create an
allocator which permits this →
https://github.com/OpenMP/spec/issues/1843 (not publicly available;
slides (last comment) are slightly outdated).

  * * *

The question is only what to do with 'requires unified_shared_memory' –
and a non-multi-device allocator.

Probably: unified_shared_memory or no nvptx device: just use mlock.
Otherwise (i.e. both nvptx device and (unified_shared_memory or a
multi-device-allocator)), use the CUDA one.

For the latter, I think Thomas' remarks are helpful.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Stubbs, Andrew June 9, 2022, 10:22 a.m. UTC | #13
> The question is only what to do with 'requires unified_shared_memory' –
> and a non-multi-device allocator.

The compiler emits an error at compile time if you attempt to use both -foffload-memory=pinned and USM, because they’re not compatible. You're fine to use both explicit allocators in the same program, but the "pinnedness" of USM allocations is a matter for Cuda to care about (cuMallocManaged) and has nothing to do with this discussion.

The OpenMP pinned memory feature is intended to accelerate normal mappings, as far as I can tell.

Andrew
  
Stubbs, Andrew June 9, 2022, 10:31 a.m. UTC | #14
> For example, it's documented that 'cuMemHostAlloc',
> <https://docs.nvidia.com/cuda/cuda-driver-
> api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b
> 9>,
> "Allocates page-locked host memory".  The crucial thing, though, what
> makes this different from 'malloc' plus 'mlock' is, that "The driver
> tracks the virtual memory ranges allocated with this function and
> automatically accelerates calls to functions such as cuMemcpyHtoD().
> Since the memory can be accessed directly by the device, it can be read
> or written with much higher bandwidth than pageable memory obtained with
> functions such as malloc()".

OK, interesting. I had not seen this, but I think it confirms that the performance difference is within Cuda and regular locked memory is not so great.

> Also, by means of the Nvidia Driver allocating the memory, I suppose
> using this interface likely circumvents any "annoying" 'ulimit'
> limitations?

Yes, this is the case.

> If not directly *allocating and registering* such memory via
> 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
> *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
> <https://docs.nvidia.com/cuda/cuda-driver-
> api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b5422
> 3>:
> "Page-locks the memory range specified [...] and maps it for the
> device(s) [...].  This memory range also is added to the same tracking
> mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
> manual 'mlock'ing involved in that case, too; presumably again using this
> interface likely circumvents any "annoying" 'ulimit' limitations?)
> 
> Such a *register* abstraction can then be implemented by all the libgomp
> offloading plugins: they just call the respective
> CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
> memory.
> 
> ..., but maybe I'm missing some crucial "detail" here?

I'm investigating this stuff for the AMD USM implementation as well right now. It might be a good way to handle static and stack data too. Or not.

Andrew
  
Thomas Schwinge Feb. 10, 2023, 3:11 p.m. UTC | #15
Hi!

Re OpenMP 'pinned' memory allocator trait semantics vs. 'omp_realloc':

On 2022-01-13T13:53:03+0000, Andrew Stubbs <ams@codesourcery.com> wrote:
> On 05/01/2022 17:07, Andrew Stubbs wrote:
>> [...], I'm working on an implementation using mmap instead of malloc
>> for pinned allocations.  [...]

> This means that large allocations will now be page aligned and therefore
> pin the smallest number of pages for the size requested, and that that
> memory will be unpinned automatically when freed via munmap, or moved
> via mremap.

> --- /dev/null
> +++ b/libgomp/config/linux/allocator.c

> +static void *
> +linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
> +                     size_t oldsize, size_t size, int oldpin, int pin)
> +{
> +  if (oldpin && pin)
> +    {
> +      void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE);
> +      if (newaddr == MAP_FAILED)
> +     return NULL;
> +
> +      return newaddr;
> +    }
> +  else if (oldpin || pin)
> +    {
> +      void *newaddr = linux_memspace_alloc (memspace, size, pin);
> +      if (newaddr)
> +     {
> +       memcpy (newaddr, addr, oldsize < size ? oldsize : size);
> +       linux_memspace_free (memspace, addr, oldsize, oldpin);
> +     }
> +
> +      return newaddr;
> +    }
> +  else
> +    return realloc (addr, size);
> +}

I did wonder if 'mremap' with 'MREMAP_MAYMOVE' is really acceptable here,
given OpenMP 5.2, 6.2 "Memory Allocators": "Allocators with the 'pinned'
trait defined to be 'true' ensure that their allocations remain in the
same storage resource at the same location for their entire lifetime."
I'd have read into this that 'realloc' may shrink or enlarge the region
(unless even that considered faulty), but the region must not be moved
("same location"), thus no 'MREMAP_MAYMOVE'; see 'man 2 mremap'
(2019-03-06):

    'MREMAP_MAYMOVE'
        By  default, if there is not sufficient space to expand a mapping at its current location, then 'mremap()' fails.  If this flag is specified, then the kernel is permitted to relocate the mapping to a new virtual address, if necessary.  If the mapping is relocated, then absolute pointers into the old mapping location become invalid (offsets relative to the starting address of the mapping should be employed).

..., but then I saw that OpenMP 5.2, 18.13.9 'omp_realloc' is specified
such that it isn't expected to 'realloc' in-place, but rather it
"deallocates previously allocated memory and requests a memory
allocation", which I understand that it does end a "lifetime" and then
establish a new "lifetime", which means that 'MREMAP_MAYMOVE' in fact is
fine (as implemented)?


Further I read in 'man 2 mremap' (2019-03-06):

    If  the  memory segment specified by *old_address* and *old_size* is locked (using 'mlock(2)' or similar), then this lock is maintained when the segment is resized and/or relocated.  As a consequence, the amount of memory locked by the process may change.

(The current proposed code evidently does make use of that; OK.)

But then in 'NOTES' I read:

    If 'mremap()' is used to move or expand an area locked with 'mlock(2)' or equivalent, the 'mremap()' call will make a best effort to populate the new area but will not fail with 'ENOMEM' if the area cannot be populated.

What exactly is that supposed to tell us: "will make a best effort [...]
but will not fail"?  Isn't that in conflict with the earlier statement?
So can we rely on 'mremap' together with 'mlock' or not?


(This topic remains valid even if we follow through the idea of using
CUDA to register page-locked memory, because that's not available in all
configurations, and we then still want to do the 'mmap'/'mlock' thing, I
suppose.)


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Andrew Stubbs Feb. 10, 2023, 3:55 p.m. UTC | #16
On 10/02/2023 15:11, Thomas Schwinge wrote:
> Hi!
> 
> Re OpenMP 'pinned' memory allocator trait semantics vs. 'omp_realloc':
> 
> On 2022-01-13T13:53:03+0000, Andrew Stubbs <ams@codesourcery.com> wrote:
>> On 05/01/2022 17:07, Andrew Stubbs wrote:
>>> [...], I'm working on an implementation using mmap instead of malloc
>>> for pinned allocations.  [...]
> 
>> This means that large allocations will now be page aligned and therefore
>> pin the smallest number of pages for the size requested, and that that
>> memory will be unpinned automatically when freed via munmap, or moved
>> via mremap.
> 
>> --- /dev/null
>> +++ b/libgomp/config/linux/allocator.c
> 
>> +static void *
>> +linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
>> +                     size_t oldsize, size_t size, int oldpin, int pin)
>> +{
>> +  if (oldpin && pin)
>> +    {
>> +      void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE);
>> +      if (newaddr == MAP_FAILED)
>> +     return NULL;
>> +
>> +      return newaddr;
>> +    }
>> +  else if (oldpin || pin)
>> +    {
>> +      void *newaddr = linux_memspace_alloc (memspace, size, pin);
>> +      if (newaddr)
>> +     {
>> +       memcpy (newaddr, addr, oldsize < size ? oldsize : size);
>> +       linux_memspace_free (memspace, addr, oldsize, oldpin);
>> +     }
>> +
>> +      return newaddr;
>> +    }
>> +  else
>> +    return realloc (addr, size);
>> +}
> 
> I did wonder if 'mremap' with 'MREMAP_MAYMOVE' is really acceptable here,
> given OpenMP 5.2, 6.2 "Memory Allocators": "Allocators with the 'pinned'
> trait defined to be 'true' ensure that their allocations remain in the
> same storage resource at the same location for their entire lifetime."
> I'd have read into this that 'realloc' may shrink or enlarge the region
> (unless even that considered faulty), but the region must not be moved
> ("same location"), thus no 'MREMAP_MAYMOVE'; see 'man 2 mremap'

I don't think the OpenMP specification really means that any program 
using omp_realloc should abort randomly depending on the vagaries of 
chaos? What are we supposed to do? Hugely over-allocate in case realloc 
is ever called?

Andrew
  

Patch

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index b1f5fe0a5e2..671b91e7ff8 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -51,6 +51,25 @@ 
 #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
   ((void)MEMSPACE, (void)SIZE, free (ADDR))
 #endif
+#ifndef MEMSPACE_PIN
+/* Only define this on supported host platforms.  */
+#ifdef __linux__
+#define MEMSPACE_PIN(MEMSPACE, ADDR, SIZE) \
+  ((void)MEMSPACE, xmlock (ADDR, SIZE))
+
+#include <sys/mman.h>
+#include <stdio.h>
+void
+xmlock (void *addr, size_t size)
+{
+  if (mlock (addr, size))
+      perror ("libgomp: failed to pin memory (ulimit too low?)");
+}
+#else
+#define MEMSPACE_PIN(MEMSPACE, ADDR, SIZE) \
+  ((void)MEMSPACE, (void)ADDR, (void)SIZE)
+#endif
+#endif
 
 /* Map the predefined allocators to the correct memory space.
    The index to this table is the omp_allocator_handle_t enum value.  */
@@ -212,7 +231,7 @@  omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
     data.alignment = sizeof (void *);
 
   /* No support for these so far (for hbw will use memkind).  */
-  if (data.pinned || data.memspace == omp_high_bw_mem_space)
+  if (data.memspace == omp_high_bw_mem_space)
     return omp_null_allocator;
 
   ret = gomp_malloc (sizeof (struct omp_allocator_data));
@@ -326,6 +345,9 @@  retry:
 #endif
 	  goto fail;
 	}
+
+      if (allocator_data->pinned)
+	MEMSPACE_PIN (allocator_data->memspace, ptr, new_size);
     }
   else
     {
@@ -335,6 +357,9 @@  retry:
       ptr = MEMSPACE_ALLOC (memspace, new_size);
       if (ptr == NULL)
 	goto fail;
+
+      if (allocator_data && allocator_data->pinned)
+	MEMSPACE_PIN (allocator_data->memspace, ptr, new_size);
     }
 
   if (new_alignment > sizeof (void *))
@@ -539,6 +564,9 @@  retry:
 #endif
 	  goto fail;
 	}
+
+      if (allocator_data->pinned)
+	MEMSPACE_PIN (allocator_data->memspace, ptr, new_size);
     }
   else
     {
@@ -548,6 +576,9 @@  retry:
       ptr = MEMSPACE_CALLOC (memspace, new_size);
       if (ptr == NULL)
 	goto fail;
+
+      if (allocator_data && allocator_data->pinned)
+	MEMSPACE_PIN (allocator_data->memspace, ptr, new_size);
     }
 
   if (new_alignment > sizeof (void *))
@@ -727,7 +758,11 @@  retry:
 #endif
 	  goto fail;
 	}
-      else if (prev_size)
+
+      if (allocator_data->pinned)
+	MEMSPACE_PIN (allocator_data->memspace, new_ptr, new_size);
+
+      if (prev_size)
 	{
 	  ret = (char *) new_ptr + sizeof (struct omp_mem_header);
 	  ((struct omp_mem_header *) ret)[-1].ptr = new_ptr;
@@ -747,6 +782,10 @@  retry:
       new_ptr = MEMSPACE_REALLOC (memspace, data->ptr, data->size, new_size);
       if (new_ptr == NULL)
 	goto fail;
+
+      if (allocator_data && allocator_data->pinned)
+	MEMSPACE_PIN (allocator_data->memspace, ptr, new_size);
+
       ret = (char *) new_ptr + sizeof (struct omp_mem_header);
       ((struct omp_mem_header *) ret)[-1].ptr = new_ptr;
       ((struct omp_mem_header *) ret)[-1].size = new_size;
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
new file mode 100644
index 00000000000..0a6360cda29
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
@@ -0,0 +1,81 @@ 
+/* { dg-do run } */
+
+/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
+
+/* Test that pinned memory works.  */
+
+#ifdef __linux__
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/mman.h>
+
+int
+get_pinned_mem ()
+{
+  int pid = getpid ();
+  char buf[100];
+  sprintf (buf, "/proc/%d/status", pid);
+
+  FILE *proc = fopen (buf, "r");
+  if (!proc)
+    abort ();
+  while (fgets (buf, 100, proc))
+    {
+      int val;
+      if (sscanf (buf, "VmLck: %d", &val))
+	{
+	  fclose (proc);
+	  return val;
+	}
+    }
+  abort ();
+}
+#else
+int
+get_pinned_mem ()
+{
+  return 0;
+}
+#endif
+
+#include <omp.h>
+
+/* Allocate more than a page each time, but stay within the ulimit.  */
+#define SIZE 10*1024
+
+int
+main ()
+{
+  const omp_alloctrait_t traits[] = {
+      { omp_atk_pinned, 1 }
+  };
+  omp_allocator_handle_t allocator = omp_init_allocator (omp_default_mem_space, 1, traits);
+
+  // Sanity check
+  if (get_pinned_mem () != 0)
+    abort ();
+
+  void *p = omp_alloc (SIZE, allocator);
+  if (!p)
+    abort ();
+
+  int amount = get_pinned_mem ();
+  if (amount == 0)
+    abort ();
+
+  p = omp_realloc (p, SIZE*2, allocator, allocator);
+
+  int amount2 = get_pinned_mem ();
+  if (amount2 <= amount)
+    abort ();
+
+  p = omp_calloc (1, SIZE, allocator);
+
+  if (get_pinned_mem () <= amount2)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-2.c b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
new file mode 100644
index 00000000000..8fdb4ff5cfd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
@@ -0,0 +1,87 @@ 
+/* { dg-do run } */
+
+/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
+
+/* Test that pinned memory works (pool_size code path).  */
+
+#ifdef __linux__
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/mman.h>
+
+int
+get_pinned_mem ()
+{
+  int pid = getpid ();
+  char buf[100];
+  sprintf (buf, "/proc/%d/status", pid);
+
+  FILE *proc = fopen (buf, "r");
+  if (!proc)
+    abort ();
+  while (fgets (buf, 100, proc))
+    {
+      int val;
+      if (sscanf (buf, "VmLck: %d", &val))
+	{
+	  fclose (proc);
+	  return val;
+	}
+    }
+  abort ();
+}
+#else
+int
+get_pinned_mem ()
+{
+  return 0;
+}
+#endif
+
+#include <omp.h>
+
+/* Allocate more than a page each time, but stay within the ulimit.  */
+#define SIZE 10*1024
+
+int
+main ()
+{
+  const omp_alloctrait_t traits[] = {
+      { omp_atk_pinned, 1 },
+      { omp_atk_pool_size, SIZE*8 }
+  };
+  omp_allocator_handle_t allocator = omp_init_allocator (omp_default_mem_space,
+							 2, traits);
+
+  // Sanity check
+  if (get_pinned_mem () != 0)
+    abort ();
+
+  void *p = omp_alloc (SIZE, allocator);
+  if (!p)
+    abort ();
+
+  int amount = get_pinned_mem ();
+  if (amount == 0)
+    abort ();
+
+  p = omp_realloc (p, SIZE*2, allocator, allocator);
+  if (!p)
+    abort ();
+
+  int amount2 = get_pinned_mem ();
+  if (amount2 <= amount)
+    abort ();
+
+  p = omp_calloc (1, SIZE, allocator);
+  if (!p)
+    abort ();
+
+  if (get_pinned_mem () <= amount2)
+    abort ();
+
+  return 0;
+}