pthread wastes memory with mlockall(MCL_FUTURE)

Message ID 20150918102734.GA27881@eper
State Superseded
Headers

Commit Message

Balazs Kezes Sept. 18, 2015, 10:27 a.m. UTC
  Hi!

I've run into the following problem: Whenever a new thread is created,
pthread creates some guard pages next to its stack. These guard pages
are usually empty zero pages, and have all their permissions removed --
nothing can read/write/execute on these pages.

The problem is that the application I use has a large number of threads
and uses mlockall(MCL_FUTURE) so this messes up the memory usage
calculation (rss) for the application which then leads to memory wasted.

Would it make sense for glibc to munlock these pages? I'm thinking
something like this (although I haven't tested it yet):


Thanks!
  

Comments

Rich Felker Sept. 18, 2015, 2:38 p.m. UTC | #1
On Fri, Sep 18, 2015 at 11:27:34AM +0100, Balazs Kezes wrote:
> Hi!
> 
> I've run into the following problem: Whenever a new thread is created,
> pthread creates some guard pages next to its stack. These guard pages
> are usually empty zero pages, and have all their permissions removed --
> nothing can read/write/execute on these pages.
> 
> The problem is that the application I use has a large number of threads
> and uses mlockall(MCL_FUTURE) so this messes up the memory usage
> calculation (rss) for the application which then leads to memory wasted.
> 
> Would it make sense for glibc to munlock these pages? I'm thinking
> something like this (although I haven't tested it yet):
> 
> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
> index 753da61..1fc715c 100644
> --- a/nptl/allocatestack.c
> +++ b/nptl/allocatestack.c
> @@ -659,6 +659,11 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>  
>  	      return errno;
>  	    }
> +	  /* The guard pages shouldn't be locked into memory. A lot of memory
> +	     would be unnecessarily wasted if you have a lot of threads and
> +	     mlockall(MCL_FUTURE) set otherwise. We ignore the errors because
> +	     can't do anything about them anyways.  */
> +	  (void) munlock (guard, guardsize);

I would say it's a kernel bug for PROT_NONE pages to actually occupy
resources when locked, if they actually do? How did you test/measure
this?

Rich
  
Balazs Kezes Sept. 18, 2015, 4:38 p.m. UTC | #2
On 2015-09-18 10:38 -0400, Rich Felker wrote:
> I would say it's a kernel bug for PROT_NONE pages to actually occupy
> resources when locked, if they actually do?

It could make sense that you have some pages with some data in them but
in a later stage you remove the permissions to trap data accesses. I
think some debuggers (e.g. watchpoints) or some cache simulators work
this way.

> How did you test/measure this?

There's a pthread_attr_setguardsize function. Basically I've set it to
large (to expand the effect of this behavior) and then created a bunch
of threads and checked what happens. I could try to create a simple
repro if needed.
  
Rich Felker Sept. 18, 2015, 5:08 p.m. UTC | #3
On Fri, Sep 18, 2015 at 05:38:42PM +0100, Balazs Kezes wrote:
> On 2015-09-18 10:38 -0400, Rich Felker wrote:
> > I would say it's a kernel bug for PROT_NONE pages to actually occupy
> > resources when locked, if they actually do?
> 
> It could make sense that you have some pages with some data in them but
> in a later stage you remove the permissions to trap data accesses. I
> think some debuggers (e.g. watchpoints) or some cache simulators work
> this way.

I'm talking about new PROT_NONE pages. The kernel certainly accounts
for them differently as commit charge. New PROT_NONE pages consume no
commit charge. Anonymous pages with data in them, which would become
available again if you mprotect them readable, do consume commit
charge. (For this reason, you have to mmap MAP_FIXED+PROT_NONE to
uncommit memory rather than just using mprotect PROT_NONE, even if you
already used madvise MADV_DONTNEED on it.)

> > How did you test/measure this?
> 
> There's a pthread_attr_setguardsize function. Basically I've set it to
> large (to expand the effect of this behavior) and then created a bunch
> of threads and checked what happens. I could try to create a simple
> repro if needed.

But you were able to measure them using physical resources? Or might
it possibly just be bad accounting. Utilities like 'top' are notorious
for misrepresenting the memory usage of a process.

Rich
  
Balazs Kezes Sept. 18, 2015, 7:29 p.m. UTC | #4
On 2015-09-18 13:08 -0400, Rich Felker wrote:
> I'm talking about new PROT_NONE pages.

That's not how pthread does the allocation: it mmaps read/write first,
and then does a mprotect(..., ..., PROT_NONE).

> The kernel certainly accounts for them differently as commit charge.
> New PROT_NONE pages consume no commit charge. Anonymous pages with
> data in them, which would become available again if you mprotect them
> readable, do consume commit charge. (For this reason, you have to mmap
> MAP_FIXED+PROT_NONE to uncommit memory rather than just using mprotect
> PROT_NONE, even if you already used madvise MADV_DONTNEED on it.)

So while working on the repro I've looked deeper and created a simple
app which demonstrates the mmap behavior:

// gcc -Wall -Wextra -std=c99 mapping.c -o mapping
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

int main(void)
{
	int r;
	r = mlockall(MCL_CURRENT | (getenv("M") ? MCL_FUTURE : 0));
	assert(r == 0);

	int flags = MAP_PRIVATE | MAP_ANONYMOUS;
	void *mem = mmap(NULL, 8LL << 30, PROT_WRITE, flags, -1, 0);
	assert(mem != NULL);
	sleep(100);

	return 0;
}

All it does it mmaps some memory and if I have the envvar M set then it
also does the mlocking part. When I run this application without
mlocking then it barely uses any RSS memory. However when I set M then I
can see in htop that RSS is 8GB and that
"cat /proc/meminfo | grep MemAvailable" shows 8 GB less memory. Actually
when I look at the number of minor pagefaults I get this:

	$ /usr/bin/time -f %R ./mapping
	102
	$ M=1 /usr/bin/time -f %R ./mapping
	4709

So I think the kernel preallocates all the memory in this case.

However if I set the protection to PROT_NONE then the kernel doesn't do
the preallocation.

Interestingly it does *not* preallocate even if mmap with PROT_NONE
first and then do a mprotect(mem, 8LL<<30, PROT_WRITE). I do see the
pagefaults if I do a memset(mem, 0, 8LL<<30) afterwards though.

So here's what I think pthreads should do: First mmap with PROT_NONE and
only then should mprotect read/write the stack pages.

Does that sound reasonable?

Thanks!
  
Rich Felker Sept. 18, 2015, 7:45 p.m. UTC | #5
On Fri, Sep 18, 2015 at 08:29:52PM +0100, Balazs Kezes wrote:
> On 2015-09-18 13:08 -0400, Rich Felker wrote:
> > I'm talking about new PROT_NONE pages.
> 
> That's not how pthread does the allocation: it mmaps read/write first,
> and then does a mprotect(..., ..., PROT_NONE).

Ah, that explains it then. I did the opposite in musl for exactly this
reason: first mmap PROT_NONE then mprotect the non-guard part
PROT_READ|PROT_WRITE.

> > The kernel certainly accounts for them differently as commit charge.
> > New PROT_NONE pages consume no commit charge. Anonymous pages with
> > data in them, which would become available again if you mprotect them
> > readable, do consume commit charge. (For this reason, you have to mmap
> > MAP_FIXED+PROT_NONE to uncommit memory rather than just using mprotect
> > PROT_NONE, even if you already used madvise MADV_DONTNEED on it.)
> 
> So while working on the repro I've looked deeper and created a simple
> app which demonstrates the mmap behavior:
> 
> // gcc -Wall -Wextra -std=c99 mapping.c -o mapping
> #define _GNU_SOURCE
> #include <assert.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <unistd.h>
> 
> int main(void)
> {
> 	int r;
> 	r = mlockall(MCL_CURRENT | (getenv("M") ? MCL_FUTURE : 0));
> 	assert(r == 0);
> 
> 	int flags = MAP_PRIVATE | MAP_ANONYMOUS;
> 	void *mem = mmap(NULL, 8LL << 30, PROT_WRITE, flags, -1, 0);
> 	assert(mem != NULL);
> 	sleep(100);
> 
> 	return 0;
> }
> 
> All it does it mmaps some memory and if I have the envvar M set then it
> also does the mlocking part. When I run this application without
> mlocking then it barely uses any RSS memory. However when I set M then I
> can see in htop that RSS is 8GB and that
> "cat /proc/meminfo | grep MemAvailable" shows 8 GB less memory. Actually
> when I look at the number of minor pagefaults I get this:
> 
> 	$ /usr/bin/time -f %R ./mapping
> 	102
> 	$ M=1 /usr/bin/time -f %R ./mapping
> 	4709
> 
> So I think the kernel preallocates all the memory in this case.
> 
> However if I set the protection to PROT_NONE then the kernel doesn't do
> the preallocation.
> 
> Interestingly it does *not* preallocate even if mmap with PROT_NONE
> first and then do a mprotect(mem, 8LL<<30, PROT_WRITE). I do see the
> pagefaults if I do a memset(mem, 0, 8LL<<30) afterwards though.
> 
> So here's what I think pthreads should do: First mmap with PROT_NONE and
> only then should mprotect read/write the stack pages.
> 
> Does that sound reasonable?

Yes.

Rich
  

Patch

diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index 753da61..1fc715c 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -659,6 +659,11 @@  allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 
 	      return errno;
 	    }
+	  /* The guard pages shouldn't be locked into memory. A lot of memory
+	     would be unnecessarily wasted if you have a lot of threads and
+	     mlockall(MCL_FUTURE) set otherwise. We ignore the errors because
+	     can't do anything about them anyways.  */
+	  (void) munlock (guard, guardsize);
 
 	  pd->guardsize = guardsize;
 	}