Add LD_NUMA_REPLICATION for glibc

Message ID 20210903121434.12162-1-shijie@os.amperecomputing.com
State Rejected
Headers
Series Add LD_NUMA_REPLICATION for glibc |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Huang Shijie Sept. 3, 2021, 12:14 p.m. UTC
  This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.

If LD_NUMA_REPLICATION is set for program foo like this:
	#LD_NUMA_REPLICATION=1 ./foo

At the time ld.so mmaps the shared libraries, it will uses
	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
for them, and the mmap will trigger COW(copy on write) for the shared libraries
at the NUMA node which the program `foo` runs. After the COW, the foo will have a copy of
the shared library segment(mmap covered) which belong to the same NUMA node.

So when enable LD_NUMA_REPLICATION, it will consume more memory,
but it will reduce the remote-access in NUMA.

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
 elf/dl-map-segments.h      | 28 ++++++++++++++++++++++++----
 elf/dl-support.c           |  4 ++++
 elf/rtld.c                 |  4 ++++
 sysdeps/generic/ldsodefs.h |  4 ++++
 4 files changed, 36 insertions(+), 4 deletions(-)
  

Comments

Florian Weimer Sept. 3, 2021, 6:28 a.m. UTC | #1
* Huang Shijie via Libc-alpha:

> This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.
>
> If LD_NUMA_REPLICATION is set for program foo like this:
> 	#LD_NUMA_REPLICATION=1 ./foo
>
> At the time ld.so mmaps the shared libraries, it will uses
> 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
> for them, and the mmap will trigger COW(copy on write) for the shared
> libraries at the NUMA node which the program `foo` runs. After the
> COW, the foo will have a copy of the shared library segment(mmap
> covered) which belong to the same NUMA node.
>
> So when enable LD_NUMA_REPLICATION, it will consume more memory,
> but it will reduce the remote-access in NUMA.

I think the kernel could do this in a much better way, avoiding
duplicating the pages within the same NUMA node.

The other issue is the temporary RWX mapping, which does not
interoperate well with some security hardening features.

Thanks,
Florian
  
Huang Shijie Sept. 3, 2021, 3:15 p.m. UTC | #2
On Fri, Sep 03, 2021 at 08:28:57AM +0200, Florian Weimer wrote:
> * Huang Shijie via Libc-alpha:
> 
> > This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.
> >
> > If LD_NUMA_REPLICATION is set for program foo like this:
> > 	#LD_NUMA_REPLICATION=1 ./foo
> >
> > At the time ld.so mmaps the shared libraries, it will uses
> > 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
> > for them, and the mmap will trigger COW(copy on write) for the shared
> > libraries at the NUMA node which the program `foo` runs. After the
> > COW, the foo will have a copy of the shared library segment(mmap
> > covered) which belong to the same NUMA node.
> >
> > So when enable LD_NUMA_REPLICATION, it will consume more memory,
> > but it will reduce the remote-access in NUMA.
> 
> I think the kernel could do this in a much better way, avoiding
> duplicating the pages within the same NUMA node.
I think maybe only the per-NUMA-node page cache can avoid it..

> 
> The other issue is the temporary RWX mapping, which does not
> interoperate well with some security hardening features.
> 
Thanks for the feedback.
I CC more people who may  have interesting about this.


Thanks
Huang Shijie
  
Song Bao Hua \(Barry Song\) Sept. 3, 2021, 10:16 p.m. UTC | #3
> -----Original Message-----
> From: Huang Shijie [mailto:shijie@os.amperecomputing.com]
> Sent: Saturday, September 4, 2021 3:16 AM
> To: Florian Weimer <fweimer@redhat.com>
> Cc: Huang Shijie via Libc-alpha <libc-alpha@sourceware.org>;
> carlos@systemhalted.org; zwang@amperecomputing.com;
> patches@amperecomputing.com; akpm@linux-foundation.org;
> torvalds@linux-foundation.org; viro@zeniv.linux.org.uk; willy@infradead.org;
> Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; linux-mm@kvack.org
> Subject: Re: [PATCH] Add LD_NUMA_REPLICATION for glibc
> 
> On Fri, Sep 03, 2021 at 08:28:57AM +0200, Florian Weimer wrote:
> > * Huang Shijie via Libc-alpha:
> >
> > > This patch adds LD_NUMA_REPLICATION which influences the linkage of shared
> libraries at run time.
> > >
> > > If LD_NUMA_REPLICATION is set for program foo like this:
> > > 	#LD_NUMA_REPLICATION=1 ./foo
> > >
> > > At the time ld.so mmaps the shared libraries, it will uses
> > > 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
> > > for them, and the mmap will trigger COW(copy on write) for the shared
> > > libraries at the NUMA node which the program `foo` runs. After the
> > > COW, the foo will have a copy of the shared library segment(mmap
> > > covered) which belong to the same NUMA node.
> > >
> > > So when enable LD_NUMA_REPLICATION, it will consume more memory,
> > > but it will reduce the remote-access in NUMA.
> >
> > I think the kernel could do this in a much better way, avoiding
> > duplicating the pages within the same NUMA node.
> I think maybe only the per-NUMA-node page cache can avoid it..
> 
> >
> > The other issue is the temporary RWX mapping, which does not
> > interoperate well with some security hardening features.
> >
> Thanks for the feedback.
> I CC more people who may  have interesting about this.
> 

Hi Shijie,
Thanks!

Could we have some benchmark data for this patch? I assume it
will benefit frontend-bound large workset which might face
relatively more icache miss?

> 
> Thanks
> Huang Shijie

Thanks
Barry
  
Huang Shijie Sept. 6, 2021, 9:14 a.m. UTC | #4
Hi Barry,
On Fri, Sep 03, 2021 at 10:16:14PM +0000, Song Bao Hua (Barry Song) wrote:
> Could we have some benchmark data for this patch? I assume it
> will benefit frontend-bound large workset which might face
> relatively more icache miss?
I did not find a proper program for the benchmark test..

Thanks
Huang Shijie
  
Huang Shijie Sept. 9, 2021, 10:19 a.m. UTC | #5
Hi Florian,
On Fri, Sep 03, 2021 at 08:28:57AM +0200, Florian Weimer wrote:
> * Huang Shijie via Libc-alpha:
> 
> > This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.
> >
> > If LD_NUMA_REPLICATION is set for program foo like this:
> > 	#LD_NUMA_REPLICATION=1 ./foo
> >
> > At the time ld.so mmaps the shared libraries, it will uses
> > 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
> > for them, and the mmap will trigger COW(copy on write) for the shared
> > libraries at the NUMA node which the program `foo` runs. After the
> > COW, the foo will have a copy of the shared library segment(mmap
> > covered) which belong to the same NUMA node.
> >
> > So when enable LD_NUMA_REPLICATION, it will consume more memory,
> > but it will reduce the remote-access in NUMA.
> 
> I think the kernel could do this in a much better way, avoiding
> duplicating the pages within the same NUMA node.

https://marc.info/?l=linux-kernel&m=163070220429222&w=2
Since Linus did not think it a good choice to do it in kernel,
glibc is the only place to do it now.
So could you please re-evaluate this patch?

> 
> The other issue is the temporary RWX mapping, which does not
> interoperate well with some security hardening features.
Could you please tell me in detail? I am confused at it.

Thanks
Huang Shijie
  
Florian Weimer Sept. 10, 2021, 11:01 a.m. UTC | #6
* Huang Shijie:

> Hi Florian,
> On Fri, Sep 03, 2021 at 08:28:57AM +0200, Florian Weimer wrote:
>> * Huang Shijie via Libc-alpha:
>> 
>> > This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.
>> >
>> > If LD_NUMA_REPLICATION is set for program foo like this:
>> > 	#LD_NUMA_REPLICATION=1 ./foo
>> >
>> > At the time ld.so mmaps the shared libraries, it will uses
>> > 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
>> > for them, and the mmap will trigger COW(copy on write) for the shared
>> > libraries at the NUMA node which the program `foo` runs. After the
>> > COW, the foo will have a copy of the shared library segment(mmap
>> > covered) which belong to the same NUMA node.
>> >
>> > So when enable LD_NUMA_REPLICATION, it will consume more memory,
>> > but it will reduce the remote-access in NUMA.
>> 
>> I think the kernel could do this in a much better way, avoiding
>> duplicating the pages within the same NUMA node.
>
> https://marc.info/?l=linux-kernel&m=163070220429222&w=2
> Since Linus did not think it a good choice to do it in kernel,
> glibc is the only place to do it now.
> So could you please re-evaluate this patch?

The name of the environment variable is quite misleading.  It should
refer to MAP_POPULATE, not NUMA.

As far as I can tell, it does not necessarily have the desired effect
for multi-threaded applications (if some threads end up running on other
NUMA nodes).

And it would be helpful to have some performance numbers.

And I wonder if a FUSE file system could do better, by making one
backing copy per NUMA node instead of one copy per process.

>> The other issue is the temporary RWX mapping, which does not
>> interoperate well with some security hardening features.
> Could you please tell me in detail? I am confused at it.

Some environments block mapping files with PROT_WRITE | PROT_EXEC.

Thanks,
Florian
  
Huang Shijie Sept. 13, 2021, 2:40 p.m. UTC | #7
On Fri, Sep 10, 2021 at 01:01:46PM +0200, Florian Weimer wrote:
Hi Florian,
> > https://marc.info/?l=linux-kernel&m=163070220429222&w=2
> > Since Linus did not think it a good choice to do it in kernel,
> > glibc is the only place to do it now.
> > So could you please re-evaluate this patch?
> 
> The name of the environment variable is quite misleading.  It should
> refer to MAP_POPULATE, not NUMA.
In the kernel, in do_cow_fault:
https://lxr.missinglinkelectronics.com/linux/mm/memory.c#L3508
alloc_page_vma() will be passed with the numa_node_id() which will return
the current NUMA node.

This patch just uses MAP_POPULATE to implement the per-NUMA replication 
of the shared libraries.

So its name should have "NUMA". (I cannot think out a better name now.)

> 
> As far as I can tell, it does not necessarily have the desired effect
> for multi-threaded applications (if some threads end up running on other
> NUMA nodes).
Yes. Not suit for that case.

> 
> And it would be helpful to have some performance numbers.
I hope some one can test this patch, since I cannot find a proper benchmark
program. :(

> 
> And I wonder if a FUSE file system could do better, by making one
> backing copy per NUMA node instead of one copy per process.
Interesting...

> 
> >> The other issue is the temporary RWX mapping, which does not
> >> interoperate well with some security hardening features.
> > Could you please tell me in detail? I am confused at it.
> 
> Some environments block mapping files with PROT_WRITE | PROT_EXEC.
We can filer these environments by adding more check in the "if".

Thanks
Huang Shijie
  

Patch

diff --git a/elf/dl-map-segments.h b/elf/dl-map-segments.h
index f9fb110e..ae6661a7 100644
--- a/elf/dl-map-segments.h
+++ b/elf/dl-map-segments.h
@@ -52,13 +52,33 @@  _dl_map_segments (struct link_map *l, int fd,
                                   c->mapstart & GLRO(dl_use_load_bias))
            - MAP_BASE_ADDR (l));
 
-      /* Remember which part of the address space this object uses.  */
-      l->l_map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplength,
+      if (__glibc_unlikely(GLRO(dl_numa_replication)))
+      {
+	/* Trigger the linux kernel COW(copy on write) on purpose */
+        l->l_map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplength,
+                                            c->prot|PROT_WRITE,
+                                            MAP_COPY|MAP_FILE|MAP_POPULATE,
+                                            fd, c->mapoff);
+        if (__glibc_unlikely ((void *) l->l_map_start == MAP_FAILED))
+          return DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT;
+
+	/* Change back to c->prot if needed */
+	if (!(c->prot & PROT_WRITE))
+        {
+	  if (__mprotect((caddr_t)l->l_map_start, maplength, c->prot))
+            return DL_MAP_SEGMENTS_ERROR_MPROTECT;
+	}
+      }
+      else
+      {
+        /* Remember which part of the address space this object uses.  */
+        l->l_map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplength,
                                             c->prot,
                                             MAP_COPY|MAP_FILE,
                                             fd, c->mapoff);
-      if (__glibc_unlikely ((void *) l->l_map_start == MAP_FAILED))
-        return DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT;
+        if (__glibc_unlikely ((void *) l->l_map_start == MAP_FAILED))
+          return DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT;
+      }
 
       l->l_map_end = l->l_map_start + maplength;
       l->l_addr = l->l_map_start - c->mapstart;
diff --git a/elf/dl-support.c b/elf/dl-support.c
index 01557181..d2eb3164 100644
--- a/elf/dl-support.c
+++ b/elf/dl-support.c
@@ -79,6 +79,10 @@  const char *_dl_origin_path;
 /* Nonzero if runtime lookup should not update the .got/.plt.  */
 int _dl_bind_not;
 
+  /* Do we want to do the replication(by linux copy on write) for shared libraries in NUMA?
+   Only valid in the linux system. */
+int _dl_numa_replication;
+
 /* A dummy link map for the executable, used by dlopen to access the global
    scope.  We don't export any symbols ourselves, so this can be minimal.  */
 static struct link_map _dl_main_map =
diff --git a/elf/rtld.c b/elf/rtld.c
index d733359e..10378c00 100644
--- a/elf/rtld.c
+++ b/elf/rtld.c
@@ -2788,7 +2788,11 @@  process_envvars (struct dl_main_state *state)
 	      GLRO(dl_verbose) = 1;
 	      GLRO(dl_debug_mask) |= DL_DEBUG_PRELINK;
 	      GLRO(dl_trace_prelink) = &envline[17];
+	      break;
 	    }
+
+	  if (memcmp (envline, "NUMA_REPLICATION", 16) == 0)
+	    GLRO(dl_numa_replication) = true;
 	  break;
 
 	case 20:
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index 9c152592..f6114522 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -569,6 +569,10 @@  struct rtld_global_ro
   /* Nonzero if runtime lookups should not update the .got/.plt.  */
   EXTERN int _dl_bind_not;
 
+  /* Do we want to do the replication(by linux copy on write) for shared libraries in NUMA?
+     Only valid in the linux system. */
+  EXTERN int _dl_numa_replication;
+
   /* Nonzero if references should be treated as weak during runtime
      linking.  */
   EXTERN int _dl_dynamic_weak;